Levenshtein, soundex, similar_text — fuzzy string matching
Concept
Fuzzy string matching functions measure similarity between strings rather than looking for exact or pattern-based matches. They're used for spell-checking, autocorrect suggestions, deduplication, and search-as-you-type features.
levenshtein($str1, $str2) computes the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into the other. Known as edit distance. O(n·m) time complexity. Optional parameters allow different costs for each operation — useful when deletions are cheaper than insertions for your domain. Returns an integer: 0 means identical strings, higher means more different.
similar_text($str1, $str2, &$percent) calculates the number of matching characters and optionally the percentage similarity. Uses a different algorithm than Levenshtein — it finds the longest common substring recursively. Less symmetric than Levenshtein (order of arguments matters for the percentage). Slower than Levenshtein for long strings.
soundex($str) / metaphone($str): phonetic encoding — returns a code representing how a word sounds. Soundex is a simple 4-character code. Metaphone is more accurate for English. Used to find words that sound similar regardless of spelling. double_metaphone (not built-in, but available via PECL) handles more languages.
Practical use: For production search-as-you-type or autocorrect, these built-in functions are too slow for large datasets — use Elasticsearch, Algolia, or a dedicated fuzzy search library. These functions shine for small-scale matching: validating city names, correcting command-line argument typos, finding near-duplicate entries in a small list.
Code Example
<?php
declare(strict_types=1);
// Levenshtein — edit distance
echo levenshtein('kitten', 'sitting'); // 3
echo levenshtein('php', 'PHP'); // 3 (case-sensitive, 3 substitutions)
echo levenshtein('hello', 'hello'); // 0
// Spell-correction: find closest match in a dictionary
function findClosest(string $word, array $dictionary): string
{
$minDist = PHP_INT_MAX;
$closest = $dictionary[0];
foreach ($dictionary as $candidate) {
$dist = levenshtein(strtolower($word), strtolower($candidate));
if ($dist < $minDist) {
$minDist = $dist;
$closest = $candidate;
}
}
return $closest;
}
$cities = ['Paris', 'London', 'Berlin', 'Madrid', 'Bucharest'];
echo findClosest('Londnn', $cities); // "London"
echo findClosest('Bucharst', $cities); // "Bucharest"
// Levenshtein with custom costs
// Cost 1 for insert/delete, 2 for substitution (substitutions are expensive)
echo levenshtein('abc', 'axc', 1, 2, 1); // 2 (one substitution at cost 2)
// similar_text — percentage similarity
similar_text('World of Warcraft', 'World of Warcraft II', $percent);
echo round($percent, 1); // 94.7%
similar_text('PHP', 'Python', $pct);
echo round($pct, 1); // 40.0%
// Note: argument order affects the result
similar_text('abc', 'xyz', $pct1);
similar_text('xyz', 'abc', $pct2);
// $pct1 may != $pct2 — similar_text is not fully symmetric
// Soundex / Metaphone — phonetic matching
echo soundex('Smith'); // "S530"
echo soundex('Smyth'); // "S530" — same code, sounds the same
echo metaphone('Thompson'); // "TMPSN"
echo metaphone('Thomson'); // "TMSN" — different (more accurate than soundex)
// Find phonetically similar names
$names = ['Smith', 'Schmidt', 'Smyth', 'Jones', 'Jonez'];
$target = soundex('Smythe');
$matches = array_filter($names, fn($n) => soundex($n) === $target);
print_r($matches); // ['Smith', 'Smyth']