0

Levenshtein, soundex, similar_text — fuzzy string matching

Intermediate5 min read·php-03-014

Concept

Fuzzy string matching functions measure similarity between strings rather than looking for exact or pattern-based matches. They're used for spell-checking, autocorrect suggestions, deduplication, and search-as-you-type features.

levenshtein($str1, $str2) computes the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into the other. Known as edit distance. O(n·m) time complexity. Optional parameters allow different costs for each operation — useful when deletions are cheaper than insertions for your domain. Returns an integer: 0 means identical strings, higher means more different.

similar_text($str1, $str2, &$percent) calculates the number of matching characters and optionally the percentage similarity. Uses a different algorithm than Levenshtein — it finds the longest common substring recursively. Less symmetric than Levenshtein (order of arguments matters for the percentage). Slower than Levenshtein for long strings.

soundex($str) / metaphone($str): phonetic encoding — returns a code representing how a word sounds. Soundex is a simple 4-character code. Metaphone is more accurate for English. Used to find words that sound similar regardless of spelling. double_metaphone (not built-in, but available via PECL) handles more languages.

Practical use: For production search-as-you-type or autocorrect, these built-in functions are too slow for large datasets — use Elasticsearch, Algolia, or a dedicated fuzzy search library. These functions shine for small-scale matching: validating city names, correcting command-line argument typos, finding near-duplicate entries in a small list.

Code Example

php
<?php
declare(strict_types=1);

// Levenshtein — edit distance
echo levenshtein('kitten', 'sitting'); // 3
echo levenshtein('php',    'PHP');     // 3 (case-sensitive, 3 substitutions)
echo levenshtein('hello',  'hello');   // 0

// Spell-correction: find closest match in a dictionary
function findClosest(string $word, array $dictionary): string
{
    $minDist = PHP_INT_MAX;
    $closest = $dictionary[0];
    foreach ($dictionary as $candidate) {
        $dist = levenshtein(strtolower($word), strtolower($candidate));
        if ($dist < $minDist) {
            $minDist = $dist;
            $closest = $candidate;
        }
    }
    return $closest;
}

$cities = ['Paris', 'London', 'Berlin', 'Madrid', 'Bucharest'];
echo findClosest('Londnn', $cities);    // "London"
echo findClosest('Bucharst', $cities);  // "Bucharest"

// Levenshtein with custom costs
// Cost 1 for insert/delete, 2 for substitution (substitutions are expensive)
echo levenshtein('abc', 'axc', 1, 2, 1); // 2 (one substitution at cost 2)

// similar_text — percentage similarity
similar_text('World of Warcraft', 'World of Warcraft II', $percent);
echo round($percent, 1); // 94.7%

similar_text('PHP', 'Python', $pct);
echo round($pct, 1); // 40.0%

// Note: argument order affects the result
similar_text('abc', 'xyz', $pct1);
similar_text('xyz', 'abc', $pct2);
// $pct1 may != $pct2 — similar_text is not fully symmetric

// Soundex / Metaphone — phonetic matching
echo soundex('Smith');   // "S530"
echo soundex('Smyth');   // "S530" — same code, sounds the same

echo metaphone('Thompson'); // "TMPSN"
echo metaphone('Thomson');  // "TMSN"  — different (more accurate than soundex)

// Find phonetically similar names
$names = ['Smith', 'Schmidt', 'Smyth', 'Jones', 'Jonez'];
$target = soundex('Smythe');
$matches = array_filter($names, fn($n) => soundex($n) === $target);
print_r($matches); // ['Smith', 'Smyth']