Understanding Fuzzy Matching: Jaro-Winkler vs Levenshtein vs Token Sort

If you have looked into fuzzy matching, you have encountered terms like Levenshtein distance, Jaro-Winkler similarity, and token sort ratio. These are all algorithms for measuring how similar two strings are, but they work differently and excel at different tasks. Choosing the wrong algorithm for your data type can mean the difference between a 70% match rate and a 90% match rate.

This article explains each algorithm in plain language, shows how they score the same example strings, and tells you which one to use for different types of data.

Levenshtein Distance (Edit Distance)

Levenshtein distance counts the minimum number of single-character operations needed to transform one string into another. The three allowed operations are: insert a character, delete a character, or substitute one character for another.

Example: Transforming "kitten" into "sitting" requires three operations: substitute k with s, substitute e with i, insert g at the end. The Levenshtein distance is 3.

To convert this into a similarity percentage, use the formula: 1 - (distance / max(length_of_string_a, length_of_string_b)). For "kitten" (6 chars) vs "sitting" (7 chars): 1 - (3/7) = 0.571, or about 57% similar.

Strengths

Intuitive and easy to understand
Good at catching typos (single character insertions, deletions, or substitutions)
Works well on short to medium-length strings
Well-supported in every programming language and most tools

Weaknesses

Sensitive to string length differences. "Bob" vs "Robert" has a high edit distance despite being the same name.
Does not handle transpositions efficiently. Swapping two adjacent characters (like "ab" to "ba") costs 2 operations, not 1.
Penalizes differences at the beginning of the string the same as differences at the end, even though prefix matches are often more meaningful.

Best for

Fields with expected typos: email addresses, company names, product codes, and any string where the most common errors are single-character mistakes.

Jaro-Winkler Similarity

The Jaro similarity metric was designed specifically for comparing short strings like names. It considers the number of matching characters and the number of transpositions. Two characters are considered matching if they are the same and not farther apart than half the length of the longer string.

The Winkler modification adds a bonus for strings that share a common prefix, up to the first 4 characters. The idea: if two strings start the same way, they are more likely to be the same word with a variation later in the string.

Example scores:

"Martha" vs "Marhta": Jaro = 0.944, Jaro-Winkler = 0.961 (transposition, prefix matches)
"Dixon" vs "Dicksonx": Jaro = 0.767, Jaro-Winkler = 0.813 (common prefix boosts score)
"Catherine" vs "Katherine": Jaro = 0.889, Jaro-Winkler = 0.889 (no common prefix, no Winkler boost)

Strengths

Excellent for person names and short strings
Handles transpositions naturally (swapped adjacent characters)
The prefix bonus aligns well with how names typically vary (differences tend to be in the middle or end)
Score range is 0 to 1, which is intuitive as a percentage

Weaknesses

Less effective on long strings where the prefix is a small portion of the total
Does not handle word reordering. "John Smith" vs "Smith John" gets a low score even though they are clearly the same name.
The prefix bonus can occasionally cause false positives on names that share a common beginning but are different (like "Johnson" vs "Johnston")

Best for

Person name fields (first name and last name separately), city names, and any short string field where character-level similarity with prefix weighting makes sense.

Token Sort Ratio

Token sort ratio takes a completely different approach. Instead of comparing character sequences, it splits both strings into tokens (words), sorts the tokens alphabetically, joins them back into a string, and then compares the resulting strings using edit distance.

Example: "John Robert Smith" vs "Smith, John R." First, tokenize and sort: "John Robert Smith" becomes "john robert smith" and "Smith, John R." becomes "john r smith." Then compare the sorted, normalized strings.

This approach handles word reordering, which is common in:

Names stored as "Last, First" vs "First Last"
Company names: "International Business Machines" vs "Business Machines International"
Addresses: "Suite 100, 123 Main St" vs "123 Main St Suite 100"

Strengths

Handles word reordering transparently
Works well on multi-word strings (full names, company names, addresses)
Combined with normalization (lowercasing, removing punctuation), it is very effective at matching real-world data

Weaknesses

Sorting destroys positional information. If word order matters (like in a product name where "Model X Pro" is different from "Pro X Model"), token sort is inappropriate.
Short strings with one or two tokens do not benefit from sorting.
Extra tokens significantly reduce the score. "John Smith" vs "John Robert Smith III" gets a lower score than you might expect because the extra tokens add edit distance.

Best for

Full name fields (first + last combined), company names, addresses, and any multi-word field where word order may vary between sources.

Which Algorithm Should You Use?

The honest answer: use all of them on different fields. A well-designed matching system does not pick one algorithm. It picks the right algorithm for each column type:

First name: Jaro-Winkler (optimized for short name strings)
Last name: Jaro-Winkler (same reasoning, plus prefix bonus helps with common prefixes like "Mac" and "Mc")
Full name (combined): Token sort ratio (handles "Last, First" vs "First Last")
Email: Levenshtein (typos are the primary error type in emails)
Company name: Token sort ratio (word order varies, abbreviations like "Inc" and "LLC" are common)
Address: Token sort ratio (component order varies widely)
Phone number: Exact match on normalized digits (fuzzy matching is not appropriate for phone numbers)

ListMatchGenie applies this multi-algorithm approach automatically. When the AI detects column types during upload, it assigns the optimal matching algorithm to each field. Name columns get Jaro-Winkler plus phonetic matching, address columns get token-based matching, and identifier columns get exact matching with normalization. You get the benefit of algorithm selection without needing to configure anything.

Understanding Fuzzy Matching: Jaro-Winkler vs Levenshtein vs Token Sort

Levenshtein Distance (Edit Distance)

Strengths

Weaknesses

Best for

Jaro-Winkler Similarity

Strengths

Weaknesses

Best for

Token Sort Ratio

Strengths

Weaknesses

Best for

Which Algorithm Should You Use?

Keep reading

The Hidden Cost of Dirty Data: What Messy Spreadsheets Really Cost

Why Your Excel VLOOKUP Misses 20% of Matches

How Phonetic Matching Finds Records That Exact Match Misses

Let the Genie handle the grunt work.