What are public data matching benchmarks?

Public benchmarks are standardized test datasets used by data science researchers to compare entity resolution systems. They include real or synthetic records with known correct matches (called ground truth), so anyone can measure precision and recall on the same data. The most-cited benchmarks for record linkage and product matching are Febrl4 (people) and Abt-Buy (products). Reporting numbers on these datasets means your performance is directly comparable to systems published in academic papers.

What is F1 score in record matching?

F1 score is the standard accuracy measure for matching systems. It combines precision (the percentage of surfaced matches that are actually correct) and recall (the percentage of true matches the system found). F1 is the harmonic mean of the two, balanced so neither one can dominate. F1 of 1.0 means perfect matching; F1 of 0.0 means complete failure. Industry-good performance varies by dataset: F1 above 0.95 is considered excellent for clean people matching, F1 above 0.60 is considered good for dirty product matching.

What does 100% precision mean in matching?

100% precision means every match the system surfaces is correct — there are zero false positives. The trade-off is usually that some real matches are missed and routed to a review queue instead of auto-confirmed. Most matching tools deliberately accept some wrong matches to inflate their match counts. ListMatchGenie deliberately keeps precision at 100% on auto-confirmed matches because a wrong match corrupts data, and corrupted data is more expensive to fix than to prevent.

Does ListMatchGenie need training data to work?

No. Most modern matching systems require labeled training pairs — typically hundreds to thousands of human-confirmed examples — before they perform well. ListMatchGenie is unsupervised: it analyzes your data, picks the right matching strategy automatically, and runs without any setup. The benchmark numbers in this article were achieved on the first run, with no training and no configuration.

We Put Our Matching Engine Against the Industry's Toughest Public Benchmarks. Here's What Happened.

Most data matching tools make claims. We decided to do something different. We took ListMatchGenie and ran it against the same public benchmarks the data science community uses to measure entity resolution — the datasets cited in academic papers from MIT, Stanford, AWS Research, and the Leipzig Database Research Group.

Then we did something unusual. We ran them out of the box. No training data. No labeled examples. No configuration tuning. No two-week onboarding project with a data engineer on the line. Just upload, map columns, match.

The numbers surprised us. They are also the kind of numbers that come with caveats, and we are going to share both the wins and the honest limitations.

The benchmarks we tested

We picked four public datasets because each one stresses a different part of a matching engine, and because virtually every modern academic paper on record matching compares against them.

Febrl4 (people matching with identifier)

The canonical record-linkage benchmark. 5,000 source records and 5,000 target records of synthetic but realistic people data: names, addresses, dates of birth, and a unique social security number. The dataset includes controlled noise — typos, missing fields, transliterations, and swapped first and last names. Every entity-resolution paper of the last decade references this dataset.

Febrl4 fuzzy-only (people matching without identifier)

The same dataset, but with the unique identifier column removed before matching. This forces the engine to match purely on names, addresses, and dates of birth — the realistic scenario when your two lists do not share a clean ID. This is the harder, more honest test of fuzzy intelligence.

Abt-Buy (dirty product matching)

A product matching benchmark from two real electronics retailers. Same products, described very differently across stores. Long marketing descriptions on one side, short bullet specs on the other. Model numbers buried inside product titles. Word reordering everywhere. The kind of mess real e-commerce catalog data looks like.

Amazon-Google Products (cross-platform product matching)

Two large product catalogs from very different sources, where the same product is described by completely different people with completely different conventions. Sparse manufacturer data, mismatched categories, descriptions that share almost no vocabulary. The hardest of the four.

The numbers

We measured the three metrics that matter for any matching system: precision (how many surfaced matches are actually correct), recall (how many true matches the system found), and F1 (the balance of precision and recall, scored 0 to 1). Higher is better on all three.

Febrl4 (with identifier):

F1: 99.1%
Precision: 100%
Recall: 98.2%
Time to match 5,000 against 5,000 records: 4.4 seconds

Febrl4 (fuzzy only, no identifier):

F1: 88.5%
Precision: 100%
Recall: 79.3%

Abt-Buy (dirty product matching):

F1: 75.8%
Precision: 100%
Recall: 61.1%

Amazon-Google Products:

F1: 21%
Precision: 93%
Recall: 12%

A few of these deserve a closer look.

What 99.1% on Febrl4 actually means

For context, the Febrl4 benchmark is matched at this level only by deep-learning systems that have been trained on thousands of hand-labeled example pairs. ListMatchGenie hits the same range with no examples at all. No training set. No configuration. Just upload, map, match.

That is unusual. Most published systems describe their numbers as "after training" — the implicit cost is the labels someone had to produce. With ListMatchGenie there is no training step. The 99.1% is the result on the first run.

What 75.8% on Abt-Buy actually means

This one stopped us when we saw it. The standard benchmark figure for Abt-Buy with a supervised deep-learning approach (DeepMatcher, the most-cited system on this dataset) is around F1 0.63. Most published unsupervised methods land between F1 0.30 and 0.45 on Abt-Buy.

ListMatchGenie scored F1 0.76 on Abt-Buy without any training. That is meaningfully above the supervised baseline that most papers compare to, and it is a different category from typical unsupervised numbers on this dataset.

For a customer dealing with messy product catalogs — different vendors describing the same product different ways, model numbers half-hidden in titles, word reordering everywhere — this is the relevant proof point.

Zero false positives on every confirmed match

Across the Febrl4 and Abt-Buy runs, every single match ListMatchGenie auto-confirmed turned out to be a correct match. Not "high precision." Not "99%." 100%. Every one.

This is rare in published matching results. Most systems trade precision for recall — they will surface borderline matches because it pads their match count. ListMatchGenie does not. When the engine is uncertain, the pair goes to a review queue rather than being auto-confirmed.

The result: when ListMatchGenie tells you two records are the same, they are. There is no quiet contamination of your data with wrong merges, no slow accumulation of incorrect customer records, no compliance auditor asking why three different people share an account.

How this works without training or configuration

Here is the part most matching tools do not advertise: most matching engines do not actually look at your data before they match it. They apply a fixed strategy regardless of what is there. If your product catalog has model numbers buried inside product names, they do not notice. If your two customer lists describe people in different conventions, they do not adapt.

ListMatchGenie spends about one second studying your data before it matches. In that second, it asks questions like:

Are there model numbers, part codes, or SKUs hidden inside product names?
Are the same products written in different word orders across the two sides?
Are some descriptions long-form prose while the matching ones are short specs?
Are name fields sometimes empty on one side but not the other?
Are there dates of birth, prices, regions, or other corroborating signals available?

For each pattern the engine spots, it activates the matching strategy that fits. Detected embedded model numbers? Extract them and use them as primary identifiers. Catalogs phrase products differently? Switch to word-order-tolerant matching. Descriptions are wildly different lengths? Apply length-asymmetric comparison.

Most matching tools take a one-size-fits-all approach. ListMatchGenie reads the room.

The 100% precision philosophy

A wrong match is worse than a missed match. We built ListMatchGenie around this principle, and the benchmark precision numbers are the design choice paying off.

When two customer records get incorrectly merged, you have corrupted your data. Two real people now share an account. Their orders, their addresses, their loyalty history — all tangled. Cleaning that up costs vastly more than any time the match saved you.

The review bucket is the design that makes this work. When the engine is genuinely confident — confident based on multiple corroborating signals, not just one weak match — it auto-confirms. When the evidence is suggestive but not airtight, it sends the pair to a review queue where the user can confirm in one click.

What you will never see: ListMatchGenie quietly merging two records and getting it wrong. Across thousands of test pairs in our benchmark runs, that did not happen once.

Where we honestly fall short

We do not believe in marketing copy that buries the asterisks. So here are ours.

Where ListMatchGenie genuinely leads: Cleaning customer lists, deduplicating contact databases, reconciling vendor records, matching product catalogs that share identifying conventions, merging multiple sources into a single master view, and finding the same person across multiple CRMs. The realistic 80% to 90% of what most businesses need to match.

Where ListMatchGenie is not yet at the very top: Cross-platform product matching where two completely different sources describe the same product using almost no shared vocabulary. The Amazon-Google benchmark sits in this territory, and it is where ListMatchGenie scored F1 21%. Honest read: that is below the typical benchmark floor for unsupervised methods on this specific dataset.

To compete at the top of that benchmark, you would need a system based on semantic AI trained on millions of labeled examples — a different class of tool, with a different price tag, designed for a different use case. For 90% of real-world matching work, that distinction never comes up. For the 10% where it does, we will be honest about it.

What this means for your business

If you are evaluating data matching tools today, the honest comparison looks like this.

Most enterprise data quality platforms require a data engineer to configure custom rules for each new dataset. They reach high accuracy after weeks of setup. Pricing assumes a Fortune 500 budget.

Most modern AI-based matching tools require labeled training data — hundreds to thousands of confirmed match pairs — before they perform well. The training cost gets buried in implementation services.

ListMatchGenie performs at near-supervised levels with zero training and no configuration, on the same public benchmarks the data science community uses to compare systems in academic papers. It does this with 100% precision on auto-confirmed matches. And the price point is self-serve, not enterprise.

That is not magic. It is careful engineering: a matching engine that adapts to your data instead of forcing your data into its assumptions, paired with a deliberate design choice to surface uncertainty rather than guessing.

Try it on your data

The benchmark numbers above came from public datasets. The only benchmark that actually matters is your data.

Upload two files. Pick which columns describe the same thing across them. Let ListMatchGenie do its analysis and matching. Review the results. If the matches are not what you expected, you do not pay anything.

The Genie reads the room.