What's the difference between entity resolution and deduplication?

Deduplication is one file against itself — finding rows that describe the same entity within a single dataset. Entity resolution is the broader discipline: matching records across any two or more sources, with dedup as a special case. Same math, bigger scope.

How is entity resolution different from Master Data Management (MDM)?

Entity resolution is a capability. MDM is enterprise software that uses that capability to own the system of record for resolved entities. MDM platforms like Informatica or Reltio carry six-figure licenses and multi-year implementations. You can do entity resolution without MDM — in a spreadsheet, in Python, or in a focused tool like ListMatchGenie — and for most teams that's the right choice.

When do I need entity resolution vs just matching on email or ID?

Simple ID-based matching works for roughly 30% of real-world cases — rows where both records share a clean, populated unique identifier. The other 70% breaks against typos, format drift (phone number styles), nicknames, missing values, and abbreviations. Entity resolution is what catches those — and studies of CRM hygiene consistently find 15–30% of 'unique' records are actually duplicates that exact-match can't detect.

What Is Entity Resolution? A Practical Guide for Data Teams

Q: What is entity resolution?

Entity resolution is the process of figuring out when two records describe the same real-world thing — the same person, company, product, or provider — even when the records don't agree on every field. It combines deterministic matching (exact agreement on a unique ID like email or NPI) with probabilistic matching (weighing combined evidence across every available field) to decide with confidence whether two records refer to the same entity.

Every team that touches customer or partner data eventually runs into the same problem: the same real-world person or company appears in your systems under more than one record. Two slightly different spellings of a name. An email at work, an email at home. A phone number with parentheses, a phone number without. The math problem of figuring out which records describe the same thing is called entity resolution, and getting it right is the difference between a CRM that helps your team and a CRM that lies to them.

If you've ever heard "record linkage" (academic and healthcare), "identity resolution" (martech), "fuzzy matching" (developers), or "data deduplication" (everyone) — those are all entity resolution under different vocabulary. The underlying problem is the same: same real-world thing, different representations, decide with confidence.

Why entity resolution is hard

The naive answer is "just match on a unique identifier." Email, phone, account number — pick one and use it as the key. This works for the easy 30% of cases. The other 70% breaks against:

Spelling variants — Robert vs Bob, Catherine vs Kathy, Garcia vs García
Format drift — (617) 555-1234 vs 617.555.1234 vs +16175551234
Truncation and abbreviation — Massachusetts General Hospital vs Mass Gen
Field reordering — First Last vs Last, First
Partial overlap — someone moved, changed jobs, married, abbreviated their name
Missing values — they gave you a phone in one system, an email in another

Studies of CRM hygiene routinely find that 15-30% of "unique" records are actually duplicates that exact-string matching can't catch. That's where the real money is — and where naive solutions fail.

Deterministic vs probabilistic matching

There are two basic approaches, and the right answer is usually both.

Deterministic matching says: two records match if they agree exactly on a specific column. Email match, NPI match, account-number match. It's fast, transparent, and easy to audit. It also breaks the moment someone has a typo, an old email, or a missing value.

Probabilistic matching says: two records match if their combined evidence across every available field exceeds a confidence threshold. Names, addresses, phones, emails, dates of birth — each contributes evidence weighted by how informative it is. (A shared rare last name like "Nakagawa" is much stronger evidence than a shared common one like "Smith.") The math is calibrated weighted evidence — the same foundation that academic record-linkage research has refined for decades.

The right architecture is to run deterministic first (catch the easy wins fast), then run probabilistic on whatever's left. ListMatchGenie's three-stage pipeline does exactly this.

Entity resolution vs deduplication vs identity resolution vs MDM

These terms get used interchangeably. They aren't the same:

Deduplication is one file against itself — find rows that describe the same entity within a single dataset.
Entity resolution is the broader discipline. Match records across any two (or more) sources, including dedup as a special case.
Identity resolution is the marketing-tech version: stitch the same person across email, mobile ad ID, web cookie, CRM, loyalty program. Same math, different vocabulary, focused on customer touchpoints.
Record linkage is the academic / healthcare / government statistics term. Same problem.
Master Data Management (MDM) is enterprise software that owns the system of record for resolved entities. Six-figure license, two-year implementation, dedicated data team. Entity resolution is a capability MDM platforms provide; you don't need MDM to do entity resolution.

What you get from doing entity resolution well

The output of good entity resolution is a clean dataset where each unique entity has one canonical record. That canonical record is sometimes called a golden record (the survivorship-rules-applied "best" version of a person) and the broader picture across all touchpoints is sometimes called Customer 360.

What this unlocks:

Marketing stops sending three identical emails to the same customer
Sales stops calling someone the team already won (or lost) last quarter
Support sees the customer's full history instead of one ticket in isolation
Finance reports accurate customer counts and lifetime values
Compliance can actually answer "what data do you have on this person" for GDPR/CCPA requests
Analytics stops double-counting and finally trusts the dashboards

When you need entity resolution (and when you don't)

You need it if any of these are true:

Your CRM has duplicate records you've never gotten around to cleaning
You buy or subscribe to external lead lists and need to know which entries you already have
You're matching healthcare provider rosters against the NPI registry or other authority files
Your marketing operations team spends hours per week on "list hygiene"
You're consolidating systems after an acquisition
You're building a loyalty program and need to unify identities across channels
You've ever exported a customer list to a spreadsheet and tried to dedupe it manually

You don't need it if your data is small, you control every input system, you use consistent identifiers everywhere, and your data quality is genuinely high. Most teams aren't in this position.

Build vs buy

Entity resolution can be built in-house — mature open-source record-linkage libraries exist and are free. Building it yourself makes sense if:

You have a dedicated data engineering team
You want full control over the matching logic
You're matching at scales (10M+ records per run) where ongoing tuning pays off

It doesn't make sense if you'd rather solve the problem in an afternoon than in a quarter. ListMatchGenie targets the gap between Excel/VLOOKUP (free but caps out at simple deterministic matching) and enterprise MDM platforms (powerful but $50K+/year and a team to operate). For a marketing ops team, sales ops, or RevOps person who needs to match a 50K-row lead list against a 200K-row CRM tonight, the answer is usually "buy something focused, not build something custom."

The TL;DR

Entity resolution is figuring out when two records describe the same real-world entity. It's harder than it looks because real data is messy. Deterministic matching catches the easy 30%; probabilistic matching catches most of the rest. The output — a clean, deduplicated, cross-referenced dataset — is the foundation for everything from accurate dashboards to GDPR compliance to a working CRM. Whether you build or buy depends on scale, team, and how soon you need the answer.

If "this afternoon" is the answer, give ListMatchGenie a try — free tier handles 1,000 rows, paid tiers scale to 500,000 per match, no annual contract required.

What Is Entity Resolution? A Practical Guide for Data Teams

Why entity resolution is hard

Deterministic vs probabilistic matching

Entity resolution vs deduplication vs identity resolution vs MDM

What you get from doing entity resolution well

When you need entity resolution (and when you don't)

Build vs buy

The TL;DR

Keep reading

We Put Our Matching Engine Against the Industry's Toughest Public Benchmarks. Here's What Happened.

GDPR-Compliant Data Matching: What to Look For in a Tool

How to Match Two Lists with Different Column Names

Let the Genie handle the grunt work.