Entity resolution is the discipline of figuring out when two records — across different systems, files, or formats — describe the same real-world thing. The "thing" is usually a person (customer, patient, provider) or a business (account, vendor, supplier), but it can be any entity: a product SKU, a building, a clinical trial, a household.
It sounds trivial. It is not. Real-world data is messy: people change names, addresses get abbreviated three different ways, vendors copy the same supplier into a CRM four times, healthcare rosters reuse provider IDs, and someone always types St instead of Street. Entity resolution is the work of deciding with calibrated confidence whether John Smith, 123 Main St, jsmith@gmail.com and John W Smith, 123 Main Street, john.smith@gmail.com are the same John Smith.
Why it's hard
A naive approach — exact-string match on a few key columns — misses the majority of true duplicates in real data. Studies of CRM hygiene routinely find that 15–30% of "unique" customer records are actually duplicates that VLOOKUP and Excel filtering can't catch. Why:
- Spelling variants (
RobertvsBob,CatherinevsKathy) - Format drift (
(617) 555-1234vs617.555.1234vs+16175551234) - Transliteration (
GarcíavsGarcia,MüllervsMueller) - Truncation and abbreviation (
Massachusetts General HospitalvsMass Gen) - Field reordering (
First LastvsLast, First) - Partial overlap (someone moved, changed jobs, married, abbreviated their name)
Single-field exact matching catches the easy 30%. The remaining 70% needs probabilistic matching — a model that scores per-field similarity, weights fields by how informative each is, and combines those scores into a single confidence value.
The two flavors: deterministic vs probabilistic
Deterministic matching says: two records match if (and only if) they agree on a specific column or combination. Email match, NPI match, account-number match. Fast, transparent, but brittle — any typo, formatting difference, or missing value breaks it.
Probabilistic matching says: two records match if their combined evidence across all fields exceeds a confidence threshold. Names, addresses, phones, emails, dates of birth — each contributes evidence weighted by how rare its agreement is in the population. (A shared rare last name like "Nakagawa" is much stronger evidence than a shared common one like "Smith".)
ListMatchGenie does both:
- Stage 1 runs deterministic matching on identifier columns first — fast wins on email/NPI/account.
- Stage 2 runs probabilistic matching on everything else — the engine scores per-field agreement against per-field disagreement priors using the Fellegi-Sunter framework, the academic foundation for modern record linkage.
See Three-stage pipeline for the full flow.
Entity resolution vs deduplication vs identity resolution vs MDM
These terms get used interchangeably but they're not the same:
- Deduplication is one file against itself — find rows that describe the same entity.
- Entity resolution is the broader discipline — match records across any two (or more) sources, including dedup as a special case.
- Identity resolution is the marketing-tech version: stitch together the same person across email, mobile ad ID, web cookie, CRM, loyalty program. Same math, different vocabulary.
- Record linkage is the academic term, used heavily in healthcare and government statistics. Same problem.
- Master Data Management (MDM) is enterprise software that owns the system of record for resolved entities — usually a six-figure license, a two-year implementation, and a dedicated data team. Entity resolution is a capability MDM platforms provide; you don't need MDM to do entity resolution.
What you get out of entity resolution
A clean, deduplicated, source-ranked dataset where each unique entity has one canonical record. That canonical record is sometimes called a golden record (the survivorship-rules-applied "best" version of a person/company) and the broader picture across all touchpoints is sometimes called Customer 360.
ListMatchGenie produces this every time you run a match:
- Each source row gets a
_lmg_match_status(match/review/unmatched) - Matched rows include the matched master row's columns alongside the source's, so you have a single "wide" record per entity
- Confidence scores let you sort by certainty
- The dedup report flags duplicates within the file before matching even starts
When you should care
If any of these are true, entity resolution is solving a real problem for you:
- Your CRM has multiple records for the same customer
- You're trying to match an external lead list against your internal customer list
- You bought a supplier list and need to know which entries you already have
- Your healthcare roster has provider names that don't quite match the NPI registry
- Your nonprofit donor file has the same household entered under three spellings
- You're building a loyalty program and need to unify identities across web/mobile/in-store
If your data is small, clean, and uses consistent identifiers everywhere — congratulations, you don't need entity resolution. Most teams aren't that lucky.
Related reading
- Three-stage pipeline — how cleanse, match, and insights fit together
- Confidence scores — what the 0–100 number means
- Match profiles — picking the right preset for your entity type
- How matching works — the engine internals
- Glossary — terms used in this article
