ListMatchGenie

Entity resolution

What entity resolution is, how it differs from deduplication and identity resolution, and how ListMatchGenie does it without the data-engineering project.

Entity resolution is the discipline of figuring out when two records — across different systems, files, or formats — describe the same real-world thing. The "thing" is usually a person (customer, patient, provider) or a business (account, vendor, supplier), but it can be any entity: a product SKU, a building, a clinical trial, a household.

It sounds trivial. It is not. Real-world data is messy: people change names, addresses get abbreviated three different ways, vendors copy the same supplier into a CRM four times, healthcare rosters reuse provider IDs, and someone always types St instead of Street. Entity resolution is the work of deciding with calibrated confidence whether John Smith, 123 Main St, jsmith@gmail.com and John W Smith, 123 Main Street, john.smith@gmail.com are the same John Smith.

Why it's hard

A naive approach — exact-string match on a few key columns — misses the majority of true duplicates in real data. Studies of CRM hygiene routinely find that 15–30% of "unique" customer records are actually duplicates that VLOOKUP and Excel filtering can't catch. Why:

  • Spelling variants (Robert vs Bob, Catherine vs Kathy)
  • Format drift ((617) 555-1234 vs 617.555.1234 vs +16175551234)
  • Transliteration (García vs Garcia, Müller vs Mueller)
  • Truncation and abbreviation (Massachusetts General Hospital vs Mass Gen)
  • Field reordering (First Last vs Last, First)
  • Partial overlap (someone moved, changed jobs, married, abbreviated their name)

Single-field exact matching catches the easy 30%. The remaining 70% needs probabilistic matching — a model that scores per-field similarity, weights fields by how informative each is, and combines those scores into a single confidence value.

The two flavors: deterministic vs probabilistic

Deterministic matching says: two records match if (and only if) they agree on a specific column or combination. Email match, NPI match, account-number match. Fast, transparent, but brittle — any typo, formatting difference, or missing value breaks it.

Probabilistic matching says: two records match if their combined evidence across all fields exceeds a confidence threshold. Names, addresses, phones, emails, dates of birth — each contributes evidence weighted by how rare its agreement is in the population. (A shared rare last name like "Nakagawa" is much stronger evidence than a shared common one like "Smith".)

ListMatchGenie does both:

  • Stage 1 runs deterministic matching on identifier columns first — fast wins on email/NPI/account.
  • Stage 2 runs probabilistic matching on everything else — the engine scores per-field agreement against per-field disagreement priors using the Fellegi-Sunter framework, the academic foundation for modern record linkage.

See Three-stage pipeline for the full flow.

Entity resolution vs deduplication vs identity resolution vs MDM

These terms get used interchangeably but they're not the same:

  • Deduplication is one file against itself — find rows that describe the same entity.
  • Entity resolution is the broader discipline — match records across any two (or more) sources, including dedup as a special case.
  • Identity resolution is the marketing-tech version: stitch together the same person across email, mobile ad ID, web cookie, CRM, loyalty program. Same math, different vocabulary.
  • Record linkage is the academic term, used heavily in healthcare and government statistics. Same problem.
  • Master Data Management (MDM) is enterprise software that owns the system of record for resolved entities — usually a six-figure license, a two-year implementation, and a dedicated data team. Entity resolution is a capability MDM platforms provide; you don't need MDM to do entity resolution.

What you get out of entity resolution

A clean, deduplicated, source-ranked dataset where each unique entity has one canonical record. That canonical record is sometimes called a golden record (the survivorship-rules-applied "best" version of a person/company) and the broader picture across all touchpoints is sometimes called Customer 360.

ListMatchGenie produces this every time you run a match:

  • Each source row gets a _lmg_match_status (match / review / unmatched)
  • Matched rows include the matched master row's columns alongside the source's, so you have a single "wide" record per entity
  • Confidence scores let you sort by certainty
  • The dedup report flags duplicates within the file before matching even starts

When you should care

If any of these are true, entity resolution is solving a real problem for you:

  • Your CRM has multiple records for the same customer
  • You're trying to match an external lead list against your internal customer list
  • You bought a supplier list and need to know which entries you already have
  • Your healthcare roster has provider names that don't quite match the NPI registry
  • Your nonprofit donor file has the same household entered under three spellings
  • You're building a loyalty program and need to unify identities across web/mobile/in-store

If your data is small, clean, and uses consistent identifiers everywhere — congratulations, you don't need entity resolution. Most teams aren't that lucky.