Every team that touches customer or partner data eventually runs into the same problem: the same real-world person or company appears in your systems under more than one record. Two slightly different spellings of a name. An email at work, an email at home. A phone number with parentheses, a phone number without. The math problem of figuring out which records describe the same thing is called entity resolution, and getting it right is the difference between a CRM that helps your team and a CRM that lies to them.
If you've ever heard "record linkage" (academic and healthcare), "identity resolution" (martech), "fuzzy matching" (developers), or "data deduplication" (everyone) — those are all entity resolution under different vocabulary. The underlying problem is the same: same real-world thing, different representations, decide with confidence.
Why entity resolution is hard
The naive answer is "just match on a unique identifier." Email, phone, account number — pick one and use it as the key. This works for the easy 30% of cases. The other 70% breaks against:
- Spelling variants — Robert vs Bob, Catherine vs Kathy, Garcia vs García
- Format drift — (617) 555-1234 vs 617.555.1234 vs +16175551234
- Truncation and abbreviation — Massachusetts General Hospital vs Mass Gen
- Field reordering — First Last vs Last, First
- Partial overlap — someone moved, changed jobs, married, abbreviated their name
- Missing values — they gave you a phone in one system, an email in another
Studies of CRM hygiene routinely find that 15-30% of "unique" records are actually duplicates that exact-string matching can't catch. That's where the real money is — and where naive solutions fail.
Deterministic vs probabilistic matching
There are two basic approaches, and the right answer is usually both.
Deterministic matching says: two records match if they agree exactly on a specific column. Email match, NPI match, account-number match. It's fast, transparent, and easy to audit. It also breaks the moment someone has a typo, an old email, or a missing value.
Probabilistic matching says: two records match if their combined evidence across every available field exceeds a confidence threshold. Names, addresses, phones, emails, dates of birth — each contributes evidence weighted by how informative it is. (A shared rare last name like "Nakagawa" is much stronger evidence than a shared common one like "Smith.") The math comes from the Fellegi-Sunter framework, the academic foundation for record linkage since 1969.
The right architecture is to run deterministic first (catch the easy wins fast), then run probabilistic on whatever's left. ListMatchGenie's three-stage pipeline does exactly this.
Entity resolution vs deduplication vs identity resolution vs MDM
These terms get used interchangeably. They aren't the same:
- Deduplication is one file against itself — find rows that describe the same entity within a single dataset.
- Entity resolution is the broader discipline. Match records across any two (or more) sources, including dedup as a special case.
- Identity resolution is the marketing-tech version: stitch the same person across email, mobile ad ID, web cookie, CRM, loyalty program. Same math, different vocabulary, focused on customer touchpoints.
- Record linkage is the academic / healthcare / government statistics term. Same problem.
- Master Data Management (MDM) is enterprise software that owns the system of record for resolved entities. Six-figure license, two-year implementation, dedicated data team. Entity resolution is a capability MDM platforms provide; you don't need MDM to do entity resolution.
What you get from doing entity resolution well
The output of good entity resolution is a clean dataset where each unique entity has one canonical record. That canonical record is sometimes called a golden record (the survivorship-rules-applied "best" version of a person) and the broader picture across all touchpoints is sometimes called Customer 360.
What this unlocks:
- Marketing stops sending three identical emails to the same customer
- Sales stops calling someone the team already won (or lost) last quarter
- Support sees the customer's full history instead of one ticket in isolation
- Finance reports accurate customer counts and lifetime values
- Compliance can actually answer "what data do you have on this person" for GDPR/CCPA requests
- Analytics stops double-counting and finally trusts the dashboards
When you need entity resolution (and when you don't)
You need it if any of these are true:
- Your CRM has duplicate records you've never gotten around to cleaning
- You buy or subscribe to external lead lists and need to know which entries you already have
- You're matching healthcare provider rosters against the NPI registry or other authority files
- Your marketing operations team spends hours per week on "list hygiene"
- You're consolidating systems after an acquisition
- You're building a loyalty program and need to unify identities across channels
- You've ever exported a customer list to a spreadsheet and tried to dedupe it manually
You don't need it if your data is small, you control every input system, you use consistent identifiers everywhere, and your data quality is genuinely high. Most teams aren't in this position.
Build vs buy
Entity resolution can be built in-house — Splink, dedupe.io, and similar libraries are mature and free. Building it yourself makes sense if:
- You have a dedicated data engineering team
- You want full control over the matching logic
- You're matching at scales (10M+ records per run) where ongoing tuning pays off
It doesn't make sense if you'd rather solve the problem in an afternoon than in a quarter. ListMatchGenie targets the gap between Excel/VLOOKUP (free but caps out at simple deterministic matching) and enterprise MDM platforms (powerful but $50K+/year and a team to operate). For a marketing ops team, sales ops, or RevOps person who needs to match a 50K-row lead list against a 200K-row CRM tonight, the answer is usually "buy something focused, not build something custom."
The TL;DR
Entity resolution is figuring out when two records describe the same real-world entity. It's harder than it looks because real data is messy. Deterministic matching catches the easy 30%; probabilistic matching catches most of the rest. The output — a clean, deduplicated, cross-referenced dataset — is the foundation for everything from accurate dashboards to GDPR compliance to a working CRM. Whether you build or buy depends on scale, team, and how soon you need the answer.
If "this afternoon" is the answer, give ListMatchGenie a try — free tier handles 1,000 rows, paid tiers scale to 500,000 per match, no annual contract required.

