This page explains what happens between clicking Run match and seeing results. You don't need this to use the product, but understanding it helps you tune profiles, set thresholds, and interpret results when they surprise you.
The match engine runs in passes. Each pass has a specific job and runs in a specific order.
The stages, in order
Exact identifier match
If both files share a high-quality identifier column (email, account number, NPI, SSN, etc.), the engine first matches rows that agree on that column. These are matched immediately with a score of 100 and are the fastest, most reliable matches in the run.
Deterministic match
For pairs not matched by identifier, the engine checks whether all comparable fields agree exactly (after cleansing). If first name, last name, address, and ZIP all match exactly, the pair is classified as deterministic with a score in the 95–100 range.
Blocking
Fuzzy comparison of every source row against every master row would be prohibitively slow on large files. To avoid this, the engine groups rows into blocks using a cheap discriminating key (ZIP code, phonetic code of last name, domain, or another signal based on the match profile).
Within a block, pairs are compared; across blocks, they aren't. This reduces the comparison space by orders of magnitude while preserving almost all legitimate matches.
Candidate scoring
Within each block, every pair is scored on each comparable field:
- Name fields — fuzzy string similarity, boosted by phonetic and nickname matches
- Address fields — token-based comparison with abbreviation normalization
- ZIP — exact, or distance-based if ZIP radius matching is enabled
- State / country — binary (match or no match)
- Numeric fields (prices, dimensions) — relative-tolerance comparison
Per-field scores are combined using the profile's weights into a single composite score (0–100).
Classification
For each source row, the engine takes its best-scoring candidate and classifies:
- Score ≥ match threshold (default 70) →
match - Score ≥ review threshold (default 55), below match threshold →
review - Below review threshold →
unmatched
Assignment optimization (optional)
When one-to-one matching is enabled, the engine runs a final pass that finds the globally optimal 1:1 pairing between source and master — preventing the case where two source rows both claim the same master as their best match.
The scoring mechanics
Composite score
Per-field scores are combined using the profile's weighting. Each profile assigns more weight to the fields that carry more identity signal for that entity type — names and identifiers are weighted heavily for people; domain and company name for organizations; SKU for products.
Weight normalization for missing fields
If a comparable field is missing from one or both records, its weight is redistributed proportionally to the remaining fields. Records with less data don't get artificially low scores.
Tie-breaking
If two candidates score identically, ties are broken by:
- Row order in the master file (lower master row ID wins)
- Unless one-to-one is enabled, in which case global assignment optimization may resolve differently
Passes you might not think about
Pre-filter
Before expensive fuzzy comparison, a quick-reject pass eliminates obvious non-matches:
- Last-name similarity below 80% — drop (rarely is a matched pair that divergent on last name)
- Gender mismatch on names (Eric vs Erica) — drop unless configured to ignore
This pass can reduce the scored pair count by 70–90% at almost no accuracy cost.
Phonetic indexing
If phonetic matching is enabled, each name is indexed by a phonetic code representing how it sounds. Names with matching phonetic codes can block together even if their spelling diverges significantly (Smith / Smyth code identically).
Nickname expansion
If nickname matching is enabled (default for Person), the engine maintains a canonical-form table (Bill → William, Jennifer → Jen, etc.) and accepts matches where source and master agree on the canonical form even if their original strings differ.
What the engine does not do
- Machine learning. The matching engine is rules-based. Weights, thresholds, and comparison algorithms are fixed per profile. This makes results predictable and explainable — every score can be reproduced from first principles.
- Cross-row inference. The engine doesn't infer "record X must match Y because every other row matches their obvious counterpart". Each pair is scored independently.
- Implicit duplicate collapse. If your source file has duplicates, the engine matches them independently — it's up to you to dedupe the source first.
Why stages matter for tuning
When you look at a match result and wonder why, the method breakdown on the job detail page tells you:
- Most matches came from
exact_id? Your identifier column did the heavy lifting. Data is clean. - Most came from
fuzzy? Data is messy or there's no strong identifier. Consider enriching with one. - Most came from
phonetic? Lots of spelling variation. Normal for international data.
The method distribution is also what you tune against when adjusting profiles and thresholds.
Related reading
- Match profiles — how weights are set per entity type
- Confidence scores — how the composite score is used
- Setting the confidence threshold — tuning the classification dials
