How matching works

This page explains what happens between clicking Run match and seeing results. You don't need this to use the product, but understanding it helps you tune profiles, set thresholds, and interpret results when they surprise you.

The match engine runs in passes. Each pass has a specific job and runs in a specific order.

The stages, in order

Exact identifier match

If both files share a high-quality identifier column (email, account number, NPI, SSN, etc.), the engine first matches rows that agree on that column. These are matched immediately with a score of 100 and are the fastest, most reliable matches in the run.

Deterministic match

For pairs not matched by identifier, the engine checks whether all comparable fields agree exactly (after cleansing). If first name, last name, address, and ZIP all match exactly, the pair is classified as deterministic with a score in the 95–100 range.

Blocking

Fuzzy comparison of every source row against every master row would be prohibitively slow on large files. To avoid this, the engine groups rows into blocks using a cheap discriminating key (ZIP code, phonetic code of last name, domain, or another signal based on the match profile).

Within a block, pairs are compared; across blocks, they aren't. This reduces the comparison space by orders of magnitude while preserving almost all legitimate matches.

Candidate scoring

Within each block, every pair is scored on each comparable field:

Name fields — fuzzy string similarity, boosted by phonetic and nickname matches
Address fields — token-based comparison with abbreviation normalization
ZIP — exact, or distance-based if ZIP radius matching is enabled
State / country — binary (match or no match)
Numeric fields (prices, dimensions) — relative-tolerance comparison

Per-field scores are combined using the profile's weights into a single composite score (0–100).

Classification

For each source row, the engine takes its best-scoring candidate and classifies:

Score ≥ match threshold (default 70) → match
Score ≥ review threshold (default 55), below match threshold → review
Below review threshold → unmatched

Assignment optimization (optional)

When one-to-one matching is enabled, the engine runs a final pass that finds the globally optimal 1:1 pairing between source and master — preventing the case where two source rows both claim the same master as their best match.

The scoring mechanics

Composite score

Per-field scores are combined using the profile's weighting. Each profile assigns more weight to the fields that carry more identity signal for that entity type — names and identifiers are weighted heavily for people; domain and company name for organizations; SKU for products.

Weight normalization for missing fields

If a comparable field is missing from one or both records, its weight is redistributed proportionally to the remaining fields. Records with less data don't get artificially low scores.

Tie-breaking

If two candidates score identically, ties are broken by:

Row order in the master file (lower master row ID wins)
Unless one-to-one is enabled, in which case global assignment optimization may resolve differently

Passes you might not think about

Pre-filter

Before expensive fuzzy comparison, a quick-reject pass eliminates obvious non-matches:

Last-name similarity below 80% — drop (rarely is a matched pair that divergent on last name)
Gender mismatch on names (Eric vs Erica) — drop unless configured to ignore

This pass can reduce the scored pair count by 70–90% at almost no accuracy cost.

Machine learning. The matching engine is rules-based. Weights, thresholds, and comparison algorithms are fixed per profile. This makes results predictable and explainable — every score can be reproduced from first principles.
Cross-row inference. The engine doesn't infer "record X must match Y because every other row matches their obvious counterpart". Each pair is scored independently.
Implicit duplicate collapse. If your source file has duplicates, the engine matches them independently — it's up to you to dedupe the source first.

Why stages matter for tuning

When you look at a match result and wonder why, the method breakdown on the job detail page tells you:

Most matches came from exact_id? Your identifier column did the heavy lifting. Data is clean.
Most came from fuzzy? Data is messy or there's no strong identifier. Consider enriching with one.
Most came from phonetic? Lots of spelling variation. Normal for international data.

The method distribution is also what you tune against when adjusting profiles and thresholds.

Match profiles — how weights are set per entity type
Confidence scores — how the composite score is used
Setting the confidence threshold — tuning the classification dials