ListMatchGenie

How matching works

A user-facing tour of the match engine — what each pass does, how candidates are scored, and how decisions are made — without requiring an algorithms background.

This page explains what happens between clicking Run match and seeing results. You don't need this to use the product, but understanding it helps you tune profiles, set thresholds, and interpret results when they surprise you.

The match engine runs in passes. Each pass has a specific job and runs in a specific order.

The stages, in order

Exact identifier match

If both files share a high-quality identifier column (email, account number, NPI, SSN, etc.), the engine first matches rows that agree on that column. These are matched immediately with a score of 100 and are the fastest, most reliable matches in the run.

Deterministic match

For pairs not matched by identifier, the engine checks whether all comparable fields agree exactly (after cleansing). If first name, last name, address, and ZIP all match exactly, the pair is classified as deterministic with a score in the 95–100 range.

Blocking

Fuzzy comparison of every source row against every master row would be prohibitively slow on large files. To avoid this, the engine groups rows into blocks using a cheap discriminating key (ZIP code, phonetic code of last name, domain, or another signal based on the match profile).

Within a block, pairs are compared; across blocks, they aren't. This reduces the comparison space by orders of magnitude while preserving almost all legitimate matches.

Candidate scoring

Within each block, every pair is scored on each comparable field:

  • Name fields — fuzzy string similarity, boosted by phonetic and nickname matches
  • Address fields — token-based comparison with abbreviation normalization
  • ZIP — exact, or distance-based if ZIP radius matching is enabled
  • State / country — binary (match or no match)
  • Numeric fields (prices, dimensions) — relative-tolerance comparison

Per-field scores are combined using the profile's weights into a single composite score (0–100).

Classification

For each source row, the engine takes its best-scoring candidate and classifies:

  • Score ≥ match threshold (default 70) → match
  • Score ≥ review threshold (default 55), below match threshold → review
  • Below review threshold → unmatched

Assignment optimization (optional)

When one-to-one matching is enabled, the engine runs a final pass that finds the globally optimal 1:1 pairing between source and master — preventing the case where two source rows both claim the same master as their best match.

The scoring mechanics

Composite score

Per-field scores are combined using the profile's weighting. Each profile assigns more weight to the fields that carry more identity signal for that entity type — names and identifiers are weighted heavily for people; domain and company name for organizations; SKU for products.

Weight normalization for missing fields

If a comparable field is missing from one or both records, its weight is redistributed proportionally to the remaining fields. Records with less data don't get artificially low scores.

Tie-breaking

If two candidates score identically, ties are broken by:

  1. Row order in the master file (lower master row ID wins)
  2. Unless one-to-one is enabled, in which case global assignment optimization may resolve differently

Passes you might not think about

Pre-filter

Before expensive fuzzy comparison, a quick-reject pass eliminates obvious non-matches:

  • Last-name similarity below 80% — drop (rarely is a matched pair that divergent on last name)
  • Gender mismatch on names (Eric vs Erica) — drop unless configured to ignore

This pass can reduce the scored pair count by 70–90% at almost no accuracy cost.

Phonetic indexing

If phonetic matching is enabled, each name is indexed by a phonetic code representing how it sounds. Names with matching phonetic codes can block together even if their spelling diverges significantly (Smith / Smyth code identically).

Nickname expansion

If nickname matching is enabled (default for Person), the engine maintains a canonical-form table (Bill → William, Jennifer → Jen, etc.) and accepts matches where source and master agree on the canonical form even if their original strings differ.

What the engine does not do

  • Machine learning. The matching engine is rules-based. Weights, thresholds, and comparison algorithms are fixed per profile. This makes results predictable and explainable — every score can be reproduced from first principles.
  • Cross-row inference. The engine doesn't infer "record X must match Y because every other row matches their obvious counterpart". Each pair is scored independently.
  • Implicit duplicate collapse. If your source file has duplicates, the engine matches them independently — it's up to you to dedupe the source first.

Why stages matter for tuning

When you look at a match result and wonder why, the method breakdown on the job detail page tells you:

  • Most matches came from exact_id? Your identifier column did the heavy lifting. Data is clean.
  • Most came from fuzzy? Data is messy or there's no strong identifier. Consider enriching with one.
  • Most came from phonetic? Lots of spelling variation. Normal for international data.

The method distribution is also what you tune against when adjusting profiles and thresholds.