The three-stage pipeline

ListMatchGenie runs every file through three stages, always in the same order. If you understand what each stage does and what signals the next one consumes, you'll diagnose accuracy problems in seconds instead of hours — and you'll know which knob to turn to improve results.

The three stages:

Cleanse — turn whatever you uploaded into clean, comparable data.
Match — find the best master record for every source row.
Insights — summarize what happened so you can act on it.

Every screen in the product maps to one of these stages. The match wizard steps through them in sequence; the job detail page reports on all three after a match completes.

Stage 1: Cleanse

Goal: remove every source of noise that would make matching harder than it needs to be.

Raw spreadsheets are messy. The same ZIP code can appear as 01841, 1841, and 01841-2100. The same company can be Acme, Inc., ACME INC, and acme incorporated. The same phone number can be (555) 123-4567, 5551234567, and +15551234567. If you try to match these as-is, most of your near-matches will fail on formatting differences — not on anything that actually matters.

The cleanse stage runs automatically on both your source and master files. It does three things:

Profile

The Genie inspects every column and records:

Detected type (email, phone, date, currency, identifier, free text, etc.)
Null rate and distinct value count
Distribution (most common values, outliers)
Anomalies (mixed casing, stray characters, formatting inconsistencies)

The column profile is what the match engine uses later to pick the right comparison method per column.

Standardize

Based on the profile, cleansing rules apply to each column. The defaults handle:

Whitespace — trim and collapse internal runs
Casing — uppercase/lowercase based on column type (emails → lowercase, names → title case)
Phone numbers — digits only, with country code normalized
Dates — ISO 8601 (YYYY-MM-DD) regardless of input format
ZIP codes — pad to 5 digits, strip ZIP+4 suffix unless kept intentionally
Identifiers — strip prefixes, leading zeros, and punctuation based on the ID type
Abbreviations — expand St to Street, Inc to Incorporated, etc. (configurable per column)
Accents — transliterate to ASCII for matching, preserve original for display (García ↔ Garcia)

Cleansing is reversible

The original file is never overwritten. Cleansed output is stored alongside raw input, and exports always include your original column values. If cleansing produces something you don't expect, you can override or disable specific rules without re-uploading.

Deduplicate

Exact duplicates and near-duplicates within each file are flagged and (optionally) collapsed before matching runs. The dedup report shows you what was removed so you can spot-check.

Why this matters: a single duplicate in your master file will cause every matching source row to classify as "review" because the Genie can't tell which of the two master records is the intended one. Deduping the master before matching prevents that entire class of ambiguity.

Stage 2: Match

Goal: for every source row, find the master row that most likely represents the same real-world entity — or decide that no match exists.

This is the stage most tools call "matching" and leave as a black box. ListMatchGenie makes it transparent: every step has a name, every pair has a score, and every decision is explained.

The match engine runs multiple passes:

Exact match on identifiers — if both files share a unique ID column (email, SSN, account number, NPI), candidates that agree on that ID are matched first. These never go to the review queue.
Blocking — the remaining rows are grouped into "blocks" by a cheap key (ZIP code, phonetic name code) so the engine doesn't have to compare every source row against every master row.
Candidate scoring — within each block, every pair is scored on each comparable column. Scores are weighted per the match profile you selected.
Classification — the best-scoring candidate per source row is assigned a status based on how the score compares to your confidence threshold.

The output is your source file with three new columns per row:

_lmg_match_status — match / review / unmatched
_lmg_match_score — the score of the best candidate (0–100)
_lmg_master_row_id — the row ID in the master file this matched to

Plus every column from the matched master row, appended to the source. You never have to VLOOKUP your own data.

Why multiple passes?

A single-pass fuzzy match on every column would take minutes on a 100,000-row file and still miss obvious things — like two records with the same email but different name spellings. The pass order catches the cheap certainties first, then progressively relaxes to handle the harder cases. See How matching works.

Stage 3: Insights

Goal: convert the match results into something you can act on, share, and defend.

Raw match counts are useful. "3,147 matches, 842 review, 823 unmatched" is a fact. But it doesn't tell you:

Why the unmatched rows didn't match
Where the review cases cluster (geography, data quality, entity type)
What you should do next (fix the master, re-upload with a tighter profile, accept the review queue as-is)

The insights stage produces answers to those questions in three forms:

The Genie's Take

A short narrative summary on the job detail page. Always at the top of a completed match. Written by the Genie from aggregate statistics — not raw rows — so it's safe to share without exposing PII.

Structured reports

A full analytical document with executive summary, match method breakdown, per-dimension pivots (match rate by state, by company, by data quality), sample rows, and an inline follow-up Q&A with the Genie. Generate one from any completed match on the reports page.

Shareable links

Generate a tokenized link to any report, with optional password protection (Pro+). Share the analysis without giving someone app access or exposing the underlying files. See Sharing reports.

How the stages feed each other

The three stages are not independent steps. Each one shapes what the next one can do well:

Good cleansing makes matching faster (smaller blocks) and more accurate (fewer false negatives from formatting).
Transparent matching (per-pass, per-pair scores) gives the insights stage enough signal to write a narrative instead of a guess.
Structured insights tell you what to change next time — a tighter threshold, a different profile, a re-upload with an added column.

This is why the wizard walks you through them in order and why the job detail page reports on all three. Treating any one stage as a black box costs you accuracy downstream.

Cleansing report — every rule the cleanse stage applies
How matching works — the match engine in more detail
Reading a report — how to interpret the insights stage