ListMatchGenie runs every file through three stages, always in the same order. If you understand what each stage does and what signals the next one consumes, you'll diagnose accuracy problems in seconds instead of hours — and you'll know which knob to turn to improve results.
The three stages:
- Cleanse — turn whatever you uploaded into clean, comparable data.
- Match — find the best master record for every source row.
- Insights — summarize what happened so you can act on it.
Every screen in the product maps to one of these stages. The match wizard steps through them in sequence; the job detail page reports on all three after a match completes.
Stage 1: Cleanse
Goal: remove every source of noise that would make matching harder than it needs to be.
Raw spreadsheets are messy. The same ZIP code can appear as 01841, 1841, and 01841-2100. The same company can be Acme, Inc., ACME INC, and acme incorporated. The same phone number can be (555) 123-4567, 5551234567, and +15551234567. If you try to match these as-is, most of your near-matches will fail on formatting differences — not on anything that actually matters.
The cleanse stage runs automatically on both your source and master files. It does three things:
Profile
The Genie inspects every column and records:
- Detected type (email, phone, date, currency, identifier, free text, etc.)
- Null rate and distinct value count
- Distribution (most common values, outliers)
- Anomalies (mixed casing, stray characters, formatting inconsistencies)
The column profile is what the match engine uses later to pick the right comparison method per column.
Standardize
Based on the profile, cleansing rules apply to each column. The defaults handle:
- Whitespace — trim and collapse internal runs
- Casing — uppercase/lowercase based on column type (emails → lowercase, names → title case)
- Phone numbers — digits only, with country code normalized
- Dates — ISO 8601 (
YYYY-MM-DD) regardless of input format - ZIP codes — pad to 5 digits, strip ZIP+4 suffix unless kept intentionally
- Identifiers — strip prefixes, leading zeros, and punctuation based on the ID type
- Abbreviations — expand
SttoStreet,InctoIncorporated, etc. (configurable per column) - Accents — transliterate to ASCII for matching, preserve original for display (
García↔Garcia)
Cleansing is reversible
The original file is never overwritten. Cleansed output is stored alongside raw input, and exports always include your original column values. If cleansing produces something you don't expect, you can override or disable specific rules without re-uploading.
Deduplicate
Exact duplicates and near-duplicates within each file are flagged and (optionally) collapsed before matching runs. The dedup report shows you what was removed so you can spot-check.
Why this matters: a single duplicate in your master file will cause every matching source row to classify as "review" because the Genie can't tell which of the two master records is the intended one. Deduping the master before matching prevents that entire class of ambiguity.
Stage 2: Match
Goal: for every source row, find the master row that most likely represents the same real-world entity — or decide that no match exists.
This is the stage most tools call "matching" and leave as a black box. ListMatchGenie makes it transparent: every step has a name, every pair has a score, and every decision is explained.
The match engine runs multiple passes:
-
Exact match on identifiers — if both files share a unique ID column (email, SSN, account number, NPI), candidates that agree on that ID are matched first. These never go to the review queue.
-
Blocking — the remaining rows are grouped into "blocks" by a cheap key (ZIP code, phonetic name code) so the engine doesn't have to compare every source row against every master row.
-
Candidate scoring — within each block, every pair is scored on each comparable column. Scores are weighted per the match profile you selected.
-
Classification — the best-scoring candidate per source row is assigned a status based on how the score compares to your confidence threshold.
The output is your source file with three new columns per row:
_lmg_match_status—match/review/unmatched_lmg_match_score— the score of the best candidate (0–100)_lmg_master_row_id— the row ID in the master file this matched to
Plus every column from the matched master row, appended to the source. You never have to VLOOKUP your own data.
Why multiple passes?
A single-pass fuzzy match on every column would take minutes on a 100,000-row file and still miss obvious things — like two records with the same email but different name spellings. The pass order catches the cheap certainties first, then progressively relaxes to handle the harder cases. See How matching works.
Stage 3: Insights
Goal: convert the match results into something you can act on, share, and defend.
Raw match counts are useful. "3,147 matches, 842 review, 823 unmatched" is a fact. But it doesn't tell you:
- Why the unmatched rows didn't match
- Where the review cases cluster (geography, data quality, entity type)
- What you should do next (fix the master, re-upload with a tighter profile, accept the review queue as-is)
The insights stage produces answers to those questions in three forms:
The Genie's Take
A short narrative summary on the job detail page. Always at the top of a completed match. Written by the Genie from aggregate statistics — not raw rows — so it's safe to share without exposing PII.
Structured reports
A full analytical document with executive summary, match method breakdown, per-dimension pivots (match rate by state, by company, by data quality), sample rows, and an inline follow-up Q&A with the Genie. Generate one from any completed match on the reports page.
Shareable links
Generate a tokenized link to any report, with optional password protection (Pro+). Share the analysis without giving someone app access or exposing the underlying files. See Sharing reports.
How the stages feed each other
The three stages are not independent steps. Each one shapes what the next one can do well:
- Good cleansing makes matching faster (smaller blocks) and more accurate (fewer false negatives from formatting).
- Transparent matching (per-pass, per-pair scores) gives the insights stage enough signal to write a narrative instead of a guess.
- Structured insights tell you what to change next time — a tighter threshold, a different profile, a re-upload with an added column.
This is why the wizard walks you through them in order and why the job detail page reports on all three. Treating any one stage as a black box costs you accuracy downstream.
Related reading
- Cleansing report — every rule the cleanse stage applies
- How matching works — the match engine in more detail
- Reading a report — how to interpret the insights stage
