ListMatchGenie

The dedup report

Before matching runs, the Genie finds duplicates inside each file and reports on them. Use the dedup report to understand — and optionally fix — the duplicate problems in your source data.

The dedup report is generated alongside the cleansing report during the cleanse stage. While cleansing fixes per-cell quality issues, the dedup report focuses on one specific kind of problem: rows that are duplicates of other rows in the same file.

Duplicates matter because they poison matching. A single duplicate in your master file means every matching source row ends up in the review queue — the engine can't tell which of the two master rows is the intended match.

What counts as a duplicate

The dedup report classifies duplicates at three tiers:

Exact duplicates

Every cell in every column is identical to another row. These are almost always safe to remove — they usually come from copy-paste errors or repeated imports. The Genie removes them automatically by default and reports the count.

Near-exact duplicates

All "identity" columns (those profiled as name, email, phone, identifier) are identical, but supplementary columns (notes, tags, timestamps) differ. These usually represent the same entity recorded at different times or through different channels. By default the Genie keeps one row and merges supplementary columns from the others — the specifics are shown in the report.

Fuzzy duplicates

Identity columns are similar but not identical — e.g. Jane Smith, jane@acme.com and Janet Smith, jane@acme.com. These are flagged but not removed automatically, because merging them is a judgment call. You decide case-by-case.

The report structure

Every dedup report has four parts:

Summary

A one-paragraph narrative from the Genie. Example:

"The Genie found 23 exact duplicates, 47 near-exact duplicates, and 61 fuzzy duplicates across your 4,812-row file. Exact and near-exact duplicates were removed automatically; fuzzy duplicates are listed below for your review. Post-dedup row count: 4,742."

Counts by tier

A table showing how many rows fell into each tier and what action was taken:

TierDetectedActionRemaining
Exact23Removed0
Near-exact47Merged0 (47 survivors with merged data)
Fuzzy61Flagged61 (pairs shown below)

Fuzzy duplicate pairs

A list of flagged pairs, each showing:

  • Both rows side-by-side with differing fields highlighted
  • A similarity score (0–100) — higher means more similar
  • Which fields matched exactly and which fuzzy-matched
  • Three actions: merge, keep both, delete one

Decisions here persist through the match job and are captured in the lmg columns for audit.

Impact on matching

A callout describing what changes as a result of dedup: how many rows remain to be matched, how many unique identities are represented, and whether any cluster of duplicates would have caused match ambiguity.

Why the Genie handles the three tiers differently

Exact duplicates are objectively noise — there is no information loss in removing them. Near-exact duplicates carry a judgment ("these are the same entity, merge the metadata") that's almost always safe, so the Genie defaults to yes but shows you exactly what merged. Fuzzy duplicates carry real risk — two similar rows might be a typo or might be two different real people — so the Genie refuses to make that call unilaterally.

Fuzzy duplicates in the master file are high-stakes

If your master file has fuzzy duplicates, the matching engine will produce review-queue cases for every source row that might match either one. Resolving fuzzy master duplicates before running a big match dramatically cuts review workload.

Contact dedupe mode

When you run a contact dedupe match profile, the dedup report is the primary output. Instead of running a match against a second file, the engine runs the dedup-detection logic at every confidence threshold you've configured and exports clusters of probable duplicates with their scores.

In contact dedupe mode:

  • Every row gets a _lmg_cluster_id — rows sharing an ID are probable duplicates of each other.
  • The _lmg_match_score for each pair is the similarity score.
  • You can merge clusters in bulk from the review queue, or export the clusters and merge in your source system.

See Deduplicate a customer list for a full walkthrough.

Where to find the report

  • Live — during the cleanse step of the match wizard, alongside the cleansing report
  • Persisted — on the job detail page under "Dedup"
  • Exported — XLSX exports include a dedicated "Duplicates" sheet

Controlling dedup behavior

Per-file settings let you override defaults:

Remove exact duplicatesbooleanDefault: true

When false, exact duplicates are flagged but not removed.

Merge near-exact duplicatesbooleanDefault: true

When false, near-exact duplicates are flagged for review like fuzzy duplicates.

Fuzzy threshold0–100Default: 80

Minimum similarity score for flagging as a fuzzy duplicate. Lower to catch more, higher to flag only obvious cases.

Identity columnscolumn selector

Override which columns count as "identity" for near-exact detection. By default the Genie picks columns profiled as name, email, phone, or identifier — override if your file has a composite key not captured automatically.