ListMatchGenie

Encoding and characters

How ListMatchGenie handles international character sets, encoding detection, accent normalization, and transliteration for matching.

Data from international systems is messy. Customer names may arrive in Latin-1, JIS, Big5, or Windows-1252 — sometimes in the same file. Accented characters display correctly in the source system and become mojibake in the export. Matching fails because García doesn't equal Garcia by string comparison even though they clearly should match.

ListMatchGenie handles this by (1) normalizing everything to UTF-8 at upload, and (2) keeping the original and the normalized value side-by-side — original for display, normalized for matching.

Encoding detection

Every upload goes through encoding detection:

  1. BOM check. If the file starts with a byte-order mark (UTF-8, UTF-16 LE/BE), the encoding is read directly from it.
  2. Content sniffing. If no BOM, statistical encoding detection analyzes the byte distribution and guesses. Accuracy is high on files with ≥1000 bytes of text.
  3. Declared override. If you know the encoding, you can override detection on the upload screen.

Detected encoding is reported in the cleansing report. If detection was uncertain, a warning appears — you can re-upload with explicit encoding if matching produces garbled results.

Supported source encodings

  • UTF-8 (with or without BOM) — preferred
  • UTF-16 LE/BE (with BOM)
  • Latin-1 (ISO-8859-1)
  • Windows-1252 (Latin-1 superset — most common for exported Excel files)
  • ISO-8859-2 (Central European — covers Polish diacritics)
  • ISO-8859-15 (Western European with euro sign)

Other encodings can be detected at upload (including Shift-JIS, GB2312, Big5, Windows-1251) and converted to UTF-8 so your files load without errors — but native matching for CJK and Cyrillic scripts is on the roadmap rather than in today's release. See Handling international names for what's supported for matching.

Conversion to UTF-8

All detected content is converted to UTF-8 at upload. Unrecognized byte sequences are replaced with the Unicode replacement character (U+FFFD, displayed as ). Rows containing replacement characters are flagged in the cleansing report so you can:

  • Re-upload with an explicit encoding if detection was wrong
  • Spot-check those rows for lost information
  • Drop them if they're a small minority and not worth the effort

Accent handling

Accented characters (diacritics) present a matching challenge: Müller and Mueller and Muller all represent the same name in different transliteration conventions. We need to match them, but we also need to display them correctly.

The Genie handles this with two internal representations per value:

Display column

The original character-preserving value: Müller, García, Åberg. This is what you see in the app, what exports contain, and what shares display. Nothing is lost.

Match column

An ASCII-transliterated value: Mueller, Garcia, Aberg. The match engine compares on this so MüllerMuellerMuller all align.

Transliteration is country-sensitive. ListMatchGenie has validated fold tables for 20 supported regions — see Supported regions for the complete reference. A few examples:

  • German / Austrian / Swiss German: äae, öoe, üue, ßss
  • Spanish / Mexican: ñn, á/é/í/ó/úa/e/i/o/u
  • French / Québécois: é/è/ê/ëe, çc, à/âa
  • Portuguese / Brazilian: ã/õa/o, çc, accent folding
  • Swedish / Norwegian / Danish: region-specific å/æ/ø conventions (Åke ↔ Ake, Søren ↔ Soren)
  • Polish: ł/ń/ś/ż/ć/ź → base-character folds
  • Generic fallback (non-supported regions): Unicode NFKD decomposition, strip combining marks

Mixed-script values

Some fields contain mixed scripts — a name in both Latin and native script (e.g. García (ガルシア)). The Genie:

  • Keeps the full string in the display column
  • Picks the Latin/ASCII portion for the match column, or the phonetic transliteration of the native portion if no Latin is present
  • Flags the value for review if the two portions disagree significantly

Supported languages for matching

Matching is supported today for 20 Latin-script regions: US, UK, Ireland, Canada, Australia, New Zealand, Germany, Austria, Switzerland, Netherlands, France, Spain, Italy, Portugal, Sweden, Norway, Denmark, Poland, Mexico, and Brazil. Each has a validated per-region module with naming conventions, particles, diacritic folds, and postal formats. See Supported regions.

Not yet supported for matching

The following scripts and language families are on our roadmap but aren't in the product today:

  • CJK — Chinese (Simplified and Traditional), Japanese, Korean
  • Right-to-left scripts — Arabic, Hebrew, Persian
  • Indic languages — Hindi, Bengali, Tamil, and others
  • Thai and Vietnamese
  • Finnish — specific linguistic structure

Files containing these scripts still upload and store correctly (we detect and convert the byte-level encoding), and a generic Latin fallback will attempt matching on any Latin-transliterated portions. We'll ship native matching for these when we can do it as well as the current 20 regions.

Column-specific behavior

Not every column needs transliteration. A column profiled as identifier (SKU, account number) is never transliterated — character preservation is more important than soft matching. Cleansing decisions are per-column based on the detected type.

Troubleshooting

Output shows characters

Encoding detection misidentified the source. Re-upload with explicit encoding (the dropdown on the Upload step). If you don't know the encoding, Windows-1252 is the most common mistake — try it first.

Accented characters match but shouldn't

Transliteration is too aggressive for your use case. Customize the match profile to disable transliteration on specific columns, or use exact-match rules (identifier-type columns) for fields that must match character-for-character.

Character-for-character match fails

Normalization may be collapsing characters you want distinct. Disable cleansing on the affected column to preserve exact byte content.