Encoding and characters

Data from international systems is messy. Customer names may arrive in Latin-1, JIS, Big5, or Windows-1252 — sometimes in the same file. Accented characters display correctly in the source system and become mojibake in the export. Matching fails because García doesn't equal Garcia by string comparison even though they clearly should match.

ListMatchGenie handles this by (1) normalizing everything to UTF-8 at upload, and (2) keeping the original and the normalized value side-by-side — original for display, normalized for matching.

Encoding detection

Every upload goes through encoding detection:

BOM check. If the file starts with a byte-order mark (UTF-8, UTF-16 LE/BE), the encoding is read directly from it.
Content sniffing. If no BOM, statistical encoding detection analyzes the byte distribution and guesses. Accuracy is high on files with ≥1000 bytes of text.
Declared override. If you know the encoding, you can override detection on the upload screen.

Detected encoding is reported in the cleansing report. If detection was uncertain, a warning appears — you can re-upload with explicit encoding if matching produces garbled results.

Supported source encodings

UTF-8 (with or without BOM) — preferred
UTF-16 LE/BE (with BOM)
Latin-1 (ISO-8859-1)
Windows-1252 (Latin-1 superset — most common for exported Excel files)
ISO-8859-2 (Central European — covers Polish diacritics)
ISO-8859-15 (Western European with euro sign)

Other encodings can be detected at upload (including Shift-JIS, GB2312, Big5, Windows-1251) and converted to UTF-8 so your files load without errors — but native matching for CJK and Cyrillic scripts is on the roadmap rather than in today's release. See Handling international names for what's supported for matching.

Conversion to UTF-8

All detected content is converted to UTF-8 at upload. Unrecognized byte sequences are replaced with the Unicode replacement character (U+FFFD, displayed as �). Rows containing replacement characters are flagged in the cleansing report so you can:

Re-upload with an explicit encoding if detection was wrong
Spot-check those rows for lost information
Drop them if they're a small minority and not worth the effort

Accent handling

Accented characters (diacritics) present a matching challenge: Müller and Mueller and Muller all represent the same name in different transliteration conventions. We need to match them, but we also need to display them correctly.

The Genie handles this with two internal representations per value:

Display column

The original character-preserving value: Müller, García, Åberg. This is what you see in the app, what exports contain, and what shares display. Nothing is lost.

Match column

An ASCII-transliterated value: Mueller, Garcia, Aberg. The match engine compares on this so Müller ↔ Mueller ↔ Muller all align.

Transliteration is country-sensitive. ListMatchGenie has validated fold tables for 20 supported regions — see Supported regions for the complete reference. A few examples:

German / Austrian / Swiss German: ä → ae, ö → oe, ü → ue, ß → ss
Spanish / Mexican: ñ → n, á/é/í/ó/ú → a/e/i/o/u
French / Québécois: é/è/ê/ë → e, ç → c, à/â → a
Portuguese / Brazilian: ã/õ → a/o, ç → c, accent folding
Swedish / Norwegian / Danish: region-specific å/æ/ø conventions (Åke ↔ Ake, Søren ↔ Soren)
Polish: ł/ń/ś/ż/ć/ź → base-character folds
Generic fallback (non-supported regions): Unicode NFKD decomposition, strip combining marks

Mixed-script values

Some fields contain mixed scripts — a name in both Latin and native script (e.g. García (ガルシア)). The Genie:

Keeps the full string in the display column
Picks the Latin/ASCII portion for the match column, or the phonetic transliteration of the native portion if no Latin is present
Flags the value for review if the two portions disagree significantly

Supported languages for matching

Matching is supported today for 20 Latin-script regions: US, UK, Ireland, Canada, Australia, New Zealand, Germany, Austria, Switzerland, Netherlands, France, Spain, Italy, Portugal, Sweden, Norway, Denmark, Poland, Mexico, and Brazil. Each has a validated per-region module with naming conventions, particles, diacritic folds, and postal formats. See Supported regions.

Not yet supported for matching

The following scripts and language families are on our roadmap but aren't in the product today:

CJK — Chinese (Simplified and Traditional), Japanese, Korean
Right-to-left scripts — Arabic, Hebrew, Persian
Indic languages — Hindi, Bengali, Tamil, and others
Thai and Vietnamese
Finnish — specific linguistic structure

Files containing these scripts still upload and store correctly (we detect and convert the byte-level encoding), and a generic Latin fallback will attempt matching on any Latin-transliterated portions. We'll ship native matching for these when we can do it as well as the current 20 regions.

Supported file formats — encoding detection at upload
Handling international names — matching across languages
Cleansing report — where encoding warnings appear

Encoding and characters

Encoding detection

Supported source encodings

Conversion to UTF-8

Accent handling

Display column

Match column

Mixed-script values

Supported languages for matching

Not yet supported for matching

Column-specific behavior

Troubleshooting

Output shows `�` characters

Accented characters match but shouldn't

Character-for-character match fails