Matching people across regions is genuinely hard. García might appear in one system as Garcia, in another as García, and in a third as GARCIA. A Spanish customer list often has María Isabel García López Hernández in a single "name" column — four name tokens, two surnames, one Marian given-name compound. Dutch surnames carry particles (van, van der, van den). German records fold umlauts per convention (Müller ↔ Mueller).
ListMatchGenie handles this with per-region modules for 20 regions, plus layered strategies: encoding normalization, diacritic folding, phonetic matching, and profile-level tuning. International matching is a first-class part of the engine, not an afterthought.
Supported regions
Twenty regions are supported today, each with validated handling of naming conventions, particles, compound surnames, diacritics, and local postal-code formats. See Supported regions for the per-region reference.
- English-speaking (6): United States, United Kingdom, Ireland, Canada, Australia, New Zealand
- Western Europe — DACH + Benelux (4): Germany, Austria, Switzerland, Netherlands
- Southern Europe (4): France, Spain, Italy, Portugal
- Nordic (3): Sweden, Norway, Denmark
- Eastern Europe (1): Poland
- Latin America (2): Mexico, Brazil
Not yet supported (on the roadmap)
The following are on our roadmap but aren't in the product today — we'll ship them when we can do them as well as we handle the current 20 regions:
- CJK — Chinese (Simplified and Traditional), Japanese, Korean
- Right-to-left scripts — Arabic, Hebrew, Persian
- Indic languages — Hindi, Bengali, Tamil, and others (transliteration complexity)
- Thai and Vietnamese
- Finnish — specific linguistic structure; may add later if demand warrants
The foundation: encoding normalization
Every file is normalized to UTF-8 at upload (see Encoding and characters). This means byte-level representation isn't a concern — all names are represented as UTF-8 Unicode strings.
This is the minimum: byte-level differences (Latin-1 café vs UTF-8 café) no longer cause false negatives.
Layer 2: diacritic handling
For each string, the engine computes a diacritic-stripped form used for matching:
García→ match keyGarciaMüller→ match keyMueller(German convention)Lénárd→ match keyLenardÅberg→ match keyAberg(Scandinavian convention)
Display preserves the accented form. Matching uses the stripped form. Both records agree on the stripped form, so they match.
The specific transliteration is country-aware where possible:
- German:
ä/ö/ü/ß→ae/oe/ue/ss - Scandinavian:
å/æ/ø→a/ae/o - Eastern European: various specific mappings
Layer 3: particle and compound-surname handling
Many of the 20 supported regions use particles or compound surnames that need to be preserved in display but treated correctly in matching:
- Dutch:
van,van der,van den,de,den—Johan van der BergmatchesJ. van der Berg - Spanish / Mexican: paternal + maternal surnames (
García López), particles likede la,del - Portuguese / Brazilian: connector particles (
da,dos,de) as part of the surname chain (Ana da Silva dos Santos) - German: nobility particles (
von,zu) - Italian:
di,della,del - French:
de,du,de la - French-Canadian: saint-prefix abbreviations (
St-Pierre ↔ Saint-Pierre)
See Supported regions for per-region specifics.
Layer 4: phonetic matching
On top of transliteration, enabling phonetic matching catches further spelling variations. Phonetic coding operates on Latin-script input, so after transliteration it handles the remaining noise.
Layer 5: name-order and token-assignment conventions
Even within the 20 supported Latin-script regions, name conventions differ:
- Western convention (most regions): First Last (
John Smith) - Spanish / LatAm convention: Given name(s) + paternal surname + maternal surname (
María Isabel García López) — the engine treats the paternal surname as the primary key and the maternal surname as a secondary signal - Portuguese convention: Given name(s) + maternal family name + paternal family name, often with particles (
Ana da Silva dos Santos)
When a file has a single "full name" column with 4–5 tokens, the engine infers token roles (given vs. paternal vs. maternal surname) using regional rules. When in doubt, mapping first-name and last-name columns explicitly produces the most reliable results.
Eastern name-order conventions (Chinese, Japanese, Korean, Vietnamese) and patronymic conventions (Arabic, Persian) are on the roadmap alongside CJK and RTL script support.
Practical strategies
Strategy 1: normalize before upload
If you know your data has transliteration variance (e.g. the same Spanish name written with and without accents across two files), pre-normalize where possible:
- Pick one convention for accents and particles
- Apply it consistently across both files
- Then match normally
This is optional — the engine handles variance automatically — but it can reduce the review queue.
Strategy 2: use identifiers
For international data, relying on name matching alone is risky. When available, identifiers (email, national ID, passport number) are dramatically more reliable. Use Identifier profile with name/address as tie-breakers.
Strategy 3: reduce threshold for international data
Expect lower match rates on international data compared to domestic. Running at threshold 65 instead of 70 is often appropriate, with careful review of the expanded queue.
Strategy 4: separate passes per language
For datasets with clean language segmentation (e.g. all records tagged with language code), consider running separate matches per language segment. This lets you tune thresholds and profiles per language without compromise.
Common pitfalls
Order of operations matters
ListMatchGenie's default order is: normalize → region-detect → fold diacritics → particle-handle → phonetic → score. Changing this (possible in custom profiles) requires understanding the consequences.
Cross-lingual nicknames
Mike, Miguel, Michel, Michele are all forms of the same name across languages. The English nickname table covers Mike/Mikey/Michael; Spanish/Portuguese variants are handled through the region modules. For data spanning many regions, a custom nickname table is often worth maintaining.
Names from unsupported regions
Files with records from regions not in the supported list (e.g. CJK or RTL script names) still match with a generic Latin fallback — Unicode NFKD normalization, accent stripping, generic particle handling. Results are reasonable on Latin-script input but don't benefit from per-country validation.
Related reading
- Supported regions — per-region reference for all 20 supported regions
- Encoding and characters — the byte-level foundation
- Phonetic matching — the name-sound layer
- Handling nicknames and abbreviations — the name-canonical layer
