Handling international names

Matching people across regions is genuinely hard. García might appear in one system as Garcia, in another as García, and in a third as GARCIA. A Spanish customer list often has María Isabel García López Hernández in a single "name" column — four name tokens, two surnames, one Marian given-name compound. Dutch surnames carry particles (van, van der, van den). German records fold umlauts per convention (Müller ↔ Mueller).

ListMatchGenie handles this with per-region modules for 20 regions, plus layered strategies: encoding normalization, diacritic folding, phonetic matching, and profile-level tuning. International matching is a first-class part of the engine, not an afterthought.

Supported regions

Twenty regions are supported today, each with validated handling of naming conventions, particles, compound surnames, diacritics, and local postal-code formats. See Supported regions for the per-region reference.

English-speaking (6): United States, United Kingdom, Ireland, Canada, Australia, New Zealand
Western Europe — DACH + Benelux (4): Germany, Austria, Switzerland, Netherlands
Southern Europe (4): France, Spain, Italy, Portugal
Nordic (3): Sweden, Norway, Denmark
Eastern Europe (1): Poland
Latin America (2): Mexico, Brazil

Not yet supported (on the roadmap)

The following are on our roadmap but aren't in the product today — we'll ship them when we can do them as well as we handle the current 20 regions:

CJK — Chinese (Simplified and Traditional), Japanese, Korean
Right-to-left scripts — Arabic, Hebrew, Persian
Indic languages — Hindi, Bengali, Tamil, and others (transliteration complexity)
Thai and Vietnamese
Finnish — specific linguistic structure; may add later if demand warrants

The foundation: encoding normalization

Every file is normalized to UTF-8 at upload (see Encoding and characters). This means byte-level representation isn't a concern — all names are represented as UTF-8 Unicode strings.

This is the minimum: byte-level differences (Latin-1 café vs UTF-8 café) no longer cause false negatives.

Layer 2: diacritic handling

For each string, the engine computes a diacritic-stripped form used for matching:

García → match key Garcia
Müller → match key Mueller (German convention)
Lénárd → match key Lenard
Åberg → match key Aberg (Scandinavian convention)

Display preserves the accented form. Matching uses the stripped form. Both records agree on the stripped form, so they match.

The specific transliteration is country-aware where possible:

German: ä/ö/ü/ß → ae/oe/ue/ss
Scandinavian: å/æ/ø → a/ae/o
Eastern European: various specific mappings

Layer 3: particle and compound-surname handling

Many of the 20 supported regions use particles or compound surnames that need to be preserved in display but treated correctly in matching:

Dutch: van, van der, van den, de, den — Johan van der Berg matches J. van der Berg
Spanish / Mexican: paternal + maternal surnames (García López), particles like de la, del
Portuguese / Brazilian: connector particles (da, dos, de) as part of the surname chain (Ana da Silva dos Santos)
German: nobility particles (von, zu)
Italian: di, della, del
French: de, du, de la
French-Canadian: saint-prefix abbreviations (St-Pierre ↔ Saint-Pierre)

See Supported regions for per-region specifics.

Layer 4: phonetic matching

On top of transliteration, enabling phonetic matching catches further spelling variations. Phonetic coding operates on Latin-script input, so after transliteration it handles the remaining noise.

Layer 5: name-order and token-assignment conventions

Even within the 20 supported Latin-script regions, name conventions differ:

Western convention (most regions): First Last (John Smith)
Spanish / LatAm convention: Given name(s) + paternal surname + maternal surname (María Isabel García López) — the engine treats the paternal surname as the primary key and the maternal surname as a secondary signal
Portuguese convention: Given name(s) + maternal family name + paternal family name, often with particles (Ana da Silva dos Santos)

When a file has a single "full name" column with 4–5 tokens, the engine infers token roles (given vs. paternal vs. maternal surname) using regional rules. When in doubt, mapping first-name and last-name columns explicitly produces the most reliable results.

Eastern name-order conventions (Chinese, Japanese, Korean, Vietnamese) and patronymic conventions (Arabic, Persian) are on the roadmap alongside CJK and RTL script support.

Practical strategies

Strategy 1: normalize before upload

If you know your data has transliteration variance (e.g. the same Spanish name written with and without accents across two files), pre-normalize where possible:

Pick one convention for accents and particles
Apply it consistently across both files
Then match normally

This is optional — the engine handles variance automatically — but it can reduce the review queue.

Strategy 2: use identifiers

For international data, relying on name matching alone is risky. When available, identifiers (email, national ID, passport number) are dramatically more reliable. Use Identifier profile with name/address as tie-breakers.

Strategy 3: reduce threshold for international data

Expect lower match rates on international data compared to domestic. Running at threshold 65 instead of 70 is often appropriate, with careful review of the expanded queue.

Strategy 4: separate passes per language

For datasets with clean language segmentation (e.g. all records tagged with language code), consider running separate matches per language segment. This lets you tune thresholds and profiles per language without compromise.

Common pitfalls

Order of operations matters

ListMatchGenie's default order is: normalize → region-detect → fold diacritics → particle-handle → phonetic → score. Changing this (possible in custom profiles) requires understanding the consequences.

Cross-lingual nicknames

Mike, Miguel, Michel, Michele are all forms of the same name across languages. The English nickname table covers Mike/Mikey/Michael; Spanish/Portuguese variants are handled through the region modules. For data spanning many regions, a custom nickname table is often worth maintaining.

Names from unsupported regions

Files with records from regions not in the supported list (e.g. CJK or RTL script names) still match with a generic Latin fallback — Unicode NFKD normalization, accent stripping, generic particle handling. Results are reasonable on Latin-script input but don't benefit from per-country validation.

Supported regions — per-region reference for all 20 supported regions
Encoding and characters — the byte-level foundation
Phonetic matching — the name-sound layer
Handling nicknames and abbreviations — the name-canonical layer