Localization Engineering Vocabulary: i18n, l10n, XLIFF, and TMS Explained

The complete localization engineering vocabulary — i18n vs l10n, BCP 47 language tags, ICU MessageFormat, CLDR plural rules, XLIFF files, translation memory, and TMS tools like Phrase and Crowdin.

When engineers talk about shipping a product in multiple languages, they are usually conflating two distinct engineering disciplines. The vocabulary is precise and the distinction matters: getting it wrong in a technical interview or a planning meeting signals unfamiliarity with the domain. This guide covers the complete localization engineering vocabulary, from the core abbreviations to the tooling ecosystem.

i18n vs l10n — The Core Distinction

The IT industry compresses long words by keeping the first and last letter and counting the characters in between. Internationalization has 18 characters between the i and the n, hence i18n. Localization has 10 between the l and the n, hence l10n. The same pattern gives us g11n (globalization, the overarching business strategy) and t9n (translation, one component of l10n).

These are not synonyms. They describe different phases of the same pipeline:

i18n is the engineering work of designing and building software so that it can be localized — externalising user-facing strings into resource files, supporting Unicode and multiple character sets, building flexible date, number, and currency formatting, and avoiding hardcoded assumptions about text direction or cultural context.

l10n is the work that comes after: producing a locale-specific version of the product — translating strings, adapting images and colour choices for local cultural norms, handling locale-specific legal requirements, and adjusting UI layout for text expansion.

“We’ve finished the i18n framework — all strings are externalized and the date formatting is locale-aware. The l10n work starts when we hand the XLIFF files to the translation team and add French.”

You build i18n once. You do l10n for every new locale.


BCP 47 Language Tags

BCP 47 (IETF Best Current Practice 47) defines the standard format for language tags. The full structure is language-Script-REGION, where each element is optional beyond the base language:

  • en — English (unspecified region)
  • en-GB — English as used in the United Kingdom
  • zh-Hant-TW — Traditional Chinese (Hant script) as used in Taiwan
  • sr-Latn-RS — Serbian written in the Latin script, as used in Serbia
  • pt-BR — Brazilian Portuguese

The critical point for engineering conversations: en alone is not a complete locale for a product. American English, British English, and Australian English differ in spelling, date format convention, and vocabulary. A product shipping to all three should explicitly target en-US, en-GB, and en-AU.

Locale Fallback Chains

In practice, not every string will have a translation for every locale. Systems implement locale fallback chains to degrade gracefully:

“If a user requests pt-BR, we fall back to pt (generic Portuguese), then fall back to en-US as the base locale.”

In planning conversations, it is standard to distinguish tier-1 locales (fully translated and QA’d at launch) from tier-2 locales (machine-translated or community-contributed):

“We’re targeting tier-1 locales for launch: en-US, fr-FR, de-DE, ja-JP, zh-CN, pt-BR, and es-MX. Tier-2 will ship in Q3.”


ICU MessageFormat and CLDR Plural Rules

The most common i18n mistake in English-speaking engineering teams is building pluralization logic that works in English and breaks in every other language. The canonical bad example:

"You have %d message(s)" // untranslatable — "message(s)" is an English hack

This pattern is not localizable. Polish has four plural forms. Arabic has six. Russian has three. Hardcoding a single string per integer count (message_0, message_1, message_5, etc.) becomes impractical at any scale.

ICU MessageFormat

ICU MessageFormat is the standard solution, developed by IBM’s International Components for Unicode project and now supported by every major i18n library. It uses a declarative syntax inside the message string itself:

{count, plural,
  one {You have # message}
  other {You have # messages}
}

For a language with more plural forms (Russian):

{count, plural,
  one {У вас # сообщение}
  few {У вас # сообщения}
  many {У вас # сообщений}
  other {У вас # сообщения}
}

The plural categories — zero, one, two, few, many, other — are defined by the CLDR (Unicode Common Locale Data Repository). CLDR is the authoritative data source for locale-specific rules: plural categories, number formatting patterns, date ordering, and more.

“We switched to ICU MessageFormat to handle plural rules correctly across all 20 target locales. Previously we had a hardcoded count-based key system that was already breaking in Polish and would never have worked for Arabic.”


Translation Files and TMS Vocabulary

XLIFF

XLIFF (XML Localization Interchange File Format) is the ISO standard exchange format for localization data. It structures source strings alongside their translations in a portable XML format that every major TMS can consume and produce.

A minimal <trans-unit> element — the per-string record inside an XLIFF file — looks like this:

<trans-unit id="nav.home">
  <source>Home</source>
  <target state="translated">Accueil</target>
</trans-unit>

The state attribute tracks progress: new, needs-translation, translated, final. In engineering conversations, you’ll hear “how many units are untranslated” as a launch readiness metric.

Translation Memory

Translation memory (TM) is a database of previously translated source-target string pairs. When a new string is submitted for translation, the TMS checks the TM for similar prior translations:

  • 100% match — the string is identical to a previously translated string; no new translation cost, applied automatically
  • Fuzzy match — the string is similar but not identical, typically above an 85% or 95% similarity threshold; a translator reviews and confirms rather than translating from scratch
  • Repetitions — the same string appears multiple times in the source files; counted once for translation

TM reduces cost and enforces consistency. A phrase translated one way in the app’s onboarding flow should appear identically in the help documentation.

TMS Tools

A TMS (Translation Management System) orchestrates the localization workflow: file ingestion, translator assignment, TM lookup, machine translation drafting, QA checks, and export back to the repository. The major platforms in engineering contexts are Phrase (formerly Phrase Strings), Crowdin, and Lokalise.

Key TMS features referenced in technical conversations:

  • GitHub integration — XLIFF or JSON files are automatically pulled from and pushed back to the repository, eliminating manual file handling
  • MT + human review workflow — a machine translation engine (DeepL, Google, or a custom model) drafts a translation; a human translator reviews and confirms it
  • Glossary enforcement — product-specific terms (brand names, feature names) are locked to approved translations across all strings
  • QA checks — automated checks for missing placeholders, inconsistent punctuation, number format mismatches, and overly long translated strings that will break the UI layout

Pseudo-localization

Pseudo-localization is a development technique for testing i18n infrastructure before any real translation exists. The pseudo-locale replaces source characters with visually distinct accented or extended equivalents and wraps strings in brackets:

Hello, World! → [Ĥéļļö, Ŵöŗļð!]

This single technique catches four classes of localization bugs mechanically:

  1. Hardcoded strings — any string that bypasses the localization pipeline remains in its original form, making it visually distinct from pseudo-localized strings in the UI
  2. Layout breakage from text expansion — translated strings are typically 30–50% longer than English; the pseudo-locale uses longer replacements to simulate this expansion, surfacing truncation and overflow bugs immediately
  3. String concatenation bugs — dynamically assembled strings (e.g., "You have " + count + " messages") cannot be correctly localized because the translator cannot reorder the fragments; bracket wrapping makes these bugs visually obvious
  4. RTL rendering bugs — pseudo-locales can include Unicode bidirectional override characters to simulate right-to-left layout, catching RTL-specific rendering regressions

“We run the pseudo-locale as part of our CI visual regression suite — it catches layout regressions and hardcoded strings before real translation work begins. It’s saved us from at least three embarrassing production bugs.”


Localization engineering is a discipline where English vocabulary precision directly affects implementation quality: a team that conflates i18n and l10n will architect systems that solve only half the problem. For practice, explore the CoderLingo localization vocabulary exercises and i18n phrasebook to reinforce these terms in context.