Internationalization Headaches - Part 1

Working on the CATS project has taken me from developing primarily English software into a whole new realm of excitement — internationalization and localization (i8n / L10n). Suddenly I’ve got people from 120 countries (and not a handful, hundreds of paying customers!) wanting to see full support for their native tongues.

I could probably talk for hours on the enormous effort that it took to take CATS to the level of i8n support it has today; but, instead, I’m going to talk about the top 10 headaches I ran into.

Before I get started, if you’re looking at adopting i8n / L10n either pre-development or on an existing project, UTF-8 is the way to go. There are alternatives; but unless you have a very heavy non-Latin based user base, stop looking. UTF-8 is backwards compatible with ASCII, it supports just about everything (and is supported by just about everything) and it’s the best thing since sliced bread.

Now let’s get started!

1) My umlaut looks like a question mark in a fancy triangle!

This is the first step: change the encoding on all of your rendered HTML pages. Hopefully, you use a CMS or have a single header file where you can add this to the top of your pages in between the tags:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

The handful of bytes you save by loading ISO-8859-1 isn’t worth it. Get on the bandwagon and start implementing UTF-8, even if you don’t need it yet — there’s a reason the IETF (read: Internet Police) requires all Internet Protocols support it.

Once you cover the HTML, don’t forget about other content types. Make sure that your Ajax, RSS feeds and XML responses all include the UTF-8 identifiers or there will be some jumbling going on.

2) My XML or HTML doesn’t validate, it says invalid entity but it’s using or includes the above tag and the entity is valid UTF-8!

Unless your DTD includes specifications for the UTF-8 entities, you’re going to get yelled at during validation. The whole point of the encoding=”UTF-8″ is so you don’t need entities. Luckily, this is an easy fix in PHP. Use the built-in html_entity_decode function to turn those entities into their actual characters:

$value = html_entity_decode(‘fancy Ü', ENT_COMPAT, ‘UTF-8′);
// returns ‘fancy Ã'

Just run your string data through it prior to exporting it to your XML writer. On a side note, if you haven’t noticed, my examples of Unicode data almost always include one of my favorite words: umlaut.

Stay tuned for more “Internationalization Headaches” from Andrew Kandels!