This post is about the dangers of tying a specification, protocol or application to a specific version of Unicode.

For example, I was in a discussion last week about XML, and the problems caused by the fact that XML 1.0 is currently tied to a specific version of Unicode, and a very old version at that (2.0). This affects what characters you can use for things such as element and attribute names, enumerated lists for attribute values, and ids. Note that I’m not talking about the content, just those names.

I spoke about this at a W3C Technical Plenary some time back in terms of how this bars people from using certain aspects of XML applications in their own language if they use scripts that have been added to Unicode since version 2.0. This includes over 150 million people speaking languages written with Ethiopic, Canadian Syllabics, Khmer, Sinhala, Mongolian, Yi, Philippine, New Tai Lue, Buginese, Cherokee, Syloti Nagri, N’Ko, Tifinagh and other scripts.

This means, for example, that if your language is written with one of these scripts, and you write some XHTML that you want to be valid (so you can use it with AJAX or XSLT, etc.), you can’t use the same language for an id attribute value as for the content of your page. (Try validating this page now. The previous link used some Ethiopic for the name and id attribute values.)

But there’s another issue that hasn’t received so much press – and yet I think, in it’s own way, it can be just as problematic. Scripts that were supported by Unicode 2.0 have not stood still, and additional characters are being added to such scripts with every new Unicode release. In some cases these characters will see very general use. Take for example, the Bengali character U+09CE BENGALI LETTER KHANDA TA.

With the release of Unicode 4.1 this character was added to the standard, with a clear admonition that it should in future be used in text, rather than the workaround people had been using previously.

This is not a rarely used character. It is a common part of the alphabet. Put Bengali in a link and you’re generally ok. Include a khanda ta letter in it, though, and you’re in trouble. It’s as if English speakers could use any word in an id, as long as it didn’t have a ‘q’ in it. It’s a recipe for confusion and frustration.

Similar, but much more far reaching, changes will be introduced to the Myanmar script (used for Burmese) in the upcoming version 5.1. Unlike the khanda ta, these changes will affect almost every word. So if your application or protocol froze its Unicode support to a version between 3.0 and 5.0, like IDNA, you will suddenly be disenfranchising Burmese users who had been perfectly happy until now.

Here are a few more examples (provided by Ken Whistler) of characters added to Unicode after the initial script adoption that will raise eyebrows for people who speak the relevant language:

  • 01F6 LATIN SMALL LETTER N WITH GRAVE: shows up in NFC pinyin data for Chinese.
  • 0219 LATIN SMALL LETTER S WITH COMMA BELOW: Romanian data.
  • 0450 CYRILLIC SMALL LETTER IE WITH GRAVE: Macedonian in NFC.
  • 0653..0655 Arabic combining maddah and hamza: Implicated in NFC normalization of common Arabic letters now.
  • 0972 DEVANAGARI LETTER CANDRA A: Marathi.
  • 097B DEVANAGARI LETTER GGA: Sindhi.
  • 0B35 ORIYA LETTER VA: Oriya.
  • 0BB6 TAMIL LETTER SHA: Needed to spell sri.
  • 0D7A..0D7F Malayalam chillu letters: Those will be ubiquitous in Malayalam data, post Unicode 5.1.
  • and a bunch of Chinese additions.

So the moral is this: decouple your application, protocol or specification from a specific version of the Unicode Standard. Allow new characters to be used by people as they come along, and users all around the world will thank you.