Dochula Pass, Bhutan

>> Use it !

Picture of the page in action.

The default arrangement for this picker is still shape-based (though with some small improvements), but I have added a new view that is arranged by sound.

Update: After some initial feedback, I decided to change the phonic view of the picker so that vowels are entered by single click. This will probably disconcert people familiar with typing Thai. Revised description follows.

Another update (2008-03-03): I have added additional ways of viewing the characters, and re-architected the picker as a basis for extending this to other pickers in the future. I also changed the way of dealing with initial clusters in the phonic view. I changed the text below again to reflect what’s new:

Alphabetic view By default, characters are arranged by groups, and consonants and vowels are listed in alphabetic order. Digits are in keypad order. Obsolete and rare characters are only displayed if you click on the grey arrow, top right. Similar characters are highlighted by default, but this can be switched off using the ‘Hint’ selector.

Comparison view This was the original view for the Thai picker. Characters are grouped by shape or type to enable easy identification by people who are unfamiliar with the Thai script. Vowels are shown near the bottom. Digits are on the right, in keypad order.

Phonic view Characters are grouped and ordered by sound. I set this up for myself, because I wanted to enter Thai text that was accompanied by a transcription.

Initial consonants are followed by tones and consonants that come second in a cluster, then vowels. Alternatives with the same sound are separated by a red dot. Consonants that have different sounds when word final are also listed under those sounds. (Dropped aspiration is not considered significant.)

Dashes representing consonants indicate which vowels are non-final or occur before the consonant.

Where a vowel has a part that comes before a consonant, a single click should arrange the parts properly. This behaviour speeds up typing. It may not be so intuitive to people familiar with Thai, however, since it makes Thai behave like Khmer and Indic scripts. You should add any tone mark before the vowel and the picker will automatically reorder characters as needed.

If you want to wrap text around a combination of two syllable-initial characters, type the characters then click on ‘flag as cluster’ before clicking on the tone mark or vowel.

Font grid view Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).

You can start up directly in any one of the above views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, comparison, phonic or fontgrid.


>> Use it !

Picture of the page in action.

This latest picker includes characters used for writing Vietnamese. Characters are taken from various Latin Unicode blocks.

Tones are separated from base characters in the selection area, but the output you create is always fully precomposed. If you copy and paste text into the output area, you can normalize the Vietnamese text as NFC by selecting the tab below. The Vietnamese text in the output area is also normalized when you select one of the transcription tabs.

The tabs IPA N and IPA S tabs provide a basic, mostly phonemic-level, transcription of the pronunciation. N means North Vietnamese, S is for South. The sources I used for this varied a great deal, particularly in the choice of symbols to represent vowels. There are also more than two main dialects. So this is a synthesis and a rough guide. Some rare vowel combinations may be missing, although I have covered quite a number.

There are a large number of UVN fonts – so many that I didn’t know which ones to pick for the font pulldown. I chose the two that show up on Alan Wood’s page. If you think certain others are so common that they ought to be there, please let me know.


This post is about the dangers of tying a specification, protocol or application to a specific version of Unicode.

For example, I was in a discussion last week about XML, and the problems caused by the fact that XML 1.0 is currently tied to a specific version of Unicode, and a very old version at that (2.0). This affects what characters you can use for things such as element and attribute names, enumerated lists for attribute values, and ids. Note that I’m not talking about the content, just those names.

I spoke about this at a W3C Technical Plenary some time back in terms of how this bars people from using certain aspects of XML applications in their own language if they use scripts that have been added to Unicode since version 2.0. This includes over 150 million people speaking languages written with Ethiopic, Canadian Syllabics, Khmer, Sinhala, Mongolian, Yi, Philippine, New Tai Lue, Buginese, Cherokee, Syloti Nagri, N’Ko, Tifinagh and other scripts.

This means, for example, that if your language is written with one of these scripts, and you write some XHTML that you want to be valid (so you can use it with AJAX or XSLT, etc.), you can’t use the same language for an id attribute value as for the content of your page. (Try validating this page now. The previous link used some Ethiopic for the name and id attribute values.)

But there’s another issue that hasn’t received so much press – and yet I think, in it’s own way, it can be just as problematic. Scripts that were supported by Unicode 2.0 have not stood still, and additional characters are being added to such scripts with every new Unicode release. In some cases these characters will see very general use. Take for example, the Bengali character U+09CE BENGALI LETTER KHANDA TA.

With the release of Unicode 4.1 this character was added to the standard, with a clear admonition that it should in future be used in text, rather than the workaround people had been using previously.

This is not a rarely used character. It is a common part of the alphabet. Put Bengali in a link and you’re generally ok. Include a khanda ta letter in it, though, and you’re in trouble. It’s as if English speakers could use any word in an id, as long as it didn’t have a ‘q’ in it. It’s a recipe for confusion and frustration.

Similar, but much more far reaching, changes will be introduced to the Myanmar script (used for Burmese) in the upcoming version 5.1. Unlike the khanda ta, these changes will affect almost every word. So if your application or protocol froze its Unicode support to a version between 3.0 and 5.0, like IDNA, you will suddenly be disenfranchising Burmese users who had been perfectly happy until now.

Here are a few more examples (provided by Ken Whistler) of characters added to Unicode after the initial script adoption that will raise eyebrows for people who speak the relevant language:

  • 01F6 LATIN SMALL LETTER N WITH GRAVE: shows up in NFC pinyin data for Chinese.
  • 0653..0655 Arabic combining maddah and hamza: Implicated in NFC normalization of common Arabic letters now.
  • 0B35 ORIYA LETTER VA: Oriya.
  • 0BB6 TAMIL LETTER SHA: Needed to spell sri.
  • 0D7A..0D7F Malayalam chillu letters: Those will be ubiquitous in Malayalam data, post Unicode 5.1.
  • and a bunch of Chinese additions.

So the moral is this: decouple your application, protocol or specification from a specific version of the Unicode Standard. Allow new characters to be used by people as they come along, and users all around the world will thank you.