Dochula Pass, Bhutan

Characters in the Unicode Bengali block.

If you’re interested, I just did a major overhaul of my script notes on Bengali in Unicode. There’s a new section about which characters to use when there are multiple options (eg. RRA vs. DDA+nukta), and the page provides information about more characters from the Bengali block in Unicode (including those used in Bengali’s amazingly complicated currency notation prior to 1957).

In addition, this has all been squeezed into the latest look and feel for script notes pages.

The new page is at a new location. There is a redirect on the old page.

Hope it’s useful.

>> Read it

>> Read it !

Picture of the page in action.

I finally got to the point, after many long early morning hours, where I felt I could remove the ‘Draft’ from the heading of my Myanmar (Burmese) script notes.

This page is the result of my explorations into how the Myanmar script is used for the Burmese language in the context of the Unicode Myanmar block. It takes into account the significant changes introduced in Unicode version 5.1 in April of this year.

Btw, if you have JavaScript running you can get a list of characters in the examples by mousing over them. If you don’t have JS, you can link to the same information.

There’s also a PDF version, if you don’t want to install the (free) fonts pointed to for the examples.

Here is a summary of the script:

Myanmar is a tonal language and is syllable-based. The script is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs.

Spaces are used to separate phrases, rather than words. Words can be separated with ZWSP to allow for easy wrapping of text.

Words are composed of syllables. These start with an consonant or initial vowel. An initial consonant may be followed by a medial consonant, which adds the sound j or w. After the vowel, a syllable may end with a nasalisation of the vowel or an unreleased glottal stop, though these final sounds can be represented by various different consonant symbols.

At the end of a syllable a final consonant usually has an ‘asat’ sign above it, to show that there is no inherent vowel.

In multisyllabic words derived from an Indian language such as Pali, where two consonants occur internally with no intervening vowel, the consonants tend to be stacked vertically, and the asat sign is not used.

Text runs from left to right.

There are a set of Myanmar numerals, which are used just like Latin digits.

So, what next. I’m quite keen to get to Mongolian. That looks really complicated. But I’ve been telling myself for a while that I ought to look at Malayalam or Tamil, so I think I’ll try Malayalam.

I’m sitting here watching a video of Timbl talking on a BBC news page and I suddenly realised how good this was.

The page design helps give the impression – there are no clunky boxes around the video itself – but there’s also no need to view in a different area, or switch to another tool, or even wait for a download to get started – it’s just there as part of the page, but a part that moves and produces sound. Kind of like the moving paper in Harry Potter’s world.

It’s great how technology marches on sometimes.

[Update: Since I wrote the above the video has acquired grey panels around the edges for controls, which I think is a shame. It’s still pretty good technology though. ]

This post is about the dangers of tying a specification, protocol or application to a specific version of Unicode.

For example, I was in a discussion last week about XML, and the problems caused by the fact that XML 1.0 is currently tied to a specific version of Unicode, and a very old version at that (2.0). This affects what characters you can use for things such as element and attribute names, enumerated lists for attribute values, and ids. Note that I’m not talking about the content, just those names.

I spoke about this at a W3C Technical Plenary some time back in terms of how this bars people from using certain aspects of XML applications in their own language if they use scripts that have been added to Unicode since version 2.0. This includes over 150 million people speaking languages written with Ethiopic, Canadian Syllabics, Khmer, Sinhala, Mongolian, Yi, Philippine, New Tai Lue, Buginese, Cherokee, Syloti Nagri, N’Ko, Tifinagh and other scripts.

This means, for example, that if your language is written with one of these scripts, and you write some XHTML that you want to be valid (so you can use it with AJAX or XSLT, etc.), you can’t use the same language for an id attribute value as for the content of your page. (Try validating this page now. The previous link used some Ethiopic for the name and id attribute values.)

But there’s another issue that hasn’t received so much press – and yet I think, in it’s own way, it can be just as problematic. Scripts that were supported by Unicode 2.0 have not stood still, and additional characters are being added to such scripts with every new Unicode release. In some cases these characters will see very general use. Take for example, the Bengali character U+09CE BENGALI LETTER KHANDA TA.

With the release of Unicode 4.1 this character was added to the standard, with a clear admonition that it should in future be used in text, rather than the workaround people had been using previously.

This is not a rarely used character. It is a common part of the alphabet. Put Bengali in a link and you’re generally ok. Include a khanda ta letter in it, though, and you’re in trouble. It’s as if English speakers could use any word in an id, as long as it didn’t have a ‘q’ in it. It’s a recipe for confusion and frustration.

Similar, but much more far reaching, changes will be introduced to the Myanmar script (used for Burmese) in the upcoming version 5.1. Unlike the khanda ta, these changes will affect almost every word. So if your application or protocol froze its Unicode support to a version between 3.0 and 5.0, like IDNA, you will suddenly be disenfranchising Burmese users who had been perfectly happy until now.

Here are a few more examples (provided by Ken Whistler) of characters added to Unicode after the initial script adoption that will raise eyebrows for people who speak the relevant language:

  • 01F6 LATIN SMALL LETTER N WITH GRAVE: shows up in NFC pinyin data for Chinese.
  • 0653..0655 Arabic combining maddah and hamza: Implicated in NFC normalization of common Arabic letters now.
  • 0B35 ORIYA LETTER VA: Oriya.
  • 0BB6 TAMIL LETTER SHA: Needed to spell sri.
  • 0D7A..0D7F Malayalam chillu letters: Those will be ubiquitous in Malayalam data, post Unicode 5.1.
  • and a bunch of Chinese additions.

So the moral is this: decouple your application, protocol or specification from a specific version of the Unicode Standard. Allow new characters to be used by people as they come along, and users all around the world will thank you.

I just read a post by Ivan Herman about how Hungary has joined the Schengen Agreement, and will soon be removing border controls on the EU side. That put me in mind of the first time I tried to pass through the Iron Curtain.

I was travelling from Vienna to Budapest (probably about 25 years ago) and I had decided to go through Sopron, rather than Hegyeshalom, so I could see something a little more off the beaten track. This was a month-long InterRail trip, so I was able to follow my whim and jump on whatever train I wanted. The train connections worked, and I found myself heading south from Vienna.

Eventually, the train passed into Hungary and stopped. I needed a visa, so I got off with a bunch of other (Hungarian looking) people, and traipsed over to a small outbuilding, where I found myself at the back of a queue of people jostling bags of various sizes and dressed and coiffed in what looked to me to be a very Eastern European fashion. Looking out of the window, everything was grey. I could see rail tracks and points and small, grey buildings but also several very tall towers with machine gun nests perched on top (quite large looking machine guns). The queue moved slowly, and I was surprised at one point to see my train pulling away and disappearing. It seemed a bit odd (and I was glad I’d brought all my stuff with me), but I figured this was probably normal, and I’d just have to catch another train.

I finally arrived at the desk and asked for a visa. The guy behind the desk started talking to me in a somewhat animated fashion, but I had no idea what he was saying. I hadn’t learned German yet, and Hungarian was completely incomprehensible to me. I kept trying to explain, politely, in English, that I needed a visa. Finally, he gave me an exasperated look and called someone out of a nearby room. The guy who emerged was huge, bald and intimidatingly business-like. (Some time later I saw the film Midnight Express, and realised that the prison guard and he could have been the same person.) He shouted at me “Nicht visa!”. And I tried to explain, in English, that, yes, I had no visa, but would like to obtain one, please. This didn’t appear to get across clearly, because he simply repeated “Nicht visa!!” several times, increasing in volume.

Finally, the tension broke and gave way to action. He motioned for me to follow him out of the building, and we started walking away across a couple of sets of railway tracks. I noticed, feeling slightly less at ease but still hopeful, that I was flanked by a soldier with a gun on either side. They weren’t exactly giving me encouraging looks, and as I glanced up at the machine gun towers and at the surrounding barbed wire, I began to wish I knew what was happening.

Soon we arrived at the end of a short train. The very last carriage of this train looked like something you’d expect to see in a Wild West film. It had a kind of standing area at each end with a railing, a door into the carriage and steps leading down to the ground on either side. I was ushered up one set of steps and into what turned out to be an empty carriage. The door was shut behind me, and within a minute or so, as I remember it, the train started moving off, in the same direction my earlier train had disappeared. So I wasn’t just being sent back across the border.

That last realisation started to trouble me a little, since I still had no visa and no idea what was happening. It didn’t help that there was a small round window in the door at each end of the carriage, through which I could see guards sitting on the steps at each corner, all holding machine guns at the ready. As the towers slid away behind us, night started to fall.

Twenty five years has dulled the memory of some of what happened next, but eventually I got off at a small station, having reached the end of the line. The guards were gone, and the station turned out to be quite modern and clean looking. I still couldn’t understand anything anyone was saying, so I still had no idea where I was, but I was able figure out that I was somehow back in Austria. It was much later that I was to realise that Sopron is on a peninsular that sticks into Austria, and I had come in one side and been sent out the other.

I slept that night on the floor of the main station building, and the next morning set off to find someone who spoke English and could tell me where I was – and just as importantly how to get into Hungary. The town was quite small, maybe just a village. In spite of that it took me a while, but I eventually came across a chap in a supermarket who was able to explain to me that visas are not issued on entry into Hungary by train via Sopron. I was ahead of him there. He also offered to drive me to the border, telling me that I would be able to get a visa at the road entry point.

It’s nice to think about that person whenever I relive this story. He really went out of his way, leaving work to assist a complete stranger, with no fuss or thought for reward. I wonder whether he remembers me. I doubt it. Of course, these days he may even be reading this blog post…

So it was that, eventually, I got the stamp in my passport that I needed, and somehow found my way onto another train heading for Budapest. Well, it wasn’t quite the end of the fun. That continued when I tried to meet up with my father in the capital. But that, as they say, is another story…

Tim Greenwood just pointed out to me a ‘bug’ in my converter program, which I think is actually, in my mind, a bug in Firefox (although I imagine it was implemented by someone as a feature).

If you type A0 (the hex code for a non-breaking space) in the Hexadecimal code points field, then press Convert, you will get a blank space in the Characters field that should be U+00A0 NO-BREAK SPACE. Then press Convert or View Names above this Characters field and you’ll find that what was supposed to be a NBSP has changed into an ordinary space. IE7, Opera and Safari all continue to show the character in the field as a NBSP.

(However, all four browsers substitute an ordinary space when you copy and paste the text from the Characters field into something else.)

I tried this with a range of other types of space , but had no such behaviour (try it). They all remained themselves.

Anyone know what this is about?

The word Mandalay in Myanmar script.

I’ve been brushing up on the Myanmar script, since major changes are on the way with Unicode 5.1.

I upgraded my myanmar picker to handle the new characters, and I edited my notes on how the script works.

The new characters will make a big difference to how you author text in Unicode, and people will need to update currently existing pages to bring them in line with the new approach. The changes should make it much easier to create content in Burmese, in addition to addressing some niggly problems with making the script work correctly. One reason the changes were sanctioned is that there is currently very little Burmese content out there in Unicode.

I’ll be updating my character by character notes later too.

The only problem with all this is that existing fonts will all need to be changed to support the new world order (or myanmar order). I found one font that is already 5.1 ready from the Myanmar Unicode & NLP Research Center. So if you don’t want to download that font, you’ll need to read the PDF version of my notes on the script.

That would be a pity, however, since i had some fun adding javascript to the article today, so that it displays a breakdown, character by character, of each example as you mouse over it (using images, so you see it properly).

I’m at the ITS face-to-face meeting in Prague, Czech Republic and I’ve been trying to learn to read Czech words. Jirka Kosek showed me a Czech tongue-twister last night at dinner.

Strč prst skrz krk.

How amazing is that? A whole sentence without vowels! (Means “Put your finger down your throat.” – I’m wondering whether that has something to do with the missing vowels…)

See a video of Jirka pronouncing it.

Multiple scripts in XMetal’s tags-on view (click to enlarge).

I received a query from someone asking:

I try to edit lao and thai text with XMetal 5.0, but nothing is displayed but squares. In fact, Unicode characters seems to be correctly saved in the XML file and displayed in Firefox (for example), but i can’t get a correct display in XMetal. Is it a font problem ?

There are two places this needs to be addressed:

  1. in the plain text view
  2. in the tags-on view

For the plain text view, it is a question of setting a font that shows Lao and Thai (or whatever other language/script you need) in Tools>Options>Plain Text View>Font. You can only set one font at a time, so a wide ranging Unicode font like Arial Unicode MS or Code2000 may be useful for Windows users.

For the tags-on view (which is the view I use most of the time) you need to edit the CSS file that sets the editor’s styling for the DOCTYPE you are working with. This may be in one of a number of places. The one I edit is C:\Program Files\Blast Radius\XMetaL 4.6\Author\Display\xhtml1-transitional.css.

I added the following to mine. I chose fonts I have on my PC and sets font sizes relative to the size I set for my body element. You should, of course, choose your own fonts and sizes.

[lang="am"] { font-family: "Code2000", serif; font-size: 120%; }
[lang="ar"] {font-family: "Traditional Arabic", sans-serif; font-size: 200%; }
[lang="bn"] {font-family: SolaimanLipi, sans-serif; font-size: 200%; }
[lang="dz"] { font-family: "Tibetan Machine Uni", serif; font-size: 140%; }
[lang="he"] {font-family: "Arial Unicode MS", sans-serif; font-size: 120%;}
[lang="hi"] {font-family: Mangal, sans-serif;  font-size: 120%;}
[lang="kk"] {font-family: "Arial Unicode MS", sans-serif;  }
[lang="iu"] {font-family: Pigiarniq, Uqammaq, sans-serif; font-size: 120%; }
[lang="ko"] { font-family: Batang, sans-serif; font-size: 120%;}
[lang="ne"] {font-family: Mangal, sans-serif;  font-size: 120%; }
[lang="pa"] { font-family: Raavi, sans-serif; font-size: 120%;}
[lang="te"] {font-family: Gautami, sans-serif; font-size: 140%;}
[lang="my"] {font-family: Myanmar1, sans-serif; font-size: 200%;}
[lang="th"] {font-family: "Cordia New", sans-serif; font-size: 200%; }
[lang="ur"] { font-family: "Nafees Nastaleeq", serif; font-size: 130%;}
[lang="ve"] { font-family: "Arial Unicode MS", sans-serif; }
[lang="zh-Hans"] { font-family: "Simsun", sans-serif; font-size: 140%; }
[lang="zh-Hant"] { font-family: "Mingliu", sans-serif; font-size: 140%; }

Note that I would have preferred to say :lang(am) { font-family… } etc, but XMetal 4.6 seems to require you to specify the attribute value as shown above. (You also have to specify class selectors as [class=”myclass”] {…} rather than .myclass {…}.)

I see from a recent bugzilla report and some cursory testing that a (very) long-standing bug in Mozilla related to complex scripts has now been fixed.

Complex scripts include many non-Latin scripts that use combining characters or ligatures, or that apply shaping to adjacent characters like Arabic script.

It used to be that, when you highlighted text in a complex script, as you extended the edges of the highlighted area you would break apart combining characters from their base character, split ligatures and disrupt the joining behaviour of Arabic script characters.

The good news is that this no longer happens – it was fixed by the new text frame code. The bad news is that the highlighting still happens character by character, rather than at grapheme boundaries – which can make it tricky to know whether you got the combining characters or not.

UPDATE: I hear from Kevin Brosnan that the following will be fixed in Firefox 3. Hurrah! And thank you Mozilla team.

What doesn’t appear to be fixed is the behaviour of asian scripts when the CSS text-align:justify is applied. 🙁

I raised a bug report about this. I was amazed, after hearing about this from Indians and Pakistanis too, that there didn’t seem to be a bug report already. Come on users, don’t leave this up to the W3C!

Basically, the issue is that if you apply text-align: justify to some text in an Indian or Tibetan script the combining characters all get rendered alongside their base characters, ie. you go from this (showing, respectively, tibetan, devanagari (hindi and nepali), punjabi, telegu and thai text):

Picture of text with no alignment.

to this:

Picture of text with justify alignment.

Strangely the effect doesn’t seem to apply to the Thai text, nor to other text with combining characters that I’ve tried.

That’s a pretty big bug for people in the affected region because it effectively means that text-align:justify can’t be used.

Sarmad Hussain, at the Center for Research in Urdu Language Processing FAST National University, Pakistan, is looking at enabling Urdu IDNs based on ICANN recommendations, but this may lead to similar approaches in a number of other countries.

Sarmad writes: “We are trying to make the URL enabled in Urdu for people who are not literate in any other language (a large majority of literate population in Pakistan). ICANN has only given specs for the Domain Name in other languages (through its RFCs). Until they allow the TLDs in Urdu, we are considering an application end solution: have a plug in for a browser for people who want to use it, which URL in Urdu, strips and maps all the TLD information to .com, .pk, etc. and converts the domain name to punycode Thus, people can type URLs in pure Urdu which are converted to the mixed English-Urdu URLs by the application layer which ICANN currently allows.”

“We are currently trying to figure out what would be the ‘academic’ requirements/solutions for a language. To practically solve the problem, organizations like ICANN would need to come up with the solutions.”

There are some aspects to Sarmad’s proposal, arising from the nature of the Arabic script used for Urdu, that raise some interesting questions about the way IDN works for this kind of language. These have to do with the choice of characters allowed in a domain name. For example, there is a suggestion that users should be able to use certain characters when writing a URI in Urdu which are then either removed (eg. vowel diacritics) or converted to other characters (eg. Arabic characters) during the conversion to punycode.

This is not something that is normally relevant for English-only URIs, because of the relative simplicity of our alphabet. There is much more potential ambiguity in Urdu for use of characters. Note, however, that the proposals Sarmad is making are language-specific, not script-specific, ie. Arabic or Persian (also written with the Arabic script) would need some slightly different rules.

I find myself wondering whether you could use a plug-in to strip out or convert the characters while converting to punycode. People typing IDNs in Urdu would need to be aware of the need for a plug-in, and would still need to know how to type in IDNs if they found themselves using a browser that didn’t have the plug-in (eg. the businessman who is visiting a corporation in the US that prevents ad hoc downloads of software). On the one hand, I wonder whether we can expect a user who sees a URI on a hard copy brochure containing vowel diacritics to know what to do if their browser or mail client doesn’t support the plug-in. On the other hand, a person writing a clickable URI in HTML or an email would not be able to guarantee that users would have access to the plug-in. In that case, they would be unwise to use things like short vowel diacritics, since the user cannot easily change the link if they don’t have a plug-in. Imagine a vowelled IDN coming through in a plain text email, for example: the reader may need to edit the email text to get to the resource rather than just click on it. Not likely to be popular.

Another alternative is to do such removal and conversion of characters as part of the standard punycode conversion process. This, I suspect, would necessitate every browser to have access to standardised tables of characters that should be ignored or converted for any language. But there is an additional problem in that the language would need to be determined correctly before such rules were applied – that is, the language of the original URI. That too seems a bit difficult.

So I can see the problem, but I’m not sure what the solution would be. I’m inclined to think that creating a plug-in might create more trouble than benefit, by replacing the problems of errors and ambiguities with the problems of uninteroperable IDNs.

I have posted this to the www-international list for discussion.

Follow this link to see lists of characters that may be removed or converted.

Ruby text above and below Japanese characters.

My last post mentioned an extension that takes care of Thai line breaking. In this post I want to point to another useful extension that handles ruby annotation.

Typically ruby is used in East Asian scripts to provide phonetic transcriptions of obscure characters, or characters that the reader is not expected to be familiar with. For example it is widely used in education materials and children’s texts. It is also occasionally used to convey information about the meaning of ideographic characters. For more information see Ruby Markup and Styling.

Ruby markup (called 振り仮名 [furigana] in Japan) is described by the W3C’s Ruby Annotation spec. It comes in two flavours, simple and complex.

Ruby markup is a part of XHTML 1.1 (served as XML), but native support is not widely available. IE doesn’t support XHTML 1.1, but it does support simple ruby markup in HTML and XHTML 1.0. This extension provides support in Firefox for both simple and complex ruby, in HTML, XHTML 1.0 and XHTML 1.1.

It passes all the I18n Activity ruby tests, with the exception of one *very* minor nit related to spacing of complex ruby annotation.

Before and after applying the extension.

Samphan Raruenrom has produced a Firefox extension based on ICU to handle Thai line breaking.

Thai line breaks respect word boundaries, but there are no spaces between words in written Thai. Spaces are used instead as phrase separators (like English comma and full stop). This means that dictionary-based lookup is needed to properly wrap Thai text.

The current release works on Windows and the current Firefox release, The next release will also support Linux and will support future Mozilla Firefox/Thunderbird releases.

You can test this on our i18n articles translated into Thai.

This replaces work on a separate Thai version of Firefox.

UPDATE: This post has now been updated, reviewed and released as part of a W3C article. See

Here are some more thoughts on dealing with multi-cultural names in web forms, databases, or ontologies. See the previous post.


The first thing that English speakers must remember about other people’s names is that a large majority of them don’t use the Latin alphabet, and a majority of those that do use accents and characters that don’t occur in English. It seems obvious, once I’ve said it, but it has some important consequences for designers that are often overlooked.

If you are designing an English form you need to decide whether you are expecting people to enter names in their own script or in an ASCII-only transcription. What people will type into the form will often depend on whether the form and its page is in their language or not. If the page is in their language, don’t be surprised to get back non-Latin or accented Latin characters.

If you hope to get ASCII-only, you need to tell the user.

The decision about which is most appropriate will depend to some extent on what you are collecting people’s names for, and how you intend to use them.

  • Are you collecting the person’s name just to have an identifier in your system? If so, it may not matter whether the name is stored in ASCII-only or native script.
  • Or do you plan to call them by name on a welcome page or in correspondence? If you will correspond using their name on pages written in their language, it would seem sensible to have the name in the native script.
  • Is it important for people in your organization who handle queries to be able to recognise and use the person’s name? If so, you may want to ask for a transcription.
  • Will their name be displayed or searchable (for example Flickr optionally shows people’s names as well as their user name on their profile page)? If so, you may want to store the name in both ASCII and native script, in which case you probably need to ask the user to submit their name in both native script and ASCII-only form, using separate fields.

Note that if you intend to parse a name, you may need to use country or language-specific algorithms to do so correctly (see the previous blog on personal names).

If you do accept non-ASCII names, you should use UTF-8 encoding in your pages, your back end databases and in all the scripts in between. This will significantly simplify your life.

Icons chosen by McDonalds to represent, from left to right, Calories, Protein, Fat, Carbohydrates and Salt.

I just read a fascinating article about how McDonalds set about testing cultural acceptability of a range of icons intended to point to nutritional information. It talks about the process and gives examples of some of the issues. Very nice.

Interesting, also, that they still ended up with local variants in some cases.

Creating a New Language for Nutrition: McDonald’s Universal Icons for 109 Countries

Picture of Tibetan emphasis.

Christopher Fynn of the National Library of Bhutan raised an interesting question on the W3C Style and I18n lists. Tibetan emphasis is often achieved using one of two small marks below a Tibetan syllable, a little like Japanese wakiten. The picture shows U+0F35: TIBETAN MARK NGAS BZUNG NYI ZLA in use. The other form is 0F37: TIBETAN MARK NGAS BZUNG SGOR RTAGS.

Chris was arguing that using CSS, rather than Unicode characters, to render these marks could be useful because:

  • the mark applies to, and is centred below a whole ‘syllable’ – not just the stack of the syllable – this may be easier to achieve with styling than font positioning where, say, a syllable has an even number of head characters (see examples to the far right in the picture)
  • it would make it easier to search for text if these characters were not interspersed in it
  • it would allow for flexibility in approaches to the visual style used for emphasis – you would be able to change between using these marks or alternatives such as use of red colour or changes in font size just by changing the CSS style sheet (as we can for English text).

There are of potential issues with this approach too. These include things like the fact that the horizontal centring of glyphs within the syllable is not trivial. The vertical placement is also particularly difficult. You will notice from the attached image that the height depends on the depth of the text it falls below. On the other hand, it isn’t easy to achieve this with diacritics either, given the number of possible permutations of characters in a syllable. Such positioning is much more complicated than that of the Japanese wakiten.

A bigger issue may turn out to be that the application for this is fairly limited, and user agent developers have other priorities – at least for commercial applications.

To follow along with, and perhaps contribute to, the discussion follow the thread on the style list or the www-international list.

UPDATE: This post has now been updated, reviewed and released as a W3C article. See

People who create web forms, databases, or ontologies in English-speaking countries are often unaware how different people’s names can be in other countries. They build their forms or databases in a way that assumes too much on the part of foreign users.

I’m going to explore some of the potential issues in a series of blog posts. This content will probably go through a number of changes before settling down to something like a final form. Consider it more like a set of wiki pages than a typical blog post.


A form that asks for your name in a single field.
A form that asks for separate first and last names.

It seems to me that there are a couple of key scenarios to consider.

A You are designing a form in a single language (let’s assume English) that people from around the world will be filling in.

B You are designing a form in a one language but the form will be adapted to suit the cultural differences of a given locale when the site is translated.

In reality, you will probably not be able to localise for every different culture, so even if you rely on approach B, some people will still use a form that is not intended specifically for their culture.

Examples of differences

To get started, let’s look at some examples of how people’s names are different around the world.

Given name and patronymic

In the name Björk Guðmundsdóttir Björk is the given name. The second part of the name indicates the father’s (or sometimes the mother’s) name, followed by -sson for a male and -sdóttir for a female, and is more of a description than a family name in the Western sense. Björk’s father, Guðmundor, was the son of Gunnar, so is known as Guðmundur Gunnarsson.

Icelanders prefer to be called by their given name (Björk), or by their full name (Björk Guðmundsdóttir). Björk wouldn’t normally expect to be called Ms. Guðmundsdóttir. Telephone directories in Iceland are sorted by given name.

Other cultures where a person has one given name followed by a patronymic include parts of Southern India, Malaysia and Indonesia.

Different order of parts

In the name 毛泽东 [mao ze dong] the family name is Mao, ie. the first name, left to right. The given name is Dong. The middle character, Ze, is a generational name, and is common to all his siblings (such as his brothers and sister, 毛泽民 [mao ze min], 毛泽覃 [mao ze tan], and 毛澤紅 [mao ze hong]).

Among acquaintances Mao may be referred to as 毛泽东先生 [mao ze dong xiān shēng] or 毛先生 [mao xiān shēng]. Not everyone uses generational names these days, especially in Mainland China. If you are on familiar terms with someone called 毛泽东, you would normally refer to them using 泽东 [ze dong], not just 东 [dong].

Note also that the names are not separated by spaces.

The order family name followed by given name(s) is common in other countries, such as Japan, Korea and Hungary.

Chinese people who deal with Westerners will often adopt an additional given name that is easier for Westerners to use. For example, Yao Ming (family name Yao, given name Ming) may write his name for foreigners as Fred Yao Ming or Fred Ming Yao.

Multiple family names

Spanish-speaking people will commonly have two family names. For example, Maria-Jose Carreño Quiñones may be the daughter of Antonio Carreño Rodríguez and María Quiñones Marqués.

You would refer to her as Señorita Carreño, not Señorita Quiñones.

Variant forms

We already saw that the patronymic in Iceland ends in -son or -dóttir, depending on whether the child is male or female. Russians use patronymics as their middle name but also use family names, in the order given-patronymic-family. The endings of the patronymic and family names will indicate whether the person in question is male or female. For example, the wife of Борис Никола́евич Ельцин (Boris Nikolayevich Yeltsin) is Наина Иосифовна Ельцина (Naina Iosifovna Yeltsina) – note how the husband’s names end in consosonants, while the wife’s names (even the patronymic from her father) end in a.

Mixing it up

Many cultures mix and match these differences from Western personal names, and add their own novelties.

For example, Velikkakathu Sankaran Achuthanandan is a Kerala name from Southern India, usually written V. S. Achuthanandan which follows the order familyName-fathersName-givenName. In many parts of the world, parts of names are derived from titles, locations, genealogical information, caste, religious references, and so on, eg. the Arabic Abu Karim Muhammad al-Jamil ibn Nidal ibn Abdulaziz al-Filistini.

In Vietnam, names such as Nguyễn Tấn Dũng follow the order family-middle-given name. Although this seems similar to the Chinese example above, even in a formal situation this Prime Minister of Vietnam is referred to using his given name, ie. Mr. Dung, not Mr. Nguyen.

Further reading

Wikipedia sports a large number of fascinating articles about how people’s names look in various cultures around the world. I strongly recommend a perusal of the follow links.



If designing a form or database that will accept names from people with a variety of backgrounds, you should ask yourself whether you really need to have separate fields for given name and family name.

This will depend on what you need to do with the data, but obviously it will be simpler to just use the full name as the user provides it, where possible.

Note that if you have separate fields because you want to use the person’s given name to communicate with them, you may not only have problems due to name syntax, but there are varying expectations around the world with regards to formality also that need to be accounted for. It may be better to ask separately, when setting up a profile for example, how that person would like you to address them.

If you do still feel you need to ask for constituent parts of a name separately, try to avoid using the labels ‘first name’ and ‘last name’, since these can be confusing for people who normally write their family name followed by given names.

Be careful, also, about assumptions built into algorithms that pull out the parts of a name automatically. For example, the v-card and h-card approach of implied “n” optimization could have difficulties with, say, Chinese names. You should be as clear as possible about telling people how to specify their name so that you capture the data you think you need.

If you are designing forms that will be localised on a per culture basis, don’t forget that atomised name parts may still need to be stored in a central database, which therefore needs to be able to represent all the various complexities that you dealt with by relegating the form design to the localisation effort.

I’ll post some further issues and thoughts about personal names when time allows.

[See part 2.]

This morning I came across an interesting set of principles for site design. It was developed as part of the BBC 2.0 project.

That led me to the BBC Director General’s “BBC 2.0: why on demand changes everything“. Also a very interesting read as a case study for the web as part of a medium of mass communication.

One particular topic out of several I found of interest:

Interestingly, on July 7th last year, which was by far the biggest day yet for the use of rich audio-visual content from our news site, the content most frequently demanded was the eyewitness user generated content (UGC) from the bomb scenes.

Shaky, blurry images uploaded by one member of the public, downloaded by hundreds of thousands of other members of the public.

It’s a harbinger of a very different, more collaborative, more involving kind of news.

Here, as at so many other points in the digital revolution, the public are moving very quickly now – at least as quickly as the broadcasters.

I also find it interesting to see how news spreads through channels like Flickr. Eighteen months ago we were mysteriously bounced out of bed at 6am, but there was nothing on the TV to explain what had happened. I went up to the roof patio and took some of the first photos of the Buncefield explosion, including a photo taken just 20 minutes after the blast, and uploaded it to Flickr. A slightly later photo hit the number one spot for interestingness for that day. And as many other people’s photos appeared it was possible to get a lot of information, even ahead of the national news, including eye witness accounts, about what had happened.

Over the past 24 hours I’ve been exploring with interest the localization of the Flickr UI.

One difference you’ll notice, if you switch to a language other than English, is that the icons above a photo such as on this page have no text in them.

I checked with Flickr staff, and they confirmed that this is because of the difficulty of squeezing in the translated text in the space available. A classic localization issue, and one that developers and designers should always consider when designing the UI.

For example, here’s the relevant part of the English UI:

Picture of the English version, with text embedded in graphics.

and here is what it looks like when using the Spanish (or indeed any other) UI:

Picture of the Spanish version with only icons, no text.

The text has been dropped in favour of just icons. Note, however, how the text appears in the tooltips that pop up as you mouse over the icons.

This can be an effective way of addressing the problems of text expansion during translation, as long as the icons are understandable memorable and free from cultural bias or offense. Using the tooltips to clarify the meaning is useful too. I think these icons work well, and I’d actually like the Flickr folks to make the English version look like this, too. It detracts less from the photos, to my mind.

Here’s what it may have looked like if Flickr had done the Portuguese in the same way as the English:

Picture of a hypothetical Portuguese version with text in the graphics.

There are a number of problems here. The text is quite dense, it overshoots the width of the photo (there are actually still two letters missing on the right), and it is quite hard to see the accented characters. The text would have to be in a much bigger font to support the complexity of the characters in Chinese and Korean (and of course many other future languages).

Of course, in many situations where text appears in graphics the available width for the text is seriously fixed, and usually within a space that just about fits the English (if that’s the source).

Text will usually expand when translating from English, in particular. This expansion can be particularly pronounced for short pieces of text like icon labels.

So the moral of this story: Think several times before using text in graphics, and in particular icons. If you need to localise your page later, you could have problems. Better just avoid it from the start if you can.

That’s called internationalization 😉

View blog reactions

>> Get my slides !

The long-awaited @media conference is finally over. It went ok, I thought. I’ve been looking forward to carrying the i18n gospel to the heathens of the design and development community. 😉

It was great to have a single track in San Fran. Of course, given that there were two tracks in London, my audience there wasn’t huge – though I’m guessing that about one third of the 700-odd attendees came, which isn’t too bad – especially since I was up against Dan Cederholm (even I wanted to see Dan again). It’s always frustrating that people don’t know how much they’d find talks on i18n useful until they have accidentally been to one.

Anyway, I got a lot out of the other excellent conference talks and enjoyed meeting or better getting to know many new people. I’m looking forward to next year already. It will be different, however, and probably a little quieter, given that Molly Holzschlag announced that she was leaving the Web Conference Circuit, and Joe Clark announced at the very end of the conference that he was retiring (‘pretty much’) from accessibility. Good luck to them both.

As usual, there are lots of photos.

About the presentation

Check out slide 77 for a list of practical takeaways from the presentation.

The presentation was not designed to give you a thorough overview of potential internationalization and localization issues – we would need much longer for that. It aims to provide you with a few practical takeaways, but more importantly it aims to get you thinking about what internationalization is all about – to take you out of your comfort zone, and help you realize that if you want your content to wow people outside your own culture and language, you need to build in certain flexibilities and adopt certain approaches during the design and development – not as an afterthought. Otherwise you are likely to be creating substantial barriers for worldwide use.

The presentation also aims to show that, although using Unicode is an extremely good start to making your stuff world-ready, using a Unicode encoding such as UTF-8 throughout your content, scripts and databases is only a start. You need to worry about whether translators will be able to adapt your stuff linguistically, but you also need to also consider whether graphics and design are going to be culturally appropriate or can be adapted, and whether your approaches and methodologies fit with those of your target users.

View blog reactions

« Previous PageNext Page »