Dochula Pass, Bhutan

Sarmad Hussain, at the Center for Research in Urdu Language Processing FAST National University, Pakistan, is looking at enabling Urdu IDNs based on ICANN recommendations, but this may lead to similar approaches in a number of other countries.

Sarmad writes: “We are trying to make the URL enabled in Urdu for people who are not literate in any other language (a large majority of literate population in Pakistan). ICANN has only given specs for the Domain Name in other languages (through its RFCs). Until they allow the TLDs in Urdu, we are considering an application end solution: have a plug in for a browser for people who want to use it, which URL in Urdu, strips and maps all the TLD information to .com, .pk, etc. and converts the domain name to punycode Thus, people can type URLs in pure Urdu which are converted to the mixed English-Urdu URLs by the application layer which ICANN currently allows.”

“We are currently trying to figure out what would be the ‘academic’ requirements/solutions for a language. To practically solve the problem, organizations like ICANN would need to come up with the solutions.”

There are some aspects to Sarmad’s proposal, arising from the nature of the Arabic script used for Urdu, that raise some interesting questions about the way IDN works for this kind of language. These have to do with the choice of characters allowed in a domain name. For example, there is a suggestion that users should be able to use certain characters when writing a URI in Urdu which are then either removed (eg. vowel diacritics) or converted to other characters (eg. Arabic characters) during the conversion to punycode.

This is not something that is normally relevant for English-only URIs, because of the relative simplicity of our alphabet. There is much more potential ambiguity in Urdu for use of characters. Note, however, that the proposals Sarmad is making are language-specific, not script-specific, ie. Arabic or Persian (also written with the Arabic script) would need some slightly different rules.

I find myself wondering whether you could use a plug-in to strip out or convert the characters while converting to punycode. People typing IDNs in Urdu would need to be aware of the need for a plug-in, and would still need to know how to type in IDNs if they found themselves using a browser that didn’t have the plug-in (eg. the businessman who is visiting a corporation in the US that prevents ad hoc downloads of software). On the one hand, I wonder whether we can expect a user who sees a URI on a hard copy brochure containing vowel diacritics to know what to do if their browser or mail client doesn’t support the plug-in. On the other hand, a person writing a clickable URI in HTML or an email would not be able to guarantee that users would have access to the plug-in. In that case, they would be unwise to use things like short vowel diacritics, since the user cannot easily change the link if they don’t have a plug-in. Imagine a vowelled IDN coming through in a plain text email, for example: the reader may need to edit the email text to get to the resource rather than just click on it. Not likely to be popular.

Another alternative is to do such removal and conversion of characters as part of the standard punycode conversion process. This, I suspect, would necessitate every browser to have access to standardised tables of characters that should be ignored or converted for any language. But there is an additional problem in that the language would need to be determined correctly before such rules were applied – that is, the language of the original URI. That too seems a bit difficult.

So I can see the problem, but I’m not sure what the solution would be. I’m inclined to think that creating a plug-in might create more trouble than benefit, by replacing the problems of errors and ambiguities with the problems of uninteroperable IDNs.

I have posted this to the www-international list for discussion.

Follow this link to see lists of characters that may be removed or converted.
(more…)


Ruby text above and below Japanese characters.

My last post mentioned an extension that takes care of Thai line breaking. In this post I want to point to another useful extension that handles ruby annotation.

Typically ruby is used in East Asian scripts to provide phonetic transcriptions of obscure characters, or characters that the reader is not expected to be familiar with. For example it is widely used in education materials and children’s texts. It is also occasionally used to convey information about the meaning of ideographic characters. For more information see Ruby Markup and Styling.

Ruby markup (called 振り仮名 [furigana] in Japan) is described by the W3C’s Ruby Annotation spec. It comes in two flavours, simple and complex.

Ruby markup is a part of XHTML 1.1 (served as XML), but native support is not widely available. IE doesn’t support XHTML 1.1, but it does support simple ruby markup in HTML and XHTML 1.0. This extension provides support in Firefox for both simple and complex ruby, in HTML, XHTML 1.0 and XHTML 1.1.

It passes all the I18n Activity ruby tests, with the exception of one *very* minor nit related to spacing of complex ruby annotation.


Before and after applying the extension.

Samphan Raruenrom has produced a Firefox extension based on ICU to handle Thai line breaking.

Thai line breaks respect word boundaries, but there are no spaces between words in written Thai. Spaces are used instead as phrase separators (like English comma and full stop). This means that dictionary-based lookup is needed to properly wrap Thai text.

The current release works on Windows and the current Firefox release, 2.0.0.4. The next release will also support Linux and will support future Mozilla Firefox/Thunderbird releases.

You can test this on our i18n articles translated into Thai.

This replaces work on a separate Thai version of Firefox.

UPDATE: This post has now been updated, reviewed and released as part of a W3C article. See http://www.w3.org/International/questions/qa-personal-names.

Here are some more thoughts on dealing with multi-cultural names in web forms, databases, or ontologies. See the previous post.

Script

The first thing that English speakers must remember about other people’s names is that a large majority of them don’t use the Latin alphabet, and a majority of those that do use accents and characters that don’t occur in English. It seems obvious, once I’ve said it, but it has some important consequences for designers that are often overlooked.

If you are designing an English form you need to decide whether you are expecting people to enter names in their own script or in an ASCII-only transcription. What people will type into the form will often depend on whether the form and its page is in their language or not. If the page is in their language, don’t be surprised to get back non-Latin or accented Latin characters.

If you hope to get ASCII-only, you need to tell the user.

The decision about which is most appropriate will depend to some extent on what you are collecting people’s names for, and how you intend to use them.

  • Are you collecting the person’s name just to have an identifier in your system? If so, it may not matter whether the name is stored in ASCII-only or native script.
  • Or do you plan to call them by name on a welcome page or in correspondence? If you will correspond using their name on pages written in their language, it would seem sensible to have the name in the native script.
  • Is it important for people in your organization who handle queries to be able to recognise and use the person’s name? If so, you may want to ask for a transcription.
  • Will their name be displayed or searchable (for example Flickr optionally shows people’s names as well as their user name on their profile page)? If so, you may want to store the name in both ASCII and native script, in which case you probably need to ask the user to submit their name in both native script and ASCII-only form, using separate fields.

Note that if you intend to parse a name, you may need to use country or language-specific algorithms to do so correctly (see the previous blog on personal names).

If you do accept non-ASCII names, you should use UTF-8 encoding in your pages, your back end databases and in all the scripts in between. This will significantly simplify your life.


Icons chosen by McDonalds to represent, from left to right, Calories, Protein, Fat, Carbohydrates and Salt.

I just read a fascinating article about how McDonalds set about testing cultural acceptability of a range of icons intended to point to nutritional information. It talks about the process and gives examples of some of the issues. Very nice.

Interesting, also, that they still ended up with local variants in some cases.

Creating a New Language for Nutrition: McDonald’s Universal Icons for 109 Countries

Some applications insert a signature or Byte Order Mark (BOM) at the beginning of UTF-8 text. For example, Notepad always adds a BOM when saving as UTF-8.

Older text editors or browsers will display the BOM as a blank line on-screen, others will display unexpected characters, such as . This may also occur in the latest browsers if a file that starts with a BOM is included into another file by PHP.

For more information, see the article Unexpected characters or blank lines and the test pages and results on the W3C site.

If you have problems that you think might be related to this, the following may help.

Checking for the BOM

I created a small utility that checks for a BOM at the beginning of a file. Just type in the URI for the file and it will take a look. (Note, if it’s a file included by PHP that you think is causing the problem, type in the URI of the included file.)

Removing the BOM

If there is a BOM, you will probably want to remove it. One way would be to save the file using a BOM-aware editor that allows you to specify that you don’t want a BOM at the start of the file. For example, if Dreamweaver detects a BOM the Save As dialogue box will have a check mark alongside the text “Include Unicode Signature (BOM)”. Just uncheck the box and save.

Another way would be to run a script on your file. Here is some simple Perl scripting to check for a BOM and remove it if it exists (developed by Martin Dürst and tweaked a little by myself).

# program to remove a leading UTF-8 BOM from a file
# works both STDIN -> STDOUT and on the spot (with filename as argument)

if ($#ARGV > 0) {
    print STDERR "Too many arguments!\n";
    exit;
    }

my @file;   # file content
my $lineno = 0;

my $filename = @ARGV[0];
if ($filename) {
    open( BOMFILE, $filename ) || die "Could not open source file for reading.";
    while (<BOMFILE>) {
        if ($lineno++ == 0) {
            if ( index( $_, '' ) == 0 ) {
                s/^\xEF\xBB\xBF//;
                print "BOM found and removed.\n";
                }
            else { print "No BOM found.\n"; }
            }
        push @file, $_ ;
        }
    close (BOMFILE)  || die "Can't close source file after reading.";

    open (NOBOMFILE, ">$filename") || die "Could not open source file for writing.";
    foreach $line (@file) {
        print NOBOMFILE $line;
        }
    close (NOBOMFILE)  || die "Can't close source file after writing.";
    }
else {  # STDIN -> STDOUT
    while (<>) {
    if (!$lineno++) {
        s/^\xEF\xBB\xBF//;
        }
    push @file, $_ ;
    }

    foreach $line (@file) {
        print $line;
        }
    }
Picture of Tibetan emphasis.

Christopher Fynn of the National Library of Bhutan raised an interesting question on the W3C Style and I18n lists. Tibetan emphasis is often achieved using one of two small marks below a Tibetan syllable, a little like Japanese wakiten. The picture shows U+0F35: TIBETAN MARK NGAS BZUNG NYI ZLA in use. The other form is 0F37: TIBETAN MARK NGAS BZUNG SGOR RTAGS.

Chris was arguing that using CSS, rather than Unicode characters, to render these marks could be useful because:

  • the mark applies to, and is centred below a whole ‘syllable’ – not just the stack of the syllable – this may be easier to achieve with styling than font positioning where, say, a syllable has an even number of head characters (see examples to the far right in the picture)
  • it would make it easier to search for text if these characters were not interspersed in it
  • it would allow for flexibility in approaches to the visual style used for emphasis – you would be able to change between using these marks or alternatives such as use of red colour or changes in font size just by changing the CSS style sheet (as we can for English text).

There are of potential issues with this approach too. These include things like the fact that the horizontal centring of glyphs within the syllable is not trivial. The vertical placement is also particularly difficult. You will notice from the attached image that the height depends on the depth of the text it falls below. On the other hand, it isn’t easy to achieve this with diacritics either, given the number of possible permutations of characters in a syllable. Such positioning is much more complicated than that of the Japanese wakiten.

A bigger issue may turn out to be that the application for this is fairly limited, and user agent developers have other priorities – at least for commercial applications.

To follow along with, and perhaps contribute to, the discussion follow the thread on the style list or the www-international list.

UPDATE: This post has now been updated, reviewed and released as a W3C article. See http://www.w3.org/International/questions/qa-personal-names.

People who create web forms, databases, or ontologies in English-speaking countries are often unaware how different people’s names can be in other countries. They build their forms or databases in a way that assumes too much on the part of foreign users.

I’m going to explore some of the potential issues in a series of blog posts. This content will probably go through a number of changes before settling down to something like a final form. Consider it more like a set of wiki pages than a typical blog post.

Scenarios

A form that asks for your name in a single field.
A form that asks for separate first and last names.

It seems to me that there are a couple of key scenarios to consider.

A You are designing a form in a single language (let’s assume English) that people from around the world will be filling in.

B You are designing a form in a one language but the form will be adapted to suit the cultural differences of a given locale when the site is translated.

In reality, you will probably not be able to localise for every different culture, so even if you rely on approach B, some people will still use a form that is not intended specifically for their culture.

Examples of differences

To get started, let’s look at some examples of how people’s names are different around the world.

Given name and patronymic

In the name Björk Guðmundsdóttir Björk is the given name. The second part of the name indicates the father’s (or sometimes the mother’s) name, followed by -sson for a male and -sdóttir for a female, and is more of a description than a family name in the Western sense. Björk’s father, Guðmundor, was the son of Gunnar, so is known as Guðmundur Gunnarsson.

Icelanders prefer to be called by their given name (Björk), or by their full name (Björk Guðmundsdóttir). Björk wouldn’t normally expect to be called Ms. Guðmundsdóttir. Telephone directories in Iceland are sorted by given name.

Other cultures where a person has one given name followed by a patronymic include parts of Southern India, Malaysia and Indonesia.

Different order of parts

In the name 毛泽东 [mao ze dong] the family name is Mao, ie. the first name, left to right. The given name is Dong. The middle character, Ze, is a generational name, and is common to all his siblings (such as his brothers and sister, 毛泽民 [mao ze min], 毛泽覃 [mao ze tan], and 毛澤紅 [mao ze hong]).

Among acquaintances Mao may be referred to as 毛泽东先生 [mao ze dong xiān shēng] or 毛先生 [mao xiān shēng]. Not everyone uses generational names these days, especially in Mainland China. If you are on familiar terms with someone called 毛泽东, you would normally refer to them using 泽东 [ze dong], not just 东 [dong].

Note also that the names are not separated by spaces.

The order family name followed by given name(s) is common in other countries, such as Japan, Korea and Hungary.

Chinese people who deal with Westerners will often adopt an additional given name that is easier for Westerners to use. For example, Yao Ming (family name Yao, given name Ming) may write his name for foreigners as Fred Yao Ming or Fred Ming Yao.

Multiple family names

Spanish-speaking people will commonly have two family names. For example, Maria-Jose Carreño Quiñones may be the daughter of Antonio Carreño Rodríguez and María Quiñones Marqués.

You would refer to her as Señorita Carreño, not Señorita Quiñones.

Variant forms

We already saw that the patronymic in Iceland ends in -son or -dóttir, depending on whether the child is male or female. Russians use patronymics as their middle name but also use family names, in the order given-patronymic-family. The endings of the patronymic and family names will indicate whether the person in question is male or female. For example, the wife of Борис Никола́евич Ельцин (Boris Nikolayevich Yeltsin) is Наина Иосифовна Ельцина (Naina Iosifovna Yeltsina) – note how the husband’s names end in consosonants, while the wife’s names (even the patronymic from her father) end in a.

Mixing it up

Many cultures mix and match these differences from Western personal names, and add their own novelties.

For example, Velikkakathu Sankaran Achuthanandan is a Kerala name from Southern India, usually written V. S. Achuthanandan which follows the order familyName-fathersName-givenName. In many parts of the world, parts of names are derived from titles, locations, genealogical information, caste, religious references, and so on, eg. the Arabic Abu Karim Muhammad al-Jamil ibn Nidal ibn Abdulaziz al-Filistini.

In Vietnam, names such as Nguyễn Tấn Dũng follow the order family-middle-given name. Although this seems similar to the Chinese example above, even in a formal situation this Prime Minister of Vietnam is referred to using his given name, ie. Mr. Dung, not Mr. Nguyen.

Further reading

Wikipedia sports a large number of fascinating articles about how people’s names look in various cultures around the world. I strongly recommend a perusal of the follow links.

AkanArabicBalineseBulgarianCzechChineseDutchFijianFrenchGermanHawaiianHebrewHungarianIcelandicIndianIndonesianIrishItalianJapaneseJavaneseKoreanLithuanianMalaysianMongolianPersianPhilippinePolishPortugueseRussianSpanishTaiwaneseThaiVietnamese

Consequences

If designing a form or database that will accept names from people with a variety of backgrounds, you should ask yourself whether you really need to have separate fields for given name and family name.

This will depend on what you need to do with the data, but obviously it will be simpler to just use the full name as the user provides it, where possible.

Note that if you have separate fields because you want to use the person’s given name to communicate with them, you may not only have problems due to name syntax, but there are varying expectations around the world with regards to formality also that need to be accounted for. It may be better to ask separately, when setting up a profile for example, how that person would like you to address them.

If you do still feel you need to ask for constituent parts of a name separately, try to avoid using the labels ‘first name’ and ‘last name’, since these can be confusing for people who normally write their family name followed by given names.

Be careful, also, about assumptions built into algorithms that pull out the parts of a name automatically. For example, the v-card and h-card approach of implied “n” optimization could have difficulties with, say, Chinese names. You should be as clear as possible about telling people how to specify their name so that you capture the data you think you need.

If you are designing forms that will be localised on a per culture basis, don’t forget that atomised name parts may still need to be stored in a central database, which therefore needs to be able to represent all the various complexities that you dealt with by relegating the form design to the localisation effort.

I’ll post some further issues and thoughts about personal names when time allows.

[See part 2.]

This morning I came across an interesting set of principles for site design. It was developed as part of the BBC 2.0 project.

That led me to the BBC Director General’s “BBC 2.0: why on demand changes everything“. Also a very interesting read as a case study for the web as part of a medium of mass communication.

One particular topic out of several I found of interest:

Interestingly, on July 7th last year, which was by far the biggest day yet for the use of rich audio-visual content from our news site, the content most frequently demanded was the eyewitness user generated content (UGC) from the bomb scenes.

Shaky, blurry images uploaded by one member of the public, downloaded by hundreds of thousands of other members of the public.

It’s a harbinger of a very different, more collaborative, more involving kind of news.

Here, as at so many other points in the digital revolution, the public are moving very quickly now – at least as quickly as the broadcasters.

I also find it interesting to see how news spreads through channels like Flickr. Eighteen months ago we were mysteriously bounced out of bed at 6am, but there was nothing on the TV to explain what had happened. I went up to the roof patio and took some of the first photos of the Buncefield explosion, including a photo taken just 20 minutes after the blast, and uploaded it to Flickr. A slightly later photo hit the number one spot for interestingness for that day. And as many other people’s photos appeared it was possible to get a lot of information, even ahead of the national news, including eye witness accounts, about what had happened.