Sarmad Hussain, at the Center for Research in Urdu Language Processing FAST National University, Pakistan, is looking at enabling Urdu IDNs based on ICANN recommendations, but this may lead to similar approaches in a number of other countries.

Sarmad writes: “We are trying to make the URL enabled in Urdu for people who are not literate in any other language (a large majority of literate population in Pakistan). ICANN has only given specs for the Domain Name in other languages (through its RFCs). Until they allow the TLDs in Urdu, we are considering an application end solution: have a plug in for a browser for people who want to use it, which URL in Urdu, strips and maps all the TLD information to .com, .pk, etc. and converts the domain name to punycode Thus, people can type URLs in pure Urdu which are converted to the mixed English-Urdu URLs by the application layer which ICANN currently allows.”

“We are currently trying to figure out what would be the ‘academic’ requirements/solutions for a language. To practically solve the problem, organizations like ICANN would need to come up with the solutions.”

There are some aspects to Sarmad’s proposal, arising from the nature of the Arabic script used for Urdu, that raise some interesting questions about the way IDN works for this kind of language. These have to do with the choice of characters allowed in a domain name. For example, there is a suggestion that users should be able to use certain characters when writing a URI in Urdu which are then either removed (eg. vowel diacritics) or converted to other characters (eg. Arabic characters) during the conversion to punycode.

This is not something that is normally relevant for English-only URIs, because of the relative simplicity of our alphabet. There is much more potential ambiguity in Urdu for use of characters. Note, however, that the proposals Sarmad is making are language-specific, not script-specific, ie. Arabic or Persian (also written with the Arabic script) would need some slightly different rules.

I find myself wondering whether you could use a plug-in to strip out or convert the characters while converting to punycode. People typing IDNs in Urdu would need to be aware of the need for a plug-in, and would still need to know how to type in IDNs if they found themselves using a browser that didn’t have the plug-in (eg. the businessman who is visiting a corporation in the US that prevents ad hoc downloads of software). On the one hand, I wonder whether we can expect a user who sees a URI on a hard copy brochure containing vowel diacritics to know what to do if their browser or mail client doesn’t support the plug-in. On the other hand, a person writing a clickable URI in HTML or an email would not be able to guarantee that users would have access to the plug-in. In that case, they would be unwise to use things like short vowel diacritics, since the user cannot easily change the link if they don’t have a plug-in. Imagine a vowelled IDN coming through in a plain text email, for example: the reader may need to edit the email text to get to the resource rather than just click on it. Not likely to be popular.

Another alternative is to do such removal and conversion of characters as part of the standard punycode conversion process. This, I suspect, would necessitate every browser to have access to standardised tables of characters that should be ignored or converted for any language. But there is an additional problem in that the language would need to be determined correctly before such rules were applied – that is, the language of the original URI. That too seems a bit difficult.

So I can see the problem, but I’m not sure what the solution would be. I’m inclined to think that creating a plug-in might create more trouble than benefit, by replacing the problems of errors and ambiguities with the problems of uninteroperable IDNs.

I have posted this to the www-international list for discussion.

Follow this link to see lists of characters that may be removed or converted.

The following characters will allowed in the IRI but removed before conversion to punycode.

These characters are optional in Arabic script, though they can sometimes be useful for disambiguating pronunciation and meaning – particularly useful for Urdu, which has more vowel sounds than Arabic.

064B: ً ARABIC FATHATAN
064C: ٌ ARABIC DAMMATAN
064D: ٍ ARABIC KASRATAN
064E: َ ARABIC FATHA
064F: ُ ARABIC DAMMA
0650: ِ ARABIC KASRA
0651: ّ ARABIC SHADDA
0652: ْ ARABIC SUKUN
0655: ٕ ARABIC HAMZA BELOW
0656: ٖ ARABIC SUBSCRIPT ALEF
0658: ٘ ARABIC MARK NOON GHUNNA
0670: ٰ ARABIC LETTER SUPERSCRIPT ALEF
0612: ؒ ARABIC SIGN RAHMATULLAH ALAYHE
0614: ؔ ARABIC SIGN TAKHALLUS

Some other characters used in Arabic but not Urdu will be allowed but will be converted to a character used in Urdu during conversion to punycode. They are included in the set of allowed characters, however, to avoid confusion when they are used incorrectly.

0629: ة ARABIC LETTER TEH MARBUTA
0643: ك ARABIC LETTER KAF
0649: ى ARABIC LETTER ALEF MAKSURA
064A: ي ARABIC LETTER YEH
0660: ٠ ARABIC-INDIC DIGIT ZERO
0661: ١ ARABIC-INDIC DIGIT ONE
0662: ٢ ARABIC-INDIC DIGIT TWO
0663: ٣ ARABIC-INDIC DIGIT THREE
0664: ٤ ARABIC-INDIC DIGIT FOUR
0665: ٥ ARABIC-INDIC DIGIT FIVE
0666: ٦ ARABIC-INDIC DIGIT SIX
0667: ٧ ARABIC-INDIC DIGIT SEVEN
0668: ٨ ARABIC-INDIC DIGIT EIGHT
0669: ٩ ARABIC-INDIC DIGIT NINE