Dochula Pass, Bhutan

I’ve been lucky enough to have access to a pre-publication electronic version of the new Unicode Standard 5 book, and though I’ve been terribly busy just lately, I’ve carved out a little time to read and even use some of it. And I like what I see.

I’ve always thought the Unicode book was a really useful thing to have if you need to understand the ins-and-outs of Unicode for implementation purposes, or if you are simply interested in how scripts work. It has always been relatively easy to read, and more like a guidebook than a standard, if you know what I mean. The good news is that that seems to be even more the case in the latest version. There are lots of small edits that improve the clarity of the text and make it more readable.

In simple terms, a grapheme cluster is a sequence of characters that need to be kept together for things like wrapping text at the end of a line, cursor movement, delete, etc.

There are, however, some more significant changes that are also very welcome. For example, I’ve been looking at first-letter styling in CSS recently, particularly in the context of Indian scripts, but despite a lot of searching I was unable to figure out where the Standard actually told me that a default grapheme cluster didn’t cover a whole Indic syllable. The grapheme cluster concept is really quite an important one for implementations, and it was frustrating to see it described so poorly.

All that has changed with extensive additions to Chapter 3. Now section 3.6 Combination contains a substantial amount of new text that explains grapheme clusters quite clearly. Again, don’t be put off by the dour-sounding title for Chapter 3, Conformance. It contains lots of useful definitions and explanations in the typical clear and succinct style of the book.

I have to admit to a tinge of disappointment that the Standard Annexes which are now included in the book have simply been added as appendices, rather than integrated into the text proper. My evaluation copy didn’t actually contain this text, so I can’t comment further, however.

Also, I had decided a short while ago that I need to finally get to grips with Tibetan script, and some urgency has been added to that given that I will visit Bhutan in January. I was disappointed, therefore, to find that the section on Tibetan script had not been edited at all. That section has always been substandard, to my mind, in terms of clarity and writing style.

On the other hand, I see that useful additions have been made to existing block descriptions elsewhere (such as a useful additional section on Rendering of Thai Combining Marks in the Thai description). I see similar additions to block descriptions such as Lao, Gujarati and Gurmukhi, and the Bengali block description seems to have been largely rewritten. I’m looking forward to getting my teeth into those and also the numerous, enticing new block descriptions, such as Phags-pa, N’ko, Sumer-Akkadian (cuneiform) and the like.

So would I recommend it? Certainly. The Unicode Standard is a mine of useful and accessible information if, as I said, you are implementing Unicode-based applications or you are interested in how scripts work. And it’s worth replacing your previous version, not only because the new smaller format will make it much easier to handle and keep on your bookshelf, but because of the value of the many useful additions. I’ll be picking up my copy at the Unicode Conference in Washington next month.

View blog reactions

Start the app

This dynamic HTML app helps you convert between Unicode character numbers, characters, UTF-8 and UTF-16 code units in hex, percent escapes, and Numeric Character References (hex and decimal).

This new version adds some useful things:

  • You can now convert to and from percent escaped forms. When converting to percent escapes, characters allowed in URI syntax are not converted. When converting from percent escapes you can only use characters allowed in URIs.
  • You can also now convert from a mixture of characters and escapes in the bottom two fields.
Some people have construed this as an attack on IE7. It is absolutely not. I’m trying to be helpful. Microsoft has always taken great care not to break things for their customers when releasing new browser versions. I’m just trying point out an issue I think they may have missed. The title summarises the issue.

The IE7 blog just announced Microsoft’s intention to change the way browser preferences for Accept-language are set up by default. Basically your preferences will no longer, by default, be set to fr if you’re French, but to fr-FR instead, ie. your locale as determined by Windows settings.

I think this is going to cause major problems with content negotiation on the Web.

To give a practical example:
Set your language settings to just es-MX and/or es-ES and point your browser to this article on the W3C site (an article explaining how to set language preferences).

You’ll get back the English version, even though there’s a Spanish version there. Someone with es set in IE6, Opera or Firefox will see the Spanish version automatically – even if their preferences are es-MX then es.

This is down to the way language negotiation is done on the Apache server.

In the article linked to above we explain that “Some of the server-side language selection mechanisms require an exact match to the Accept-Language header. If a document on the server is tagged as fr (French) then a request for a document matching fr-CH (Swiss French) will fail. To ensure success you should configure your browser to request both fr-CH and fr.”

This is from the Apache 2 documentation:

The server will also attempt to match language-subsets when no other match can be found. For example, if a client requests documents with the language en-GB for British English, the server is not normally allowed by the HTTP/1.1 standard to match that against a document that is marked as simply en. (Note that it is almost surely a configuration error to include en-GB and not en in the Accept-Language header, since it is very unlikely that a reader understands British English, but doesn’t understand English in general. Unfortunately, many current clients have default configurations that resemble this.)

Apache 2 introduces “some exceptions … to the negotiation algorithm to allow graceful fallback when language negotiation fails to find a match”, but those using Apache 1 don’t have that luxury.

Apart from the fact that most users wouldn’t even know that they can set their browser preferences differently, not to mention knowing how to do that, IE7 CR1 doesn’t even provide a preset selection for es rather than es-ES – you have to enter it manually. Not likely to happen much.

It seems to me that a simple fix to this would be for IE7 to set the user’s default preferences to *also* include es (ie. es-ES, es for Spain, fr-FR, fr for France, etc.). Then, when a file such as is not found, the server will find and return a French file. Those people who want to know where the user’s browser is (likely to be) physically located can still use the fr-FR information to get the locale.

I think that the result of ignoring this is that many people will be confused about why they no longer see a page in Spanish, when they did before, and a lot of hard work by content developers will go unnoticed on the Web. In short, think Microsoft is about to introduce a serious bug into IE7.

Note, in passing, that the rules for specifying the lang attribute in HTML and xml:lang in XHTML are described by BCP47. The latest syntax and matching specifications are RFC4646 and RFC4647 – which obsolete RFC 3066 and RFC 1766, and which tells you to go to the IANA Language Subtag Registry at to find out what language codes to use, rather than the ISO code lists. For more information, see )

Btw, I tried posting this as a comment on the IE7 blog page, but it didn’t work (site busy) so I did it here.

View blog reactions

I got an email this morning asking for some use cases for the CSS :lang selector. Here are some ideas. This should help content authors understand how using :lang can sometimes be better than other approaches when selecting content for styling. Of course, not all user agents support :lang, and hopefully these use cases will also show how enabling support could be useful.

Use case 1

One of the main cases where I want to use :lang is when I have a page that includes numerous short pieces of text in a different script. Take, for example, my notes on the Myanmar script. In such cases I want to assign a particular font and perhaps font-size, etc, to the numerous Myanmar examples.

It does my head in trying to ensure that I labelled all the myanmar text with class attributes so that I get the right font and colour applied. And it’s frustrating, because all I’m doing is repeating information that’s there already in the lang attribute (and in the xml:lang attribute too, given that this is xhtml).

Adding class="my" everywhere also bulks up the document. Even in this smallish document, it adds over 1K to the page size.

It would make life a lot easier to just include a single CSS rule:

:lang(my) { font-family: myanmar1, sans-serif; color:red; font-size: 130%; }

Use case 2

Suppose you have the following Japanese text in an English document:

<blockquote lang=”ja” xml:lang=”ja”>ワールド・ワイド・ウェッブを<em>世界中</em>に広げましょう</blockquote>

Now suppose you want to apply different emphasis styling to the Japanese text, since italicisation doesn’t work well for ideographic scripts in small font sizes. Let’s suppose we wanted to add the proposed wakiten emphasis style that CSS3 describes. How do you make that happen?

Well, ideally, you’d just add the following rule to your CSS, and all would be taken care of:

em:lang(ja) { font-emphasize: dot before; font-style: normal; }

(“When you encounter an em tag and the language is Japanese use wakiten and remove the italics.”)

If you’re dealing with IE6 :lang is not supported, and you’d actually have to add a special class to each and every emphasis tag embedded in Japanese text and use a rule such as

em.ja { ... }

How annoying is that!

IE7 CR1 supports the CSS selectors lang |= and lang =. Aha! you might think, problem solved. We can use the following rule:

em[lang |= 'ja'] { ... }

But you’d be wrong. This only works if the language is declared on the em element itself. So you’d still have to go through and add lang="ja" xml:lang="ja" to each em element – even though you have already declared that the whole blockquote is in Japanese!

Use case 3

This use case is slightly less mainstream, but I think it presents a slightly different use case, but one which is increasingly common with the increase in multilingual blogs and AJAX powered pages. It applies when you include text into a page that comes from another environment, either by cut & paste, or by automatic means, and you don’t have the styling information that was associated with it originally.

Assuming that the text has language attributes, or that you can apply those, you could have a set of default rules in your environment that, say, apply a nastaliq font with a percentage size scaling factor to all text in Urdu, so that it has some styling at least, and is a reasonable size relative to the Latin text.

For example, if I cut and paste some Urdu text into this blog, it could make the difference between seeing this:
Text in English and Urdu without styling.

and this:
Text in English and Urdu with styling.

Adding, once, a couple of rules in your blog css that say:

:lang(ur) { font-family: standardMSUrdufont, standardMacUrdufont, standardUnixUrdufont, serif; font-size: 140%; }
em:lang(ur) { font-weight: bold; font-style: normal; }

would be preferable to having to add extra inline markup to the text as you add it to your blog each time.

As a similar example, I just released the latest version of the UniView tool (a kind of web-based Character Map on steroids). It includes a facility that allows you to write your own notes about characters in a separate document and see the relevant notes when looking up a specific character. The information is sucked in using AJAX features. See [1].

We do not at the moment try to incorporate/recognize the other document’s style rules when the notes are displayed in UniView, however, while keeping things simple, it may be useful to allow the UniView user switch on or off some very general default style rules specifying fonts and/or font sizing to text marked up for a particular language.

As long as the code is marked up for language, such defaults can be applied regardless of what class names or styling appeared in the original document. Of course, :lang would be very useful in this respect.

[1] To see this example
a. open UniView
b. where it says “Select a range to display” select Myanmar
c. click on character 1004 and see the description on the right
d. now click on the icon with a + sign between Notes: and Search string: fields
e. from the menu select Myanmar block and say ok, and dismiss the pop up
f. now click on character 1004 again, and see the notes added to the description on the right – these notes came from an XML file (see the same file served as xhtml)

(Anyone can write such a document, stick it on a server and include its information in UniView. The only requirement is that the notes you want to appear be surrounded by <div class=”notes” id=”C[hexCodepoint]”></div>. The example above is one such file supplied with UniView.)

Other useful stuff

At the W3C Internationalization site you can find:

  1. an article that answers the question: “What is the most appropriate way to associate CSS styles with text in a particular language in a multilingual XHTML/HTML document?
  2. a set of test pages relating to user agent support of :lang, lang|= and lang= and a fairly recent summary of results

New version

This is a major new release of UniView, bringing it up to date with the Unicode Standard version 5.0.0, but also improving the user interface and adding AJAX links to supplementary notes.


  • Updated to support Unicode 5.0.0.
  • Restyled the menu panels, moving some less used functions to pop up windows to save on horizontal space.
  • Implemented an AJAX approach for incorporating notes files. This means that the page no longer has to be reloaded to add notes. It is now also possible to add more than one set of notes at a time. Note that these changes requires a small change to the markup of notes files – the div containing the notes for display has to have a class name ‘notes’ as well as the id for the character.
  • I added some bundled notes files – most notably myanmar. Note that these are subject to change on an ongoing basis.

Most of the properties display in the character-detail panel on the right are taken from the unicodedata file at the moment. I plan to incorporate additional property information over the coming months, but wanted to release this now so that you can get information about Unicode 5 characters sooner rather than later.