Dochula Pass, Bhutan

Three bopomofo letters with tone mark.

Light tone mark in annotation.

A key issue for handling of bopomofo (zhùyīn fúhào) is the placement of tone marks. When bopomofo text runs vertically (either on its own, or as a phonetic annotation), some smarts are needed to display tone marks in the right place. This may also be required (though with different rules) for bopomofo when used horizontally for phonetic annotations (ie. above a base character), but not in all such cases. However, when bopomofo is written horizontally in any other situation (ie. when not written above a base character), the tone mark typically follows the last bopomofo letter in the syllable, with no special handling.

From time to time questions are raised on W3C mailing lists about how to implement phonetic annotations in bopomofo. Participants in these discussions need a good understanding of the various complexities of bopomofo rendering.

To help with that, I just uploaded a new Web page Bopomofo on the Web. The aim is to provide background information, and carry useful ideas from one discussion to the next. I also add some personal thoughts on implementation alternatives, given current data.

I intend to update the page from time to time, as new information becomes available.

Screen Shot 2015-01-18 at 07.42.56

Version 16 of the Bengali character picker is now available.

Other than a small rearrangement of the selection table, and the significant standard features that version 16 brings, this version adds the following:

  • three new buttons for automatic transcription between latin and bengali. You can use these buttons to transcribe to and from latin transcriptions using ISO 15919 or Radice approaches.
  • hinting to help identify similar characters.
  • the ability to select the base character for the display of combining characters in the selection table.

For more information about the picker, see the notes at the bottom of the picker page.

In addition, I made a number of additions and changes to Bengali script notes (an overview of the Bengali script), and Bengali character notes (an annotated list of characters in the Bengali script).

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

There is some confusion about which shapes should be produced by fonts for Mongolian characters. Most letters have at least one isolated, initial, medial and final shape, but other shapes are produced by contextual factors, such as vowel harmony.

Unicode has a list of standardised variant shapes, dating from 27 November 2013, but that list is not complete and contains what are currently viewed by some as errors. It also doesn’t specify the expected default shapes for initial, medial and final positions.

The original list of standardised variants was based on 蒙古文编码 by Professor Quejingzhabu in 2000.

A new proposal was published on 20 January 2014, which attempts to resolve the current issues, although I think that it introduces one or two issues of its own.

The other factor in this is what the actual fonts do. Sometimes they follow the Unicode standardised variants list, other times they diverge from it. Occasionally a majority of implementations appear to diverge in the same way, suggesting that the standardised list should be adapted to reality.

To help unravel this, I put together a page called Notes on Mongolian variant forms that visually shows the changes between the various proposals, and compares the results produced by various fonts.

This is still an early draft. The information only covers the basic Mongolian range – Todo, Sibe, etc still to come. Also, I would like to add information about other fonts, if I can obtain them.

Update: 16 Apr 2015, The Todo, Sibe, Manchu, Sanskrit and Tibetan characters are now all done, and font information added for them. (And the document was moved to github.)

See the Tibetan Script Notes

Last March I pulled together some notes about the Tibetan script overall, and detailed notes about Unicode characters used in Tibetan.

I am writing these pages as I explore the Tibetan script as used for the Tibetan language. They may be updated from time to time and should not be considered authoritative. Basically I am mostly simplifying, combining, streamlining and arranging the text from the sources listed at the bottom of the page.

The first half of the script notes page describes how Unicode characters are used to write Tibetan. The second half looks at text layout in Tibetan (eg. line-breaking, justification, emphasis, punctuation, etc.)

The character notes page lists all the characters in the Unicode Tibetan block, and provides specific usage notes for many of them per their use for writing the Tibetan language.

See the Tibetan Character Notes

Tibetan is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs. Text runs from left to right.

There are various different Tibetan scripts, of two basic types: དབུ་ཙན་ dbu-can, pronounced /uchen/ (with a head), and དབུ་མེད་ dbu-med, pronounced /ume/ (headless). This page concentrates on the former. Pronunciations are based on the central, Lhasa dialect.

The pronunciation of Tibetan words is typically much simpler than the orthography, which involves patterns of consonants. These reduce ambiguity and can affect pronunciation and tone. In the notes I try to explain how that works, in an approachable way (though it’s still a little complicated, at first).

Traditional Tibetan text was written on pechas (དཔེ་ཆ་ dpe-cha), loose-leaf sheets. Some of the characters used and formatting approaches are different in books and pechas.

For similar notes on other scripts, see my docs list.

It’s disappointing to see that non-standard implementations of UTF-8 are being used by the BBC on their BBC Burmese Facebook page.

Take, for example, the following text.

On the actual BBC site it looks like this (click on the burmese text to see a list of the characters used):

အိန္ဒိယ မိန်းမငယ် ၂ဦး အမှု ဆေးစစ်ချက် ကွဲလွဲနေ

As far as I can tell, this is conformant use of Unicode codepoints.

Look at the same title on the BBC’s Facebook page, however, and you see:

အိႏၵိယ မိန္းမငယ္ ၂ဦး အမႈ ေဆးစစ္ခ်က္ ကြဲလြဲေန

Depending upon where you are reading this (as long as you have some Burmese font and rendering support), one of the two lines of Burmese text above will contain lots of garbage. For me, it’s the second (non-standard).

This non-standard approach uses visual encoding for combining characters that appear before or on both sides of the base, uses Shan or Rumai Palaung codepoints for subjoining consonants, uses the wrong codepoints for medial consonants, and uses the virama instead of the asat at the end of a word.

I assume that this is because of prevalent use of the non-standard approach on mobile devices (and that the BBC is just following that trend), caused by hacks that arose when people were impatient to get on the Web but script support was lagging in applications.

However, continuing this divergence does nobody any long-term good.

[ Find fonts and other resources for the Myanmar script ]

I’ve been trying to understand how web pages need to support justification of Arabic text, so that there are straight lines down both left and right margins.

The following is an extract from a talk I gave at the MultilingualWeb workshop in Madrid at the beginning of May. (See the whole talk.) It’s very high level, and basically just draws out some of the uncertainties that seem to surround the topic.

Let’s suppose that we want to justify the following Arabic text, so that there are straight lines at both left and right margins.

Arabic justification #1

Unjustified Arabic text

Generally speaking, received wisdom says that Arabic does this by stretching the baseline inside words, rather than stretching the inter-word spacing (as would be the case in English text).

To keep it simple, lets just focus on the top two lines.

One way you may hear that this can be done is by using a special baseline extension character in Unicode, U+0640 ARABIC TATWEEL.

Arabic justification #2

Justification using tatweels

The picture above shows Arabic text from a newspaper where we have justified the first two lines using tatweels in exactly the same way it was done in the newspaper.

Apart from the fact that this looks ugly, one of the big problems with this approach is that there are complex rules for the placement of baseline extensions. These include:

  • extensions can only appear between certain characters, and are forbidden around other characters
  • the number of allowable extensions per word and per line is usually kept to a minimum
  • words vary in appropriateness for extension, depending on word length
  • there are rules about where in the line extensions can appear – usually not at the beginning
  • different font styles have different rules

An ordinary web author who is trying to add tatweels to manually justify the text may not know how to apply these rules.

A fundamental problem on the Web is that when text size or font is changed, or a window is stretched, etc, the tatweels will end up in the wrong place and cause problems. The tatweel approach is of no use for paragraphs of text that will be resized as the user stretches the window of a web page.

In the next picture we have simply switched to a font in the Naskh style. You can see that the tatweels applied to the word that was previously at the end of the first line now make the word to long to fit there. The word has wrapped to the beginning of the next line, and we have a large gap at the end of the first line.

Arabic justification #3

Tatweels in the wrong place due to just a font change

To further compound the difficulties mentioned above regarding the rules of placement for extensions, each different style of Arabic font has different rules. For example, the rules for where and how words are elongated are different in the Nastaliq version of the same text which you can see below. (All the characters are exactly the same, only the font has changed.) (See a description of how to justify Urdu text in the Nastaliq style.)

Arabic justification #4: Nastaliq

Same text in the Nastaliq font style

And fonts in the Ruqah style never use elongation at all. (We’ll come back to how you justify text using Ruqah-style fonts in a moment.)

Arabic justification #5: Ruqah

Same text in the Ruqah font style

In the next picture we have removed all the tatweel characters, and we are showing the text using a Naskh-style font. Note that this text has more ligatures on the first line, so it is able to fit in more of the text on that line than the first font we saw. We’ll again focus on the first two lines, and consider how to justify them.

Arabic justification #6: Naskh

Same text in the Naskh font style

High end systems have the ability to allow relevant characters to be elongated by working with the font glyphs themselves, rather than requiring additional baseline extension characters.

Arabic justification #7: kashida elongation

Justification using letter elongation (kashida)

In principle, if you are going to elongate words, this is a better solution for a dynamic environment. It means, however, that:

  1. the rules for applying the right-sized elongations to the right characters has to be applied at runtime by the application and font working together, and as the user or author stretches the window, changes font size, adds text, etc, the location and size of elongations needs to be reconfigured
  2. there needs to be some agreement about what those rules are, or at least a workable set of rules for an off-the-shelf, one-size-fits-all solution.

The latter is the fundamental issue we face. There is very little, high-quality information available about how to do this, and a lack of consensus about, not only what the rules are, but how justification should be done.

Some experts will tell you that text elongation is the primary method for justifying Arabic text (for example), while others will tell you that inter-word and intra-word spacing (where there are gaps in the letter-joins within a single word) should be the primary approach, and kashida elongation may or may not be used in addition where the space method is strained.

Arabic justification #8: space based

Justification using inter-word spacing

The space-based approach, of course, makes a lot of sense if you are dealing with fonts of the Ruqah style, which do not accept elongation. However, the fact that the rules for justification need to change according to the font that is used presents a new challenge for a browser that wants to implement justification for Arabic. How does the browser know the characteristics of the font being used and apply different rules as the font is changed? Fonts don’t currently indicate this information.

Looking at magazines and books on a recent trip to Oman I found lots of justification. Sometimes the justification was done using spaces, other times using elongations, and sometimes there was a mixture of both. In a later post I’ll show some examples.

By the way, for all the complexity so far described this is all quite a simplistic overview of what’s involved in Arabic justification. For example, high end systems that justify Arabic text also allow the typesetter to adjust the length of a line of text by manual adjustments that tweak such things as alternate letter shapes, various joining styles, different lengths of elongation, and discretionary ligation forms.

The key messages:

  1. We need an Arabic Layout Requirements document to capture the script needs.
  2. Then we need to figure out how to adapt Open Web Platform technologies to implement the requirements.
  3. To start all this, we need experts to provide information and develop consensus.

Any volunteers to create an Arabic Layout Requirements document? The W3C would like to hear from you!

Picture of the page in action.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script.

I’ve updated the page to show the fonts added in Windows8. This is the list:

  • Aldhabi (Urdu Nastiliq)
  • Urdu Typesetting (Urdu Nastiliq)
  • Gadugi (Cherokee/Unified Canadian Aboriginal Syllabics)
  • Myanmar Text (Myanmar)
  • Nirmala UI (10 Indic scripts)

There were also two additional UI fonts for Chinese, Jhenghei UI (Traditional) and Yahei UI (Simplified), which I haven’t listed. Also Microsoft Uighur acquired a bold font.

>> See the list

See the blog post for the first version or the page for more information.

Update, 25 Jan 2013

Patrick Andries pointed out that Tifinagh using the Windows Ebrima font was missing from the list. Not any more.

Characters in the Unicode Balinese block.

I just uploaded an initial draft of an article Balinese Script Notes. It lists the Unicode characters used to represent Balinese text, and briefly describes their use. It starts with brief notes on general script features and discussions about which Unicode characters are most appropriate when there is a choice.

The script type is abugida – consonants carry an inherent vowel. It’s a complex script derived from Brahmi, and has lots of contextual shaping and positioning going on. Text runs left-to-right, and words are not separated by spaces.

I think it’s one of the most attractive scripts in Unicode, and for that reason I’ve been wanting to learn more about it for some time now.

>> Read it

Picture of the page in action.

I’ve wanted to get around to this for years now. Here is a list of fonts that come with Windows7 and Mac OS X Snow Leopard/Lion, grouped by script.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script. I’m still working on the list, and there are some caveats.

>> See the list

Some of the fonts listed above may be disabled on the user’s system. I’m making an assumption that someone who reads tibetan will have the Tibetan font turned on, but for my articles that explain writing systems to people in English, such assumptions may not hold.

The list I used to identify Windows fonts is Windows7-specific and fairly stable, but the Mac font spans more than one version of Mac OS X, and I could only find an unofficial list of fonts for Snow Leopard, and there were some fonts on that list that I didn’t have on my system. Where a Mac font is new with Lion (and there are a significant number) it is indicated. See the official list of fonts on Mac OS X Lion.

There shouldn’t be any fonts listed here for a given script that aren’t supplied with Windows7 or Mac OS X Snow Leopard/Lion, but there are probably supplied fonts that are not yet listed here (typically these will be large fonts that cover multiple scripts). In particular, note that I haven’t yet made a list of fonts that support Latin, Greek and Cyrillic (mainly because there are so many of them and partly because I’m wondering how useful it will be.)

The text used is as much as would fit on one line of article 1 of the Universal Declaration of Human Rights, taken from this Unicode page, wherever I could find it. I created a few instances myself, where it was missing, and occasionally I resorted to arbitrary lists of characters.

You can obtain a character-based version of the text used by looking at the source text: look for the title attribute on the section heading.

Things still to do:

  • create sections for Latin, Greek and Cyrillic fonts
  • check for fonts covering multiple Unicode blocks
  • figure out how to tell, and how to show which is the system default
  • work out and show what’s not available in Windows XP
  • work out what’s new in Lion, and whether it’s worth including them
  • figure out whether people with different locale setups see different things
  • recapture all font images that need it at 36px, rather than varying sizes

Update, 19 Feb 2012

I uploaded a new version of the font list with the following main changes:

  • If you click on an image you see text with that font applied (if you have it on your system, of course). The text can be zoomed from 14px to 100px (using a nice HTML5 slider, if you have the right browser! [try Chrome, Safari or Opera]). This text includes a little Latin text so you can see the relationship between that and the script.
  • All font graphics are now standardised so that text is imaged at a font size of 36px. This makes it more difficult to see some fonts (unless you can use the zoom text feature), but gives a better idea of how fonts vary in default size.
  • I added a few extra fonts which contained multiple script support.
  • I split Chinese into Simplified and Traditional sections.
  • Various other improvements, such as adding real text for N’Ko, correcting the Traditional Chinese text, flipping headers to the left for RTL fonts, reordering fonts so that similar ones are near to each other, etc.

Webfonts, and WOFF in particular, have been in the news again recently, so I thought I should mention that a few days ago I changed my pages describing Myanmar and Arabic-for-Urdu scripts so that you can download the necessary font support for the foreign text, either as a TTF linked font or as WOFF font.

You can find the Myanmar page at Look for the links n the side bar to the right, under the heading “Fonts”.

The Urdu page, using the beautiful Nastaliq script, is at

(Note that the examples of short vowels don’t use the nastiliq style. Scroll down the page a little further.)

I haven’t had time to check whether all the opentype features are correctly rendered, but I’ve been doing Mac testing of the i18n webfonts tests, and it looks promising. (More on that later.) The Urdu font doesn’t rely on OS rendering, which should help.

Here are some examples of the text on the page:
Examples of Urdu script and Myanmar script.

I finally got around to refreshing this article, by converting the Bengali, Malayalam and Oriya examples to Unicode text. Back when I first wrote the article, it was hard to find fonts for those scripts.

I also added a new feature: In the HTML version, click on any of the examples in indic text and a pop-up appears at the bottom right of the page, showing which characters the example is composed of. The pop-up lists the characters in order, with Unicode names, and shows the characters themselves as graphics.

I have not yet updated this article’s incarnation as Unicode Technical Note #10. The Indian Government also used this article, and made a number of small changes. I have yet to incorporate those, too.

Characters in the Unicode Bengali block.

If you’re interested, I just did a major overhaul of my script notes on Bengali in Unicode. There’s a new section about which characters to use when there are multiple options (eg. RRA vs. DDA+nukta), and the page provides information about more characters from the Bengali block in Unicode (including those used in Bengali’s amazingly complicated currency notation prior to 1957).

In addition, this has all been squeezed into the latest look and feel for script notes pages.

The new page is at a new location. There is a redirect on the old page.

Hope it’s useful.

>> Read it

Picture of the page in action.
Picture of the page in action.

About the tools: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility

Latest changes: The Urdu and Tamil pickers have been upgraded to version 10. This provides new views of the data, but also involved a thorough overhaul and redesign of the pickers. Transliteration functions have also been added for the Tamil picker.

In addition, the Urdu notes page was updated and a new Tamil notes page was created. Database entries were also updated or, in the case of Tamil, created to support the notes pages. These notes pages are the first to use a new look and feel, based on the analyse-string tool I produced earlier this year. This adds information about each character from the Unicode descriptions data to that from my own database.

>> Read the notes

Today I put the finishing touches to and uploaded my first draft notes about the long lost ishidic script. See what you think of it.

Here’s a small section of the sample text shown at the bottom of the post. Click on it to see the whole transcript.

Part of a sample of text written in ishidic script.

>> Read it !

Picture of the page in action.

I finally got to the point, after many long early morning hours, where I felt I could remove the ‘Draft’ from the heading of my Myanmar (Burmese) script notes.

This page is the result of my explorations into how the Myanmar script is used for the Burmese language in the context of the Unicode Myanmar block. It takes into account the significant changes introduced in Unicode version 5.1 in April of this year.

Btw, if you have JavaScript running you can get a list of characters in the examples by mousing over them. If you don’t have JS, you can link to the same information.

There’s also a PDF version, if you don’t want to install the (free) fonts pointed to for the examples.

Here is a summary of the script:

Myanmar is a tonal language and is syllable-based. The script is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs.

Spaces are used to separate phrases, rather than words. Words can be separated with ZWSP to allow for easy wrapping of text.

Words are composed of syllables. These start with an consonant or initial vowel. An initial consonant may be followed by a medial consonant, which adds the sound j or w. After the vowel, a syllable may end with a nasalisation of the vowel or an unreleased glottal stop, though these final sounds can be represented by various different consonant symbols.

At the end of a syllable a final consonant usually has an ‘asat’ sign above it, to show that there is no inherent vowel.

In multisyllabic words derived from an Indian language such as Pali, where two consonants occur internally with no intervening vowel, the consonants tend to be stacked vertically, and the asat sign is not used.

Text runs from left to right.

There are a set of Myanmar numerals, which are used just like Latin digits.

So, what next. I’m quite keen to get to Mongolian. That looks really complicated. But I’ve been telling myself for a while that I ought to look at Malayalam or Tamil, so I think I’ll try Malayalam.

New article

In-progress draft of notes that list the symbols used to represent Bengali, describe their use, and relate them to appropriate characters for representation in Unicode. There is an index of shapes you can use to look up Bengali glyphs and track them down to their constituent Unicode codepoints.