Dochula Pass, Bhutan

I have uploaded a new version of the Thai character picker.

The new version uses characters instead of images for the selection table, making it faster to load and more flexible, and dispenses with the transcription view. If you prefer, you can still access the previous version.

Other changes include:

  • Significant rearrangement of the default selection table. The new arrangement makes it easy to choose the right characters if you have a Latin transcription to hand, which allows the removal of the previous transcription view, at the same time as speeding up that type of picking.
  • Addition of latin prompts to help locate letters (standard with v15).
  • Automatic transcription from Thai into ISO 11940-1, ISO 11940-2 and IPA. Note that for the last two there are some corner cases where the results are not quite correct, due to the ambiguity of the script, and note also that you need to show syllable boundaries with spaces before transcribing. (There’s a way to remove those spaces quickly afterwards.) See below for more information.
  • Hints! When switched on and you mouse over a character, other similar characters or characters incorporating the shape you moused over, are highlighted. Particularly useful for people who don’t know the script well, and may miss small differences, but also useful sometimes for finding a character if you first see something similar.
  • It also comes with the new v15 features that are standard, such as shape-based picking without losing context, range-selectable codepoint information, a rehabilitated escapes button, the ability to change the font of the table and the line-height of the output, and the ability to turn off autofocus on mobile devices to stop the keyboard jumping up all the time, etc.

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

More about the transcriptions: There are three buttons that allow you to convert from Thai text to Latin transcriptions. If you highlight part of the text, only that part will be transcribed.

The toISO-1 button produces an ISO 11940-1 transliteration, that latinises the Thai characters without changing their order. The result doesn’t normally tell you how to pronounce the Thai text, but it can be converted back to Thai as each Thai character is represented by a unique sequence in Latin. This transcription should produce fully conformant output. There is no need to identify syllables boundaries first.

The toISO-2 and toIPA buttons produce an output that is intended to approximately reflect actual pronunciation. It will work fine most of the time, but there are occasional ambiguities and idiosynchrasies in Thai which will cause the converter to render certain, less common syllables incorrectly. It also doesn’t automatically add accent marks to the phonetic version (though that may be added later). So the output of these buttons should be treated as something that gets you 90% of the way. NOTE: Before using these two buttons you need to add spaces or hyphens between each syllable of the Thai text. Syllable boundaries are important for correct interpretation of the text, and they are not detected automatically.

The condense button removes the spaces from the highlighted range (or the whole output area, if nothing is highlighted).

Note: For the toISO-2 transcription I use a macron over long vowels. This is non-standard.

I have uploaded a new version of the Tibetan character picker.

The new version dispenses with the images for the selection table. If you don’t have a suitable font to display the new version of the picker, you can still access the previous version, which uses images.

Other changes include:

  • Significant rearrangement of the default table, with many less common symbols moved into a location that you need to click on to reveal. This declutters the selection table.
  • Addition of latin prompts to help locate letters (standard with v15).
  • Hints (When switched on and you mouse over a character, other similar characters or characters incorporating the shape you moused over, are highlighted. Particularly useful for people who don’t know the script well, and may miss small differences, but also useful sometimes for finding a character if you first see something similar.)
  • A new Wylie button that converts Tibetan text into an extended Wylie Latin transcription. There are still some uncommon characters that don’t work, but it should cover most normal needs. I used diacritics over lowercase letters rather than uppercase letters, except for the fixed form characters. I also didn’t provide conversions for many of the symbols – they will appear without change in the transcription. See the notes on the page for more information.
  • The Codepoints button, which produces a list of characters in the output box, now has a new feature. If you have highlighted some text in the output box, you will only see a list of the highlighted characters. If there are no highlights, the contents of the whole output box are listed.
  • Don’t forget, if you are using the picker on an iPad or mobile device, to set Autofocus to Off before tapping on characters. This stops the device keypad popping up every time you select a character. (This is also standard for v15.)

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

There is some confusion about which shapes should be produced by fonts for Mongolian characters. Most letters have at least one isolated, initial, medial and final shape, but other shapes are produced by contextual factors, such as vowel harmony.

Unicode has a list of standardised variant shapes, dating from 27 November 2013, but that list is not complete and contains what are currently viewed by some as errors. It also doesn’t specify the expected default shapes for initial, medial and final positions.

The original list of standardised variants was based on 蒙古文编码 by Professor Quejingzhabu in 2000.

A new proposal was published on 20 January 2014, which attempts to resolve the current issues, although I think that it introduces one or two issues of its own.

The other factor in this is what the actual fonts do. Sometimes they follow the Unicode standardised variants list, other times they diverge from it. Occasionally a majority of implementations appear to diverge in the same way, suggesting that the standardised list should be adapted to reality.

To help unravel this, I put together a page called Notes on Mongolian variant forms that visually shows the changes between the various proposals, and compares the results produced by various fonts.

This is still an early draft. The information only covers the basic Mongolian range – Todo, Sibe, etc still to come. Also, I would like to add information about other fonts, if I can obtain them.

If you use my Unicode character pickers, you may have noticed some changes recently. I’ve moved several pickers on to version 14. Most of the noticeable changes are in the location and styling of elements on the UI – the features remain pretty much unchanged.

Pages have acquired a header at the top (which is typically hidden), that provides links to related pages, and integrates the style into that of the rest of the site. What you don’t see is a large effort to tidy the code base and style sheets.

So far, I have changed the following: Arabic block, Armenian, Balinese, Bengali, Khmer, IPA, Lao, Mongolian, Myanmar, and Tibetan.

I will convert more as and when I get time.

However, in parallel, I have already made a start on version 15, which is a significant rewrite. Gone are the graphics, to be replaced by characters and webfonts. This makes a huge improvement to the loading time of the page. I’m also hoping to introduce more automated transcription methods, and simpler shape matching approaches.

Some of the pickers I already upgraded to version 14 have mechanisms for transcription and shape-based identification that took a huge effort to create, and will take a substantial effort to upgrade to version 15. So they may stay as they are for a while. However, easier to handle and new pickers will move to the new format.

Actually, I already made a start with Gurmukhi v15, which yanks that picker out of the stone-age and into the future. There’s also a new picker for the Uighur language that uses v15 technology. I’ll write separate blogs about those.

 

[By the way, if you are viewing the pickers on a mobile device such as an iPad, don't forget to turn Autofocus off (click on 'more controls' to find the switch). This will stop the onscreen keyboard popping up, annoyingly, each time you try to tap on a character.]

tibetan-udhr
See the Tibetan Script Notes

Last March I pulled together some notes about the Tibetan script overall, and detailed notes about Unicode characters used in Tibetan.

I am writing these pages as I explore the Tibetan script as used for the Tibetan language. They may be updated from time to time and should not be considered authoritative. Basically I am mostly simplifying, combining, streamlining and arranging the text from the sources listed at the bottom of the page.

The first half of the script notes page describes how Unicode characters are used to write Tibetan. The second half looks at text layout in Tibetan (eg. line-breaking, justification, emphasis, punctuation, etc.)

The character notes page lists all the characters in the Unicode Tibetan block, and provides specific usage notes for many of them per their use for writing the Tibetan language.

tibetan-char-notes
See the Tibetan Character Notes

Tibetan is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs. Text runs from left to right.

There are various different Tibetan scripts, of two basic types: དབུ་ཙན་ dbu-can, pronounced /uchen/ (with a head), and དབུ་མེད་ dbu-med, pronounced /ume/ (headless). This page concentrates on the former. Pronunciations are based on the central, Lhasa dialect.

The pronunciation of Tibetan words is typically much simpler than the orthography, which involves patterns of consonants. These reduce ambiguity and can affect pronunciation and tone. In the notes I try to explain how that works, in an approachable way (though it’s still a little complicated, at first).

Traditional Tibetan text was written on pechas (དཔེ་ཆ་ dpe-cha), loose-leaf sheets. Some of the characters used and formatting approaches are different in books and pechas.

For similar notes on other scripts, see my docs list.

Screen shot 2014-09-26 at 16.36.47

The W3C needs to make sure that the typographic needs of scripts and languages around the world are built in to technologies such as HTML, CSS, SVG, etc. so that Web pages and eBooks can look and behave as expected for people around the world.

To that end we have experts in various parts of the world documenting typographic requirements and gaps between what is needed and what is currently supported in browsers and ebook readers.

The flagship document is Requirements for Japanese Text Layout. The information in this document has been widely used, and the process used for creating it was extremely effective. It was developed in Japan, by a task force using mailing lists and holding meetings in japanese, then converted to english for review. It was published in both languages.

We now have groups working on Indic Layout Requirements and Requirements for Hangul Text Layout and Typography, and this month I was in Beijing to discuss ongoing work on Chinese layout requirements (URL coming soon), and we heard from experts in Mongolian, Tibetan, and Uyghur who are keen to also participate in the Chinese task force and produce similar documents for their part of the world.

The Internationalization (i18n) Working Group at the W3C has also been working on other aspects of the mutlilingual user experience. For example, improvements for bidirectional text support (Arabic, Hebrew, Thaana, etc) for HTML and CSS, and supporting the work on counter styles at CSS.

To support local relevance of Web pages and eBook formats we need local experts to participate in gathering information in these task forces, to review the task force outputs, and to lobby or support via coding the implementation of features in browsers and ereaders. If you are one of these people, or know some, please get in touch!

We particularly need more information about how to handle typographic features of the Arabic script.

In the hope that it will help, I have put together some information on current areas of activity at the W3C, with pointers to useful existing requirements, specifications and tests. It is not exhaustive, and I expect it to be added to and improved over time.

Look through the list and check whether your needs are being adequately covered. If not, write to www-international@w3.org (you need to subscribe first) and make the case. If the spec does cover your needs, but the browsers don’t support your needs, raise bugs against the browsers.

It’s disappointing to see that non-standard implementations of UTF-8 are being used by the BBC on their BBC Burmese Facebook page.

Take, for example, the following text.

On the actual BBC site it looks like this (click on the burmese text to see a list of the characters used):

အိန္ဒိယ မိန်းမငယ် ၂ဦး အမှု ဆေးစစ်ချက် ကွဲလွဲနေ

As far as I can tell, this is conformant use of Unicode codepoints.

Look at the same title on the BBC’s Facebook page, however, and you see:

အိႏၵိယ မိန္းမငယ္ ၂ဦး အမႈ ေဆးစစ္ခ်က္ ကြဲလြဲေန

Depending upon where you are reading this (as long as you have some Burmese font and rendering support), one of the two lines of Burmese text above will contain lots of garbage. For me, it’s the second (non-standard).

This non-standard approach uses visual encoding for combining characters that appear before or on both sides of the base, uses Shan or Rumai Palaung codepoints for subjoining consonants, uses the wrong codepoints for medial consonants, and uses the virama instead of the asat at the end of a word.

I assume that this is because of prevalent use of the non-standard approach on mobile devices (and that the BBC is just following that trend), caused by hacks that arose when people were impatient to get on the Web but script support was lagging in applications.

However, continuing this divergence does nobody any long-term good.

[ Find fonts and other resources for the Myanmar script ]

Picture of the page in action.

>> Use UniView

This version updates the app per the changes during beta phase of the specification, so that it now reflects the finalised Unicode 7.0.0.

The initial in-app help information displayed for new users was significantly updated, and the help tab now links directly to the help page.

A more significant improvement was the addition of links to character descriptions (on the right) where such details exist. This finally reintegrates the information that was previously pulled in from a database. Links are only provided where additional data actually exists. To see an example, go here and click on See character notes at the bottom right.

Rather than pull the data into the page, the link opens a new window containing the appropriate information. This has advantages for comparing data, but it was also the best solution I could find without using PHP (which is no longer available on the server I use). It also makes it easier to edit the character notes, so the amount of such detail should grow faster. In fact, some additional pages of notes were added along with this upgrade.

A pop-up window containing resource information used to appear when you used the query to show a block. This no longer happens.

Changes in version 7beta

I forgot to announce this version on my blog, so for good measure, here are the (pretty big) changes it introduced.

This version adds the 2,834 new characters encoded in the Unicode 7.0.0 beta, including characters for 23 new scripts. It also simplified the user interface, and eliminated most of the bugs introduced in the quick port to JavaScript that was the previous version.

Some features that were available in version 6.1.0a are still not available, but they are minor.

Significant changes to the UI include the removal of the ‘popout’ box, and the merging of the search input box with that of the other features listed under Find.

In addition, the buttons that used to appear when you select a Unicode block have changed. Now the block name appears near the top right of the page with a I icon icon. Clicking on the icon takes you to a page listing resources for that block, rather than listing the resources in the lower right part of UniView’s interface.

UniView no longer uses a database to display additional notes about characters. Instead, the information is being added to HTML files.

Korean justification

The editors of the CSS3 Text module specification are currently trying to get more information about how to handle Korean justification, particularly when hanja (chinese ideographic characters) are mixed with Korean hangul.

Should the hanja be stretched with inter-character spacing at the same time as the Korean inter-word spaces are stretched? What if only a single word fits on a line, should it be stretched using inter-character spacing? Are there more sophisticated rules involving choices of justification location, as there are in Japanese where you adjust punctuation first and do inter-character spacing later? What if the whole text is hanja, such as for an ancient document?

If you are able to provide information, take a look at what’s in the CSS3 Text module and follow the thread on public-i18n-cjk@w3.org (subscribe).

Factoids listed at the start of the EURid/UNESCO World Report on IDN Deployment 2013

5.1 million IDN domain names

Only 2% of the world’s domain names are in non-Latin script

The 5 most popular browsers have strong support for IDNs in their latest versions

Poor support for IDNs in mobile devices

92% of the world’s most popular websites do not recognise IDNs as URLs in links

0% of the world’s most popular websites allow IDN email addresses as user accounts

99% correlation between IDN scripts and language of websites (Han, Hangkuk, Hiragana, Katakana)

About two weeks ago I attended the part of a 3-day Asia Pacific Top Level Domain Association (APTLD) meeting in Oman related to ‘Universal Acceptance’ of Internationalized Domain Names (IDNs), ie. domain names using non-ASCII characters. This refers to the fact that, although IDNs work reasonably well in the browser context, they are problematic when people try to use them in the wider world for things such as email and social media ids, etc. The meeting was facilitated by Don Hollander, GM of APTLD.

Here’s a summary of information from the presentations and discussions.

(By the way, Don Hollander and Dennis Tan Tanaka, Verisign, each gave talks about this during the MultilingualWeb workshop in Madrid the week before. You can find links to their slides from the event program.)

Basic proposition

International Domain Names (IDNs) provide much improved accessibility to the web for local communities using non-Latin scripts, and are expected to particularly smooth entry for the 3 billion people not yet web-enabled. For example, in advertising (such as on the side of a bus) they are easier and much faster to recognise and remember, they are also easier to note down and type into a browser.

The biggest collection of IDNs is under .com and .net, but there are new Brand TLDs emerging as well as IDN country codes. On the Web there is a near-perfect correlation between use of IDNs and the language of a web site.

The problems tend to arise where IDNs are used across cultural/script boundaries. These cross-cultural boundaries are encountered not just by users but by implementers/companies that create tools, such as email clients, that are deployed across multilingual regions.

It seems to be accepted that there is a case for IDNs, and that they already work pretty well in the context of the browser, but problems in widespread usage of internationalized domain names beyond the browser are delaying demand, and this apparently slow demand doesn’t convince implementers to make changes – it’s a chicken and egg situation.

The main question asked at the meeting was how to break the vicious cycle. The general opinion seemed to lean to getting major players like Google, Microsoft and Apple to provide end-to-end support for IDNs throughout their produce range, to encourage adoption by others.

Problems

Domain names are used beyond the browser context. Problem areas include:

  • email
    • email clients generally don’t support use of non-ascii email addresses
    • standards don’t address the username part of email addresses as well as domain
    • there’s an issue to do with smptutf8 not being visible in all the right places
    • you can’t be sure that your email will get through, it may be dropped on the floor even if only one cc is IDN
  • applications that accept email IDs or IDNs
    • even Russian PayPal IDs fail for the .рф domain
    • things to be considered include:
      • plain text detection: you currently need http or www at start in google docs to detect that something is a domain name
      • input validation: no central validation repository of TLDs
      • rendering: what if the user doesn’t have a font?
      • storage & normalization: ids that exist as either IDN or punycode are not unique ids
      • security and spam controls: Google won’t launch a solution without resolving phishing issues; some spam filters or anti-virus scanners think IDNs are dangerous abnormalities
      • other integrations: add contact, create mail and send mail all show different views of IDN email address
  • search: how do you search for IDNs in contacts list?
    • search in general already works pretty well on Google
    • I wasn’t clear about how equivalent IDN and Latin domain names will be treated
  • mobile devices: surprisingly for the APTLD folks, it’s harder to find the needed fonts and input mechanisms to allow typing IDNs in mobile devices
  • consistent rendering:
    • some browsers display as punycode in some circumstances – not very user friendly
    • there are typically differences between full and hybrid (ie. partial) int. domain names
    • IDNs typed in twitter are sent as punycode (mouse over the link in the tweet on a twitter page)

Initiatives

Google are working on enabling IDN’s throughout their application space, including Gmail but also many other applications – they pulled back from fixing many small, unconnected bugs to develop a company wide strategy and roll out fixes across all engineering teams. The Microsoft speaker echoed the same concerns and approaches.

In my talk, I expressed the hope that Google and MS and others would collaborate to develop synergies and standards wherever feasible. Microsoft, also called for a standard approach rather than in-house, proprietary solutions, to ensure interoperability.

However, progress is slow because changes need to be made in so many places, not just the email client.

Google expects to have some support for international email addresses this summer. You won’t be able to sign up for Arabic/Chinese/etc email addresses yet, but you will be able to use Gmail to communicate with users on other providers who have internationalized addresses. Full implementation will take a little longer because there’s no real way to test things without raising inappropriate user expectations if the system is live.

SaudiNIC has been running Arabic emails for some time, but it’s a home-grown and closed system – they created their own protocols, because there were no IETF protocols at the time – the addresses are actually converted to punycode for transmission, but displayed as Arabic to the user (http://nic.sa).

Google uses system information about language preferences of the user to determine whether or not to display the IDN rather than punycode in Chrome’s address bar, but this could cause problems for people using a shared computer, for example in an internet café, a conference laptop etc. They are still worrying about users’ reactions if they can’t read/display an email address in non-ASCII script. For email, currently they’re leaning towards just always showing the Unicode version, with the caveat that they will take a hard line on mixed script (other than something mixed with ASCII) where they may just reject the mail.

A trend to note is a growing number of redirects from IDN to ASCII, eg. http://правительство.рф page shows http://government.ru in the address bar when you reach the site.

Other observations

All the Arabic email addresses I saw were shown fully right to left, ie. <tld><domain>@<username>. I wonder whether this may dislodge some of the hesitation in the IETF about the direction in which web addresses should be displayed – perhaps they should therefore also flow right-to-left?? (especially if people write domain names without http://, which these guys seem to think they will).

Many of the people in the room wanted to dispense with the http:// for display of web addresses, to eliminate the ASCII altogether, also get rid of www. – problem is, how to identify the string as a domain name – is the dot sufficient?? We saw some examples of this, but they had something like “see this link” alongside.

By the way, Google is exploring the idea of showing the user, by default, only the domain name of a URL in future versions of the Chrome browser address bar. A Google employee at the workshop said “I think URLs are going away as far as something to be displayed to users – the only thing that matters is the domain name … users don’t understand the rest of the URL”. I personally don’t agree with this.

One participant proposed that government mandates could be very helpful in encouraging adaptation of technologies to support international domain names.

My comments

I gave a talk and was on a panel. Basically my message was:

Most of the technical developments for IDN and IRIs were developed at the IETF and the Unicode Consortium, but with significant support by people involved in the W3C Internationalization Working Group. Although the W3C hasn’t been leading this work, it is interested in understanding the issues and providing support where appropriate. We are, however, also interested in wider issues surrounding the full path name of the URL (not just the domain name), 3rd level domain labels, frag ids, IRI vs punycode for domain name escaping, etc. We also view domain names as general resource identifiers (eg. for use in linked data), not just for a web presence and marketing.

I passed on a message that groups such as the Wikimedia folks I met with in Madrid the week before are developing a very wide range of fonts and input mechanisms that may help users input non-Latin IDs on terminals, mobile devices and such like, especially when travelling abroad. It’s something to look into. (For more information about Wikimedia’s jQuery extensions, see here and here.)

I mentioned the idea of bidi issues related to both the overall direction of Arabic/Hebrew/etc URLs/domain names, and the more difficult question about to handle mixed direction text that can make the logical http://www.oman/muscat render to the user as http://www.muscat/oman when ‘muscat’ and ‘oman’ are in Arabic, due to the default properties of the Unicode bidi algorithm. Community guidance would be a help in resolving this issue.

I said that the W3C is all about getting people together to find interoperable solutions via consensus, and that we could help with networking to bring the right people together. I’m not proposing that we should take on ownership of the general problem of Universal Acceptance, but I did suggest that if they can develop specific objectives for a given aspect of the problem, and identify a natural community of stakeholders for that issue, then they could use our Community Groups to give some structure to and facilitate discussions.

I also suggested that we all engage in grass-roots lobbying, requesting that service/tool providers allow us to use IDNs.

Conclusions

At the end of the first day, Don Hollander summed up what he had gathered from the presentations and discussions as follows:

People want IDNs to work, they are out there, and they are not going away. Things don’t appear quite so dire as he had previously thought, given that browser support is generally good, closed email communities are developing, and search and indexing works reasonably well. Also Google and Microsoft are working on it, albeit perhaps slower than people would like (but that’s because of the complexity involved). There are, however, still issues.

The question is how to go forward from here. He asked whether APTLD should coordinate all communities at a high level with a global alliance. After comments from panelists and participants, he concluded that APTLD should hold regular meetings to assess and monitor the situation, but should focus on advocacy. The objective would be to raise visibility of the issues and solutions. “The greatest contribution from Google and Microsoft may be to raise the awareness of their thousands of geeks.” ICANN offered to play a facilitation role and to generate more publicity.

One participant warned that we need a platform for forward motion, rather than just endless talking. I also said that in my panel contributions. I was a little disappointed (though not particularly surprised) that APTLD didn’t try to grasp the nettle and set up subcommittees to bring players together to take practical steps to address interoperable solutions, but hopefully the advocacy will help move things forward and developments by companies such as Google and Microsoft will help start a ball rolling that will eventually break the deadlock.

I’ve been trying to understand how web pages need to support justification of Arabic text, so that there are straight lines down both left and right margins.

The following is an extract from a talk I gave at the MultilingualWeb workshop in Madrid at the beginning of May. (See the whole talk.) It’s very high level, and basically just draws out some of the uncertainties that seem to surround the topic.

Let’s suppose that we want to justify the following Arabic text, so that there are straight lines at both left and right margins.

Arabic justification #1

Unjustified Arabic text

Generally speaking, received wisdom says that Arabic does this by stretching the baseline inside words, rather than stretching the inter-word spacing (as would be the case in English text).

To keep it simple, lets just focus on the top two lines.

One way you may hear that this can be done is by using a special baseline extension character in Unicode, U+0640 ARABIC TATWEEL.

Arabic justification #2

Justification using tatweels

The picture above shows Arabic text from a newspaper where we have justified the first two lines using tatweels in exactly the same way it was done in the newspaper.

Apart from the fact that this looks ugly, one of the big problems with this approach is that there are complex rules for the placement of baseline extensions. These include:

  • extensions can only appear between certain characters, and are forbidden around other characters
  • the number of allowable extensions per word and per line is usually kept to a minimum
  • words vary in appropriateness for extension, depending on word length
  • there are rules about where in the line extensions can appear – usually not at the beginning
  • different font styles have different rules

An ordinary web author who is trying to add tatweels to manually justify the text may not know how to apply these rules.

A fundamental problem on the Web is that when text size or font is changed, or a window is stretched, etc, the tatweels will end up in the wrong place and cause problems. The tatweel approach is of no use for paragraphs of text that will be resized as the user stretches the window of a web page.

In the next picture we have simply switched to a font in the Naskh style. You can see that the tatweels applied to the word that was previously at the end of the first line now make the word to long to fit there. The word has wrapped to the beginning of the next line, and we have a large gap at the end of the first line.

Arabic justification #3

Tatweels in the wrong place due to just a font change

To further compound the difficulties mentioned above regarding the rules of placement for extensions, each different style of Arabic font has different rules. For example, the rules for where and how words are elongated are different in the Nastaliq version of the same text which you can see below. (All the characters are exactly the same, only the font has changed.) (See a description of how to justify Urdu text in the Nastaliq style.)

Arabic justification #4: Nastaliq

Same text in the Nastaliq font style

And fonts in the Ruqah style never use elongation at all. (We’ll come back to how you justify text using Ruqah-style fonts in a moment.)

Arabic justification #5: Ruqah

Same text in the Ruqah font style

In the next picture we have removed all the tatweel characters, and we are showing the text using a Naskh-style font. Note that this text has more ligatures on the first line, so it is able to fit in more of the text on that line than the first font we saw. We’ll again focus on the first two lines, and consider how to justify them.

Arabic justification #6: Naskh

Same text in the Naskh font style

High end systems have the ability to allow relevant characters to be elongated by working with the font glyphs themselves, rather than requiring additional baseline extension characters.

Arabic justification #7: kashida elongation

Justification using letter elongation (kashida)

In principle, if you are going to elongate words, this is a better solution for a dynamic environment. It means, however, that:

  1. the rules for applying the right-sized elongations to the right characters has to be applied at runtime by the application and font working together, and as the user or author stretches the window, changes font size, adds text, etc, the location and size of elongations needs to be reconfigured
  2. there needs to be some agreement about what those rules are, or at least a workable set of rules for an off-the-shelf, one-size-fits-all solution.

The latter is the fundamental issue we face. There is very little, high-quality information available about how to do this, and a lack of consensus about, not only what the rules are, but how justification should be done.

Some experts will tell you that text elongation is the primary method for justifying Arabic text (for example), while others will tell you that inter-word and intra-word spacing (where there are gaps in the letter-joins within a single word) should be the primary approach, and kashida elongation may or may not be used in addition where the space method is strained.

Arabic justification #8: space based

Justification using inter-word spacing

The space-based approach, of course, makes a lot of sense if you are dealing with fonts of the Ruqah style, which do not accept elongation. However, the fact that the rules for justification need to change according to the font that is used presents a new challenge for a browser that wants to implement justification for Arabic. How does the browser know the characteristics of the font being used and apply different rules as the font is changed? Fonts don’t currently indicate this information.

Looking at magazines and books on a recent trip to Oman I found lots of justification. Sometimes the justification was done using spaces, other times using elongations, and sometimes there was a mixture of both. In a later post I’ll show some examples.

By the way, for all the complexity so far described this is all quite a simplistic overview of what’s involved in Arabic justification. For example, high end systems that justify Arabic text also allow the typesetter to adjust the length of a line of text by manual adjustments that tweak such things as alternate letter shapes, various joining styles, different lengths of elongation, and discretionary ligation forms.

The key messages:

  1. We need an Arabic Layout Requirements document to capture the script needs.
  2. Then we need to figure out how to adapt Open Web Platform technologies to implement the requirements.
  3. To start all this, we need experts to provide information and develop consensus.

Any volunteers to create an Arabic Layout Requirements document? The W3C would like to hear from you!

Following up on a suggestion by Nathan Hill of SOAS, I added a la-swe glyph to the default view of the picker alongside the medial consonants. If you click on it, it produces U+1039 MYANMAR SIGN VIRAMA + U+101C MYANMAR LETTER LA.

I also rearranged the font pull-down list a little, adding information about what fonts are available on your Mac OS X or Windows7 system, and added a placeholder, like I did recently for the Khmer picker.

You can find the Myanmar picker at http://rishida.net/scripts/pickers/myanmar/

Picture of the page in action.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script.

I’ve updated the page to show the fonts added in Windows8. This is the list:

  • Aldhabi (Urdu Nastiliq)
  • Urdu Typesetting (Urdu Nastiliq)
  • Gadugi (Cherokee/Unified Canadian Aboriginal Syllabics)
  • Myanmar Text (Myanmar)
  • Nirmala UI (10 Indic scripts)

There were also two additional UI fonts for Chinese, Jhenghei UI (Traditional) and Yahei UI (Simplified), which I haven’t listed. Also Microsoft Uighur acquired a bold font.

>> See the list

See the blog post for the first version or the page for more information.

Update, 25 Jan 2013

Patrick Andries pointed out that Tifinagh using the Windows Ebrima font was missing from the list. Not any more.

Following up on a very good suggestion by Roger Sperberg, I added two webfonts to the Khmer picker and arranged the font selection list so that you can see which fonts are available on your Mac OS X or Windows7 system.

The webfonts make it possible to use the picker on an iPad or other device that doesn’t have a Khmer font installed. I added two webfonts because one worked on my iPad and the other didn’t, and it was vice versa on my Snow Leopard Macbook.

I also added an HTML5 placeholder for the output box. (I’m wishing you could style that differently from the standard content – and wishing that markup designers would think about this sort of thing and stop using attributes for natural language text…).

You can find the Khmer picker at http://rishida.net/scripts/pickers/khmer/

Picture of the page in action.

>> Use UniView

The main addition in this version is a couple of buttons that appear when you ask UniView to display a block.

Clicking on Show annotated list generates a list of all characters in the block, with annotations.

Clicking on Show script links displays a list of links to key sources of information about the script of the block, links to relevant articles and apps on the rishida.net site, and related fonts and input methods. This provides a very quick way of finding this information. One particularly useful link (‘Historical documentation’, which links to a Scriptsource.org page) allows you to find the proposals for all additions to Unicode related to the relevant script. These proposals are a mine of useful information about the individual characters in a block, and SIL staff should get a medal for trawling through all the relevant data to provide this.

In addition, there were some changes to the user interface, including the following:

  • The order of information in the lower right panel (detailed character information) was slightly changed, and two alterative representations of the character were added: an HTML escape, and a URI escape.
  • The search box at the top left was constrained to appear closer to the other controls when the window is stretched wide.

Various bugs were also fixed.