Dochula Pass, Bhutan

initial-letter-tibetan-01

The CSS WG needs advice on initial letter styling in non-Latin scripts, ie. enlarged letters or syllables at the start of a paragraph like those shown in the picture. Most of the current content of the recently published Working Draft, CSS Inline Layout Module Level 3 is about styling of initial letters, but the editors need to ensure that they have covered the needs of users of non-Latin scripts.

The spec currently describes drop, sunken and raised initial characters, and allows you to manipulate them using the initial-letter and the initial-letter-align properties. You can apply those properties to text selected by ::first-letter, or to the first child of a block (such as a span).

The editors are looking for

any examples of drop initials in non-western scripts, especially Arabic and Indic scripts.

I have scanned some examples from newspapers (so, not high quality print).

In the section about initial-letter-align the spec says:

Input from those knowledgeable about non-Western typographic traditions would be very helpful in describing the appropriate alignments. More values may be required for this property.

Do you have detailed information about initial letter styling in a non-Latin script that you can contribute? If so, please write to www-style@w3.org (how to subscribe).

I’m struggling to show combining characters on a page in a consistent way across browsers.

For example, while laying out my pickers, I want users to be able to click on a representation of a character to add it to the output field. In the past I resorted to pictures of the characters, but now that webfonts are available, I want to replace those with font glyphs. (That makes for much smaller and more flexible pages.)

Take the Bengali picker that I’m currently working on. I’d like to end up with something like this:

comchacon0

I put a no-break space before each combining character, to give it some width, and because that’s what the Unicode Standard recommends (p60, Exhibiting Nonspacing Marks in Isolation). The result is close to what I was looking for in Chrome and Safari except that you can see a gap for the nbsp to the left.

comchacon1

But in IE and Firefox I get this:

comchacon2

This is especially problematic since it messes up the overall layout, but in some cases it also causes text to overlap.

I tried using a dotted circle Unicode character, instead of the no-break space. On Firefox this looked ok, but on Chrome it resulted in two dotted circles per combining character.

I considered using a consonant as the base character. It would work ok, but it would possibly widen the overall space needed (not ideal) and would make it harder to spot a combining character by shape. I tried putting a span around the base character to grey it out, but the various browsers reacted differently to the span. Vowel signs that appear on both sides of the base character no longer worked – the vowel sign appeared after. In other cases, the grey of the base character was inherited by the whole grapheme, regardless of the fact that the combining character was outside the span. (Here are some examples ে and ো.)

In the end, I settled for no preceding base character at all. The combining character was the first thing in the table cell or span that surrounded it. This gave the desired result for the font I had been using, albeit that I needed to tweak the occasional character with padding to move it slightly to the right.

On the other hand, this was not to be a complete solution either. Whereas most of the fonts I planned to use produce the dotted circle in these conditions, one of my favourites (SolaimanLipi) doesn’t produce it. This leads to significant problems, since many combining characters appear far to the left, and in some cases it is not possible to click on them, in others you have to locate a blank space somewhere to the right and click on that. Not at all satisfactory.

comchacon3

I couldn’t find a better way to solve the problem, however, and since there were several Bengali fonts to choose from that did produce dotted circles, I settled for that as the best of a bad lot.

However, then i turned my attention to other pickers and tried the same solution. I found that only one of the many Thai fonts I tried for the Thai picker produced the dotted circles. So the approach here would have to be different. For Khmer, the main Windows font (Daunpenh) produced dotted circles only for some of the combining characters in Internet Explorer. And on Chrome, a sequence of two combining characters, one after the other, produced two dotted circles…

I suspect that I’ll need to choose an approach for each picker based on what fonts are available, and perhaps provide an option to insert or remove base characters before combining characters when someone wants to use a different font.

It would be nice to standardise behaviour here, and to do so in a way that involves the no-break space, as described in the Unicode Standard, or some other base character such as – why not? – the dotted circle itself. I assume that the fix for this would have to be handled by the browser, since there are already many font cats out of the bag.

Does anyone have an alternate solution? I thought I heard someone at the last Unicode conference mention some way of controlling the behaviour of dotted circles via some script or font setting…?

Update: See Marc Durdin’s blog for more on this topic, and his experiences while trying to design on-screen keyboards for Lao and other scripts.

Screen shot 2014-09-26 at 16.36.47

The W3C needs to make sure that the typographic needs of scripts and languages around the world are built in to technologies such as HTML, CSS, SVG, etc. so that Web pages and eBooks can look and behave as expected for people around the world.

To that end we have experts in various parts of the world documenting typographic requirements and gaps between what is needed and what is currently supported in browsers and ebook readers.

The flagship document is Requirements for Japanese Text Layout. The information in this document has been widely used, and the process used for creating it was extremely effective. It was developed in Japan, by a task force using mailing lists and holding meetings in japanese, then converted to english for review. It was published in both languages.

We now have groups working on Indic Layout Requirements and Requirements for Hangul Text Layout and Typography, and this month I was in Beijing to discuss ongoing work on Chinese layout requirements (URL coming soon), and we heard from experts in Mongolian, Tibetan, and Uyghur who are keen to also participate in the Chinese task force and produce similar documents for their part of the world.

The Internationalization (i18n) Working Group at the W3C has also been working on other aspects of the mutlilingual user experience. For example, improvements for bidirectional text support (Arabic, Hebrew, Thaana, etc) for HTML and CSS, and supporting the work on counter styles at CSS.

To support local relevance of Web pages and eBook formats we need local experts to participate in gathering information in these task forces, to review the task force outputs, and to lobby or support via coding the implementation of features in browsers and ereaders. If you are one of these people, or know some, please get in touch!

We particularly need more information about how to handle typographic features of the Arabic script.

In the hope that it will help, I have put together some information on current areas of activity at the W3C, with pointers to useful existing requirements, specifications and tests. It is not exhaustive, and I expect it to be added to and improved over time.

Look through the list and check whether your needs are being adequately covered. If not, write to www-international@w3.org (you need to subscribe first) and make the case. If the spec does cover your needs, but the browsers don’t support your needs, raise bugs against the browsers.

Korean justification

The editors of the CSS3 Text module specification are currently trying to get more information about how to handle Korean justification, particularly when hanja (chinese ideographic characters) are mixed with Korean hangul.

Should the hanja be stretched with inter-character spacing at the same time as the Korean inter-word spaces are stretched? What if only a single word fits on a line, should it be stretched using inter-character spacing? Are there more sophisticated rules involving choices of justification location, as there are in Japanese where you adjust punctuation first and do inter-character spacing later? What if the whole text is hanja, such as for an ancient document?

If you are able to provide information, take a look at what’s in the CSS3 Text module and follow the thread on public-i18n-cjk@w3.org (subscribe).

I’ve been trying to understand how web pages need to support justification of Arabic text, so that there are straight lines down both left and right margins.

The following is an extract from a talk I gave at the MultilingualWeb workshop in Madrid at the beginning of May. (See the whole talk.) It’s very high level, and basically just draws out some of the uncertainties that seem to surround the topic.

Let’s suppose that we want to justify the following Arabic text, so that there are straight lines at both left and right margins.

Arabic justification #1

Unjustified Arabic text

Generally speaking, received wisdom says that Arabic does this by stretching the baseline inside words, rather than stretching the inter-word spacing (as would be the case in English text).

To keep it simple, lets just focus on the top two lines.

One way you may hear that this can be done is by using a special baseline extension character in Unicode, U+0640 ARABIC TATWEEL.

Arabic justification #2

Justification using tatweels

The picture above shows Arabic text from a newspaper where we have justified the first two lines using tatweels in exactly the same way it was done in the newspaper.

Apart from the fact that this looks ugly, one of the big problems with this approach is that there are complex rules for the placement of baseline extensions. These include:

  • extensions can only appear between certain characters, and are forbidden around other characters
  • the number of allowable extensions per word and per line is usually kept to a minimum
  • words vary in appropriateness for extension, depending on word length
  • there are rules about where in the line extensions can appear – usually not at the beginning
  • different font styles have different rules

An ordinary web author who is trying to add tatweels to manually justify the text may not know how to apply these rules.

A fundamental problem on the Web is that when text size or font is changed, or a window is stretched, etc, the tatweels will end up in the wrong place and cause problems. The tatweel approach is of no use for paragraphs of text that will be resized as the user stretches the window of a web page.

In the next picture we have simply switched to a font in the Naskh style. You can see that the tatweels applied to the word that was previously at the end of the first line now make the word to long to fit there. The word has wrapped to the beginning of the next line, and we have a large gap at the end of the first line.

Arabic justification #3

Tatweels in the wrong place due to just a font change

To further compound the difficulties mentioned above regarding the rules of placement for extensions, each different style of Arabic font has different rules. For example, the rules for where and how words are elongated are different in the Nastaliq version of the same text which you can see below. (All the characters are exactly the same, only the font has changed.) (See a description of how to justify Urdu text in the Nastaliq style.)

Arabic justification #4: Nastaliq
Same text in the Nastaliq font style

And fonts in the Ruqah style never use elongation at all. (We’ll come back to how you justify text using Ruqah-style fonts in a moment.)

Arabic justification #5: Ruqah
Same text in the Ruqah font style

In the next picture we have removed all the tatweel characters, and we are showing the text using a Naskh-style font. Note that this text has more ligatures on the first line, so it is able to fit in more of the text on that line than the first font we saw. We’ll again focus on the first two lines, and consider how to justify them.

Arabic justification #6: Naskh
Same text in the Naskh font style

High end systems have the ability to allow relevant characters to be elongated by working with the font glyphs themselves, rather than requiring additional baseline extension characters.

Arabic justification #7: kashida elongation
Justification using letter elongation (kashida)

In principle, if you are going to elongate words, this is a better solution for a dynamic environment. It means, however, that:

  1. the rules for applying the right-sized elongations to the right characters has to be applied at runtime by the application and font working together, and as the user or author stretches the window, changes font size, adds text, etc, the location and size of elongations needs to be reconfigured
  2. there needs to be some agreement about what those rules are, or at least a workable set of rules for an off-the-shelf, one-size-fits-all solution.

The latter is the fundamental issue we face. There is very little, high-quality information available about how to do this, and a lack of consensus about, not only what the rules are, but how justification should be done.

Some experts will tell you that text elongation is the primary method for justifying Arabic text (for example), while others will tell you that inter-word and intra-word spacing (where there are gaps in the letter-joins within a single word) should be the primary approach, and kashida elongation may or may not be used in addition where the space method is strained.

Arabic justification #8: space based

Justification using inter-word spacing

The space-based approach, of course, makes a lot of sense if you are dealing with fonts of the Ruqah style, which do not accept elongation. However, the fact that the rules for justification need to change according to the font that is used presents a new challenge for a browser that wants to implement justification for Arabic. How does the browser know the characteristics of the font being used and apply different rules as the font is changed? Fonts don’t currently indicate this information.

Looking at magazines and books on a recent trip to Oman I found lots of justification. Sometimes the justification was done using spaces, other times using elongations, and sometimes there was a mixture of both. In a later post I’ll show some examples.

By the way, for all the complexity so far described this is all quite a simplistic overview of what’s involved in Arabic justification. For example, high end systems that justify Arabic text also allow the typesetter to adjust the length of a line of text by manual adjustments that tweak such things as alternate letter shapes, various joining styles, different lengths of elongation, and discretionary ligation forms.

The key messages:

  1. We need an Arabic Layout Requirements document to capture the script needs.
  2. Then we need to figure out how to adapt Open Web Platform technologies to implement the requirements.
  3. To start all this, we need experts to provide information and develop consensus.

Any volunteers to create an Arabic Layout Requirements document? The W3C would like to hear from you!

When it comes to wrapping text at the end of a line in a web page, there are some special rules that should be applied if you know the language of the text is either Chinese or Japanese (ie. if the markup contains a lang attribute to that effect).

The CSS3 Text module attempts to describe these rules, and we have some tests to check what browsers currently do for Japanese and Chinese.

There’s an open question in the editor’s draft about whether Korean has any special behaviours that need to be documented in the spec, when the markup uses lang to identify the content as Korean.

If you want to provide information, take a look at what’s in the CSS3 Text module and write to www-international@w3.org and copy public-i18n-cjk@w3.org.

If you put a span tag around one or two letters in an Arabic word, say to change the colour, it breaks the cursiveness in WebKit and Blink browsers. You can change things like colour in Mozilla and IE, but changing the font breaks the connection.

Breaking on colour change makes it hard to represent educational texts and things such as the Omantel logo, which I saw all over Muscat recently. (Omantel is the largest internet provider in Oman.) Note how, despite the colour change, the Arabic letters in the logo below (on the left) still join.

Picture of the Omantel logo.
Multi-coloured Omantel Arabic logo on a building in Muscat.

Here’s an example of an educational page that colours parts of words. You currently have to use Firefox or IE to get the desired effect.

This lead to questions about what to do if you convert block elements, such as li, into inline elements that sit side by side. You probably don’t want the character at the end of one li tag to join with the next one. What if there is padding or margins between them, should this cause bidi isolation as well as preventing joining behaviour?

See a related thread on the W3C Internationalization and CSS lists.