The CSS WG needs advice on initial letter styling in non-Latin scripts, ie. enlarged letters or syllables at the start of a paragraph like those shown in the picture. Most of the current content of the recently published Working Draft, CSS Inline Layout Module Level 3 is about styling of initial letters, but the editors need to ensure that they have covered the needs of users of non-Latin scripts.
The spec currently describes drop, sunken and raised initial characters, and allows you to manipulate them using the initial-letter and the initial-letter-align properties. You can apply those properties to text selected by ::first-letter, or to the first child of a block (such as a span).
The editors are looking for
any examples of drop initials in non-western scripts, especially Arabic and Indic scripts.
The W3C needs to make sure that the typographic needs of scripts and languages around the world are built in to technologies such as HTML, CSS, SVG, etc. so that Web pages and eBooks can look and behave as expected for people around the world.
To that end we have experts in various parts of the world documenting typographic requirements and gaps between what is needed and what is currently supported in browsers and ebook readers.
The flagship document is Requirements for Japanese Text Layout. The information in this document has been widely used, and the process used for creating it was extremely effective. It was developed in Japan, by a task force using mailing lists and holding meetings in japanese, then converted to english for review. It was published in both languages.
We now have groups working on Indic Layout Requirements and Requirements for Hangul Text Layout and Typography, and this month I was in Beijing to discuss ongoing work on Chinese layout requirements (URL coming soon), and we heard from experts in Mongolian, Tibetan, and Uyghur who are keen to also participate in the Chinese task force and produce similar documents for their part of the world.
The Internationalization (i18n) Working Group at the W3C has also been working on other aspects of the mutlilingual user experience. For example, improvements for bidirectional text support (Arabic, Hebrew, Thaana, etc) for HTML and CSS, and supporting the work on counter styles at CSS.
To support local relevance of Web pages and eBook formats we need local experts to participate in gathering information in these task forces, to review the task force outputs, and to lobby or support via coding the implementation of features in browsers and ereaders. If you are one of these people, or know some, please get in touch!
We particularly need more information about how to handle typographic features of the Arabic script.
Look through the list and check whether your needs are being adequately covered. If not, write to email@example.com (you need to subscribe first) and make the case. If the spec does cover your needs, but the browsers don’t support your needs, raise bugs against the browsers.
I’ve been trying to understand how web pages need to support justification of Arabic text, so that there are straight lines down both left and right margins.
The following is an extract from a talk I gave at the MultilingualWeb workshop in Madrid at the beginning of May. (See the whole talk.) It’s very high level, and basically just draws out some of the uncertainties that seem to surround the topic.
Let’s suppose that we want to justify the following Arabic text, so that there are straight lines at both left and right margins.
Generally speaking, received wisdom says that Arabic does this by stretching the baseline inside words, rather than stretching the inter-word spacing (as would be the case in English text).
To keep it simple, lets just focus on the top two lines.
One way you may hear that this can be done is by using a special baseline extension character in Unicode, U+0640 ARABIC TATWEEL.
The picture above shows Arabic text from a newspaper where we have justified the first two lines using tatweels in exactly the same way it was done in the newspaper.
Apart from the fact that this looks ugly, one of the big problems with this approach is that there are complex rules for the placement of baseline extensions. These include:
extensions can only appear between certain characters, and are forbidden around other characters
the number of allowable extensions per word and per line is usually kept to a minimum
words vary in appropriateness for extension, depending on word length
there are rules about where in the line extensions can appear – usually not at the beginning
different font styles have different rules
An ordinary web author who is trying to add tatweels to manually justify the text may not know how to apply these rules.
A fundamental problem on the Web is that when text size or font is changed, or a window is stretched, etc, the tatweels will end up in the wrong place and cause problems. The tatweel approach is of no use for paragraphs of text that will be resized as the user stretches the window of a web page.
In the next picture we have simply switched to a font in the Naskh style. You can see that the tatweels applied to the word that was previously at the end of the first line now make the word to long to fit there. The word has wrapped to the beginning of the next line, and we have a large gap at the end of the first line.
To further compound the difficulties mentioned above regarding the rules of placement for extensions, each different style of Arabic font has different rules. For example, the rules for where and how words are elongated are different in the Nastaliq version of the same text which you can see below. (All the characters are exactly the same, only the font has changed.) (See a description of how to justify Urdu text in the Nastaliq style.)
And fonts in the Ruqah style never use elongation at all. (We’ll come back to how you justify text using Ruqah-style fonts in a moment.)
In the next picture we have removed all the tatweel characters, and we are showing the text using a Naskh-style font. Note that this text has more ligatures on the first line, so it is able to fit in more of the text on that line than the first font we saw. We’ll again focus on the first two lines, and consider how to justify them.
High end systems have the ability to allow relevant characters to be elongated by working with the font glyphs themselves, rather than requiring additional baseline extension characters.
In principle, if you are going to elongate words, this is a better solution for a dynamic environment. It means, however, that:
the rules for applying the right-sized elongations to the right characters has to be applied at runtime by the application and font working together, and as the user or author stretches the window, changes font size, adds text, etc, the location and size of elongations needs to be reconfigured
there needs to be some agreement about what those rules are, or at least a workable set of rules for an off-the-shelf, one-size-fits-all solution.
The latter is the fundamental issue we face. There is very little, high-quality information available about how to do this, and a lack of consensus about, not only what the rules are, but how justification should be done.
Some experts will tell you that text elongation is the primary method for justifying Arabic text (for example), while others will tell you that inter-word and intra-word spacing (where there are gaps in the letter-joins within a single word) should be the primary approach, and kashida elongation may or may not be used in addition where the space method is strained.
The space-based approach, of course, makes a lot of sense if you are dealing with fonts of the Ruqah style, which do not accept elongation. However, the fact that the rules for justification need to change according to the font that is used presents a new challenge for a browser that wants to implement justification for Arabic. How does the browser know the characteristics of the font being used and apply different rules as the font is changed? Fonts don’t currently indicate this information.
Looking at magazines and books on a recent trip to Oman I found lots of justification. Sometimes the justification was done using spaces, other times using elongations, and sometimes there was a mixture of both. In a later post I’ll show some examples.
By the way, for all the complexity so far described this is all quite a simplistic overview of what’s involved in Arabic justification. For example, high end systems that justify Arabic text also allow the typesetter to adjust the length of a line of text by manual adjustments that tweak such things as alternate letter shapes, various joining styles, different lengths of elongation, and discretionary ligation forms.
The key messages:
We need an Arabic Layout Requirements document to capture the script needs.
Then we need to figure out how to adapt Open Web Platform technologies to implement the requirements.
To start all this, we need experts to provide information and develop consensus.
Any volunteers to create an Arabic Layout Requirements document? The W3C would like to hear from you!
In the phrase “Zusätzlich erleichtert PLS die Eingrenzung von Anwendungen, indem es Aussprachebelange von anderen Teilen der Anwendung abtrennt.” (“In addition, PLS facilitates the localization of applications by separating pronunciation concerns from other parts of the application.”) there are many long words. To fit these in narrow columns (coming soon to the Web via CSS) or on mobile devices, it would help to automatically hyphenate them.
Other major browsers already supported soft-hyphens when Firefox 5 implemented FF support. Soft hyphens provide a manual workaround for breaking long words, but more recently browsers such as Firefox, Safari and Chrome have begun to support the CSS3 hyphens property, with hyphenation dictionaries for a range of languages, to support (or disable, if needed) automatic hyphenation. (Note, however, that Aussprachebelange is incorrectly hyphenated in the example from Safari 5.1 on Lion OS shown above. It is hyphenated as Aussprac- hebelange. Some refinement is clearly still needed at this stage.)
For hyphenation to work correctly, the text must be marked up with language information, using the language tags described earlier. This is because hyphenation rules vary by language, not by script. The description of the hyphens property in CSS says “Correct automatic hyphenation requires a hyphenation resource appropriate to the language of the text being broken. The UA is therefore only required to automatically hyphenate text for which the author has declared a language (e.g. via HTML lang or XML xml:lang) and for which it has an appropriate hyphenation resource.”
This post is a place for me to dump a few URIs related to this topic, so that i can find them again later.
I see from a recent bugzilla report and some cursory testing that a (very) long-standing bug in Mozilla related to complex scripts has now been fixed.
Complex scripts include many non-Latin scripts that use combining characters or ligatures, or that apply shaping to adjacent characters like Arabic script.
It used to be that, when you highlighted text in a complex script, as you extended the edges of the highlighted area you would break apart combining characters from their base character, split ligatures and disrupt the joining behaviour of Arabic script characters.
The good news is that this no longer happens – it was fixed by the new text frame code. The bad news is that the highlighting still happens character by character, rather than at grapheme boundaries – which can make it tricky to know whether you got the combining characters or not.
UPDATE: I hear from Kevin Brosnan that the following will be fixed in Firefox 3. Hurrah! And thank you Mozilla team.
What doesn’t appear to be fixed is the behaviour of asian scripts when the CSS text-align:justify is applied. 🙁
I raised a bug report about this. I was amazed, after hearing about this from Indians and Pakistanis too, that there didn’t seem to be a bug report already. Come on users, don’t leave this up to the W3C!
Basically, the issue is that if you apply text-align: justify to some text in an Indian or Tibetan script the combining characters all get rendered alongside their base characters, ie. you go from this (showing, respectively, tibetan, devanagari (hindi and nepali), punjabi, telegu and thai text):
Strangely the effect doesn’t seem to apply to the Thai text, nor to other text with combining characters that I’ve tried.
That’s a pretty big bug for people in the affected region because it effectively means that text-align:justify can’t be used.
Typically ruby is used in East Asian scripts to provide phonetic transcriptions of obscure characters, or characters that the reader is not expected to be familiar with. For example it is widely used in education materials and children’s texts. It is also occasionally used to convey information about the meaning of ideographic characters. For more information see Ruby Markup and Styling.
Ruby markup (called 振り仮名 [furigana] in Japan) is described by the W3C’s Ruby Annotation spec. It comes in two flavours, simple and complex.
Ruby markup is a part of XHTML 1.1 (served as XML), but native support is not widely available. IE doesn’t support XHTML 1.1, but it does support simple ruby markup in HTML and XHTML 1.0. This extension provides support in Firefox for both simple and complex ruby, in HTML, XHTML 1.0 and XHTML 1.1.
It passes all the I18n Activity ruby tests, with the exception of one *very* minor nit related to spacing of complex ruby annotation.
Samphan Raruenrom has produced a Firefox extension based on ICU to handle Thai line breaking.
Thai line breaks respect word boundaries, but there are no spaces between words in written Thai. Spaces are used instead as phrase separators (like English comma and full stop). This means that dictionary-based lookup is needed to properly wrap Thai text.
The current release works on Windows and the current Firefox release, 126.96.36.199. The next release will also support Linux and will support future Mozilla Firefox/Thunderbird releases.
Chris was arguing that using CSS, rather than Unicode characters, to render these marks could be useful because:
the mark applies to, and is centred below a whole ‘syllable’ – not just the stack of the syllable – this may be easier to achieve with styling than font positioning where, say, a syllable has an even number of head characters (see examples to the far right in the picture)
it would make it easier to search for text if these characters were not interspersed in it
it would allow for flexibility in approaches to the visual style used for emphasis – you would be able to change between using these marks or alternatives such as use of red colour or changes in font size just by changing the CSS style sheet (as we can for English text).
There are of potential issues with this approach too. These include things like the fact that the horizontal centring of glyphs within the syllable is not trivial. The vertical placement is also particularly difficult. You will notice from the attached image that the height depends on the depth of the text it falls below. On the other hand, it isn’t easy to achieve this with diacritics either, given the number of possible permutations of characters in a syllable. Such positioning is much more complicated than that of the Japanese wakiten.
A bigger issue may turn out to be that the application for this is fairly limited, and user agent developers have other priorities – at least for commercial applications.