Dochula Pass, Bhutan

This came up again recently in a discussion on the W3C i18n Interest Group list, and I decided to put my thoughts in this post so that I can point people to them easily.

I think HTML4 and HTML5 should continue to support <b> and <i> tags, for backwards compatability, but we should urge caution regarding their use and strongly encourage people to use <em> and <strong> or elements with class="…" where appropriate. (I reworded this 2008-02-01)

Here are a couple of reasons I say that:

  1. I constantly see people misusing these tags in ways that can make localization of content difficult.

    For example, just because and English document may use italicisation for emphasis, document titles and foreign words, it doesn’t hold that a Japanese translation of the document will use a single presentational convention for all three. Japanese authors may avoid both italicization and bolding, since their characters are too complicated to look good in small sizes with these effects. Japanese translators may find that the content communicates better if they use wakiten (boten marks) for emphasis, but corner brackets for 『 document names 』, and guillemets for 《 foreign words 》. These are common Japanese typographic approaches that we don’t use in English.

    The problem is that, if the English author has used <i> tags everywhere (thinking about the presentational rendering he/she wants in English), the Japanese localizer will be unable to easily apply different styling to the different types of text.

    The problem could be avoided if semantic markup is used. If the English author had used <em>..</em> and <span class="doctitle">...</span> and <span class="foreignword">..</span> to distinguish the three cases, it would allow the localizer to easily change the CSS to achieve different effects for these items, one at a time.

    Of course, over time this is equally relevant to pages that are monolingual. Suppose your new corporate publishing guidelines change, and proclaim that bolding is better than italics for document names. With semantically marked up HTML, you can easily change a whole site with one tiny edit to the CSS. In the situation described above, however, you’d have to hunt through every page for relevant <i> tags and change them individually, so that you didn’t apply the same style change to emphasis and foreign words too.

  2. Allowing authors to use <b> and <i> tags is also problematic, in my mind, because it keeps authors thinking in presentational terms, rather than helping them move to properly semantic markup. At the very least, it blurs the ideas. To an author in a hurry, it is also tempting to just slap one of these tags on the text to make it look different, rather than to stop and think about things like consistency and future-proofing. (Yes, I’ve often done it too…)

I always forget how to get around the namespace issue when transforming XHTML files to XHTML using XSL, and it always takes ages for me to figure it out again. So I’m going to make a note here to remind me. This seems to work:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:transform version="2.0"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/02/xpath-functions" xmlns:xdt="http://www.w3.org/2005/02/xpath-datatypes"
xmlns:saxon="http://icl.com/saxon"
<strong>exclude-result-prefixes="saxon fn xs xdt html"</strong>>
;

<xsl:output method="xhtml" encoding="UTF-8"
doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" indent="no" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" />

Then you need to refer to elements in the source to be converted by using the html: namespace prefix, eg. <xsl :template match=”html:div”>….</xsl>.

I always have to look up the template that copies everything not fiddled with in the other templates, too, so here it is, for good measure:

<xsl:template match="@*|node()">
	<xsl:copy>
		<xsl:apply-templates select="@*|node()"/>
		</xsl:copy>
	</xsl:template>

>> Use it !

Picture of the page in action.

Although I have a picker already for Arabic, Persian and Urdu, I have developed another that is specifically for inputting Urdu. One reason for this is to reduce the choice of characters so that the user is more likely to select the right character for Urdu (eg. heh goal rather than arabic heh). Another is to provide shortcuts for things like aspirated letters and some common combinations (like the word ‘allah’).

It includes characters used for Urdu in Unicode 5.0. Most of the characters in the Urdu standard UZT 1.01 are included.

The aspirated letters of the alphabet can be entered with a single click. Also, base characters with diacritics can be inserted into the text with a single click where NFC normalisation would produce a single precomposed character.

Letters of the alphabet are shown in alphabetic order at the top left, digits are in keypad order, and combining characters related to vowel sounds are shown along the bottom. The lower middle section contains useful but non-alphabetic characters and punctuation. To the right are various symbols. Hinting is implemented for visually similar glyphs.

>> Use it !

Picture of the page in action.

Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility.

The Bengali picker includes all the characters in the Unicode 5.0 Bengali block. Note: There was an important addition to the Bengali block in version 4.1, a single character for khanda ta, that may not yet be supported in fonts, but has been added to this version of the picker.

Consonants are mostly in a typical articulatory arrangement, vowels are aligned with vowel signs, and digits are in keypad order. Hinting is implemented for visually similar glyphs.

A function has also been added to transliterate Bengali text to Latin, though the scheme used is not standard, and may change at short notice. Don’t use it in anger yet.

I’ve been wanting to improve the editing behaviour of my pickers for quite some time, so that users could interact more easily with the keyboard, and insert characters into the middle of a composition, not just at the end. In fact, the output area maintains the focus all the time, now – which makes a major improvement to the usability of the pickers.

This week I made those things happen, and created a new template with some other changes, too.

An updated Bengali picker is first out of the box, but look out for a brand new Urdu-specific picker to follow close on its heels. I will retrofit the new template to other pickers as time allows, or need dictates.

I also beefed up the font selection list with a large number of TT and OT fonts, and improved the reference material at the bottom.

I improved the mechanism that highlights similar characters, to give more fine-grained control to the associations between characters.

I also added a field just under the title that gives information about the character the user is mousing over, and added a search field to help users find characters for which they know the Unicode name or number. I plan to extend the information associated with characters in future to include native names (eg. e-kar) and other useful search info.

I also changed the scripting and HTML so that a single click can now produce multiple characters in the composition field. This will allow users to input ligatures like the indic ‘ksha’ or Urdu aspirated consonants, or complex sequences tied to ligatures (like the word ‘Allah’) with a simple click.

Some things have also been removed. There is no DEL button now, since you can interact more easily with the keyboard for that. Spaces are available from the (now rationalised) character area, rather than a button. And there is no longer an option to switch between graphics and characters for the selection. This is partly for simplicity, and partly to make it easier to represent some of the slightly more complicated selection options I want to add in future – for example, specific shapes are appropriate for Urdu arabic characters, and I don’t want to leave it to chance as to whether the user’s system has the right fonts to produce the desired shapes.

Getting to this actually required a huge amount of unseen work, since I had to wrap all the images in button markup and move and change attributes, etc. so that the composition box retains the focus in IE (it worked fine for Firefox, Opera and Safari). I also, of course, made significant, but probably not noticeable, changes to the Javascript and CSS.