Dochula Pass, Bhutan

Tim Greenwood just pointed out to me a ‘bug’ in my converter program, which I think is actually, in my mind, a bug in Firefox (although I imagine it was implemented by someone as a feature).

If you type A0 (the hex code for a non-breaking space) in the Hexadecimal code points field, then press Convert, you will get a blank space in the Characters field that should be U+00A0 NO-BREAK SPACE. Then press Convert or View Names above this Characters field and you’ll find that what was supposed to be a NBSP has changed into an ordinary space. IE7, Opera and Safari all continue to show the character in the field as a NBSP.

(However, all four browsers substitute an ordinary space when you copy and paste the text from the Characters field into something else.)

I tried this with a range of other types of space , but had no such behaviour (try it). They all remained themselves.

Anyone know what this is about?




Blue Beanie Day

Originally uploaded by r12a

Monday, November 26, 2007 is the day thousands of Standardistas (people who support web standards) will wear a Blue Beanie to show their support for accessible, semantic, and hopefully internationalized web content.

I haven’t got a blue hat, so I cheated a little by borrowing bits of the cover of Jeffrey Zeldman’s great book, “Designing with Web Standards”. That’s me under the hat though.

(If you’re wondering, the text on the left says the same as the text top right, in Arabic, Urdu, Inuktitut, Simplified Chinese, Traditional Chinese, Khazakh, Greek, Dzonkha, Ethiopian, Hebrew, Hindi, Nepali, Japanese, Korean, Hungarian, Punjabi, Thai and Venda.)

See the Flickr pool.


The word Mandalay in Myanmar script.

I’ve been brushing up on the Myanmar script, since major changes are on the way with Unicode 5.1.

I upgraded my myanmar picker to handle the new characters, and I edited my notes on how the script works.

The new characters will make a big difference to how you author text in Unicode, and people will need to update currently existing pages to bring them in line with the new approach. The changes should make it much easier to create content in Burmese, in addition to addressing some niggly problems with making the script work correctly. One reason the changes were sanctioned is that there is currently very little Burmese content out there in Unicode.

I’ll be updating my character by character notes later too.

The only problem with all this is that existing fonts will all need to be changed to support the new world order (or myanmar order). I found one font that is already 5.1 ready from the Myanmar Unicode & NLP Research Center. So if you don’t want to download that font, you’ll need to read the PDF version of my notes on the script.

That would be a pity, however, since i had some fun adding javascript to the article today, so that it displays a breakdown, character by character, of each example as you mouse over it (using images, so you see it properly).

I’m at the ITS face-to-face meeting in Prague, Czech Republic and I’ve been trying to learn to read Czech words. Jirka Kosek showed me a Czech tongue-twister last night at dinner.

Strč prst skrz krk.

How amazing is that? A whole sentence without vowels! (Means “Put your finger down your throat.” – I’m wondering whether that has something to do with the missing vowels…)

See a video of Jirka pronouncing it.


Multiple scripts in XMetal’s tags-on view (click to enlarge).

I received a query from someone asking:

I try to edit lao and thai text with XMetal 5.0, but nothing is displayed but squares. In fact, Unicode characters seems to be correctly saved in the XML file and displayed in Firefox (for example), but i can’t get a correct display in XMetal. Is it a font problem ?

There are two places this needs to be addressed:

  1. in the plain text view
  2. in the tags-on view

For the plain text view, it is a question of setting a font that shows Lao and Thai (or whatever other language/script you need) in Tools>Options>Plain Text View>Font. You can only set one font at a time, so a wide ranging Unicode font like Arial Unicode MS or Code2000 may be useful for Windows users.

For the tags-on view (which is the view I use most of the time) you need to edit the CSS file that sets the editor’s styling for the DOCTYPE you are working with. This may be in one of a number of places. The one I edit is C:\Program Files\Blast Radius\XMetaL 4.6\Author\Display\xhtml1-transitional.css.

I added the following to mine. I chose fonts I have on my PC and sets font sizes relative to the size I set for my body element. You should, of course, choose your own fonts and sizes.

[lang="am"] { font-family: "Code2000", serif; font-size: 120%; }
[lang="ar"] {font-family: "Traditional Arabic", sans-serif; font-size: 200%; }
[lang="bn"] {font-family: SolaimanLipi, sans-serif; font-size: 200%; }
[lang="dz"] { font-family: "Tibetan Machine Uni", serif; font-size: 140%; }
[lang="he"] {font-family: "Arial Unicode MS", sans-serif; font-size: 120%;}
[lang="hi"] {font-family: Mangal, sans-serif;  font-size: 120%;}
[lang="kk"] {font-family: "Arial Unicode MS", sans-serif;  }
[lang="iu"] {font-family: Pigiarniq, Uqammaq, sans-serif; font-size: 120%; }
[lang="ko"] { font-family: Batang, sans-serif; font-size: 120%;}
[lang="ne"] {font-family: Mangal, sans-serif;  font-size: 120%; }
[lang="pa"] { font-family: Raavi, sans-serif; font-size: 120%;}
[lang="te"] {font-family: Gautami, sans-serif; font-size: 140%;}
[lang="my"] {font-family: Myanmar1, sans-serif; font-size: 200%;}
[lang="th"] {font-family: "Cordia New", sans-serif; font-size: 200%; }
[lang="ur"] { font-family: "Nafees Nastaleeq", serif; font-size: 130%;}
[lang="ve"] { font-family: "Arial Unicode MS", sans-serif; }
[lang="zh-Hans"] { font-family: "Simsun", sans-serif; font-size: 140%; }
[lang="zh-Hant"] { font-family: "Mingliu", sans-serif; font-size: 140%; }

Note that I would have preferred to say :lang(am) { font-family… } etc, but XMetal 4.6 seems to require you to specify the attribute value as shown above. (You also have to specify class selectors as [class="myclass"] {…} rather than .myclass {…}.)

I see from a recent bugzilla report and some cursory testing that a (very) long-standing bug in Mozilla related to complex scripts has now been fixed.

Complex scripts include many non-Latin scripts that use combining characters or ligatures, or that apply shaping to adjacent characters like Arabic script.

It used to be that, when you highlighted text in a complex script, as you extended the edges of the highlighted area you would break apart combining characters from their base character, split ligatures and disrupt the joining behaviour of Arabic script characters.

The good news is that this no longer happens – it was fixed by the new text frame code. The bad news is that the highlighting still happens character by character, rather than at grapheme boundaries – which can make it tricky to know whether you got the combining characters or not.

UPDATE: I hear from Kevin Brosnan that the following will be fixed in Firefox 3. Hurrah! And thank you Mozilla team.

What doesn’t appear to be fixed is the behaviour of asian scripts when the CSS text-align:justify is applied. :(

I raised a bug report about this. I was amazed, after hearing about this from Indians and Pakistanis too, that there didn’t seem to be a bug report already. Come on users, don’t leave this up to the W3C!

Basically, the issue is that if you apply text-align: justify to some text in an Indian or Tibetan script the combining characters all get rendered alongside their base characters, ie. you go from this (showing, respectively, tibetan, devanagari (hindi and nepali), punjabi, telegu and thai text):

Picture of text with no alignment.

to this:

Picture of text with justify alignment.

Strangely the effect doesn’t seem to apply to the Thai text, nor to other text with combining characters that I’ve tried.

That’s a pretty big bug for people in the affected region because it effectively means that text-align:justify can’t be used.

Sarmad Hussain, at the Center for Research in Urdu Language Processing FAST National University, Pakistan, is looking at enabling Urdu IDNs based on ICANN recommendations, but this may lead to similar approaches in a number of other countries.

Sarmad writes: “We are trying to make the URL enabled in Urdu for people who are not literate in any other language (a large majority of literate population in Pakistan). ICANN has only given specs for the Domain Name in other languages (through its RFCs). Until they allow the TLDs in Urdu, we are considering an application end solution: have a plug in for a browser for people who want to use it, which URL in Urdu, strips and maps all the TLD information to .com, .pk, etc. and converts the domain name to punycode Thus, people can type URLs in pure Urdu which are converted to the mixed English-Urdu URLs by the application layer which ICANN currently allows.”

“We are currently trying to figure out what would be the ‘academic’ requirements/solutions for a language. To practically solve the problem, organizations like ICANN would need to come up with the solutions.”

There are some aspects to Sarmad’s proposal, arising from the nature of the Arabic script used for Urdu, that raise some interesting questions about the way IDN works for this kind of language. These have to do with the choice of characters allowed in a domain name. For example, there is a suggestion that users should be able to use certain characters when writing a URI in Urdu which are then either removed (eg. vowel diacritics) or converted to other characters (eg. Arabic characters) during the conversion to punycode.

This is not something that is normally relevant for English-only URIs, because of the relative simplicity of our alphabet. There is much more potential ambiguity in Urdu for use of characters. Note, however, that the proposals Sarmad is making are language-specific, not script-specific, ie. Arabic or Persian (also written with the Arabic script) would need some slightly different rules.

I find myself wondering whether you could use a plug-in to strip out or convert the characters while converting to punycode. People typing IDNs in Urdu would need to be aware of the need for a plug-in, and would still need to know how to type in IDNs if they found themselves using a browser that didn’t have the plug-in (eg. the businessman who is visiting a corporation in the US that prevents ad hoc downloads of software). On the one hand, I wonder whether we can expect a user who sees a URI on a hard copy brochure containing vowel diacritics to know what to do if their browser or mail client doesn’t support the plug-in. On the other hand, a person writing a clickable URI in HTML or an email would not be able to guarantee that users would have access to the plug-in. In that case, they would be unwise to use things like short vowel diacritics, since the user cannot easily change the link if they don’t have a plug-in. Imagine a vowelled IDN coming through in a plain text email, for example: the reader may need to edit the email text to get to the resource rather than just click on it. Not likely to be popular.

Another alternative is to do such removal and conversion of characters as part of the standard punycode conversion process. This, I suspect, would necessitate every browser to have access to standardised tables of characters that should be ignored or converted for any language. But there is an additional problem in that the language would need to be determined correctly before such rules were applied – that is, the language of the original URI. That too seems a bit difficult.

So I can see the problem, but I’m not sure what the solution would be. I’m inclined to think that creating a plug-in might create more trouble than benefit, by replacing the problems of errors and ambiguities with the problems of uninteroperable IDNs.

I have posted this to the www-international list for discussion.

Follow this link to see lists of characters that may be removed or converted.
(more…)


Ruby text above and below Japanese characters.

My last post mentioned an extension that takes care of Thai line breaking. In this post I want to point to another useful extension that handles ruby annotation.

Typically ruby is used in East Asian scripts to provide phonetic transcriptions of obscure characters, or characters that the reader is not expected to be familiar with. For example it is widely used in education materials and children’s texts. It is also occasionally used to convey information about the meaning of ideographic characters. For more information see Ruby Markup and Styling.

Ruby markup (called 振り仮名 [furigana] in Japan) is described by the W3C’s Ruby Annotation spec. It comes in two flavours, simple and complex.

Ruby markup is a part of XHTML 1.1 (served as XML), but native support is not widely available. IE doesn’t support XHTML 1.1, but it does support simple ruby markup in HTML and XHTML 1.0. This extension provides support in Firefox for both simple and complex ruby, in HTML, XHTML 1.0 and XHTML 1.1.

It passes all the I18n Activity ruby tests, with the exception of one *very* minor nit related to spacing of complex ruby annotation.


Before and after applying the extension.

Samphan Raruenrom has produced a Firefox extension based on ICU to handle Thai line breaking.

Thai line breaks respect word boundaries, but there are no spaces between words in written Thai. Spaces are used instead as phrase separators (like English comma and full stop). This means that dictionary-based lookup is needed to properly wrap Thai text.

The current release works on Windows and the current Firefox release, 2.0.0.4. The next release will also support Linux and will support future Mozilla Firefox/Thunderbird releases.

You can test this on our i18n articles translated into Thai.

This replaces work on a separate Thai version of Firefox.

UPDATE: This post has now been updated, reviewed and released as part of a W3C article. See http://www.w3.org/International/questions/qa-personal-names.

Here are some more thoughts on dealing with multi-cultural names in web forms, databases, or ontologies. See the previous post.

Script

The first thing that English speakers must remember about other people’s names is that a large majority of them don’t use the Latin alphabet, and a majority of those that do use accents and characters that don’t occur in English. It seems obvious, once I’ve said it, but it has some important consequences for designers that are often overlooked.

If you are designing an English form you need to decide whether you are expecting people to enter names in their own script or in an ASCII-only transcription. What people will type into the form will often depend on whether the form and its page is in their language or not. If the page is in their language, don’t be surprised to get back non-Latin or accented Latin characters.

If you hope to get ASCII-only, you need to tell the user.

The decision about which is most appropriate will depend to some extent on what you are collecting people’s names for, and how you intend to use them.

  • Are you collecting the person’s name just to have an identifier in your system? If so, it may not matter whether the name is stored in ASCII-only or native script.
  • Or do you plan to call them by name on a welcome page or in correspondence? If you will correspond using their name on pages written in their language, it would seem sensible to have the name in the native script.
  • Is it important for people in your organization who handle queries to be able to recognise and use the person’s name? If so, you may want to ask for a transcription.
  • Will their name be displayed or searchable (for example Flickr optionally shows people’s names as well as their user name on their profile page)? If so, you may want to store the name in both ASCII and native script, in which case you probably need to ask the user to submit their name in both native script and ASCII-only form, using separate fields.

Note that if you intend to parse a name, you may need to use country or language-specific algorithms to do so correctly (see the previous blog on personal names).

If you do accept non-ASCII names, you should use UTF-8 encoding in your pages, your back end databases and in all the scripts in between. This will significantly simplify your life.


Icons chosen by McDonalds to represent, from left to right, Calories, Protein, Fat, Carbohydrates and Salt.

I just read a fascinating article about how McDonalds set about testing cultural acceptability of a range of icons intended to point to nutritional information. It talks about the process and gives examples of some of the issues. Very nice.

Interesting, also, that they still ended up with local variants in some cases.

Creating a New Language for Nutrition: McDonald’s Universal Icons for 109 Countries

Picture of Tibetan emphasis.

Christopher Fynn of the National Library of Bhutan raised an interesting question on the W3C Style and I18n lists. Tibetan emphasis is often achieved using one of two small marks below a Tibetan syllable, a little like Japanese wakiten. The picture shows U+0F35: TIBETAN MARK NGAS BZUNG NYI ZLA in use. The other form is 0F37: TIBETAN MARK NGAS BZUNG SGOR RTAGS.

Chris was arguing that using CSS, rather than Unicode characters, to render these marks could be useful because:

  • the mark applies to, and is centred below a whole ‘syllable’ – not just the stack of the syllable – this may be easier to achieve with styling than font positioning where, say, a syllable has an even number of head characters (see examples to the far right in the picture)
  • it would make it easier to search for text if these characters were not interspersed in it
  • it would allow for flexibility in approaches to the visual style used for emphasis – you would be able to change between using these marks or alternatives such as use of red colour or changes in font size just by changing the CSS style sheet (as we can for English text).

There are of potential issues with this approach too. These include things like the fact that the horizontal centring of glyphs within the syllable is not trivial. The vertical placement is also particularly difficult. You will notice from the attached image that the height depends on the depth of the text it falls below. On the other hand, it isn’t easy to achieve this with diacritics either, given the number of possible permutations of characters in a syllable. Such positioning is much more complicated than that of the Japanese wakiten.

A bigger issue may turn out to be that the application for this is fairly limited, and user agent developers have other priorities – at least for commercial applications.

To follow along with, and perhaps contribute to, the discussion follow the thread on the style list or the www-international list.

UPDATE: This post has now been updated, reviewed and released as a W3C article. See http://www.w3.org/International/questions/qa-personal-names.

People who create web forms, databases, or ontologies in English-speaking countries are often unaware how different people’s names can be in other countries. They build their forms or databases in a way that assumes too much on the part of foreign users.

I’m going to explore some of the potential issues in a series of blog posts. This content will probably go through a number of changes before settling down to something like a final form. Consider it more like a set of wiki pages than a typical blog post.

Scenarios

A form that asks for your name in a single field.
A form that asks for separate first and last names.

It seems to me that there are a couple of key scenarios to consider.

A You are designing a form in a single language (let’s assume English) that people from around the world will be filling in.

B You are designing a form in a one language but the form will be adapted to suit the cultural differences of a given locale when the site is translated.

In reality, you will probably not be able to localise for every different culture, so even if you rely on approach B, some people will still use a form that is not intended specifically for their culture.

Examples of differences

To get started, let’s look at some examples of how people’s names are different around the world.

Given name and patronymic

In the name Björk Guðmundsdóttir Björk is the given name. The second part of the name indicates the father’s (or sometimes the mother’s) name, followed by -sson for a male and -sdóttir for a female, and is more of a description than a family name in the Western sense. Björk’s father, Guðmundor, was the son of Gunnar, so is known as Guðmundur Gunnarsson.

Icelanders prefer to be called by their given name (Björk), or by their full name (Björk Guðmundsdóttir). Björk wouldn’t normally expect to be called Ms. Guðmundsdóttir. Telephone directories in Iceland are sorted by given name.

Other cultures where a person has one given name followed by a patronymic include parts of Southern India, Malaysia and Indonesia.

Different order of parts

In the name 毛泽东 [mao ze dong] the family name is Mao, ie. the first name, left to right. The given name is Dong. The middle character, Ze, is a generational name, and is common to all his siblings (such as his brothers and sister, 毛泽民 [mao ze min], 毛泽覃 [mao ze tan], and 毛澤紅 [mao ze hong]).

Among acquaintances Mao may be referred to as 毛泽东先生 [mao ze dong xiān shēng] or 毛先生 [mao xiān shēng]. Not everyone uses generational names these days, especially in Mainland China. If you are on familiar terms with someone called 毛泽东, you would normally refer to them using 泽东 [ze dong], not just 东 [dong].

Note also that the names are not separated by spaces.

The order family name followed by given name(s) is common in other countries, such as Japan, Korea and Hungary.

Chinese people who deal with Westerners will often adopt an additional given name that is easier for Westerners to use. For example, Yao Ming (family name Yao, given name Ming) may write his name for foreigners as Fred Yao Ming or Fred Ming Yao.

Multiple family names

Spanish-speaking people will commonly have two family names. For example, Maria-Jose Carreño Quiñones may be the daughter of Antonio Carreño Rodríguez and María Quiñones Marqués.

You would refer to her as Señorita Carreño, not Señorita Quiñones.

Variant forms

We already saw that the patronymic in Iceland ends in -son or -dóttir, depending on whether the child is male or female. Russians use patronymics as their middle name but also use family names, in the order given-patronymic-family. The endings of the patronymic and family names will indicate whether the person in question is male or female. For example, the wife of Борис Никола́евич Ельцин (Boris Nikolayevich Yeltsin) is Наина Иосифовна Ельцина (Naina Iosifovna Yeltsina) – note how the husband’s names end in consosonants, while the wife’s names (even the patronymic from her father) end in a.

Mixing it up

Many cultures mix and match these differences from Western personal names, and add their own novelties.

For example, Velikkakathu Sankaran Achuthanandan is a Kerala name from Southern India, usually written V. S. Achuthanandan which follows the order familyName-fathersName-givenName. In many parts of the world, parts of names are derived from titles, locations, genealogical information, caste, religious references, and so on, eg. the Arabic Abu Karim Muhammad al-Jamil ibn Nidal ibn Abdulaziz al-Filistini.

In Vietnam, names such as Nguyễn Tấn Dũng follow the order family-middle-given name. Although this seems similar to the Chinese example above, even in a formal situation this Prime Minister of Vietnam is referred to using his given name, ie. Mr. Dung, not Mr. Nguyen.

Further reading

Wikipedia sports a large number of fascinating articles about how people’s names look in various cultures around the world. I strongly recommend a perusal of the follow links.

AkanArabicBalineseBulgarianCzechChineseDutchFijianFrenchGermanHawaiianHebrewHungarianIcelandicIndianIndonesianIrishItalianJapaneseJavaneseKoreanLithuanianMalaysianMongolianPersianPhilippinePolishPortugueseRussianSpanishTaiwaneseThaiVietnamese

Consequences

If designing a form or database that will accept names from people with a variety of backgrounds, you should ask yourself whether you really need to have separate fields for given name and family name.

This will depend on what you need to do with the data, but obviously it will be simpler to just use the full name as the user provides it, where possible.

Note that if you have separate fields because you want to use the person’s given name to communicate with them, you may not only have problems due to name syntax, but there are varying expectations around the world with regards to formality also that need to be accounted for. It may be better to ask separately, when setting up a profile for example, how that person would like you to address them.

If you do still feel you need to ask for constituent parts of a name separately, try to avoid using the labels ‘first name’ and ‘last name’, since these can be confusing for people who normally write their family name followed by given names.

Be careful, also, about assumptions built into algorithms that pull out the parts of a name automatically. For example, the v-card and h-card approach of implied “n” optimization could have difficulties with, say, Chinese names. You should be as clear as possible about telling people how to specify their name so that you capture the data you think you need.

If you are designing forms that will be localised on a per culture basis, don’t forget that atomised name parts may still need to be stored in a central database, which therefore needs to be able to represent all the various complexities that you dealt with by relegating the form design to the localisation effort.

I’ll post some further issues and thoughts about personal names when time allows.

[See part 2.]

This morning I came across an interesting set of principles for site design. It was developed as part of the BBC 2.0 project.

That led me to the BBC Director General’s “BBC 2.0: why on demand changes everything“. Also a very interesting read as a case study for the web as part of a medium of mass communication.

One particular topic out of several I found of interest:

Interestingly, on July 7th last year, which was by far the biggest day yet for the use of rich audio-visual content from our news site, the content most frequently demanded was the eyewitness user generated content (UGC) from the bomb scenes.

Shaky, blurry images uploaded by one member of the public, downloaded by hundreds of thousands of other members of the public.

It’s a harbinger of a very different, more collaborative, more involving kind of news.

Here, as at so many other points in the digital revolution, the public are moving very quickly now – at least as quickly as the broadcasters.

I also find it interesting to see how news spreads through channels like Flickr. Eighteen months ago we were mysteriously bounced out of bed at 6am, but there was nothing on the TV to explain what had happened. I went up to the roof patio and took some of the first photos of the Buncefield explosion, including a photo taken just 20 minutes after the blast, and uploaded it to Flickr. A slightly later photo hit the number one spot for interestingness for that day. And as many other people’s photos appeared it was possible to get a lot of information, even ahead of the national news, including eye witness accounts, about what had happened.

Over the past 24 hours I’ve been exploring with interest the localization of the Flickr UI.

One difference you’ll notice, if you switch to a language other than English, is that the icons above a photo such as on this page have no text in them.

I checked with Flickr staff, and they confirmed that this is because of the difficulty of squeezing in the translated text in the space available. A classic localization issue, and one that developers and designers should always consider when designing the UI.

For example, here’s the relevant part of the English UI:

Picture of the English version, with text embedded in graphics.

and here is what it looks like when using the Spanish (or indeed any other) UI:

Picture of the Spanish version with only icons, no text.

The text has been dropped in favour of just icons. Note, however, how the text appears in the tooltips that pop up as you mouse over the icons.

This can be an effective way of addressing the problems of text expansion during translation, as long as the icons are understandable memorable and free from cultural bias or offense. Using the tooltips to clarify the meaning is useful too. I think these icons work well, and I’d actually like the Flickr folks to make the English version look like this, too. It detracts less from the photos, to my mind.

Here’s what it may have looked like if Flickr had done the Portuguese in the same way as the English:

Picture of a hypothetical Portuguese version with text in the graphics.

There are a number of problems here. The text is quite dense, it overshoots the width of the photo (there are actually still two letters missing on the right), and it is quite hard to see the accented characters. The text would have to be in a much bigger font to support the complexity of the characters in Chinese and Korean (and of course many other future languages).

Of course, in many situations where text appears in graphics the available width for the text is seriously fixed, and usually within a space that just about fits the English (if that’s the source).

Text will usually expand when translating from English, in particular. This expansion can be particularly pronounced for short pieces of text like icon labels.

So the moral of this story: Think several times before using text in graphics, and in particular icons. If you need to localise your page later, you could have problems. Better just avoid it from the start if you can.

That’s called internationalization ;-)

View blog reactions




Presenting in San Francisco
Flickr photostream.

>> Get my slides !

The long-awaited @media conference is finally over. It went ok, I thought. I’ve been looking forward to carrying the i18n gospel to the heathens of the design and development community. ;-)

It was great to have a single track in San Fran. Of course, given that there were two tracks in London, my audience there wasn’t huge – though I’m guessing that about one third of the 700-odd attendees came, which isn’t too bad – especially since I was up against Dan Cederholm (even I wanted to see Dan again). It’s always frustrating that people don’t know how much they’d find talks on i18n useful until they have accidentally been to one.

Anyway, I got a lot out of the other excellent conference talks and enjoyed meeting or better getting to know many new people. I’m looking forward to next year already. It will be different, however, and probably a little quieter, given that Molly Holzschlag announced that she was leaving the Web Conference Circuit, and Joe Clark announced at the very end of the conference that he was retiring (‘pretty much’) from accessibility. Good luck to them both.

As usual, there are lots of photos.

About the presentation

Check out slide 77 for a list of practical takeaways from the presentation.

The presentation was not designed to give you a thorough overview of potential internationalization and localization issues – we would need much longer for that. It aims to provide you with a few practical takeaways, but more importantly it aims to get you thinking about what internationalization is all about – to take you out of your comfort zone, and help you realize that if you want your content to wow people outside your own culture and language, you need to build in certain flexibilities and adopt certain approaches during the design and development – not as an afterthought. Otherwise you are likely to be creating substantial barriers for worldwide use.

The presentation also aims to show that, although using Unicode is an extremely good start to making your stuff world-ready, using a Unicode encoding such as UTF-8 throughout your content, scripts and databases is only a start. You need to worry about whether translators will be able to adapt your stuff linguistically, but you also need to also consider whether graphics and design are going to be culturally appropriate or can be adapted, and whether your approaches and methodologies fit with those of your target users.


View blog reactions

There’s one piece of music I like so much I’d like it played at my funeral. I’ve not had an opportunity to tell anyone that yet, but this seems as good a time as any.

It’s a piece of music I listen to regularly, and never get tired of.

In the days when I did lots of acting, I would always lie down and try to completely relax immediately before my first entry onto the stage as Figaro, Iago, Pauvre Bitos, Biedermann, Trissotin, Ariel – whatever role. You’d find me lying in some quiet, dark corner with my headphones on, and always listening to this piece of music.

It is calming but rich, it’s exotic but deep, it mingles melody with tone poem. It would wake up my sensitivity to culture and feeling, and still does, while helping me wash away the worries and concerns of the day.

Ok, so you want to know what it is? Alright. It’s one of pieces in the Trittico Boticelliano (Three Boticelli Pictures) by Respighi. Specifically, L’Adorazione dei Magi (The Adoration of the Magi). I found an audio track on the Web. The track is a bit rushed and lacks the refinement of the version I usually listen to, but it may give you an idea.

Enjoy.

(By the way, runners up worth a mention would include the short but sweet Arabian Dance from Tchaikovsky’s Nutcracker Suite, and the very different, but also brilliant, Khachaturian’s Masquerade Waltz.)

I was just pointed to an article entitled “Missing the PowerPoint of public speaking” by Guy Kewney (the URI ends with ‘death_by_powerpoint’).

I’m sorry, but I thought that was a fairly lame article.

First out, it is not about PowerPoint, but about the use of slides. And you could very easily replace ‘PowerPoint’ with ‘Slidy’ [an XHTML-based slide presentation tool used at the W3C] throughout the article.

Next, the author seems to assume that ‘visual notes’ have to be written sentences of text. In his conclusion he says that his ‘fundamental assumption’ is

“If you engage the audience, get them to respond in some way, you’ll hold their attention and get good feedback. I don’t know how you’d set up a PowerPoint show that would allow that flexibility.”

While I’ll concede that a very large proportion of speakers do plaster their slides with text, it’s not inevitable. I try to use slides mostly to show pictures or illustrations, which I then talk about and point at, but document elsewhere in notes. (See for example my presentation “Practical & Cultural Issues in Designing International Web Sites”). I find slides, when used that way, a very powerful tool to assist communication. A picture conveys more than words, and facilitates understanding. I also find the examples used to be particularly effective in helping a listener remember things that were said – particularly when they come across a similar example in their work.

You can also use slides to graphically help the user understand where they are in the discussion and what the current topic is. That can be a significant help in keeping people on board or guiding them towards your conclusion (and even keeping the attention of people who know about the topic addressed by part of the talk by showing them how that fits in with other stuff you will say that perhaps they didn’t know, etc.)

Slides can also help you in structuring your discussion, bringing it in on time, and summarising key points. Of course, they have to be designed with care and used properly to do that.

I’ve seen plenty of people in conferences get a third of the way through their slides and realise that they have run out of time. What’s bad there is that people are just spewing out poorly designed slide content, and then not, as the author says, engaging with the audience.

But done right, creating a slide presentation can really help structure (and importantly weed out, compact and bring down to an appropriate size) your thoughts, even if you don’t plan to project the slides. If you have 50 text-heavy slides for a 20 minute presentation, you can see, if you look, that it might be difficult to fit all your points in the allotted time. Because slides tend to make you think of a presentation in terms of components, however, you are also often likely to be able to see easily where slides (and therefore information) can be pruned to fit the allotted time while still getting over the message.

For what it’s worth, in my mind, the critical part of creating an effective presentation is always the process of deciding what to now leave out, after you’ve decided what to put in. I usually start with my intended conclusion/effect, then decide what are the major points I need to make to get to that conclusion, and then use my time constaints and knowledge about the audience to decide how many points I need and how long I have to make each point. The argument is often more important than the data. You can’t be persuasive if you disrespect your audience.

By the way, if I do have a page with bullets on, I try to always use builds. This keeps the user focused on what you are currently saying, reduces the stress of them trying to match text with speech, and avoids them reading ahead to stuff that needs their attention during the earlier points if it is to be properly understood.

PowerPoint, by the way, is a very handy tool for creating and organizing the types of slide I like to make. Note however that I try to prepare slide notes or articles for people to rediscover the information later, and make them and the slides available in a portable format such as valid XHTML with CSS, or PDF

Having said that, I suppose it also comes down to individual speaking styles. I can think of many occasions when I sat in conferences listening to a poor speaker and thanked my lucky stars that they had committed the main ideas in succinct and well-organized form to their slides, because it helped me understand what they were actually trying to say.

Another thought about the article. Guy says

“There are indeed a few situations – of the ‘blackboard notes’ lecture type – where PowerPoint is actually useful – where you want to have your audience stop and write down what you’re saying.”

I can’t think of any such occasions. Memories come back to me of one lecturer at Cambridge (and others elsewhere) who would spend lesson after lesson writing notes on speech synthesis on the blackboard, which we then copied to our notebooks. It seemed then, and seems now, like a monumental waste of time. I’d much rather have a copy of the notes given to me. Apart from it sparing my aching wrist, I could have digested the content that way in a fraction of the time.

The value of being in the presence of the person conveying the knowledge is the time that it makes available for presenter to bring the topic to life by their presentation style (including possibly unwritten humour or memorable anecdotes), and the opportunity for the audience to ask and debate questions. If I knew in advance that someone at a conference was going to just repeat their paper or slides verbatim, I’d much rather spend my time in another track.

Another thought, unrelated to Guy’s article. Many of the Takahashi-style presentations I’ve seen don’t always impress me. It’s good to get the user away from writing long sentences on the slides. And a single large word on a page can have a strong impact temporarily. But after a few slides like that, the impact begins to be lost, I begin to find myself less able to distinguish words that need to convey impact from those that are just fillers, and I too become lost, in terms of where we are in the presentation. I think it is more useful to diagrammatically help the user see (useful) relationships between concepts rather than just flash them up in serial fashion. Like a site map, showing how ideas interconnect can help your listener better assimilate the relationships between the ideas you are putting before them.

You should always use the lang and/or xml:lang attributes in HTML or XHTML to identify the human language of the content so that applications such as voice browsers, style sheets, and the like can process that text. (See Declaring Language in XHTML and HTML for the details.)

You can override that language setting for a part of the document that is in a different language, eg. some French quotation in an English document, by using the same attribute(s) around the relevant bit of text.

Suppose you have some text that is not in any language, such as type samples, part numbers, perhaps program code. How would you say that this was no language in particular?

There are a number of possible approaches:

  1. A few years ago we introduced into the XML spec the idea that xml:lang=”" conveys that ‘there is no language information available’. (See 2.12 Language Identification)

  2. An alternative is to use the value ‘und’, for ‘undetermined’.

  3. In the IANA Subtag Registry there is another tag, ‘zxx’, that means ‘No linguistic content’. Perhaps this is a better choice. It has my vote at the moment.

xml:lang=”" Is ‘no language information available’ suitable to express ‘this is not a language’? My feeling is not.

If it were appropriate, there are some other questions to be answered here. With HTML an empty string value for the lang or xml:lang attribute produces a validation error.

It seems to me that the validator should not produce an error for xml:lang=”". It needs to be fixed.

I’m not clear whether the HTML DTD supports an empty string value for lang. If so, the presumably the validator needs to be fixed. If not, then this is not a viable option, since you’d really want both lang and xml:lang to have the same values.

und Would the description ‘undetermined’ fit this case, given that it is not a language at all? Again, it doesn’t seem right to me, since ‘undetermined’ seems to suggest that it is a language of some sort, but we’re not sure which.

zxx This seems to be the right choice for me. It would produce no validation issues. The only issue is perhaps that it’s not terrible memorable.

This is an attempt to summarise and move forward some ideas in a thread on www-international@w3.org by Christophe Strobbe, Martin Duerst, Bjoern Hoermann and Tex Texin. I am sending this to that list once more.

I use XMetal 4.6 for all my XHTML and XML authoring. As someone who has been advocating for some time that you should always declare the human language of your content when creating Web content, I’m finding XMetal’s spell checker both exciting and frustrating. Here are a few tips that might help others.

The exciting part is that XMetal figures out which spell checker to use based on the xml:lang language declarations. Given the following code:

<html xml:lang="en-us" lang="en-us" ... > 
...
<p>behavior localization color</p>
<p>behaviour localisation colour</p>
<p xml:lang="fr" lang="fr">ceci est français</p>
<p lang="gr" xml:lang="gr">Κάνοντας τον Παγκόσμιο Ιστό πραγματικά Παγκόσμιο</p>
...

The spell checker will recognize three errors (behaviour localisation colour). The en-us value in the html tag causes it to use the US-English spell check dictionary, and the fr and gr values in the last two paragraphs cause it to use a French and Greek dictionary, respectively, for the words in those elements. Great!

Picture of the spell checker in action.

Note that, since XMetal is an XML editor, rather than an HTML editor, it is the value in the xml:lang attribute rather than the one in the lang attribute that counts here. For XHTML 1.0 content served as text/html, of course, you should use both.

The following, however, are things you need to watch out for:

  1. If your html tag contains just xml:lang=”en” your spell checking won’t be terribly effective, since all the English dictionaries (US, UK, Australia, and Canada) will be used. This means that for the code above you will receive no error notifications, since each spelling occurs in at least one dictionary.

    This is logical enough, though it’s something you may not think about when spell checking. (Even if you go into the spell checker options and set, say, the US English spell checker, the language declaration will override that manual choice.)

  2. If you want to write British English, you would normally put en-GB in the xml:lang (because that’s what BCP 47 says you should do). Unfortunately this will produce no errors with our test case above! XMetal doesn’t recognise the GB subtag, and reverts to xml:lang=”en”. To get the behaviour you are expecting you have to put en-UK in xml:lang. This is really bad. It means you are marking up your content incorrectly. Presumably the same holds true for other languages. I see CF for Canadian French, rather than CA, SD for Swiss German rather than CH, etc.

It’s good to see that the language markup is being used for spell-checking. However, it’s a case of two steps forward, one step back. Which is a shame.

UPDATE: Justsystems have worked on this some more. See my later blog post for details.

« Previous PageNext Page »