Dochula Pass, Bhutan

>> Use it

Inspired by some comments on John Well’s blog, I decided to add a keyboard layout to the IPA picker today. It follows the layout of Mark Huckvale’s Unicode Phonetic Keyboard (UCL) v1.01.

I can’t say I understand why many of the characters are allocated to the keys they are, but I figured that if John Wells uses this keyboard it would be probably worth using its layout.

Bopomofo, or zhùyīn fúhào, is an alphabet that is used for phonetic transliteration of Chinese text. It is usually only used in dictionaries or educational texts, to clarify the pronunciation of the Chinese ideographic characters.

This post is intended to evolve over time. I’ll post other blog posts or tweets as it changes. The current content is to the best of my knowledge correct. Please contribute comments (preferably with pointers to live examples) to help build an accurate picture if you spot something that needs correcting or expanding.

The name bopomofo is equivalent to saying “ABCD” in English, ie. it strings together the pronunciation of the first four characters in the zhuyin fuhao alphabet.

For more information about bopomofo, see Wikipedia and the Unicode Standard.

In this post we will summarise how bopomofo is displayed, to assist people involved in developing the CSS3 Ruby specification. These notes will focus on typical usage for Mandarin Chinese, rather than the extended usage for Minnan and Hakka languages.

Characters and tone marks

These are the bopomofo characters in the basic Unicode Bopomofo block.

One of these characters, U+3127 BOPOMOFO LETTER I, can appear as either a horizontal or vertical line, depending on the context.

In addition to the base characters, there are a set of Unicode characters that are used to express tones. For Mandarin Chinese, these characters are :

02C9 MODIFIER LETTER MACRON
02CA MODIFIER LETTER ACUTE ACCENT
02C7 CARON
02CB MODIFIER LETTER GRAVE ACCENT
02D9 DOT ABOVE

See the list in UniView.

It is important to understand that bopomofo tone marks are not combining characters. They are regular spacing characters that are stored after the sequence of bopomofo letters that make up a syllable. These tone marks can be displayed alongside bopomofo base characters in one of two ways.

Bopomofo used as ruby

When used to describe the phonetics of Chinese ideographs in running text (ie. ruby), bopomofo can be rendered in different ways. A bopomofo transliteration is always done on a character by character basis (ie. mono-ruby).

Horizontal base, horizontal ruby

In this approach the bopomofo is generally written above horizontal base text.

There appear to be two ways of displaying tone marks: (1) following the bopomofo characters for each ideograph, and (2) above the bopomofo characters, as if they were combining characters. We need clarity on which of these approaches is most common, and which needs to be supported. For details about tone placement in (2) see the next section.

Tones following:
bopo-horiz-toneafter
Tones above:
bopo-horiz-tone-above

Horizontal base, vertical ruby

This is a common configuration. The bopomofo appears in a vertical line to the right of each base character. In general, tone marks then appear to the right of the bopomofo characters, however there are some complications with regard to the actual positioning of these marks (see the next section for details).

Example.

Vertical base, vertical ruby

This works just like horizontal base+vertical ruby.

Example.

Vertical base, horizontal ruby

I don’t believe that this exists.

Tones in bopomofo ruby

In ruby text, tones 2-4 are displayed in their own vertical column to the right of the bopomofo letters, and tone 1 is displayed above the column of bopomofo letters.

The first tone

The first tone is not displayed. Here is an example of a syllable with the first tone. There are two bopomofo letters, but no tone mark.

Picture showing the dot above other bopomofo characters.

Tones 2 to 4

The position of tones 2-4 depends on the number of bopomofo characters the tone modifies.

The Ministry of Education in Taiwan has issued charts indicating the expected positioning for vertically aligned bopomofo that conform roughly to this diagram:

Picture showing relative positions of tone marks and vertical bopomofo.

Essentially, about half of the tone glyph box extends upwards from the top of the last bopomofo character box.

Tones in horizontal ruby are placed differently, relative to the bopomofo characters, according to the Ministry charts. Essentially, about half the width of the tone glyph extends to the right of the last bopomofo character in the sequence.

The charts cover alignment for vertical text (here, here and here) and for horizontal text (here, here and here).

In some cases the tone appears to be simply displayed alongside the last character in vertical text, as shown in these examples:

Three characters with the MODIFIER LETTER ACUTE ACCENT tone, but different numbers of bopomofo letters.

The light tone

When a light tone is used (U+02D9 DOT ABOVE). This appears at the top of the column of bopomofo letters, even though when written it appears after these in memory. The image just below illustrates this.

Picture showing the dot above other bopomofo characters.

Note that the actual sequence of characters in memory is:

3109: BOPOMOFO LETTER D
3127: BOPOMOFO LETTER I
02D9: DOT ABOVE

The apparent placing of the dot above the first bopomofo letter is an artifact of rendering only.

Bopomofo written on its own

It is not common to see text written only in bopomofo, but it does occur from time to time for Chinese, and sometimes it is used for aboriginal Taiwanese languages.

In horizontal text

When written on its own in horizontal layout any tone marks are displayed as spacing characters after the syllable they modify.

Example: Example.

In vertical text

I haven’t seen bopomofo used in its own right in vertical text, so I don’t know whether in that case one puts the tone marks below the bopomofo letters for a syllable, or to the side like when bopomofo is used as ruby.

In horizontal text

I have also come across instances where a bopomofo character has been included among Chinese ideographs. It may be that this reflects slang or colloquial usage.

Example 1. Example 2.

Picture of the page in action.

>> Use it

This picker contains characters from the Unicode Mongolian block needed for writing the Mongolian language. It doesn’t include Sibe, Todo or Manchu characters. Mongolian is a complex script, and I am still familiarising myself with it. This is an initial trial version of a Mongolian picker, and as people use it and raise feedback I may need to make changes.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility.

About this picker: The output area for this picker is set up for vertical text. However, only Internet Explorer currently supports vertical text display, and only IE8 supports Mongolian’s left-to-right column progression. In addition, it seems that IE doesn’t support ltr columns in textarea elements. The bottom line is that, although the output area is the right shape and position for vertical text, mostly the output will be horizontal. You will see vertical text in IE, but the column positions will look wrong. Nevertheless, in any of these cases, when you cut and paste text into another document, the characters will still be correctly ordered.

Consonants are to the left, and in the order listed in the Wikipedia article about Mongolian text. To their right are vowels, then punctuation, spaces and control characters, and number digits. The variation selectors are positioned just below the consonants.

As you mouse over the letters, the various combining forms appear in a column to the far left. This is to help identify characters, for those less familiar with the alphabet.

In this post I’m hoping to make clearer some of the concepts and issues surrounding jukugo ruby. If you don’t know what ruby is, see the article Ruby for a very quick introduction, or see Ruby Markup and Styling for a slightly longer introduction to how it was expected to work in XHTML and CSS.

You can find an explanation of jukugo ruby in Requirements for Japanese Text Layout, sections 3.3 Ruby and Emphasis Dots and Appendix F Positioning of Jukugo-ruby (you need to read both).

What is jukugo ruby?

Jukugo refers to a Japanese compound noun, ie. a word made up of more than one kanji character. We are going to be talking here about how to mark up these jukugo words with ruby.

There are three types of ruby behaviour.

Mono ruby is commonly used for phonetic annotation of text. In mono-ruby all the ruby text for a given character is positioned alongside a single base character, and doesn’t overlap adjacent base characters. Jukugo are often marked up using a mono-ruby approach. You can break a word that uses mono ruby at any point, and the ruby text just stays with the base character.

Group ruby is often used where phonetic annotations don’t map to discreet base characters, or for semantic glosses that span the whole base text. You can’t split text that is annotated with group ruby. It has to wrap a single unit onto the next line.

Jukugo ruby is a term that is used not to describe ruby annotations over jukugo text, but rather to describe ruby with a slightly different behaviour than mono or group ruby. Jukugo ruby behaves like mono ruby, in that there is a strong association between ruby text and individual base characters. This becomes clear when you split a word at the end of a line: you’ll see that the ruby text is split so that the ruby annotating a specific base character stays with that character. What’s different about jukugo ruby is that when the word is NOT split at the end of the line, there can be some significant amount of overlap of ruby text with adjacent base characters.

Example of ruby text.

The image to the right shows three examples of ruby annotating jukugo words.

In the top two examples, mono ruby can be used to produce the desired effect, since neither of the base characters are overlapped by ruby text that doesn’t relate to that character.

The third example is where we see the difference that is referred to as jukugo ruby. The first three ruby characters are associated with the first kanji character. Just the last ruby character is associated with the second kanji character. And yet the ruby text has been arranged evenly across both kanji characters.

Note, however, that we aren’t simply spreading the ruby over the whole word, as we would with group ruby. There are rules that apply, and in some cases gaps will appear. See the following examples of distribution of ruby text over jukugo words.

Various examples of jukugo ruby.

In the next part of this post I will look at some of the problems encountered when trying to use HTML and CSS for jukugo ruby.

If you want to discuss this or contribute thoughts, please do so on the public-i18n-cjk@w3.org list. You can see the archive and subscribe at http://lists.w3.org/Archives/Public/public-i18n-cjk/

Webfonts, and WOFF in particular, have been in the news again recently, so I thought I should mention that a few days ago I changed my pages describing Myanmar and Arabic-for-Urdu scripts so that you can download the necessary font support for the foreign text, either as a TTF linked font or as WOFF font.

You can find the Myanmar page at http://rishida.net/scripts/myanmar/. Look for the links n the side bar to the right, under the heading “Fonts”.

The Urdu page, using the beautiful Nastaliq script, is at http://rishida.net/scripts/urdu/.

(Note that the examples of short vowels don’t use the nastiliq style. Scroll down the page a little further.)

I haven’t had time to check whether all the opentype features are correctly rendered, but I’ve been doing Mac testing of the i18n webfonts tests, and it looks promising. (More on that later.) The Urdu font doesn’t rely on OS rendering, which should help.

Here are some examples of the text on the page:
Examples of Urdu script and Myanmar script.

Analyser: http://rishida.net/tools/analysestring/

Converter: http://rishida.net/tools/conversion/

The string analyser tool provides information about the characters in a string. One difference in this version is a new section “Data input as graphics”, where you see a horizontal sequence of graphics for each of the characters in the string you are analysing. This can be useful to get a screen snap of the characters. Of course, there is no combining or ligaturing behaviour involved – just a graphic per character.

You can reverse the character order for right-to-left scripts.

Another difference is that you can explode example text in the notes. Take this example: if you click on the Arabic word for Koran (red word near the bottom of the notes), you’ll see a pop-up window in the bottom right corner of the window that lists the characters in that word.

The other change is that the former “Related links” section in the sidebar is now called “Do more”, and the links carry the string you are analysing to the Converter or UniView apps.

Oh, and the page now remembers the options you set between refreshes, which makes life much easier.

The converter tool converts between characters and various escaped character formats. It was changed so that the “View names” button sends the characters to the string analyser tool. This means that you’ll now see graphics for the characters, and that, once on the string analyser page, you can change the amount of information displayed for each character (including showing font-based characters, if you need to).

I also fixed a bug related to the UTF-8 and UTF-16 input. Including spaces after the code values no longer fires off a bug.

PS: The string analyser tool has graphics for all new Unicode 6.0 characters, however I haven’t updated the data for those characters yet. I was planning to do so with the next release of UniView, which should be in October, when the final Unicode database is available.

http://rishida.net/scripts/indic-overview/

I finally got around to refreshing this article, by converting the Bengali, Malayalam and Oriya examples to Unicode text. Back when I first wrote the article, it was hard to find fonts for those scripts.

I also added a new feature: In the HTML version, click on any of the examples in indic text and a pop-up appears at the bottom right of the page, showing which characters the example is composed of. The pop-up lists the characters in order, with Unicode names, and shows the characters themselves as graphics.

I have not yet updated this article’s incarnation as Unicode Technical Note #10. The Indian Government also used this article, and made a number of small changes. I have yet to incorporate those, too.

I recently came across an email thread where people were trying to understand why they couldn’t see Indian content on their mobile phones. Here are some notes that may help to clarify the situation. They are not fully developed! Just rough jottings, but they may be of use.

Let’s assume, for the sake of an example, that the goal is to display a page in Hindi, which is written using the devanagari script. These principles, however, apply to one degree or another to all languages that use characters outside the ASCII range.

Let’s start by reviewing some fundamental concepts: character encodings and fonts. If you are familiar with these concepts, skip to the next heading.

Character encodings and fonts

Content is composed of a sequence of characters. Characters represent letters of the alphabet, punctuation, etc. But content is stored in a computer as a sequence of bytes, which are numeric values. Sometimes more than one byte is used to represent a single character. Like codes used in espionage, the way that the sequence of bytes is converted to characters depends on what key was used to encode the text. In this context, that key is called a character encoding.

There are many character encodings to choose from.

The person who created the content of the page you want to read should have used a character encoding that supports devanagari characters, but it should also be a character encoding that is widely recognised by browsers and available in editors. By far the best character encoding to use (for any language in the world) is called UTF-8.

UTF-8 is strongly recommended by the HTML5 draft specification.

There should be a character encoding declaration associated with the HTML code of your page to say what encoding was used. Otherwise the browser may not interpret the bytes correctly. It is also crucial that the text is actually stored in that encoding too. That means that the person creating the content must choose that encoding when they save the page from their editor. It’s not possible to change the encoding of text simply by changing the character encoding declaration in the HTML code, because the declaration is there just to indicate to the browser what key to use to get at the already encoded text.

It’s one thing for the browser to know how to interpret the bytes to represent your text, but the browser must also have a way to make those characters stored in memory appear on the screen.

A font is essential here. Fonts contain instructions for displaying a character or a sequence of characters so that you can read them. The visual representation of a character is called a glyph. The font converts characters to glyphs.

The font has tables to map the bytes in memory to text. To do this, the font needs to recognise the character encoding your page uses, and have the necessary tables to convert the characters to glyphs. It is important that the font used can work with the character encoding used in the page you want to view. Most fonts these days support UTF-8 encoded text.

Very simple fonts contain one glyph for each letter of the alphabet. This may work for English, but it wouldn’t work for a complex script such as devanagari. In these scripts the positioning and interaction of characters has to be modified according to the context in which they are displayed. This means that the font needs additional information about how to choose and postion glyphs depending on the context. That information may be built into the font itself, or the font may rely on information on your system.

Character encoding support

The browser needs to be able to recognise the character encoding used in order to correctly interpret the mapping between bytes and characters.

If the character encoding of the page is incorrectly declared, or not declared at all, there will be problems viewing the content. Typically, a browser allows the user to manually apply a particular encoding by selecting the encoding from the menu bar.

All browsers should support the UTF-8 character encoding.

Sometimes people use an encoding that is not designed for devanagari support with a font that produces the right glyphs nevertheless. Such approaches are fraught with issues and present poor interoperability on several levels. For example, the content can only be interpreted correctly by applying the specifically designed font; no other font will do if that font is not available. Also, the meaning of the text cannot be derived by machine processing, for web searches, etc., and the data cannot be easily copied or merged with other text (eg. to quote a sentence in another article that doesn’t use the same encoding). This practise seriously damages the openness of the Web and should be avoided at all costs.

System font support

Usually, a web page will rely on the operating system to provide a devanagari font. If there isn’t one, users won’t be able to see the Hindi text. The browser doesn’t supply the font, it picks it up from whatever platform the browser is running on.

If browser is running on a desktop computer, there may be a font already installed. If not, it should be possible to download free or commercial fonts and install them. If the user is viewing the page on a mobile device, it may currently be difficult to download and install one.

If there are several devanagari fonts on a system, the browser will usually pick one by default. However, if the web page uses CSS to apply styling to the page, the CSS code may specify one or more particular fonts to use for a given piece of content. If none of these are available on the system, most browsers will fall back to the default, however Internet Explorer will show square boxes instead.

Webfonts

Another way of getting a font onto the user’s system is to download it with the page, just like images are downloaded with the page. This is done using CSS code. The CSS code to do this has been defined for some years, but unfortunately most browsers implementation of this feature is still problematic.

Recently a number of major browsers have begun to support download of raw truetype or opentype fonts. Internet Explorer is not one of those. This involves simply loading the ordinary font onto a server and downloading to the browser when the page is displayed. Although the font may be cached as the user moves from page to page, there may still be some significant issues when dealing with complex scripts or Far Eastern languages (such as Chinese, Japanese and Korean) due to the size of the fonts used. The size of these fonts can often be counted in megabytes rather than kilobytes.

It is important to observe licencing restrictions when making fonts available for download in this way. The CSS mechanism doesn’t contain any restrictions related to font licences, but there are ways of preparing fonts for download that take into consideration some aspects of this issue – though not enough to provide a watertight restriction on font usage.

Microsoft makes available a program to create .eot fonts from ordinary true/opentype fonts. Eot font files can apply some usage restrictions and also subset the font to include only the characters used on the page. The subsetting feature is useful when only a small amount of text appears in a given font, but for a whole page in, say, devanagari script it is of little use – particularly if the user is to input text in forms. The biggest problem with .eot files, however, is that they are only supported by Internet Explorer, and there are no plans to support .eot format on other browsers.

The W3C is currently working on the WOFF format. Fonts converted to WOFF format can have some gentle protection with regard to use, and also apply significant compression to the font being downloaded. WOFF is currently only supported by Firefox, but all other major browsers are expected to provide support for the new format.

For this to work well, all browsers must support the same type of font download.

Beyond fonts

Complex scripts, such as those used for Indic and South East Asian languages, need to choose glyph shapes and positions and substitute ligatures, etc. according to the context in which characters are used. These adjustments can be acoomplished using the features of OpenType fonts. The browser must be able to implement those opentype features.

Often a font will also rely on operating system support for some subset of the complex script rendering. For example, a devanagari font may rely on the Windows uniscribe dll for things like positioning of left-appended vowel signs, rather than encoding that behaviour into the font itself. This reduces the size and complexity of the font, but exposes a problem when using that font on a variety of platforms. Unless the operating system can provide the same rendering support, the text will look only partially correct. Mobile devices must either provide something similar to uniscribe, or fonts used on the mobile device must include all needed rendering features.

Browsers that do font linking must also support the necessary opentype features and obtain functionality from the OS rendering support where needed.

If tools are developed to subset webfonts, the subsetting must not remove the rendering logic needed for correct display of the text.

Picture of the page in action.

>> Use it

In 1992 the Chinese government recognised the Fraser alphabet as the official script for the Lisu language and has encouraged its use since then. There are 630,000 Lisu people in China, mainly in the regions of Nujiang, Diqing, Lijiang, Dehong, Baoshan, Kunming and Chuxiong in the Yunnan Province. Another 350,000 Lisu live in Myanmar, Thailand and India. Other user communities are mostly Christians from the Dulong, the Nu and the Bai nationalities in China.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility.

Latest changes: This picker is new. The default view was modified from an original proposal by Benjamin Lee, and is likely to be more useful to people who are somewhat familiar with the alphabet and characters of Lisu. Characters are arranged to simplify entry, with consonants to the left, vowels to their right, and tone marks to their right.

There is also a keyboard view. Many of the positions of characters are based on keyboard layouts I have seen. Those keyboards, however, tended to use some ASCII characters for punctuation, when the Unicode Standard recommends other characters (in particular, MODIFIER LETTER LOW MACRON and MODIFIER LETTER APOSTROPHE) or omit some punctuation characters mentioned in the Unicode Standard. The current version of this keyboard, therefore adds some extra characters.

The layout is adequate, given that pickers assume availability of a QWERTY keyboard, however if a real standardised keyboard layout is to be made, it should involve some further changes. For example, people wanting to use syntax characters such as comma, period, semi-colon, single quote, etc, while writing the text in Lisu will need direct access to those characters. They are missing from this layout.

Picture of the page in action.

>> Use UniView lite

>> Use UniView

About the tool: Look up and see characters (using graphics or fonts) and property information, view whole character blocks or custom ranges, select characters to paste into your document, paste in and discover unknown characters, search for characters, do hex/dec/ncr conversions, highlight character types, etc. etc. Supports Unicode 5.2 and written with Web Standards to work on a variety of browsers. No need to install anything.

Latest changes: The major change in this update is the addition of an alternative UniView lite interface for the tool that makes it easier to use UniView in restricted screen sizes, such as on mobile devices. The lite interface offers a subset of the functionality provided in the full version, rearranges the user interface and sets up some different defaults (eg. list view is the default, rather than the matrix view). However, the underlying code is the same – only the initial markup and the CSS are different.

Another significant change is that when you click on a character in a list or matrix that character is either added to the text area or detailed information for that character is displayed, but not now both at the same time. You switch between the two possibilities by clicking on the icon. When the background is white (default) details are shown for the character. When the background is orange the character will be added to the text area (like a character map or picker).

Information from my character database is now shown by default when you are shown detailed information for a character. The switch to disable this has been moved to the Options panel.

Text highlighted in red in information from the character database contains examples. In case you don’t have a font for viewing such examples, or in case you just want to better understand the component characters, you can now click on these and the component characters will be listed in a new window (using the String Analyzer tool).

Access to Settings panel has been moved slightly downwards and renamed Options in the full version.

The default order for items in lists is now <character><codepoint><name>, rather than the previous <codepoint><character><name>. This can still be changed in the Options panel, or by setting query parameters.

I changed the Next and Previous functions in the character detail pane so that it moves one codepoint at a time through the Unicode encoding space. The controls are now buttons rather than images.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility.

Latest changes: This picker has been upgraded to use the version 10 look and feel, and incorporate new characters from Unicode version 5.2. Characters whose use is discouraged in Unicode have been moved to the advanced section – similar looking images in the main section put multiple characters into the output, as per NFC normalization.

>> Use it

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility.

Latest changes: Both pickers have been upgraded to use the version 10 look and feel.

The Arabic block picker now includes the latest characters added to the Arabic and Arabic Supplement blocks in Unicode 5.1. Characters are displayed using the shape view of version 10 pickers. This saves a lot of space on-screen.

The Ethiopic picker was also updated to include more recent characters from the Unicode Ethiopic block (added in version 4.1), and the layout was improved to make it easier to locate a character. It still covers only the basic Ethiopic block.

>> Use the Arabic Block picker

>> Use the Ethiopic picker

The new characters.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility

Latest changes: I recently added U+2C71 LATIN SMALL LETTER V WITH RIGHT HOOK (labiodental tap or flap) to the IPA picker. This was in the IPA chart for a long time, but was only added to Unicode in version 5.1.

Today I also added, at the request of Dan McCloy, four prosodic markers: prosodic phrase, prosodic word, syllable and mora (see the second line of the picture).

Regular users will also notice that I recently upgraded the picker chrome to version 10, too.

>> Use it


Picture of the page in action.

About the tool: Look up and see characters (using graphics or fonts) and property information, view whole character blocks or custom ranges, select characters to paste into your document, paste in and discover unknown characters, search for characters, do hex/dec/ncr conversions, highlight character types, etc. etc. Supports Unicode 5.2 and written with Web Standards to work on a variety of browsers. No need to install anything.

Latest changes: The major change in this update is the addition of a function, Show age, to show the version of Unicode where a character was added (after version 1.1). The same information is also listed in the details given for a character in the lower right panel.

The trigger for context-sensitive help was reduced to the first character of a command name, rather than the whole command name. This improves behaviour for commands under More actions by allowing you to click on the command name rather than just the icon alongside to activate the command.

Some ‘quick start’ instructions were also added to the initial display to orient people new to the tool, and this help text was updated in various areas.

The highlighting mechanism was changed. Rather than highlight characters using a coloured border (which is typically not very visible), highlighting now works by greying out characters that are not highlighted. This also makes it clearer when nothing is highlighted.

In the recent past, when you converted a matrix to a list in the lower left panel, greyed-out rows would be added for non-characters. These are no longer displayed. Consequently, the command to remove such rows from the list (previously under More actions) has been removed.

A lot of invisible work went into replacing style attributes in the code with class names. This produces better source code, but doesn’t affect the user experience.

>> Use it


Characters in the Unicode Bengali block.

If you’re interested, I just did a major overhaul of my script notes on Bengali in Unicode. There’s a new section about which characters to use when there are multiple options (eg. RRA vs. DDA+nukta), and the page provides information about more characters from the Bengali block in Unicode (including those used in Bengali’s amazingly complicated currency notation prior to 1957).

In addition, this has all been squeezed into the latest look and feel for script notes pages.

The new page is at a new location. There is a redirect on the old page.

Hope it’s useful.

>> Read it


Picture of the page in action.
 
Picture of the page in action.

About the tools: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility

Latest changes: The Urdu and Tamil pickers have been upgraded to version 10. This provides new views of the data, but also involved a thorough overhaul and redesign of the pickers. Transliteration functions have also been added for the Tamil picker.

In addition, the Urdu notes page was updated and a new Tamil notes page was created. Database entries were also updated or, in the case of Tamil, created to support the notes pages. These notes pages are the first to use a new look and feel, based on the analyse-string tool I produced earlier this year. This adds information about each character from the Unicode descriptions data to that from my own database.

Picture of the page in action.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility

Latest changes: Over the Christmas break I’ve applied version 10 upgrades to the following pickers: Bengali, Hebrew, Khmer, Lao, Malayalam, Myanmar, Thai and Tifinagh. In the case of Hebrew and Tifinagh, this came down to completely rewriting the pickers.

Key changes in version 10 include the following:

  • The visible layout of the shape view has been reduced in the vertical direction by showing a group of characters only when you mouse over the orange keys at the top. This makes it easier and faster to locate characters, and also improves use on screens with restricted space. The way similar characters in other groups is handled has been reinvented to fit the new approach better, and enable faster creation of pickers in the future.
  • The visible layout of the transcription view has been adapted in a similar way to the shape view.
  • The button to dump the phonetic buffer has been moved to just below the output area.
  • The Detail button is now called the Analyse button, and both this and the Codepoints commands now bring up the new String Analyser utility, which provides much better results than the old pages.
  • A keyboard view has been added to the Tifinagh picker. This new view may pop up in other pickers in the future.

There were a number of other changes to the code, and not least to the instructions for use on the main picker page and each set of notes below the pickers themselves.

>> Use it


Picture of the page in action.

About the tool: This tool shows you what characters are in a string of Unicode characters, and gives you informaiton about each one. Either type/paste the string into the box on the right of the page, or send it in the URL. It’s especially useful if you have no font for the text, or you are trying to unravel a sequence of characters in a complex script, but also allows you to just dig out information about one or more characters.

Here’s an example

By default you see a large graphic image of each character, the Unicode code point number and name, the Unicode script block in which it occurs, any annotations in the Unicode Standard, and any notes for that character in my character database (which I also updated today with information about Hebrew, Malayalam, Lisu and other scripts).

However, the result can be tailored in terms of the level of information and various aspects of the presentation. Simply click on the options to the right of the page, or (again) include the relevant info in the URI.

For example, you can remove any of these items of information individually (except the codepoint and name), or add a text version of the character. You can also choose a smaller graphic.

In addition, notes from my character database contain examples (coloured red). By clicking on these examples you can list the characters in the example text without leaving the page. The list of characters shows up in the right margin.

Oh, and you can click on links to see a character in UniView (to explore its Unicode properties) or to show the whole block in which the character lives.

You’ll shortly see my other applications such as pickers, UniView, etc, linking to this app.

Hope it’s useful.

>> Use it


Picture of the page in action.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility

Latest changes: This is the first version 9 picker. Changes introduced in version 9 include moving the buttons that allow you to display different views to just below the page title. Also, in version 8 pickers, there was an icon in the phonic view that allowed you to dump to the output the phonetic transcription that builds up while selecting characters. This has been replaced with a button just below the output field. There were a number of other superficial changes.

A significant addition to the Malayalam picker is the ability to convert Malayalam text into a Latin transliteration, based on ISO 15919. There was already a way to convert Latin transliterations to Malayalam script.

This version also continues to allow you to type in chillu characters as either single characters as included in Unicode v5.1, or as a sequence of consonant+virama+zwj. Additions to the Malayalam repertoire added in v5.2 have not yet been added to the picker.

>> Use it


I just received email from Derek Reid in the XMetal team at JustSystems to say that they have significantly improved the way the XMetal XML editor uses xml:lang attributes in the source code in conjunction with its spell-checker.

Basically, XMetal will switch spell-checking dictionaries based on the xml:lang settings in the markup. It also supports xml:lang=”" and xml:lang=”zxx” for places you don’t want to spell-check. It even does this when using interactive red squiggles to highlight potential misspellings.

I wrote a blog post about this in 2007, when the capability was only partially developed. Derek says:

I read this post when you first wrote it, and after getting feedback from a large number of our clients I was finally able to convince our development and management teams to properly support language auto-switching for spell checking in conjunction with xml:lang attribute values in our product.

We made a big effort to deal with these limitations during the past year and our XMetaL Author Enterprise 6.0 release addresses most or all of them.

If you are interested, I have posted instructions on how to configure XMetaL Author Enterprise 6.0 to properly support this feature:
http://forums.xmetal.com/index.php/topic,539.msg1701

I haven’t had a chance to try it out yet, but it sounds exciting. Now how about DreamWeaver…

« Previous PageNext Page »