Dochula Pass, Bhutan

Examples of case conversion.

These are notes culled from various places. There may well be some copy-pasting involved, but I did it long enough ago that I no longer remember all the sources. But these are notes, it’s not an article.

Case conversions are not always possible in Unicode by applying an offset to a codepoint, although this can work for the ASCII range by adding 32, or by adding 1 for many other characters in the Latin extensions. There are many cases where the corresponding cased character is in another block, or in an irregularly offset location.

In addition, there are linguistic issues that mean that simple mappings of one character to another are not sufficient for case conversion.

In German, the uppercase of ß is SS. German and Greek cannot, however, be easily transformed from upper to lower case: German because SS could be converted either to ß or ss, depending on the word; Greek because all tonos marks are omitted in upper case, eg. does ΑΘΗΝΑ convert to Αθηνά (the goddess) or Αθήνα (capital of Greece)? German may also uppercase ß to ẞ sometimes for things like signboards.

Also Greek converts uppercase sigma to either a final or non-final form, depending on the position in a word, eg. ΟΔΥΣΣΕΥΣ becomes οδυσσευς. This contextual difference is easy to manage, however, compared to the lexical issues in the previous paragraph.

In Serbo-Croatian there is an important distinction between uppercase and titlecase. The single letter dž converts to DŽ when the whole word is uppercased, but Dž when titlecased. Both of these forms revert to dž in lowercase, so there is no ambiguity here.

In Dutch, the titlecase of ijsvogel is IJsvogel, ie. which commonly means that the first two letters have to be titlecased. There is a single character IJ (U+0132 LATIN CAPITAL LIGATURE IJ) in Unicode that will behave as expected, but this single character is very often not available on a keyboard, and so the word is commonly written with the two letters I+J.

In Greek, tonos diacritics are dropped during uppercasing, but not dialytika. Greek diphthongs with tonos over the first vowel are converted during uppercasing to no tonos but a dialytika over the second vowel in the diphthong, eg. Νεράιδα becomes ΝΕΡΑΪΔΑ. A letter with both tonos and dialytika above drops the tonos but keeps the dialytika, eg. ευφυΐα becomes ΕΥΦΥΪΑ. Also, contrary to the initial rule mentioned here, Greek does not drop the tonos on the disjunctive eta (usually meaning ‘or’), eg. ήσουν ή εγώ ή εσύ becomes ΗΣΟΥΝ Ή ΕΓΩ Ή ΕΣΥ (note that the initial eta is not disjunctive, and so does drop the tonos). This is to maintain the distinction between ‘either/or’ ή from the η feminine form of the article, in the nominative case, singular number.

Greek titlecased vowels, ie. a vowel at the start of a word that is uppercased, retains its tonos accent, eg. Όμηρος.

Turkish, Azeri, Tatar and Bashkir pair dotted and undotted i’s, which requires special handling for case conversion, that is language-specific. For example, the name of the second largest city in Turkey is “Diyarbakır”, which contains both the dotted and dotless letters i. When rendered into upper case, this word appears like this: DİYARBAKIR.

Lithuanian also has language-specific rules that retain the dot over i when combined with accents, eg. i̇̀ i̇́ i̇̃, whereas the capital I has no dot.

Sometimes European French omits accents from uppercase letters, whereas French Canadian typically does not. However, this is more of a stylistic than a linguistic rule. Sometimes French people uppercase œ to OE, but this is mostly due to issues with lack of keyboard support, it seems (as is the issue with French accents).

Capitalisation may ignore leading symbols and punctuation for a word, and titlecase the first casing letter. This applies not only to non-letters. A letter such as the (non-casing version of the) glottal stop, ʔ, may be ignored at the start of a word, and the following letter titlecased, in IPA or Americanist phonetic transcriptions. (Note that, to avoid confusion, there are separate case paired characters available for use in orthographies such as Chipewyan, Dogrib and Slavey. These are Ɂ and ɂ.)

Another issue for titlecasing is that not all words in a sequence are necessarily titlecased. German uses capital letters to start noun words, but not verbs or adjectives. French and Italian may expect to titlecase the ‘A’ in “L’Action”, since that is the start of a word. In English, it is common not to titlecase words like ‘for’, ‘of’, ‘the’ and so forth in titles.

Unicode provides only algorithms for generic case conversion and case folding. CLDR provides some more detail, though it is hard to programmatically achieve all the requirements for case conversion.

Case folding is a way of converting to a standard sequence of (lowercase) characters that can be used for comparisons of strings. (Note that this sequence may not represent normal lowercase text: for example, both the uppercase Greek sigma and lowercase final sigma are converted to a normal sigma, and the German ß is converted to ‘ss’.) There are also different flavours of case folding available: common, full, and simple.

Picture of the page in action.

An updated version of the Unicode Character Converter web app is now available. This app allows you to convert characters between various different formats and notations.

Significant changes include the following:

  • It’s now possible to generate EcmaScript6 style escapes for supplementary characters in the JavaScript output field, eg. \u{10398} rather than \uD800\uDF98.
  • In many cases, clicking on a checkbox option now applies the change straight away if there is content in the associated output field. (There are 4 output fields where this doesn’t happen because we aren’t dealing with escapes and there are problems with spaces and delimiters.)
  • By default, the JavaScript output no longer escapes the ASCII characters that can be represented by \n, \r, \t, \’ and \”. A new checkbox is provided to force those transformations if needed. This should make the JS transform much more useful for general conversions.
  • The code to transform to HTML/XML can now replace RLI, LRI, FSI and PDI if the Convert bidi controls to HTML markup option is set.
  • The code to transform to HTML/XML can convert many more invisible or ambiguous characters to escapes if the Escape invisible characters option is set.
  • UTF-16 code points are all at least 4 digits long.
  • Fixed a bug related to U+00A0 when converting to HTML/XML.
  • The order of the output fields was changed, and various small improvements were made to the user interface.
  • Revamped and updated the notes

Many thanks to the people who wrote in with suggestions.

Picture of the page in action.

UniView now supports Unicode version 9, which is being released today, including all changes made during the beta period. (As before, images are not available for the Tangut additions, but the character information is available.)

This version of UniView also introduces a new filter feature. Below each block or range of characters is a set of links that allows you to quickly highlight characters with the property letter, mark, number, punctuation, or symbol. For more fine-grained property distinctions, see the Filter panel.

In addition, for some blocks there are other links available that reflect tags assigned to characters. This tagging is far from exhaustive! For instance, clicking on sanskrit will not show all characters used in Sanskrit.

The tags are just intended to be an aid to help you find certain characters quickly by exposing words that appear in the character descriptions or block subsection titles. For example, if you want to find the Bengali currency symbol while viewing the Bengali block, click on currency and all other characters but those related to currency will be dimmed.

(Since the highlight function is used for this, don’t forget that, if you happen to highlight a useful subset of characters and want to work with just those, you can use the Make list from highlights command, or click on the upwards pointing arrow icon below the text area to move those characters into the text area.)

Picture of the page in action.

UniView now supports the characters introduced for the beta version of Unicode 9. Any changes made during the beta period will be added when Unicode 9 is officially released. (Images are not available for the Tangut additions, but the character information is available.)

It also brings in notes for individual characters where those notes exist, if Show notes is selected. These notes are not authoritative, but are provided in case they prove useful.

A new icon was added below the text area to add commas between each character in the text area.

Links to the help page that used to appear on mousing over a control have been removed. Instead there is a noticeable, blue link to the help page, and the help page has been reorganised and uses image maps so that it is easier to find information. The reorganisation puts more emphasis on learning by exploration, rather than learning by reading.

Various tweaks were made to the user interface.

I just received a query from someone who wanted to know how to figure out what characters are in and what characters are not in a particular legacy character encoding. So rather than just send the information to her I thought I’d write it as a blog post so that others can get the same information. I’m going to write this quickly, so let me know if there are parts that are hard to follow, or that you consider incorrect, and I’ll fix it.

A few preliminary notes to set us up: When I refer to ‘legacy encodings’, I mean any character encoding that isn’t UTF-8. Though, actually, I will only consider those that are specified in the Encoding spec, and I will use the data provided by that spec to determine what characters each encoding contains (since that’s what it aims to do for Web-based content). You may come across other implementations of a given character encoding, with different characters in it, but bear in mind that those are unlikely to work on the Web.

Also, the tools I will use refer to a given character encoding using the preferred name. You can use the table in the Encoding spec to map alternative names to the preferred name I use.

What characters are in encoding X?

Let’s suppose you want to know what characters are in the character encoding you know as cseucpkdfmtjapanese. A quick check in the Encoding spec shows that the preferred name for this encoding is euc-jp.

Go to http://r12a.github.io/apps/encodings/ and look for the selection control near the bottom of the page labelled show all the characters in this encoding.

Select euc-jp. It opens a new window that shows you all the characters.

picture of the result

This is impressive, but so large a list that it’s not as useful as it could be.

So highlight and copy all the characters in the text area and go to https://r12a.github.io/apps/listcharacters/.

Paste the characters into the big empty box, and hit the button Analyse characters above.

This will now list for you those same characters, but organised by Unicode block. At the bottom of the page it gives a total character count, and adds up the number of Unicode blocks involved.

picture of the result

What characters are not in encoding X?

If instead you actually want to know what characters are not in the encoding for a given Unicode block you can follow these steps.

Go to UniView (http://r12a.github.io/uniview/) and select the block you are interested where is says Show block, or alternatively type the range into the control labelled Show range (eg. 0370:03FF).

Let’s imagine you are interested in Greek characters and you have therefore selected the Greek and Coptic block (or typed 0370:03FF in the Show range control).

On the edit buffer area (top right) you’ll see a small icon with an arrow point upwards. Click on this to bring all the characters in the block into the edit buffer area. Then hit the icon just to its left to highlight all the characters and then copy them to the clipboard.

picture of the result

Next open http://r12a.github.io/apps/encodings/ and paste the characters into the input area labelled with Unicode characters to encode, and hit the Convert button.

picture of the result

The Encoding converter app will list all the characters in a number of encodings. If the character is part of the encoding, it will be represented as two-digit hex codes. If not, and this is what you’re looking for, it will be represented as decimal HTML escapes (eg. Ͱ). This way you can get the decimal code point values for all the characters not in the encoding. (If all the characters exist in the encoding, the block will turn green.)

(If you want to see the list of characters, copy the results for the encoding you are interested in, go back to UniView and paste the characters into the input field labelled Find. Then click on Dec. Ignore all ASCII characters in the list that is produced.)

Note, by the way, that you can tailor the encodings that are shown by the Encoding converter by clicking on change encodings shown and then selecting the encodings you are interested in. There are 36 to choose from.

Picture of the page in action.
>> Use the app

This app allows you to see how Unicode characters are represented as bytes in various legacy encodings, and vice versa. You can customise the encodings you want to experiment with by clicking on change encodings shown. The default selection excludes most of the single-byte encodings.

The app provides a way of detecting the likely encoding of a sequence of bytes if you have no context, and also allows you to see which encodings support specific characters. The list of encodings is limited to those described for use on the Web by the Encoding specification.

The algorithms used are based on those described in the Encoding specification, and thus describe the behaviour you can expect from web browsers. The transforms may not be the same as for other conversion tools. (In some cases the browsers may also produce a different result than shown here, while the implementation of the spec proceeds. See the tests.)

Encoding algorithms convert Unicode characters to sequences of double-digit hex numbers that represent the bytes found in the target character encoding. A character that cannot be handled by an encoder will be represented as a decimal HTML character escape.

Decoding algorithms take the byte codes just mentioned and convert them to Unicode characters. The algorithm returns replacement characters where it is unable to map a given byte to the encoding.

For the decoder input you can provide a string of hex numbers separated by space or by percent signs.

Green backgrounds appear behind sequences where all characters or bytes were successfully mapped to a character in the given encoding. Beware, however, that the character mapped to may not be the one you expect – especially in the single byte encodings.

To identify characters and look up information about them you will find UniView extremely useful. You can paste Unicode characters into the UniView Edit Buffer and click on the down-arrow icon below to find out what they are. (Click on the name that appears for more detailed information.) It is particularly useful for identifying escaped characters. Copy the escape(s) to the Find input area on UniView and click on Dec just below.

Screen Shot 2015-01-18 at 07.42.56

Version 16 of the Bengali character picker is now available.

Other than a small rearrangement of the selection table, and the significant standard features that version 16 brings, this version adds the following:

  • three new buttons for automatic transcription between latin and bengali. You can use these buttons to transcribe to and from latin transcriptions using ISO 15919 or Radice approaches.
  • hinting to help identify similar characters.
  • the ability to select the base character for the display of combining characters in the selection table.

For more information about the picker, see the notes at the bottom of the picker page.

In addition, I made a number of additions and changes to Bengali script notes (an overview of the Bengali script), and Bengali character notes (an annotated list of characters in the Bengali script).

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

I’m struggling to show combining characters on a page in a consistent way across browsers.

For example, while laying out my pickers, I want users to be able to click on a representation of a character to add it to the output field. In the past I resorted to pictures of the characters, but now that webfonts are available, I want to replace those with font glyphs. (That makes for much smaller and more flexible pages.)

Take the Bengali picker that I’m currently working on. I’d like to end up with something like this:

comchacon0

I put a no-break space before each combining character, to give it some width, and because that’s what the Unicode Standard recommends (p60, Exhibiting Nonspacing Marks in Isolation). The result is close to what I was looking for in Chrome and Safari except that you can see a gap for the nbsp to the left.

comchacon1

But in IE and Firefox I get this:

comchacon2

This is especially problematic since it messes up the overall layout, but in some cases it also causes text to overlap.

I tried using a dotted circle Unicode character, instead of the no-break space. On Firefox this looked ok, but on Chrome it resulted in two dotted circles per combining character.

I considered using a consonant as the base character. It would work ok, but it would possibly widen the overall space needed (not ideal) and would make it harder to spot a combining character by shape. I tried putting a span around the base character to grey it out, but the various browsers reacted differently to the span. Vowel signs that appear on both sides of the base character no longer worked – the vowel sign appeared after. In other cases, the grey of the base character was inherited by the whole grapheme, regardless of the fact that the combining character was outside the span. (Here are some examples ে and ো.)

In the end, I settled for no preceding base character at all. The combining character was the first thing in the table cell or span that surrounded it. This gave the desired result for the font I had been using, albeit that I needed to tweak the occasional character with padding to move it slightly to the right.

On the other hand, this was not to be a complete solution either. Whereas most of the fonts I planned to use produce the dotted circle in these conditions, one of my favourites (SolaimanLipi) doesn’t produce it. This leads to significant problems, since many combining characters appear far to the left, and in some cases it is not possible to click on them, in others you have to locate a blank space somewhere to the right and click on that. Not at all satisfactory.

comchacon3

I couldn’t find a better way to solve the problem, however, and since there were several Bengali fonts to choose from that did produce dotted circles, I settled for that as the best of a bad lot.

However, then i turned my attention to other pickers and tried the same solution. I found that only one of the many Thai fonts I tried for the Thai picker produced the dotted circles. So the approach here would have to be different. For Khmer, the main Windows font (Daunpenh) produced dotted circles only for some of the combining characters in Internet Explorer. And on Chrome, a sequence of two combining characters, one after the other, produced two dotted circles…

I suspect that I’ll need to choose an approach for each picker based on what fonts are available, and perhaps provide an option to insert or remove base characters before combining characters when someone wants to use a different font.

It would be nice to standardise behaviour here, and to do so in a way that involves the no-break space, as described in the Unicode Standard, or some other base character such as – why not? – the dotted circle itself. I assume that the fix for this would have to be handled by the browser, since there are already many font cats out of the bag.

Does anyone have an alternate solution? I thought I heard someone at the last Unicode conference mention some way of controlling the behaviour of dotted circles via some script or font setting…?

Update: See Marc Durdin’s blog for more on this topic, and his experiences while trying to design on-screen keyboards for Lao and other scripts.

khmer-picker16

I have uploaded a new version of the Khmer character picker.

The new version uses characters instead of images for the selection table, making it faster to load and more flexible. If you prefer, you can still access the previous version.

Other than a small rearrangement of the default selection table to accomodate fonts rather than images, and the significant standard features that version 16 brings, there are no additional changes in this version.

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

uighur-picker16

devanagari-picker16

gurmukhi-picker16

I have updated the Devanagari picker, the Gurmukhi picker and the Uighur picker to version 16.

You may have spotted a previous, unannounced, version of the Devanagari and Uighur pickers on the site, but essentially these versions should be treated as new. The Gurmukhi picker has been updated from a very old version.

In addition to the standard features that version 16 of the character pickers brings, things to note include the addition of hints for all pickers, and automated transcription from Devanagari to ISO 15919, and vice versa for the Devanagari picker.

For more information about the pickers, see the notes at the bottom of the relevant picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

A couple of posts ago I mentioned that I had updated the Thai picker to version 16. I have now updated a few more. For ease of reference, I will list here the main changes between version 16 pickers and previous versions back to version 12.

  • Fonts rather than graphics. The main selection table in version 12 used images to represent characters. These have now gone, in favour of fonts. Most pickers include a web font download to ensure that you will see the characters. This reduces the size and download time significantly when you open a picker. Other source code changes have reduced the size of the files even further, so that the main file is typically only a small fraction of the size it was in version 14.

    It is also now possible, in version 16, to change the font of the main selection table and the font size.

  • UI. The whole look and feel of the user interface has changed from version 14 onwards, and includes useful links and explanations off the top of the normal work space.

    In particular, the vertical menu, introduced in version 14, has been adjusted so that input features can be turned on and off independently, and new panels appear alongside the others, rather than toggling the view from one mode to another. So, for example, you can have hints and shape-based selectors turned on at the same time. When something is switched on, its label in the menu turns orange, and the full text of the option is followed by a check mark.

  • Transcription panels. Some pickers had one or more transcription views in versions below 16. These enable you to construct some non-Latin text when working from a Latin transcription. In version 16 these alternate views are converted to panels that can be displayed at the same time as other information. They can be shown or hidden from the vertical menu. When there is ambiguity as to which characters to use, a pop up displays alternatives. Click on one to insert it into the output. There is also a panel containing non-ASCII Latin characters, which can be used when typing Latin transcriptions directly into the main output area. This panel is now hidden by default, but can be easily shown from the vertical menu.

  • Automated transcription. Version 16 pickers carry forward, and in some cases add, automated transcription converters. In some cases these are intended to generate only an approximation to the needed transcription, in order to speed up the transcription process. In other cases, they are complete. (See the notes for the picker to tell which is which.) Where there is ambiguity about how to transcribe a sequence of characters, the interface offers you a choice from alternatives. Just click on the character you want and it will replace all the options proposed. In some cases, particularly South-East Asian scripts, the text you want to transcribe has to be split into syllables first, using spaces and or hyphens. Where this is necessary, a condense button it provided, to quickly strip out the separators after the transcription is done.

  • Layout The default layout of the main selection table has usually been improved, to make it easier to locate characters. Rarely used, deprecated, etc, characters appear below the main table, rather than to the right.

  • Hints Very early versions of the pickers used to automatically highlight similar and easily confusable characters when you hovered over a character in the main selection table. This feature is being reintroduced as standard for version 16 pickers. It can be turned on or off from the vertical menu. This is very helpful for people who don’t know the script well.

  • Shape-based selection. In previous versions the shape-based view replaced the default view. In version 16 the shape selectors appear below the main selection table and highlight the characters in that table. This arrangement has several advantages.

  • Applying actions to ranges of text. When clicking on the Codepoints and Escapes buttons, it is possible to apply the action to a highighted range of characters, rather than all the characters in the output area. It is also possible to transcribe only highlighted text, when using one of the automated transcription features.

  • Phoneme bank. When composing text from a Latin transcription in previous versions you had to make choices about phonetics. Those choices were stored on the UI to speed up generation of phonetic transcriptions in addition to the native text, but this feature somewhat complicated the development and use of the transcription feature. It has been dropped in version 16. Hopefully, the transcription panels and automated transcription features will be useful enough in future.

  • Font grid. The font grid view was removed in version 16. It is of little value when the characters are already displayed using fonts.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

I have uploaded another new version of the Thai character picker.

Sorry this follows so quickly on the heels of version 15, but as soon as I uploaded v15 several ideas on how to improve it popped into my head. This is the result. I will hopefully bring all the pickers, one by one, up to the new version 16 format. If you prefer, you can still access version 12.

The main changes include:

  • UI. Adjustment of the vertical menu, so that input features can be turned on and off independently, and new panels appear with the others, rather than toggling from one to another. So, for example, you can have hints and shape-based selectors turned on at the same time. When something is switched on, its label in the menu turns orange, and the full text of the option is followed by a check mark.
  • Transcription panels. Panels have been added to enable you to construct some Thai text when working from a Latin transcription. This brings the transcription inputs of version 12 into version 16, but in a more compact and simpler way, and way that gives you continued access to the standard table for special characters.

    There are currently options to transcribe from ISO 11940-2 (although there are some gaps in that), or from the transcription used by Benjawan Poomsan Becker in her book, Thai for Beginners. These are both transcriptions based on phonetic renderings of the Thai, so there is often ambiguity about how to transcribe a particular Latin letter into Thai. When such an ambiguity occurs, the interface offers you a choice via a small pop-up. Just click on the character you want and it will be inserted into the main output area.

    The transcription panels are useful because you can add a whole vowel at a time, rather than picking the individual vowel signs that compose it. An issue arises, however, when the vowel signs that make up a given vowel contain one that appears to the left of the syllable initial consonant(s). This is easily solved by highlighting the syllable in question and clicking on the reorder button. The vowel sign in question will then appear as the first item in the highlighted text.

    There is also a panel containing non-ASCII Latin characters, which can be used when typing Latin transcriptions directly into the main output area. (This was available in v15 too, but has been made into a panel like the others, which can be hidden when not needed.)

  • Tones for automatic IPA transcriptions. The automatic transcription to IPA now adds tone marks. These are usually correct, but, as with other aspects of the transcription, it doesn’t take into account the odd idiosyncrasy in Thai spelling, so you should always check that the output is correct. (Note that there is still an issue for some of the ambiguous transcription cases, mostly involving RA.)

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

I have uploaded a new version of the Thai character picker.

The new version uses characters instead of images for the selection table, making it faster to load and more flexible, and dispenses with the transcription view. If you prefer, you can still access the previous version.

Other changes include:

  • Significant rearrangement of the default selection table. The new arrangement makes it easy to choose the right characters if you have a Latin transcription to hand, which allows the removal of the previous transcription view, at the same time as speeding up that type of picking.
  • Addition of latin prompts to help locate letters (standard with v15).
  • Automatic transcription from Thai into ISO 11940-1, ISO 11940-2 and IPA. Note that for the last two there are some corner cases where the results are not quite correct, due to the ambiguity of the script, and note also that you need to show syllable boundaries with spaces before transcribing. (There’s a way to remove those spaces quickly afterwards.) See below for more information.
  • Hints! When switched on and you mouse over a character, other similar characters or characters incorporating the shape you moused over, are highlighted. Particularly useful for people who don’t know the script well, and may miss small differences, but also useful sometimes for finding a character if you first see something similar.
  • It also comes with the new v15 features that are standard, such as shape-based picking without losing context, range-selectable codepoint information, a rehabilitated escapes button, the ability to change the font of the table and the line-height of the output, and the ability to turn off autofocus on mobile devices to stop the keyboard jumping up all the time, etc.

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

More about the transcriptions: There are three buttons that allow you to convert from Thai text to Latin transcriptions. If you highlight part of the text, only that part will be transcribed.

The toISO-1 button produces an ISO 11940-1 transliteration, that latinises the Thai characters without changing their order. The result doesn’t normally tell you how to pronounce the Thai text, but it can be converted back to Thai as each Thai character is represented by a unique sequence in Latin. This transcription should produce fully conformant output. There is no need to identify syllables boundaries first.

The toISO-2 and toIPA buttons produce an output that is intended to approximately reflect actual pronunciation. It will work fine most of the time, but there are occasional ambiguities and idiosynchrasies in Thai which will cause the converter to render certain, less common syllables incorrectly. It also doesn’t automatically add accent marks to the phonetic version (though that may be added later). So the output of these buttons should be treated as something that gets you 90% of the way. NOTE: Before using these two buttons you need to add spaces or hyphens between each syllable of the Thai text. Syllable boundaries are important for correct interpretation of the text, and they are not detected automatically.

The condense button removes the spaces from the highlighted range (or the whole output area, if nothing is highlighted).

Note: For the toISO-2 transcription I use a macron over long vowels. This is non-standard.

It’s disappointing to see that non-standard implementations of UTF-8 are being used by the BBC on their BBC Burmese Facebook page.

Take, for example, the following text.

On the actual BBC site it looks like this (click on the burmese text to see a list of the characters used):

အိန္ဒိယ မိန်းမငယ် ၂ဦး အမှု ဆေးစစ်ချက် ကွဲလွဲနေ

As far as I can tell, this is conformant use of Unicode codepoints.

Look at the same title on the BBC’s Facebook page, however, and you see:

အိႏၵိယ မိန္းမငယ္ ၂ဦး အမႈ ေဆးစစ္ခ်က္ ကြဲလြဲေန

Depending upon where you are reading this (as long as you have some Burmese font and rendering support), one of the two lines of Burmese text above will contain lots of garbage. For me, it’s the second (non-standard).

This non-standard approach uses visual encoding for combining characters that appear before or on both sides of the base, uses Shan or Rumai Palaung codepoints for subjoining consonants, uses the wrong codepoints for medial consonants, and uses the virama instead of the asat at the end of a word.

I assume that this is because of prevalent use of the non-standard approach on mobile devices (and that the BBC is just following that trend), caused by hacks that arose when people were impatient to get on the Web but script support was lagging in applications.

However, continuing this divergence does nobody any long-term good.

[ Find fonts and other resources for the Myanmar script ]

Picture of the page in action.

>> Use UniView

The main addition in this version is a couple of buttons that appear when you ask UniView to display a block.

Clicking on Show annotated list generates a list of all characters in the block, with annotations.

Clicking on Show script links displays a list of links to key sources of information about the script of the block, links to relevant articles and apps on the rishida.net site, and related fonts and input methods. This provides a very quick way of finding this information. One particularly useful link (‘Historical documentation’, which links to a Scriptsource.org page) allows you to find the proposals for all additions to Unicode related to the relevant script. These proposals are a mine of useful information about the individual characters in a block, and SIL staff should get a medal for trawling through all the relevant data to provide this.

In addition, there were some changes to the user interface, including the following:

  • The order of information in the lower right panel (detailed character information) was slightly changed, and two alterative representations of the character were added: an HTML escape, and a URI escape.
  • The search box at the top left was constrained to appear closer to the other controls when the window is stretched wide.

Various bugs were also fixed.

>> Use it

This HTML page allows you to expand information in the lines of the UnicodeData.txt file, edit them and generate a new version. It also checks the data for validity in a number of areas.

It can be helpful if you have the misfortune to pore over the source code of the UnicodeData.txt file and find your eyes blurring as you count fields. And it is particularly useful for people submitting proposals for new scripts or characters to the Unicode Consortium, to help them generate correct lists of unicode properties for inclusion in the proposal. (You can even build the whole thing in the UI, error free, by starting with a number of blank lines, such as 1111;NAME;;;;;;;;;;;;;.)

The image below shows the page in action. I dropped in a couple of lines from the Ahom script proposal, and vandalised them slightly. The first panel shows that the app has spotted an error. I used the column to the right to edit out the error in the second panel, and regenerated the lines in the box below.

Picture of the page in action.

Having made edits you can copy paste the output back into the top box to send it through the sausage machine again, and check that there are no remaining errors.

You can add a whole script block at a time to the top box, or a single line – as you like.

Well, it’s a bit esoteric, but hopefully it will be useful to someone somewhere.

Characters in the Unicode Balinese block.

I just uploaded an initial draft of an article Balinese Script Notes. It lists the Unicode characters used to represent Balinese text, and briefly describes their use. It starts with brief notes on general script features and discussions about which Unicode characters are most appropriate when there is a choice.

The script type is abugida – consonants carry an inherent vowel. It’s a complex script derived from Brahmi, and has lots of contextual shaping and positioning going on. Text runs left-to-right, and words are not separated by spaces.

I think it’s one of the most attractive scripts in Unicode, and for that reason I’ve been wanting to learn more about it for some time now.

>> Read it

Picture of the page in action.

>> Use it

This picker contains characters from the Unicode Balinese block needed for writing the Balinese language. Characters needed for Sasak are also available in the Advanced section. Balinese musical notation characters are not included.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility.

About this picker: Characters are grouped to aid input. The consonant block includes characters needed for Kawi and Sanskrit as well as the native Balinese characters, all arranged according to the Brahmi pronunciation grid.

The picker has only a default view and a font grid view. It’s difficult to put in the time for the shape-based, keyboard-based, and various transcription-based views in some other pickers. In a new departure, however, I have included a list of Latin characters on the default view to assist in writing transcriptions alongside Balinese text.

There is, however, a significant issue with this picker, due to the lack of support for Balinese as a script in computers. The only Unicode-based Balinese font I know of is Aksara Bali, but that font seems to only work as expected in Firefox on Mac OS X. Furthermore, the Aksara Bali font doesn’t handle ra repa as described in the Unicode Standard. The sequence <consonant , adeg-adeg, ra repa> produces a visible adeg-adeg, rather than the post-fixed form of ra repa. The sequence <consonant , vowel sign ra repa> produces the post-fixed form of ra repa, rather than the subjoined form. You can produce the post-fixed form with this font by using <consonant , vowel sign ra repa> and the subjoined form by using <consonant , adeg-adeg, ra, pepet>, but these sequences will produce content that cannot be matched against sequences using the Unicode approach, and content that may fail with other Unicode-compliant fonts in the future.

Hopefully some new, fully Unicode-compliant fonts will come along soon. This is one of the most beautiful scripts I have come across.

(Btw, I’m working on a set of notes for Balinese characters, linked from UniView, with some feature innovations to get around the font issue. Look out for that later. And I’m thinking I should develop a Javanese picker to go with this one. Just need a bit of time…)

For the curious, here’s the first article of the Universal Declaration of Human Rights, as typed in the Balinese picker. Translation by Tri Ediwan (reproduced from Omniglot).

Picture of the page in action.

I’ve wanted to get around to this for years now. Here is a list of fonts that come with Windows7 and Mac OS X Snow Leopard/Lion, grouped by script.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script. I’m still working on the list, and there are some caveats.

>> See the list

Some of the fonts listed above may be disabled on the user’s system. I’m making an assumption that someone who reads tibetan will have the Tibetan font turned on, but for my articles that explain writing systems to people in English, such assumptions may not hold.

The list I used to identify Windows fonts is Windows7-specific and fairly stable, but the Mac font spans more than one version of Mac OS X, and I could only find an unofficial list of fonts for Snow Leopard, and there were some fonts on that list that I didn’t have on my system. Where a Mac font is new with Lion (and there are a significant number) it is indicated. See the official list of fonts on Mac OS X Lion.

There shouldn’t be any fonts listed here for a given script that aren’t supplied with Windows7 or Mac OS X Snow Leopard/Lion, but there are probably supplied fonts that are not yet listed here (typically these will be large fonts that cover multiple scripts). In particular, note that I haven’t yet made a list of fonts that support Latin, Greek and Cyrillic (mainly because there are so many of them and partly because I’m wondering how useful it will be.)

The text used is as much as would fit on one line of article 1 of the Universal Declaration of Human Rights, taken from this Unicode page, wherever I could find it. I created a few instances myself, where it was missing, and occasionally I resorted to arbitrary lists of characters.

You can obtain a character-based version of the text used by looking at the source text: look for the title attribute on the section heading.

Things still to do:

  • create sections for Latin, Greek and Cyrillic fonts
  • check for fonts covering multiple Unicode blocks
  • figure out how to tell, and how to show which is the system default
  • work out and show what’s not available in Windows XP
  • work out what’s new in Lion, and whether it’s worth including them
  • figure out whether people with different locale setups see different things
  • recapture all font images that need it at 36px, rather than varying sizes

Update, 19 Feb 2012

I uploaded a new version of the font list with the following main changes:

  • If you click on an image you see text with that font applied (if you have it on your system, of course). The text can be zoomed from 14px to 100px (using a nice HTML5 slider, if you have the right browser! [try Chrome, Safari or Opera]). This text includes a little Latin text so you can see the relationship between that and the script.
  • All font graphics are now standardised so that text is imaged at a font size of 36px. This makes it more difficult to see some fonts (unless you can use the zoom text feature), but gives a better idea of how fonts vary in default size.
  • I added a few extra fonts which contained multiple script support.
  • I split Chinese into Simplified and Traditional sections.
  • Various other improvements, such as adding real text for N’Ko, correcting the Traditional Chinese text, flipping headers to the left for RTL fonts, reordering fonts so that similar ones are near to each other, etc.

Picture of the page in action.

>> Use UniView

The major change in this update is the update of the data to support Unicode version 6.1.0, which should be released today. (See the list of links to new Unicode blocks below.)

There are also a number of feature and bug related changes.

What UniView does: Look up and see characters (using graphics or fonts) and property information, view whole character blocks or custom ranges, select characters to paste into your document, paste in and discover unknown characters, search for characters, do hex/dec/ncr conversions, highlight character types, etc. etc. Supports Unicode 6.1 and written with Web Standards to work on a variety of browsers. No need to install anything.

List of changes:

  • One significant change enables you to display information in a separate window, rather than overwriting the information currently displayed. This can be done by typing/pasting/dragging a set of characters or character code values into the new Popout area and selecting the  icon alongside the Characters or Copy & paste input fields (depending on what you put in the popout window).

  • Two new icons were added to the Copy & paste area:

    Analyse Clicking on this will display the characters in the area in the lower right part of the page with all relevant characters converted to uppercase, lowercase and titlecase. Characters that had no case conversion information are also listed.

    Analyse Clicking on this produces the same kind of output as clicking on the icon just above, but shows the mappings for those characters that have been changed, eg. e→E.

  • Where character information displayed in the lower right panel has a case or decomposition mapping, UniView now displays the characters involved, rather than just giving the hex value(s), eg. Uppercase mapping: 0043 C. You will need a font on your system to see the characters displayed in this way, but whether or not you have a font, this provides a quick and easy way to copy the case-changed character (rather than having to copy the hex value and convert it first).

  • There is also a new line, slightly further down, when UniView is in graphic mode. This line starts with ‘As text:’, and shows the character using whatever default font you have on your system. Of course, if you don’t have a font that includes that character you won’t see it. This has been added to make it easier to copy and paste a character into text.

  • There is also a new line, slightly further down, when UniView is in graphic mode. This line starts with ‘As text:’, and shows the character using whatever default font you have on your system. Of course, if you don’t have a font that includes that character you won’t see it. This has been added to make it easier to copy and paste a character into text.

  • Fixed some small bugs, such as problems with search when U+29DC INCOMPLETE INFINITY is returned.

Enjoy.

Here are direct links to the new blocks added to Unicode 6.1:

Next Page »