Dochula Pass, Bhutan

Picture of the page in action.

The page Mongolian variant forms has been significantly changed and moved to a new location. The main changes include the following:

  • The ‘NP’ column is now the ‘DS01’ column. It shows the shapes discussed by the mongolian experts on the public-i18n-mongolian@w3.org list with a view to updating and standardising the results of using Unicode variants. The column reflects the state of that document as of 25 Nov 2016. Expected updates to the document are shown using “[current] ≫ [expected]”. As those and other changes are made to DS01 I will update this variant comparison page.
  • I added a column for L2/16-309 (the output of the Hohot discussions) as of 26 Oct 2016. Differences between DS01 and L2/16-309 are all highlighted for quick reference.
  • I also added a column for the Unicode Standard v9 chart data, and again highlighted differences from DS01 or L2/16-309. To see this column, click on the vertical blue bar, bottom right, then set Hidden Columns to Show.
  • Showing the hidden columns also shows a usage column and a notes
    column (information in both cases taken from L2/16-309).
  • All the font information has been rechecked, and some bugs fixed.
  • Other editorial changes include the hiding of former notes, and the
    simplification and update of the page intro.

Picture of the page in action.

Another significant development is the creation of a Shape Index. That document lists all shapes used in the page Mongolian Variants, and enables you to jump to the appropriate table in that page so that you can see how it’s used.

The classification used is not intended to be etymologically or philosophically pure. It is intended as a simple practical tool to help locate shapes, and one that also works for novices.

I hope these changes will help us identify and resolve the remaining differences between the documents. I will update these pages as the source documents change (please advise me when new versions are developed). Of course, comments are welcome.

If you want to join the discussion about Mongolian variants you could join this mailing list. (Hit the subscribe link.)

Some more notes for future reference. The following scripts in Unicode 9.0 have both upper and lower case characters.

In modern use

Latin
Greek
Cyrillic
Armenian
Cherokee
Adlam

Limited modern use

Coptic
Warang Citi
Old Hungarian
Osage

No longer used

Georgian*
Glagolitic
Deseret

I include characters used for phonetic transcriptions in Latin.

There are also case correspondences in the following Unicode blocks: Number Forms, Enclosed Alphanumerics, Alphabetic Presentation Forms, Halfwidth and Fullwidth Forms. Mathematical Alphanumeric Symbols also has both upper and lower case forms, but they are not convertible, one to the other.

Georgian is a little special. Modern Georgian is unicameral. An attempt was made to write it in a bicameral way in the 1950s, but it didn’t catch on.

See also the list of Right-to-left scripts.

Examples of case conversion.

These are notes culled from various places. There may well be some copy-pasting involved, but I did it long enough ago that I no longer remember all the sources. But these are notes, it’s not an article.

Case conversions are not always possible in Unicode by applying an offset to a codepoint, although this can work for the ASCII range by adding 32, or by adding 1 for many other characters in the Latin extensions. There are many cases where the corresponding cased character is in another block, or in an irregularly offset location.

In addition, there are linguistic issues that mean that simple mappings of one character to another are not sufficient for case conversion.

In German, the uppercase of ß is SS. German and Greek cannot, however, be easily transformed from upper to lower case: German because SS could be converted either to ß or ss, depending on the word; Greek because all tonos marks are omitted in upper case, eg. does ΑΘΗΝΑ convert to Αθηνά (the goddess) or Αθήνα (capital of Greece)? German may also uppercase ß to ẞ sometimes for things like signboards.

Also Greek converts uppercase sigma to either a final or non-final form, depending on the position in a word, eg. ΟΔΥΣΣΕΥΣ becomes οδυσσευς. This contextual difference is easy to manage, however, compared to the lexical issues in the previous paragraph.

In Serbo-Croatian there is an important distinction between uppercase and titlecase. The single letter dž converts to DŽ when the whole word is uppercased, but Dž when titlecased. Both of these forms revert to dž in lowercase, so there is no ambiguity here.

In Dutch, the titlecase of ijsvogel is IJsvogel, ie. which commonly means that the first two letters have to be titlecased. There is a single character IJ (U+0132 LATIN CAPITAL LIGATURE IJ) in Unicode that will behave as expected, but this single character is very often not available on a keyboard, and so the word is commonly written with the two letters I+J.

In Greek, tonos diacritics are dropped during uppercasing, but not dialytika. Greek diphthongs with tonos over the first vowel are converted during uppercasing to no tonos but a dialytika over the second vowel in the diphthong, eg. Νεράιδα becomes ΝΕΡΑΪΔΑ. A letter with both tonos and dialytika above drops the tonos but keeps the dialytika, eg. ευφυΐα becomes ΕΥΦΥΪΑ. Also, contrary to the initial rule mentioned here, Greek does not drop the tonos on the disjunctive eta (usually meaning ‘or’), eg. ήσουν ή εγώ ή εσύ becomes ΗΣΟΥΝ Ή ΕΓΩ Ή ΕΣΥ (note that the initial eta is not disjunctive, and so does drop the tonos). This is to maintain the distinction between ‘either/or’ ή from the η feminine form of the article, in the nominative case, singular number.

Greek titlecased vowels, ie. a vowel at the start of a word that is uppercased, retains its tonos accent, eg. Όμηρος.

Turkish, Azeri, Tatar and Bashkir pair dotted and undotted i’s, which requires special handling for case conversion, that is language-specific. For example, the name of the second largest city in Turkey is “Diyarbakır”, which contains both the dotted and dotless letters i. When rendered into upper case, this word appears like this: DİYARBAKIR.

Lithuanian also has language-specific rules that retain the dot over i when combined with accents, eg. i̇̀ i̇́ i̇̃, whereas the capital I has no dot.

Sometimes European French omits accents from uppercase letters, whereas French Canadian typically does not. However, this is more of a stylistic than a linguistic rule. Sometimes French people uppercase œ to OE, but this is mostly due to issues with lack of keyboard support, it seems (as is the issue with French accents).

Capitalisation may ignore leading symbols and punctuation for a word, and titlecase the first casing letter. This applies not only to non-letters. A letter such as the (non-casing version of the) glottal stop, ʔ, may be ignored at the start of a word, and the following letter titlecased, in IPA or Americanist phonetic transcriptions. (Note that, to avoid confusion, there are separate case paired characters available for use in orthographies such as Chipewyan, Dogrib and Slavey. These are Ɂ and ɂ.)

Another issue for titlecasing is that not all words in a sequence are necessarily titlecased. German uses capital letters to start noun words, but not verbs or adjectives. French and Italian may expect to titlecase the ‘A’ in “L’Action”, since that is the start of a word. In English, it is common not to titlecase words like ‘for’, ‘of’, ‘the’ and so forth in titles.

Unicode provides only algorithms for generic case conversion and case folding. CLDR provides some more detail, though it is hard to programmatically achieve all the requirements for case conversion.

Case folding is a way of converting to a standard sequence of (lowercase) characters that can be used for comparisons of strings. (Note that this sequence may not represent normal lowercase text: for example, both the uppercase Greek sigma and lowercase final sigma are converted to a normal sigma, and the German ß is converted to ‘ss’.) There are also different flavours of case folding available: common, full, and simple.

These are just some notes for future reference. The following scripts in Unicode 9.0 are normally written from right to left.

Scripts containing characters with the property ARABIC RIGHT-to-LEFT have an asterisk. The remaining scripts have characters with the property RIGHT:

In modern use

Adlam
Arabic *
Hebrew
Nko
Syriac *
Thaana *

Limited modern use

Mende Kikakui (small numbers)
Old Hungarian
Samaritan (religious)

Archaic

Avestan
Cypriot
Hatran
Imperial Aramaic
Kharoshthi
Lydian
Manichaean
Meroitic
Mandaic
Nabataean
Old South Arabian
Old North Arabian
Old Turkic
Pahlavi, (Inscriptional)
Palmyrene
Parthian, (Inscriptional)
Pheonician

See also the list of bicameral scripts.

Picture of the page in action.

>> Use UniView

Unicode 8.0.0 is released today. This new version of UniView adds the new characters encoded in Unicode 8.0.0 (including 6 new scripts). The scripts listed in the block selection menu were also reordered to match changes to the Unicode charts page.

The URL for UniView is now https://r12a.github.io/uniview/. Please change your bookmarks.

The github site now holds images for all 28,000+ Unicode codepoints other than Han ideographs and Hangul syllables (in two sizes).

I also fixed the Show Age filter, and brought it up to date.

Three bopomofo letters with tone mark.

Light tone mark in annotation.

A key issue for handling of bopomofo (zhùyīn fúhào) is the placement of tone marks. When bopomofo text runs vertically (either on its own, or as a phonetic annotation), some smarts are needed to display tone marks in the right place. This may also be required (though with different rules) for bopomofo when used horizontally for phonetic annotations (ie. above a base character), but not in all such cases. However, when bopomofo is written horizontally in any other situation (ie. when not written above a base character), the tone mark typically follows the last bopomofo letter in the syllable, with no special handling.

From time to time questions are raised on W3C mailing lists about how to implement phonetic annotations in bopomofo. Participants in these discussions need a good understanding of the various complexities of bopomofo rendering.

To help with that, I just uploaded a new Web page Bopomofo on the Web. The aim is to provide background information, and carry useful ideas from one discussion to the next. I also add some personal thoughts on implementation alternatives, given current data.

I intend to update the page from time to time, as new information becomes available.

Screen Shot 2015-01-18 at 07.42.56

Version 16 of the Bengali character picker is now available.

Other than a small rearrangement of the selection table, and the significant standard features that version 16 brings, this version adds the following:

  • three new buttons for automatic transcription between latin and bengali. You can use these buttons to transcribe to and from latin transcriptions using ISO 15919 or Radice approaches.
  • hinting to help identify similar characters.
  • the ability to select the base character for the display of combining characters in the selection table.

For more information about the picker, see the notes at the bottom of the picker page.

In addition, I made a number of additions and changes to Bengali script notes (an overview of the Bengali script), and Bengali character notes (an annotated list of characters in the Bengali script).

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

There is some confusion about which shapes should be produced by fonts for Mongolian characters. Most letters have at least one isolated, initial, medial and final shape, but other shapes are produced by contextual factors, such as vowel harmony.

Unicode has a list of standardised variant shapes, dating from 27 November 2013, but that list is not complete and contains what are currently viewed by some as errors. It also doesn’t specify the expected default shapes for initial, medial and final positions.

The original list of standardised variants was based on 蒙古文编码 by Professor Quejingzhabu in 2000.

A new proposal was published on 20 January 2014, which attempts to resolve the current issues, although I think that it introduces one or two issues of its own.

The other factor in this is what the actual fonts do. Sometimes they follow the Unicode standardised variants list, other times they diverge from it. Occasionally a majority of implementations appear to diverge in the same way, suggesting that the standardised list should be adapted to reality.

To help unravel this, I put together a page called Notes on Mongolian variant forms that visually shows the changes between the various proposals, and compares the results produced by various fonts.

This is still an early draft. The information only covers the basic Mongolian range – Todo, Sibe, etc still to come. Also, I would like to add information about other fonts, if I can obtain them.

Update: 16 Apr 2015, The Todo, Sibe, Manchu, Sanskrit and Tibetan characters are now all done, and font information added for them. (And the document was moved to github.)

tibetan-udhr
See the Tibetan Script Notes

Last March I pulled together some notes about the Tibetan script overall, and detailed notes about Unicode characters used in Tibetan.

I am writing these pages as I explore the Tibetan script as used for the Tibetan language. They may be updated from time to time and should not be considered authoritative. Basically I am mostly simplifying, combining, streamlining and arranging the text from the sources listed at the bottom of the page.

The first half of the script notes page describes how Unicode characters are used to write Tibetan. The second half looks at text layout in Tibetan (eg. line-breaking, justification, emphasis, punctuation, etc.)

The character notes page lists all the characters in the Unicode Tibetan block, and provides specific usage notes for many of them per their use for writing the Tibetan language.

tibetan-char-notes
See the Tibetan Character Notes

Tibetan is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs. Text runs from left to right.

There are various different Tibetan scripts, of two basic types: དབུ་ཙན་ dbu-can, pronounced /uchen/ (with a head), and དབུ་མེད་ dbu-med, pronounced /ume/ (headless). This page concentrates on the former. Pronunciations are based on the central, Lhasa dialect.

The pronunciation of Tibetan words is typically much simpler than the orthography, which involves patterns of consonants. These reduce ambiguity and can affect pronunciation and tone. In the notes I try to explain how that works, in an approachable way (though it’s still a little complicated, at first).

Traditional Tibetan text was written on pechas (དཔེ་ཆ་ dpe-cha), loose-leaf sheets. Some of the characters used and formatting approaches are different in books and pechas.

For similar notes on other scripts, see my docs list.

It’s disappointing to see that non-standard implementations of UTF-8 are being used by the BBC on their BBC Burmese Facebook page.

Take, for example, the following text.

On the actual BBC site it looks like this (click on the burmese text to see a list of the characters used):

အိန္ဒိယ မိန်းမငယ် ၂ဦး အမှု ဆေးစစ်ချက် ကွဲလွဲနေ

As far as I can tell, this is conformant use of Unicode codepoints.

Look at the same title on the BBC’s Facebook page, however, and you see:

အိႏၵိယ မိန္းမငယ္ ၂ဦး အမႈ ေဆးစစ္ခ်က္ ကြဲလြဲေန

Depending upon where you are reading this (as long as you have some Burmese font and rendering support), one of the two lines of Burmese text above will contain lots of garbage. For me, it’s the second (non-standard).

This non-standard approach uses visual encoding for combining characters that appear before or on both sides of the base, uses Shan or Rumai Palaung codepoints for subjoining consonants, uses the wrong codepoints for medial consonants, and uses the virama instead of the asat at the end of a word.

I assume that this is because of prevalent use of the non-standard approach on mobile devices (and that the BBC is just following that trend), caused by hacks that arose when people were impatient to get on the Web but script support was lagging in applications.

However, continuing this divergence does nobody any long-term good.

[ Find fonts and other resources for the Myanmar script ]

I’ve been trying to understand how web pages need to support justification of Arabic text, so that there are straight lines down both left and right margins.

The following is an extract from a talk I gave at the MultilingualWeb workshop in Madrid at the beginning of May. (See the whole talk.) It’s very high level, and basically just draws out some of the uncertainties that seem to surround the topic.

Let’s suppose that we want to justify the following Arabic text, so that there are straight lines at both left and right margins.

Arabic justification #1

Unjustified Arabic text

Generally speaking, received wisdom says that Arabic does this by stretching the baseline inside words, rather than stretching the inter-word spacing (as would be the case in English text).

To keep it simple, lets just focus on the top two lines.

One way you may hear that this can be done is by using a special baseline extension character in Unicode, U+0640 ARABIC TATWEEL.

Arabic justification #2

Justification using tatweels

The picture above shows Arabic text from a newspaper where we have justified the first two lines using tatweels in exactly the same way it was done in the newspaper.

Apart from the fact that this looks ugly, one of the big problems with this approach is that there are complex rules for the placement of baseline extensions. These include:

  • extensions can only appear between certain characters, and are forbidden around other characters
  • the number of allowable extensions per word and per line is usually kept to a minimum
  • words vary in appropriateness for extension, depending on word length
  • there are rules about where in the line extensions can appear – usually not at the beginning
  • different font styles have different rules

An ordinary web author who is trying to add tatweels to manually justify the text may not know how to apply these rules.

A fundamental problem on the Web is that when text size or font is changed, or a window is stretched, etc, the tatweels will end up in the wrong place and cause problems. The tatweel approach is of no use for paragraphs of text that will be resized as the user stretches the window of a web page.

In the next picture we have simply switched to a font in the Naskh style. You can see that the tatweels applied to the word that was previously at the end of the first line now make the word to long to fit there. The word has wrapped to the beginning of the next line, and we have a large gap at the end of the first line.

Arabic justification #3

Tatweels in the wrong place due to just a font change

To further compound the difficulties mentioned above regarding the rules of placement for extensions, each different style of Arabic font has different rules. For example, the rules for where and how words are elongated are different in the Nastaliq version of the same text which you can see below. (All the characters are exactly the same, only the font has changed.) (See a description of how to justify Urdu text in the Nastaliq style.)

Arabic justification #4: Nastaliq

Same text in the Nastaliq font style

And fonts in the Ruqah style never use elongation at all. (We’ll come back to how you justify text using Ruqah-style fonts in a moment.)

Arabic justification #5: Ruqah

Same text in the Ruqah font style

In the next picture we have removed all the tatweel characters, and we are showing the text using a Naskh-style font. Note that this text has more ligatures on the first line, so it is able to fit in more of the text on that line than the first font we saw. We’ll again focus on the first two lines, and consider how to justify them.

Arabic justification #6: Naskh

Same text in the Naskh font style

High end systems have the ability to allow relevant characters to be elongated by working with the font glyphs themselves, rather than requiring additional baseline extension characters.

Arabic justification #7: kashida elongation

Justification using letter elongation (kashida)

In principle, if you are going to elongate words, this is a better solution for a dynamic environment. It means, however, that:

  1. the rules for applying the right-sized elongations to the right characters has to be applied at runtime by the application and font working together, and as the user or author stretches the window, changes font size, adds text, etc, the location and size of elongations needs to be reconfigured
  2. there needs to be some agreement about what those rules are, or at least a workable set of rules for an off-the-shelf, one-size-fits-all solution.

The latter is the fundamental issue we face. There is very little, high-quality information available about how to do this, and a lack of consensus about, not only what the rules are, but how justification should be done.

Some experts will tell you that text elongation is the primary method for justifying Arabic text (for example), while others will tell you that inter-word and intra-word spacing (where there are gaps in the letter-joins within a single word) should be the primary approach, and kashida elongation may or may not be used in addition where the space method is strained.

Arabic justification #8: space based

Justification using inter-word spacing

The space-based approach, of course, makes a lot of sense if you are dealing with fonts of the Ruqah style, which do not accept elongation. However, the fact that the rules for justification need to change according to the font that is used presents a new challenge for a browser that wants to implement justification for Arabic. How does the browser know the characteristics of the font being used and apply different rules as the font is changed? Fonts don’t currently indicate this information.

Looking at magazines and books on a recent trip to Oman I found lots of justification. Sometimes the justification was done using spaces, other times using elongations, and sometimes there was a mixture of both. In a later post I’ll show some examples.

By the way, for all the complexity so far described this is all quite a simplistic overview of what’s involved in Arabic justification. For example, high end systems that justify Arabic text also allow the typesetter to adjust the length of a line of text by manual adjustments that tweak such things as alternate letter shapes, various joining styles, different lengths of elongation, and discretionary ligation forms.

The key messages:

  1. We need an Arabic Layout Requirements document to capture the script needs.
  2. Then we need to figure out how to adapt Open Web Platform technologies to implement the requirements.
  3. To start all this, we need experts to provide information and develop consensus.

Any volunteers to create an Arabic Layout Requirements document? The W3C would like to hear from you!

Picture of the page in action.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script.

I’ve updated the page to show the fonts added in Windows8. This is the list:

  • Aldhabi (Urdu Nastiliq)
  • Urdu Typesetting (Urdu Nastiliq)
  • Gadugi (Cherokee/Unified Canadian Aboriginal Syllabics)
  • Myanmar Text (Myanmar)
  • Nirmala UI (10 Indic scripts)

There were also two additional UI fonts for Chinese, Jhenghei UI (Traditional) and Yahei UI (Simplified), which I haven’t listed. Also Microsoft Uighur acquired a bold font.

>> See the list

See the blog post for the first version or the page for more information.

Update, 25 Jan 2013

Patrick Andries pointed out that Tifinagh using the Windows Ebrima font was missing from the list. Not any more.

Following up on a very good suggestion by Roger Sperberg, I added two webfonts to the Khmer picker and arranged the font selection list so that you can see which fonts are available on your Mac OS X or Windows7 system.

The webfonts make it possible to use the picker on an iPad or other device that doesn’t have a Khmer font installed. I added two webfonts because one worked on my iPad and the other didn’t, and it was vice versa on my Snow Leopard Macbook.

I also added an HTML5 placeholder for the output box. (I’m wishing you could style that differently from the standard content – and wishing that markup designers would think about this sort of thing and stop using attributes for natural language text…).

You can find the Khmer picker at http://rishida.net/scripts/pickers/khmer/

Characters in the Unicode Balinese block.

I just uploaded an initial draft of an article Balinese Script Notes. It lists the Unicode characters used to represent Balinese text, and briefly describes their use. It starts with brief notes on general script features and discussions about which Unicode characters are most appropriate when there is a choice.

The script type is abugida – consonants carry an inherent vowel. It’s a complex script derived from Brahmi, and has lots of contextual shaping and positioning going on. Text runs left-to-right, and words are not separated by spaces.

I think it’s one of the most attractive scripts in Unicode, and for that reason I’ve been wanting to learn more about it for some time now.

>> Read it

Picture of the page in action.

I’ve wanted to get around to this for years now. Here is a list of fonts that come with Windows7 and Mac OS X Snow Leopard/Lion, grouped by script.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script. I’m still working on the list, and there are some caveats.

>> See the list

Some of the fonts listed above may be disabled on the user’s system. I’m making an assumption that someone who reads tibetan will have the Tibetan font turned on, but for my articles that explain writing systems to people in English, such assumptions may not hold.

The list I used to identify Windows fonts is Windows7-specific and fairly stable, but the Mac font spans more than one version of Mac OS X, and I could only find an unofficial list of fonts for Snow Leopard, and there were some fonts on that list that I didn’t have on my system. Where a Mac font is new with Lion (and there are a significant number) it is indicated. See the official list of fonts on Mac OS X Lion.

There shouldn’t be any fonts listed here for a given script that aren’t supplied with Windows7 or Mac OS X Snow Leopard/Lion, but there are probably supplied fonts that are not yet listed here (typically these will be large fonts that cover multiple scripts). In particular, note that I haven’t yet made a list of fonts that support Latin, Greek and Cyrillic (mainly because there are so many of them and partly because I’m wondering how useful it will be.)

The text used is as much as would fit on one line of article 1 of the Universal Declaration of Human Rights, taken from this Unicode page, wherever I could find it. I created a few instances myself, where it was missing, and occasionally I resorted to arbitrary lists of characters.

You can obtain a character-based version of the text used by looking at the source text: look for the title attribute on the section heading.

Things still to do:

  • create sections for Latin, Greek and Cyrillic fonts
  • check for fonts covering multiple Unicode blocks
  • figure out how to tell, and how to show which is the system default
  • work out and show what’s not available in Windows XP
  • work out what’s new in Lion, and whether it’s worth including them
  • figure out whether people with different locale setups see different things
  • recapture all font images that need it at 36px, rather than varying sizes

Update, 19 Feb 2012

I uploaded a new version of the font list with the following main changes:

  • If you click on an image you see text with that font applied (if you have it on your system, of course). The text can be zoomed from 14px to 100px (using a nice HTML5 slider, if you have the right browser! [try Chrome, Safari or Opera]). This text includes a little Latin text so you can see the relationship between that and the script.
  • All font graphics are now standardised so that text is imaged at a font size of 36px. This makes it more difficult to see some fonts (unless you can use the zoom text feature), but gives a better idea of how fonts vary in default size.
  • I added a few extra fonts which contained multiple script support.
  • I split Chinese into Simplified and Traditional sections.
  • Various other improvements, such as adding real text for N’Ko, correcting the Traditional Chinese text, flipping headers to the left for RTL fonts, reordering fonts so that similar ones are near to each other, etc.

The content of this blog post is now available in an updated and expanded form as an article, Bopomofo on the Web.

Webfonts, and WOFF in particular, have been in the news again recently, so I thought I should mention that a few days ago I changed my pages describing Myanmar and Arabic-for-Urdu scripts so that you can download the necessary font support for the foreign text, either as a TTF linked font or as WOFF font.

You can find the Myanmar page at http://rishida.net/scripts/myanmar/. Look for the links n the side bar to the right, under the heading “Fonts”.

The Urdu page, using the beautiful Nastaliq script, is at http://rishida.net/scripts/urdu/.

(Note that the examples of short vowels don’t use the nastiliq style. Scroll down the page a little further.)

I haven’t had time to check whether all the opentype features are correctly rendered, but I’ve been doing Mac testing of the i18n webfonts tests, and it looks promising. (More on that later.) The Urdu font doesn’t rely on OS rendering, which should help.

Here are some examples of the text on the page:
Examples of Urdu script and Myanmar script.

http://rishida.net/scripts/indic-overview/

I finally got around to refreshing this article, by converting the Bengali, Malayalam and Oriya examples to Unicode text. Back when I first wrote the article, it was hard to find fonts for those scripts.

I also added a new feature: In the HTML version, click on any of the examples in indic text and a pop-up appears at the bottom right of the page, showing which characters the example is composed of. The pop-up lists the characters in order, with Unicode names, and shows the characters themselves as graphics.

I have not yet updated this article’s incarnation as Unicode Technical Note #10. The Indian Government also used this article, and made a number of small changes. I have yet to incorporate those, too.

Characters in the Unicode Bengali block.

If you’re interested, I just did a major overhaul of my script notes on Bengali in Unicode. There’s a new section about which characters to use when there are multiple options (eg. RRA vs. DDA+nukta), and the page provides information about more characters from the Bengali block in Unicode (including those used in Bengali’s amazingly complicated currency notation prior to 1957).

In addition, this has all been squeezed into the latest look and feel for script notes pages.

The new page is at a new location. There is a redirect on the old page.

Hope it’s useful.

>> Read it


Picture of the page in action.
 
Picture of the page in action.

About the tools: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility

Latest changes: The Urdu and Tamil pickers have been upgraded to version 10. This provides new views of the data, but also involved a thorough overhaul and redesign of the pickers. Transliteration functions have also been added for the Tamil picker.

In addition, the Urdu notes page was updated and a new Tamil notes page was created. Database entries were also updated or, in the case of Tamil, created to support the notes pages. These notes pages are the first to use a new look and feel, based on the analyse-string tool I produced earlier this year. This adds information about each character from the Unicode descriptions data to that from my own database.

Next Page »