Use accesskey "n" to jump to the internal navigation links at any point. Right now you can

 
ishida >> writing

Script features by language

This page provides information about script characteristics for a number of languages. The characteristics described are based on the exemplarCharacters lists in CLDR, ie. just the core characters needed to represent the language. It is not intended to be exhaustively scientific – merely to give a basic idea of what languages require what type of feature support.

Click on the column headings to sort by that column.

Language Script Number of characters Case sensitive? Combining characters Context-based positioning Multiple combining characters Contextual shaping Cursive script Text direction Space is word separator Baseline Text wrap Justification Region Feature count
AmharicEthiopic288no1nonononoltrnomidcharspaceAfrica3
Arabic detailsArabic44no8yesyesyesyesrtlyesmidspacewordM East7
ArmenianArmenian84yes0nonononoltryesmidspacespaceEurope1
BengaliBengali68no19yesyesyesnoltryeshighspacespaceAsia S5
CambodianKhmer76no29yesyesyesnoltrnomidwordclusterAsia SE7
Chinese detailsSimplified2128no0nonononoltr / tbrlnolowcharcharAsia E5
DzongkhaTibetan54no26yesyesyesnoltrnohighspecialcharAsia C8
English detailsLatin52yes0nonononoltryesmidspacespaceEurope1
FrenchLatin84yes0nonononoltryesmidspacespaceEurope1
Greek detailsGreek71yes0nonononoltryesmidspacespaceEurope1
GujaratiGujarati68no18yesyesyesnoltryeshighspacespaceAsia S5
HausaLatin51yes0nonononoltryesmidspacespaceAfrica1
Hebrew detailsHebrew27no0nonononortlyesmidspacespaceM East1
Hindi detailsDevanagari78no18yesyesyesnoltryeshighspacespaceAsia S5
IgboLatin56yes0nonononoltryesmidspacespaceAfrica1
IndonesianLatin46yes0nonononoltryesmidspacespaceAsia SE1
InuktitutUCAS109no0nonononoltryesmidspacespaceAmerica0
Japanese detailsHan, Kana2128no0nonononoltr / tbrlnolowcharcharAsia E5
KannadaKannada82no19yesyesyesnoltryesmidspacespaceAsia S4
Korean detailsHangul2350no0nonononoltr / tbrlyeslowcharspaceAsia E3
LaoLao55no15yesyesyesnoltrnomidwordclusterAsia SE7
MalayalamMalayalam70no16yesyesnonoltryesmidspacespaceAsia S3
MongolianCyrillic35yes0nonononoltryesmidspacespaceAsia C1
MongolianMongolian45no0nonoyesyestblryesverticalspacespaceAsia C4
NepaliDevanagari68no18yesyesyesnoltryeshighspacespaceAsia S5
OriyaOriya62no15yesyesyesnoltryesmidspacespaceAsia S4
PanjabiGurmukhi68no13yesyesyesnoltryeshighspacespaceAsia S5
PersianArabic43no5yesnoyesyesrtlyesmidspacewordM East6
Portuguese (BR)Latin78yes0nonononoltryesmidspacespaceEurope1
Russian detailsCyrillic66yes0nonononoltryesmidspacespaceEurope1
SpanishLatin66yes0nonononoltryesmidspacespaceEurope1
SwahiliLatin48yes0nonononoltryesmidspacespaceAfrica1
TamilTamil47no12yesyesyesnoltryesmidspacespaceAsia S4
TeluguTelugu70no19yesyesyesnoltryesmidspacespaceAsia S4
Thai detailsThai73no16yesyesnonoltrnomidwordclusterAsia SE6
TibetanTibetan87no51yesyesyesnoltrnohighspecialcharAsia C8
UrduArabic50no0nonoyesyesrtlnoslopespacewordAsia S6
VietnameseLatin186yes0nonononoltryesmidspacespaceAsia SE1
YorubaLatin70yes2nonononoltryesmidspacespaceAfrica2

Notes

The table is intended to provide a general indication only. There are things that could be disputed, and sometimes that goes back to the CLDR data. For example, the CLDR data for Burmese lists no combining character, but the asat is clearly needed, as are the vowels, and no combining characters are listed for Urdu, but they are for Arabic. I went with the CLDR data for Urdu, but I didn't put Burmese in the table.

The symbol details after a language name points to a sample page that gives more detail about the script used for that language.

Number of characters This figure is based on the simplest list of exemplarCharacters in CLDR. It is therefore the set of core characters needed to represent the language, and typically doesn't include common punctuation, currency symbols, etc. Nor does it include the additional characters that you may find in publications. For example, the English set of characters doesn't include é, which you might use for 'résumé'.

Where a language uses a case sensitive script, uppercase versions of letters are included in this figure.

The characteristics described below are also based on this set of characters only.

Case sensitive? Whether or not the script makes case distinctions.

Combining characters. This shows the subset of the number of characters that are combining characters. No attempt is made to indicate how many of the base characters each combining character can combine with. In some cases, this will be limited, but in most cases a combining character will combine with a fair number of base characters.

Contextual positioning. This is typically related to combining characters, and indicates that a typical font uses OpenType rules to position of a glyph according to the glyphs that surround it, eg. tone marks in Thai, or vowel signs in Arabic (if used).

Multiple combining characters. Whether more than one combining character can be associated with a give base character.

Contextual shaping. Whether different glyph shapes have to be used for a character depending on the visual context, eg. the RA in Myanmar that grows and shrinks to fit around the character is surrounds. Note that this does not include shaping for cursive scripts (see below).

Cursive script. Do the letters in this script join up, eg. as in Arabic?

Ligatures. Does the script require certain ligatures, ie. a single glyph for more than one underlying character.

Right-to-left. Is this a right-to-left script (which actually usually means that bidirectional behaviour needs to be supported, for numbers and embedded foreign text.)

Space not word separator. Is this a script like Thai, where spaces are used to separate phrases, not words, or like Japanese and Chinese, that don't use spaces, or Ethiopic, that has its own word separator?

Baseline. The baseline for Latin text is labelled 'mid'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'high'. Scripts like Chinese are labelled 'low'.

Text wrap. At the end of a line, where is the typically break point? Is it between words, or characters? Entries labelled 'special' wrap at a character that is not a space, eg. Tibetan, which uses a tsheg between words, rather than a space.

Justification. What is the basic starting point for justification of text on a line? Typically this is related to the spaces between words. Here are the other alternatives listed: 'char' is typical of Chinese and Japanese, where justification starts with inter-character spaces; 'cluster' refers to scripts such as in South East Asia, where word boundaries are taken into account, but spaces are used as phrase separators; 'word' is used for arabic-based scripts, where justification is commonly achieved by stretching the baseline or using ligatures.

Region. This rough grouping places the language in the region where it originated, so English is in Europe, and Arabic is in the Middle East. It serves to get a very rough idea of how things stack up on a regional basis.

Feature count. This is a very simplistic indicator that simply awards one point for each column after the first three columns that doesn't read 'no', 'mid' or '0'.

Author: Richard Ishida.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content created 29 August, 2010. Last update 2012-02-23 10:24 GMT