Large character sets
Complex script rendering
Text boundaries & wrapping
Sorting & case conversion
Initially there was only one type of Chinese – what we now call Traditional Chinese. Then in the 1950s Mainland China introduced a Simplified Chinese. It was simplified in two ways:
the more common character shapes were reduced in complexity,
a relatively smaller set of characters was defined for common usage than had traditionally been the case (resulting in the mapping more than one character in Traditional Chinese to a single character in the Simplified Chinese set).
This slide shows Traditional Chinese above and Simplified Chinese below.
Traditional Chinese is still used to write characters in Taiwan and Hong Kong, and much of the Chinese diaspora. Simplified Chinese is used in Mainland China and Singapore. It is important to stress that people speaking many different, often mutually unintelligible, Chinese dialects would use one or other of these scripts to write Chinese – ie. the characters do not necessarily represent the sounds.
There are a few local characters, such as for Cantonese in Hong Kong, that are not in widespread use.
In Chinese these ideographs are called hanzi. They are often referred to as Han characters.
There is another script used with Traditional Chinese for annotations and transliteration during input. It is called zhuyin or bopomofo, and will be described in more detail later.
It is said that Chinese people typically use around 3-4,000 characters for most communication, but a reasonable word processor would need to support at least 10,000. Unicode supports over 70,000 Han characters.
This slide shows examples of contrasting shapes in Traditional and Simplified ideographs.
The characters on the left are one ideograph; the characters on the right are another. Characters at the top are Traditional shapes; characters at the bottom are Simplified.
Note that each of the large glyphs shown above is a separate code point in Unicode. The Simplified and Traditional shapes are not unified unless they are extremely similar. (Han unification will be explained in more detail later.)
Japanese uses three native scripts in addition to Latin (which is called romaji), and mixes them all together.
Top right on the slide is an example of ideographic characters, borrowed from Chinese, which in Japanese are called kanji. Kanji characters are used principally for the roots of words.
The example at the top left of the slide is written entirely in hiragana. Hiragana is a native Japanese syllabic script typically used for many indigenous Japanese words (as in this case) and for grammatical particles and endings. The example at the bottom of the slide shows its use to express grammatical information alongside a kanji character (the darker, initial character) that expresses the root meaning of the word.
Japanese everyday usage requires around 2,000 kanji characters – although Japanese character sets include many thousands more.
The example at the bottom of this slide shows the katakana script. This is used for foreign loan words in Japanese. The example reads ‘te-ki-su-to’, ie. ‘text’.
On the two slides above we see the more common characters from the hiragana (left) and katakana (right) syllabaries arranged in traditional order. A character in the same location in each table is pronounced exactly the same.
With the exception of the vowels on the top line and the letter ‘n’, all of the symbols represent a consonant followed by a vowel.
The first of the two slides highlights some script features (on the right) from hiragana. The second shows the correspondences in katakana.
Voiced consonants are indicated by attaching a dakuten mark (looks like a quote mark) to the unvoiced shape. The ‘p’ sound is indicated by the use of a han-dakuten (looks like a small circle). The slides show glyphs for ‘ha’, ‘ba’, and ‘pa’ on the top line.
A small ‘tsu’ (っ) is commonly used to lengthen a consonant sound.
Small versions of や, ゆ, and よ are used to form syllables such as ‘kya’ (きゃ), ‘kyu’ (きゅ), and ‘kyo’ (きょ) respectively.
When writing katakana the mark ー is used to indicate a lengthened vowel.
The example at the top of the slide shows the small tsu being used in katakana to lengthen the ‘t’ sound that follows it. This can be transcribed as ‘intanetto’.
The bottom example shows usage of other small versions of katakana characters. The transcription is ‘konpyuutingu’. In the first case the small ‘yu’ combines with the preceding ‘pi’ to produce ‘pyu’. In the second case the small ‘i’ is used with the preceding ‘te’ syllable to produce ‘ti’ – a sound that is not native to Japanese. (Their equivalent would be ‘chi’.)
The bottom example also shows the use of the han-dakuten and dakuten to turn ‘hi’ into ‘pi’ and ‘ku’ into ‘gu’.
There is also a lengthening mark that lengthens the ‘u’ sound before it.
Korean uses a unique script called hangul. It is unique in that, although it is a syllabic script, the individual phonemes within a syllable are represented by individual shapes. The example shows how the word ‘ta-kuk-o’ is composed of 7 jamos, each expressing a single phoneme. The jamos are displayed as part of a two dimensional syllabic character.
Note that the initial jamo in the last syllable is not pronounced in initial position and serves purely to conform to the rule that hangul syllables always begin with a consonant.
It is possible to store hangul text as either jamos or syllabic characters in Unicode, although the latter is more common. Unicode enables both approaches.
South Korea also mixes ideographic characters borrowed from Chinese with hangul, though on nothing like the scale of Japanese. In fact, it is quite normal to find whole documents without any hanja, as the ideographic characters in Korean are called.
There are about 2,300 hangul characters in everyday use, but the Unicode Standard has code points for around 11,000.
Note how because all the characters above are mono-spaced and fit within the same sized box the text on the slide gives the appearance of a grid. Grid layouts are actually a common typographic convention in East Asian scripts.
When half-width or proportionally-spaced characters are introduced, there is a possibility of this grid being corrupted, but typographic devices are available to provide several possible solutions to this.
You can experiment with various types of grid setting using CSS on the following web pages. It only works on Internet Explorer 5+, and it only provides basic support for grid layouts:
Han and kana characters are usually full-width, whereas latin text is half-width or proportionally spaced.
Half-width katakana characters do exist, and for compatibility reasons there is a Unicode block for half-width kana characters. These codes should not normally be used, however. They arise from the early computing days when Japanese had to be fitted into a Western-biased technology.
Similarly, it is common to find full-width Latin text, especially in tables. Again, there is a Unicode block dedicated to full width Latin characters and punctuation, but a font should be used instead.
A radical is an ideograph or a component of an ideograph that is used for indexing dictionaries and word lists, and as the basis for creating new ideographs. The 214 radicals of the KangXi dictionary are universally recognised.
The examples enlarged on the slide show the ideographic character meaning ‘word’, ‘say’ or ‘speak’ (bottom left), and three more characters that use this as a radical on their left hand side.
The visual appearance of radicals may vary significantly.
Here the radical shown on the previous slide is seen as used in Simplified Chinese (top right). Although the shape differs somewhat it still represents the same radical.
On the bottom row we see the ‘water’ radical being used in two different positions in a character, and with two different shapes. This time the right-most example is found in both simplified and traditional forms.
Unicode dedicates two blocks to radicals. The KangXi radicals block depicted here contains the base forms of the 214 radicals.
The CJK Radicals Extension contains variant shapes of these radicals when they are used as parts of other characters or in simplified form. These have not been unified because they often appear independently in dictionaries indices.
Characters in these blocks should never be used as ideographs.
A very early step in realizing the use of a script or set of scripts is to define the set of characters needed for its use.
The slide shows a set that was defined recently for the North African Tifinagh script. It includes characters for a number of variants of Tifinagh besides that used in Morocco, such as writing used by the Touareg.
At this stage, this is just a bag of characters with no formal structure. It is not necessarily computer-specific – it is just a list of characters needed for writing Tifinagh, one way or another.
This is called a character set, or repertoire.
Next the characters are ordered in a standard way and numbered. Each unique character has a unique number, called a code point. The code point of the number circled above is 33 in hexadecimal notation (a common way to represent code points), or 52 in decimal.
A set of characters ordered and numbered in this way is called a coded character set.
In the early days of computing a byte consisting of 7 bits; allowing for a code page containing 128 code points. This was the day of ASCII.
When bytes contained 8 bits they gave rise to code pages containing 256 code points. These code pages typically retain the ASCII characters in the lower 128 code points and add characters for additional languages to the upper reaches. On the slide we see a Latin1 code page, ISO 8859-1, containing code points for Western European languages.
Unfortunately, 256 code points was not enough to support the whole of Europe – not even Latin based languages such as Turkish, Hungarian, etc. To support Greek characters you might see the code points re-mapped as shown on the slide (left hand side). These alternative code pages forced you to maintain contextual information so that you could determine the intended character from the upper ranges of the code page. It also made localization difficult since you had to keep changing code pages.
East Asian computing immediately faced a much bigger problem than in Europe, as can be seen by the size of these common character sets. They resorted to double-byte coded character sets. Two-byte character sets provided 16 bits, and would allow for 216 (ie. 65,356) possible code points. In reality these character sets tended to be based on a 7-bit model, utilizing only a part of the total space available.
One particular problem persisted here – these character sets and their encodings were script specific. It was still difficult to represent Chinese, Korean and Japanese text simultaneously.
Unicode sets out to encompass all scripts and symbols needed for text in a single character set.
Most modern scripts and useful symbols are currently encoded in a coding space called the Basic Multilingual Plane or BMP. There is room for 65,356 characters on each plane.
Recently Unicode and ISO 10646 have defined 16 supplementary planes, each the same size as the BMP, for future expansion. Some of those planes are being populated already. There are code points defined for additional alphabets and a large number of math characters in the Supplementary Multilingual Plane (SMP). Also a large number of additional ideographic characters have been added to the Supplementary Ideographic Plane (SIP).
In total there are now over one million code points available. This means that all of the above scripts and more can be represented simultaneously with ease. Localization also becomes easier, since there is no need to enable new code pages or switch encodings – you simply began using the characters that are available.
In addition to the normal code point allocations, there is additional space available in Unicode for privately defined character mappings. There is a Private Use Area in the BMP from code points E000–F8FF (6,400 code points). There are two additional, and much larger, private use areas in the supplementary character ranges.
Although the terms 'character set' and 'character encoding' are often treated as the same thing, we will use them to mean separate things in this tutorial.
We have already explained that a character set or repertoire comprises the set of atomic text elements you will use for a particular purpose. We also explained that the Unicode Standard assigns a unique scalar number to every character in its character set. The resulting numbered set is referred to as a coded character set. Units of a coded character set are known as code points.
The character encoding reflects the way these abstract characters are mapped to numbers for manipulation in a computer.
In a standard such as ISO 8859 encodings tend to use a single byte for a given character and the encoding is straightforwardly related to the position of the characters in the set.
The above distinction becomes helpful when discussing Unicode because the set of characters (ie. the character set) defined by the Unicode Standard can be encoded in a number of different ways. The type of encoding doesn’t change the number or nature of the characters in the Unicode set, just the way they are mapped into numbers for manipulation by the computer (see the next slide).
Similarly, on the Web, the document character set of an XML or HTML document is always Unicode. A particular XML or HTML document, however, can be encoded using any encoding, even encodings that don’t cover the full Unicode range such as ISO 8859-1 (Latin1). However, because the document character set is Unicode, even if a Web page uses Latin1 as its encoding, it can use special constructs called numeric character references (eg. ሴ) to include any Unicode character outside that encoding.
Character encodings are the things that have names in the IANA registry.
This slide demonstrates a number of ways of encoding the same characters in Unicode. These encodings are UTF-8, UTF-16, and UTF-32.
UTF-8 uses 1 byte to represent characters in the old ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.
UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.
UTF-32 uses 4 bytes everywhere.
In the chart on the slide, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.
This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17, Unicode Character Encoding Model.
Characters in the supplementary planes are addressed using pairs of characters called surrogates. There is a block of 1024 code points reserved for high surrogates, and another one of the same size for low surrogates. An encoding such as UTF-8 or UTF-16 combines one of the high surrogates with one of the low surrogates to point into the supplementary character range.
Surrogates must not be treated as individual characters. Only pairs should be counted when wrapping or highlighting text, counting characters, displaying unknown character glyphs, and so on.
Unicode provides a superset of most character sets in use around the world, but tries not to duplicate characters unnecessarily. For example, there are several ISO character sets in the 8859 range that all duplicate the ASCII characters. Unicode doesn't have as many codes for the letter 'a' as there are character sets - that would make for a huge and confusing character set.
The same principal applies for Han (Chinese) characters. The initial set of sources for Han encoding in Unicode laid end to end comprised 121,000 characters, but there were many repeats, and the final Unicode tally for all these after elimination of duplicates was 20,902. (There are now over 70,000 Han characters encoded in Unicode.)
If Han characters had different meanings or etymologies, they were not unified. Han characters, however, are highly pictorial in nature. So the (dis-) unification process had to take into account the visual forms to some extent. Where there was a significant visual difference between han characters that represented the same thing they were allotted to separate Unicode code points. (Unifying the Han characters is a sophisticated process, carried out over a long period by many East-Asian experts.)
Factors such as those shown on this slide prevent unification, ie.
What is left for unification are characters representing the same thing but exhibiting no visual differences, or relatively minor differences such as different sequence for writing strokes, differences in stroke overshoot and protrusion, differences in contact and bend of strokes, differences in accent and termination of strokes, etc.
The slide shows how a string of characters maps to byte codes in memory in UTF-8. In an encoding such as UTF-8 the number of bytes actually used depends on the character in question, and only a very small number of characters are encoded using a single byte.
This means that care has to be taken to recognize and respect the integrity of the character boundaries.
Applications cannot simply handle a fixed number of bytes when performing editing operations such as inserting, deleting, wrapping, cursor positioning, etc. Collation for searching and sorting, pointing into strings, and all other operations similarly need to work out where the boundaries of the characters lie in order to successfully process the text.
Such operations need to be based on characters, not bytes.
Similarly, string lengths should be based on characters rather than bytes.
This slide illustrates how things go wrong with technology that is not multi-byte aware. In this case the author attempted to delete a Chinese character on the last line, and the application translated that to "delete a single byte". This caused a misalignment of all the following bytes, and produced garbage.
UniView is an unofficial HTML-based tool that I created for finding Unicode characters and looking up their properties. It also acts like a character map or character picker, allowing you to create strings of Unicode characters. You can also use it to discover the contents of a string or a sequence of codepoint values, to convert to NFC or NFD normalized forms, display ranges of characters as lists or tables, highlight properties, etc.
The Unibook Character browser is a downloadable utility for offline viewing of the character charts and character properties for The Unicode Standard, created by Asmus Freytag. It can also be used to copy&paste character codes. The utility was derived from the program used to print the character code charts for the Unicode Standard and ISO/IEC 10646.
If you need to convert Unicode characters between various escaped forms, you should try the web-based Unicode Code Converter tool.
There are also over 20 web-based Unicode Character Pickers available. These allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. They are likely to be most useful if you don't know a script well enough to use the native keyboard. The arrangement of characters makes it much more useable than a regular character map utility. The more advanced pickers provide ways to select characters from semantic or phonetic arrangements, select by shape, and select by association with a transcription.
We have noted that East Asian character sets number their characters in the thousands. So how do you, quickly, find the one character you want while typing?
In the past people have tried using extremely large keyboards, or forcing people to remember the code point numbers for the character. Not surprisingly these approaches were not very popular.
The answer is to use an IME (Input Method Editor). An IME (also called a front-end processor) is software that uses a number of strategies to help you search for the character you want.
This slide summarizes the typical steps when typing in Japanese using a standard IME for Windows.
The user types Japanese in romaji transcription using a QWERTY keyboard. As they type the transcription is automatically converted to hiragana or katakana. Ranges of characters are accepted by a key press as they go along. To convert a range of characters to kanji, the user presses a key such as the space bar. Typically the IME will automatically insert into the text the kanji that were last selected for the transcription that has been input. If this is not the desired kanji sequence, the user presses the key again and a selection list pops up, usually ordered in terms of frequency of selection. The user picks the kanji characters required, and confirms their choice, then moves on.
Note that there are only a few alternatives for the sequence かいぎ. If the user had looked up かい and ぎ separately they would have been faced each time with a large number of choices. The provision of a dictionary as part of the IME for lookup of longer phrases is one way of speeding up the process of text entry for the user.
Ordering by frequency and memory of the last conversion are additional methods of assisting the user to find the right character more quickly.
Whereas the Japanese romaji input method predominates for Japanese, there are a number of different approaches available for Chinese.
Pinyin was introduced with Simplified Chinese, and is typically used in the same geographical areas, ie. Mainland China and Singapore.
It is essentially equivalent to the romaji input method. The numbers you see in the example above indicate tones. This dramatically reduces the ambiguity of the sounds in Chinese.
One of the problems of pinyin is that the transcription is based on the Mandarin or Putonghua dialect of spoken Chinese. So to use this method you need to be able to speak that dialect.
A more common input method in Taiwan uses an alphabet called zhuyin or bopomofo. This alphabet is only used for phonetic transcription of Chinese. Essentially it is the same idea as pinyin, but with different letters. The tones in this case are indicated by spacing accent marks (shown only in the top line on the slide) which in Unicode are unified with accents used in European languages.
A very different approach allows the user to create the desired character on the basis of its visual appearance rather than the underlying phonics.
Changjie input uses just such an approach. The keyboard provides access to primitive ideographic components which, when combined in the right sequence lead to the desired ideograph.
An advantage of an approach such as changjie is that you don’t have to speak Mandarin. A drawback is the additional training required.
Note that pen-based input is another useful approach. In fact, this is particularly helpful for people who do not speak Chinese or Japanese. Once you master a few simple rules about stroke order and direction, you can use something like Microsoft’s IME Pad to draw and select characters without any knowledge of components or pronunciation.
The examples on this slide show the keystrokes required to enter the text used in the previous slides containing pinyin and bopomofo examples.
In some cases you may come across an ideograph that your font or your character set doesn’t support. Unicode provides a way of saying, “I can’t represent it, but it looks like this character.”
The approach requires you to add character U+303E immediately followed by a similar looking character. This is called an ideographic variation indicator. This at least gives the reader a chance to guess at the character that is missing.
Another way of addressing the same problem is to use the ideographic description characters introduced in Unicode 3.0.
This approach allows you to draw a picture showing what are the various components of the character you can’t represent, and where they appear. The lower line on the slide shows how you would describe the large character near the top. Note that this is interpreted recursively.
Note also, that this should not be treated as in any way equivalent to an existing ideograph when collating strings.
Content created February, 2003. Last update 2005-03-29 16:29 GMT
Copyright © 2003-2012 Richard Ishida. All rights reserved.