I recently came across an email thread where people were trying to understand why they couldn’t see Indian content on their mobile phones. Here are some notes that may help to clarify the situation. They are not fully developed! Just rough jottings, but they may be of use.
Let’s assume, for the sake of an example, that the goal is to display a page in Hindi, which is written using the devanagari script. These principles, however, apply to one degree or another to all languages that use characters outside the ASCII range.
Let’s start by reviewing some fundamental concepts: character encodings and fonts. If you are familiar with these concepts, skip to the next heading.
Character encodings and fonts
Content is composed of a sequence of characters. Characters represent letters of the alphabet, punctuation, etc. But content is stored in a computer as a sequence of bytes, which are numeric values. Sometimes more than one byte is used to represent a single character. Like codes used in espionage, the way that the sequence of bytes is converted to characters depends on what key was used to encode the text. In this context, that key is called a character encoding.
There are many character encodings to choose from.
The person who created the content of the page you want to read should have used a character encoding that supports devanagari characters, but it should also be a character encoding that is widely recognised by browsers and available in editors. By far the best character encoding to use (for any language in the world) is called UTF-8.
UTF-8 is strongly recommended by the HTML5 draft specification.
There should be a character encoding declaration associated with the HTML code of your page to say what encoding was used. Otherwise the browser may not interpret the bytes correctly. It is also crucial that the text is actually stored in that encoding too. That means that the person creating the content must choose that encoding when they save the page from their editor. It’s not possible to change the encoding of text simply by changing the character encoding declaration in the HTML code, because the declaration is there just to indicate to the browser what key to use to get at the already encoded text.
It’s one thing for the browser to know how to interpret the bytes to represent your text, but the browser must also have a way to make those characters stored in memory appear on the screen.
A font is essential here. Fonts contain instructions for displaying a character or a sequence of characters so that you can read them. The visual representation of a character is called a glyph. The font converts characters to glyphs.
The font has tables to map the bytes in memory to text. To do this, the font needs to recognise the character encoding your page uses, and have the necessary tables to convert the characters to glyphs. It is important that the font used can work with the character encoding used in the page you want to view. Most fonts these days support UTF-8 encoded text.
Very simple fonts contain one glyph for each letter of the alphabet. This may work for English, but it wouldn’t work for a complex script such as devanagari. In these scripts the positioning and interaction of characters has to be modified according to the context in which they are displayed. This means that the font needs additional information about how to choose and postion glyphs depending on the context. That information may be built into the font itself, or the font may rely on information on your system.
Character encoding support
The browser needs to be able to recognise the character encoding used in order to correctly interpret the mapping between bytes and characters.
If the character encoding of the page is incorrectly declared, or not declared at all, there will be problems viewing the content. Typically, a browser allows the user to manually apply a particular encoding by selecting the encoding from the menu bar.
All browsers should support the UTF-8 character encoding.
Sometimes people use an encoding that is not designed for devanagari support with a font that produces the right glyphs nevertheless. Such approaches are fraught with issues and present poor interoperability on several levels. For example, the content can only be interpreted correctly by applying the specifically designed font; no other font will do if that font is not available. Also, the meaning of the text cannot be derived by machine processing, for web searches, etc., and the data cannot be easily copied or merged with other text (eg. to quote a sentence in another article that doesn’t use the same encoding). This practise seriously damages the openness of the Web and should be avoided at all costs.
System font support
Usually, a web page will rely on the operating system to provide a devanagari font. If there isn’t one, users won’t be able to see the Hindi text. The browser doesn’t supply the font, it picks it up from whatever platform the browser is running on.
If browser is running on a desktop computer, there may be a font already installed. If not, it should be possible to download free or commercial fonts and install them. If the user is viewing the page on a mobile device, it may currently be difficult to download and install one.
If there are several devanagari fonts on a system, the browser will usually pick one by default. However, if the web page uses CSS to apply styling to the page, the CSS code may specify one or more particular fonts to use for a given piece of content. If none of these are available on the system, most browsers will fall back to the default, however Internet Explorer will show square boxes instead.
Another way of getting a font onto the user’s system is to download it with the page, just like images are downloaded with the page. This is done using CSS code. The CSS code to do this has been defined for some years, but unfortunately most browsers implementation of this feature is still problematic.
Recently a number of major browsers have begun to support download of raw truetype or opentype fonts. Internet Explorer is not one of those. This involves simply loading the ordinary font onto a server and downloading to the browser when the page is displayed. Although the font may be cached as the user moves from page to page, there may still be some significant issues when dealing with complex scripts or Far Eastern languages (such as Chinese, Japanese and Korean) due to the size of the fonts used. The size of these fonts can often be counted in megabytes rather than kilobytes.
It is important to observe licencing restrictions when making fonts available for download in this way. The CSS mechanism doesn’t contain any restrictions related to font licences, but there are ways of preparing fonts for download that take into consideration some aspects of this issue – though not enough to provide a watertight restriction on font usage.
Microsoft makes available a program to create .eot fonts from ordinary true/opentype fonts. Eot font files can apply some usage restrictions and also subset the font to include only the characters used on the page. The subsetting feature is useful when only a small amount of text appears in a given font, but for a whole page in, say, devanagari script it is of little use – particularly if the user is to input text in forms. The biggest problem with .eot files, however, is that they are only supported by Internet Explorer, and there are no plans to support .eot format on other browsers.
The W3C is currently working on the WOFF format. Fonts converted to WOFF format can have some gentle protection with regard to use, and also apply significant compression to the font being downloaded. WOFF is currently only supported by Firefox, but all other major browsers are expected to provide support for the new format.
For this to work well, all browsers must support the same type of font download.
Complex scripts, such as those used for Indic and South East Asian languages, need to choose glyph shapes and positions and substitute ligatures, etc. according to the context in which characters are used. These adjustments can be acoomplished using the features of OpenType fonts. The browser must be able to implement those opentype features.
Often a font will also rely on operating system support for some subset of the complex script rendering. For example, a devanagari font may rely on the Windows uniscribe dll for things like positioning of left-appended vowel signs, rather than encoding that behaviour into the font itself. This reduces the size and complexity of the font, but exposes a problem when using that font on a variety of platforms. Unless the operating system can provide the same rendering support, the text will look only partially correct. Mobile devices must either provide something similar to uniscribe, or fonts used on the mobile device must include all needed rendering features.
Browsers that do font linking must also support the necessary opentype features and obtain functionality from the OS rendering support where needed.
If tools are developed to subset webfonts, the subsetting must not remove the rendering logic needed for correct display of the text.