Text in a computer or on the Web is composed of characters. Characters represent letters of the alphabet, punctuation, or other symbols.

Each character is represented in the computer or Web page using one or more bytes. The key to which bytes (or sequences of bytes) represent which characters is called a character encoding.

Web browsers and computer applications need to understand any encoding used for your text, so that they can correctly associate bytes with characters and produce readable text.

In the past, different organizations have assembled different sets of characters and created encodings for them – one set may cover just Latin-based Western European languages (excluding EC countries such as Bulgaria or Greece), another may cover a particular Far Eastern language (such as Japanese), others may be one of many sets devised in a rather adhoc way for representing another language somewhere in the world.

Unfortunately, you can’t guarantee that your application will support all encodings, nor that a given encoding will support all your needs for representing a given language. In addition, it is usually impossible to combine different encodings on the same Web page or in a database, so it is usually very difficult to support multilingual pages using ‘legacy’ approaches to encoding.

The Unicode Consortium provides a large, single character set that aims to include all the characters needed for any writing system in the world, including ancient scripts (such as Cuneiform, Gothic and Egyptian Hieroglyphs). It is now fundamental to the architecture of the Web and operating systems, and is supported by all major web browsers and applications. The Unicode Standard also describes properties and algorithms for working with characters.

This approach makes it much easier to deal with multilingual pages or systems, and provides much better coverage of your needs than most traditional encoding systems. For more information, see the Unicode home page or my tutorial on Unicode.