The Internationalization Activity of the W3C (World Wide Web Consortium) has the mission of ensuring universal access to the Web, regardless of language, script or culture. In short, to help make the World Wide Web worldwide!
I’m the Internationalization Activity Lead at the W3C, and I contribute to the Unicode Editorial Committee.
The Internationalization Working Group advises working groups and reviews specifications (such as HTML5, CSS3 modules, WOFF file format, Widgets, SVG, and the many other specifications) to ensure that those technologies can be used by people all around the world, regardless of the writing system, language or culture. We also support specification work at the IETF (such as Internationalized Domain Names and Resource Identifiers) and the Unicode Consortium.
With support from the European Commission, we have also been running a series of conferences, bringing together stakeholders in standards and best practices for the Multilingual Web.
I’m also on the advisory committee and review board of the Internationalization & Unicode Conference.
About the W3C
There are about 60 W3C staff (photo), spread around the world but attached to MIT (USA), ERCIM (France), Keio University (Japan) and Beihang University (China). Then there are around 400 member organisations that provide guidance and resources for the numerous Working Groups.
In addition to HTML, CSS and XML, the W3C has created many fundamental Web standards related to such things as privacy, graphics (eg. PNG and SVG), multimodal interaction, document styling (CSS, XSLT), voice, Web services, the Semantic Web, etc. We also have horizontal activities ensuring that principles of internationalization, accessibility and device independence are applied to Web technologies we develop.
The W3C standards (called ‘Recommendations’) lead the Web forward, and are typically well ahead of existing practice. Their aim is to improve interoperability between users of the Web – ie. provide common formats that enable people to collaborate.
Text in a computer or on the Web is composed of characters. Characters represent letters of the alphabet, punctuation, or other symbols.
Each character is represented in the computer or Web page using one or more bytes. The key to which bytes (or sequences of bytes) represent which characters is called a character encoding.
Web browsers and computer applications need to understand any encoding used for your text, so that they can correctly associate bytes with characters and produce readable text.
In the past, different organizations assembled different sets of characters and created encodings for them – one set may cover just Latin-based Western European languages (excluding EC countries such as Bulgaria or Greece, but also Latin-script languages such as Turkish and Czech), another may cover a particular Far Eastern language (such as Japanese), others may be one of many sets devised in a rather adhoc way for representing another language somewhere in the world.
Using multiple encodings to support the range of languages needed for an application is problematic, and individual encodings may not even support all your needs for representing a given language. In addition, it is usually impossible to combine different encodings on the same Web page or in a database, and so it becomes very difficult to support multilingual Web pages.
The Unicode Consortium provides a large, single character set that aims to include all the characters needed for any writing system in the world, including ancient scripts (such as Cuneiform, Gothic and Egyptian Hieroglyphs). It is now fundamental to the architecture of the Web and operating systems, and is supported by all major web browsers and applications. This Unicode Standard also describes properties and algorithms for working with characters.
This approach makes it much easier.