Dochula Pass, Bhutan

I just received a query from someone who wanted to know how to figure out what characters are in and what characters are not in a particular legacy character encoding. So rather than just send the information to her I thought I’d write it as a blog post so that others can get the same information. I’m going to write this quickly, so let me know if there are parts that are hard to follow, or that you consider incorrect, and I’ll fix it.

A few preliminary notes to set us up: When I refer to ‘legacy encodings’, I mean any character encoding that isn’t UTF-8. Though, actually, I will only consider those that are specified in the Encoding spec, and I will use the data provided by that spec to determine what characters each encoding contains (since that’s what it aims to do for Web-based content). You may come across other implementations of a given character encoding, with different characters in it, but bear in mind that those are unlikely to work on the Web.

Also, the tools I will use refer to a given character encoding using the preferred name. You can use the table in the Encoding spec to map alternative names to the preferred name I use.

What characters are in encoding X?

Let’s suppose you want to know what characters are in the character encoding you know as cseucpkdfmtjapanese. A quick check in the Encoding spec shows that the preferred name for this encoding is euc-jp.

Go to http://r12a.github.io/apps/encodings/ and look for the selection control near the bottom of the page labelled show all the characters in this encoding.

Select euc-jp. It opens a new window that shows you all the characters.

picture of the result

This is impressive, but so large a list that it’s not as useful as it could be.

So highlight and copy all the characters in the text area and go to https://r12a.github.io/apps/listcharacters/.

Paste the characters into the big empty box, and hit the button Analyse characters above.

This will now list for you those same characters, but organised by Unicode block. At the bottom of the page it gives a total character count, and adds up the number of Unicode blocks involved.

picture of the result

What characters are not in encoding X?

If instead you actually want to know what characters are not in the encoding for a given Unicode block you can follow these steps.

Go to UniView (http://r12a.github.io/uniview/) and select the block you are interested where is says Show block, or alternatively type the range into the control labelled Show range (eg. 0370:03FF).

Let’s imagine you are interested in Greek characters and you have therefore selected the Greek and Coptic block (or typed 0370:03FF in the Show range control).

On the edit buffer area (top right) you’ll see a small icon with an arrow point upwards. Click on this to bring all the characters in the block into the edit buffer area. Then hit the icon just to its left to highlight all the characters and then copy them to the clipboard.

picture of the result

Next open http://r12a.github.io/apps/encodings/ and paste the characters into the input area labelled with Unicode characters to encode, and hit the Convert button.

picture of the result

The Encoding converter app will list all the characters in a number of encodings. If the character is part of the encoding, it will be represented as two-digit hex codes. If not, and this is what you’re looking for, it will be represented as decimal HTML escapes (eg. Ͱ). This way you can get the decimal code point values for all the characters not in the encoding. (If all the characters exist in the encoding, the block will turn green.)

(If you want to see the list of characters, copy the results for the encoding you are interested in, go back to UniView and paste the characters into the input field labelled Find. Then click on Dec. Ignore all ASCII characters in the list that is produced.)

Note, by the way, that you can tailor the encodings that are shown by the Encoding converter by clicking on change encodings shown and then selecting the encodings you are interested in. There are 36 to choose from.

Picture of the page in action.
>> Use the picker

Following closely on the heels of the Old Norse and Runic pickers comes a new Old English (Anglo-Saxon) picker.

This Unicode character picker allows you to produce or analyse runs of Old English text using the Latin script.

In addition to helping you to type Old English latin-based text, the picker allows you to automatically generate phonetic and runic transcriptions. These should be used with caution! The transcriptions are only intended to be a rough guide, and there may occasionally be slight inaccuracies that need patching.

The picture in this blog post shows examples of old english text, and phonetic and runic transcriptions of the same, from the beginning of Beowulf. Click on it to see it larger, or copy-paste the following into the picker, and try out the commands on the top right: Hwæt! wē Gār-Dena in ġēar-dagum þēod-cyninga þrym gefrūnon, hūðā æþelingas ellen fremedon.

If you want to work more with runes, check out the Runic picker.

Picture of the page in action.
>> Use the picker

Character pickers are especially useful for people who don’t know a script well, as characters are displayed in ways that aid identification. These pickers also provide tools to manipulate the text.

The Runic character picker allows you to produce or analyse runs of Runic text. It allows you to type in runes for the Elder fuþark, Younger fuþark (both long-branch and short-twig variants), the Medieval fuþark and the Anglo-Saxon fuþork. To help beginners, each of the above has its own keyboard-style layout that associates the runes with characters on the keyboard to make it easier to locate them.

It can also produce a latin transliteration for a sequence of runes, or automatically produce runes from a latin transliteration. (Note that these transcriptions do not indicate pronunciation – they are standard latin substitutes for graphemes, rather than actual Old Norse or Old English, etc, text. To convert Old Norse to runes, see the description of the Old Norse pickers below. This will soon be joined by another picker which will do the same for Anglo-Saxon runes.)

Writing in runes is not an exact science. Actual runic text is subject to many variations dependent on chronology, location and the author’s idiosyncracies. It should be particularly noted that the automated transcription tools provided with this picker are intended as aids to speed up transcription, rather than to produce absolutely accurate renderings of specific texts. The output may need to be tweaked to produce the desired results.

You can use the RLO/PDF buttons below the keyboard to make the runic text run right-to-left, eg. ‮ᚹᚪᚱᚦᚷᚪ‬, and if you have the right font (such as Junicode, which is included as the default webfont, or a Babelstone font), make the glyphs face to the left also. The Bablestone fonts also implement a number of bind-runes for Anglo-Saxon (but are missing those for Old Norse) if you put a ZWJ character between the characters you want to ligate. For example: ᚻ‍ᛖ‍ᛚ. You can also produce two glyphs mirrored around the central stave by putting ZWJ between two identical characters, eg. ᚢ‍ᚢ. (Click on the picture of the picker in this blog post to see examples.)

Picture of the page in action.
>> Use the picker

The Old Norse picker allows you to produce or analyse runs of Old Norse text using the Latin script. It is based on a standardised orthography.

In addition to helping you to type Old Norse latin-based text, the picker allows you to automatically generate phonetic and runic transcriptions. These should be used with caution! The phonetic transcriptions are only intended to be a rough guide, and, as mentioned earlier, real-life runic text is often highly idiosyncratic, not to mention that it varies depending on the time period and region.

The runic transcription tools in this app produce runes of the Younger fuþark – used for Old Norse after the Elder and before the Medieval fuþarks. This transcription tool has its own idiosyncracies, that may not always match real-life usage of runes. One particular idiosyncracy is that the output always regularly conforms to the same set of rules, but others include the decision not to remove homorganic nasals before certain following letters. More information about this is given in the notes.

You can see an example of the output from these tools in the picture of the Old Norse picker that is attached to this blog post. Here’s some Old Norse text you can play with: Ok sem leið at jólum, gørðusk menn þar ókátir. Bǫðvarr spurði Hǫtt hverju þat sætti; hann sagði honum at dýr eitt hafi komit þar tvá vetr í samt, mikit ok ógurligt.

The picker also has a couple of tools to help you work with A New Introduction to Old Norse.