Dochula Pass, Bhutan

This shows the durations of dynasties and kingdoms of China during the period known as the 16 Kingdoms. Click on the image below to see an interactive version with a guide that follows your cursor and indicates the year.

Chart of timelines

See a map of territories around 409 CE. The dates and ethnic data are from Wikipedia.

Update 2016-10-03: I found it easier to work with the chart if the kingdoms are grouped by name/proximity, so changed the default to that. You can, however, still access the strictly chronological version.

Picture of the page in action.

A new Persian Character Picker web app is now available. The picker allows you to produce or analyse runs of Persian text using the Arabic script. Character pickers are especially useful for people who don’t know a script well, as characters are displayed in ways that aid identification.

The picker is able to produce UN transcriptions of the text in the box. The transcription appears just below the input box, where you can copy it, move it into the input box at the caret, or delete it. In order to obtain a full transcription it is necessary to add short vowel diactritics to places that could have more than one pronunciation, but the picker can work out the vowels needed for many letter combinations.

See the help file for more information.

This shows the durations of dynasties and kingdoms of China in the 900s. Click on the image below to see an interactive version that shows a guide that follows your cursor and indicates the year.

Chart of timelines

See a map of territories around 944 CE.

Examples of case conversion.

These are notes culled from various places. There may well be some copy-pasting involved, but I did it long enough ago that I no longer remember all the sources. But these are notes, it’s not an article.

Case conversions are not always possible in Unicode by applying an offset to a codepoint, although this can work for the ASCII range by adding 32, or by adding 1 for many other characters in the Latin extensions. There are many cases where the corresponding cased character is in another block, or in an irregularly offset location.

In addition, there are linguistic issues that mean that simple mappings of one character to another are not sufficient for case conversion.

In German, the uppercase of ß is SS. German and Greek cannot, however, be easily transformed from upper to lower case: German because SS could be converted either to ß or ss, depending on the word; Greek because all tonos marks are omitted in upper case, eg. does ΑΘΗΝΑ convert to Αθηνά (the goddess) or Αθήνα (capital of Greece)? German may also uppercase ß to ẞ sometimes for things like signboards.

Also Greek converts uppercase sigma to either a final or non-final form, depending on the position in a word, eg. ΟΔΥΣΣΕΥΣ becomes οδυσσευς. This contextual difference is easy to manage, however, compared to the lexical issues in the previous paragraph.

In Serbo-Croatian there is an important distinction between uppercase and titlecase. The single letter dž converts to DŽ when the whole word is uppercased, but Dž when titlecased. Both of these forms revert to dž in lowercase, so there is no ambiguity here.

In Dutch, the titlecase of ijsvogel is IJsvogel, ie. which commonly means that the first two letters have to be titlecased. There is a single character IJ (U+0132 LATIN CAPITAL LIGATURE IJ) in Unicode that will behave as expected, but this single character is very often not available on a keyboard, and so the word is commonly written with the two letters I+J.

In Greek, tonos diacritics are dropped during uppercasing, but not dialytika. Greek diphthongs with tonos over the first vowel are converted during uppercasing to no tonos but a dialytika over the second vowel in the diphthong, eg. Νεράιδα becomes ΝΕΡΑΪΔΑ. A letter with both tonos and dialytika above drops the tonos but keeps the dialytika, eg. ευφυΐα becomes ΕΥΦΥΪΑ. Also, contrary to the initial rule mentioned here, Greek does not drop the tonos on the disjunctive eta (usually meaning ‘or’), eg. ήσουν ή εγώ ή εσύ becomes ΗΣΟΥΝ Ή ΕΓΩ Ή ΕΣΥ (note that the initial eta is not disjunctive, and so does drop the tonos). This is to maintain the distinction between ‘either/or’ ή from the η feminine form of the article, in the nominative case, singular number.

Greek titlecased vowels, ie. a vowel at the start of a word that is uppercased, retains its tonos accent, eg. Όμηρος.

Turkish, Azeri, Tatar and Bashkir pair dotted and undotted i’s, which requires special handling for case conversion, that is language-specific. For example, the name of the second largest city in Turkey is “Diyarbakır”, which contains both the dotted and dotless letters i. When rendered into upper case, this word appears like this: DİYARBAKIR.

Lithuanian also has language-specific rules that retain the dot over i when combined with accents, eg. i̇̀ i̇́ i̇̃, whereas the capital I has no dot.

Sometimes European French omits accents from uppercase letters, whereas French Canadian typically does not. However, this is more of a stylistic than a linguistic rule. Sometimes French people uppercase œ to OE, but this is mostly due to issues with lack of keyboard support, it seems (as is the issue with French accents).

Capitalisation may ignore leading symbols and punctuation for a word, and titlecase the first casing letter. This applies not only to non-letters. A letter such as the (non-casing version of the) glottal stop, ʔ, may be ignored at the start of a word, and the following letter titlecased, in IPA or Americanist phonetic transcriptions. (Note that, to avoid confusion, there are separate case paired characters available for use in orthographies such as Chipewyan, Dogrib and Slavey. These are Ɂ and ɂ.)

Another issue for titlecasing is that not all words in a sequence are necessarily titlecased. German uses capital letters to start noun words, but not verbs or adjectives. French and Italian may expect to titlecase the ‘A’ in “L’Action”, since that is the start of a word. In English, it is common not to titlecase words like ‘for’, ‘of’, ‘the’ and so forth in titles.

Unicode provides only algorithms for generic case conversion and case folding. CLDR provides some more detail, though it is hard to programmatically achieve all the requirements for case conversion.

Case folding is a way of converting to a standard sequence of (lowercase) characters that can be used for comparisons of strings. (Note that this sequence may not represent normal lowercase text: for example, both the uppercase Greek sigma and lowercase final sigma are converted to a normal sigma, and the German ß is converted to ‘ss’.) There are also different flavours of case folding available: common, full, and simple.