Dochula Pass, Bhutan

Here are some lists of characters that are useful for normalization. I’ll probably add some others later.

The lists apply to Unicode version 5.1.

The files below contain declarations for JavaScript sparse arrays. They are easy enough to convert to other formats using global search and replace. The verbose version provides character names and code points.

Combining characters with non-zero properties

Characters with non-zero combining properties are assigned to a sparse array indexed by codepoint. The value gives the combining property value.

http://rishida.net/code/normalization/nonzerocombiningchars.txt

http://rishida.net/code/normalization/nonzerocombiningchars-verbose.txt

There are 498 of these.

Canonically decomposable characters for NFD

This list maps single characters to their decompositions. The single character is referenced by an index into the array, and the value for that index is the decomposed characters.

http://rishida.net/code/normalization/canonicaldecomposables.txt

http://rishida.net/code/normalization/canonicaldecomposables-verbose.txt

There are 2042 of these characters.

The following code converts a hex codepoint to a sequence of bytes that represent the Unicode codepoint in UTF-8.

This is useful because PHP’s chr() function only works on ASCII :((.

function cp2utf8 ($hexcp) {
	$outputString = '';
	$n = hexdec($hexcp);
	if ($n < = 0x7F) {
		$outputString .= chr($n);
		}
	else if ($n <= 0x7FF) {
		$outputString .= chr(0xC0 | (($n>>6) & 0x1F))
		.chr(0x80 | ($n & 0x3F));
		}
	else if ($n < = 0xFFFF) {
		$outputString .= chr(0xE0 | (($n>>12) & 0x0F))
		.chr(0x80 | (($n>>6) & 0x3F))
		.chr(0x80 | ($n & 0x3F));
		}
	else if ($n < = 0x10FFFF) {
		$outputString .= chr(0xF0 | (($n>>18) & 0x07))
		.chr(0x80 | (($n>>12) & 0x3F)).chr(0x80 | (($n>>6) & 0x3F))
		.chr(0x80 | ($n & 0x3F));
		}
	else {
		$outputString .= 'Error: ' + $n +' not recognised!';
		}
	return $outputString;
	}