Dochula Pass, Bhutan

I was asked to make available the code for my normalization functions in JavaScript and PHP. The links are below. I’m making the code available under a Creative Commons Attribution-Noncommercial-Share Alike licence.

Disclaimers Note that I make no claim to have produced polished, compact or well-optimised code! The code does what I need, and I’m happy with that. You are welcome to suggest improvements, and I’m sure there are many that could be made.

As they say, this code is made available in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

The code is a little more convoluted that it ought to be, to get around the fact that JavaScript doesn’t understand supplementary characters, and PHP just doesn’t naturally understand Unicode. (How I long for PHP6.)

Update: [[I meant to mention that there is a way of doing normalization in PHP already. I made this code available just because I had it. I created it as a learning exercise. It may be useful, however, if you are unable to load the ICU and intl packages onto your server.]]

To use the code, simply call nfc('your-text-string') or nfd('your-text-string') from your code and capture the result.

For PHP you’ll need these routines and this data.

For JavaScript look at these routines and this data. There is also a lite version of the data file that doesn’t include Han characters. I use this sometimes for bandwidth savings (about 14K less).

Test files I also created some test files for PHP and for JavaScript.
Both of these expect to find a copy of http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt in the local directory. These files run 71,076 tests.

Cautions Be careful about the editor you use for the data files. I spent several hours fruitlessly debugging the routines, only to find that Notepad++ was displaying certain supplementary characters ok, but corrupting them on save. I switched to Notepad and the problem evaporated. And I probably don’t need to add that editing the data files in something like DreamWeaver is a bad idea because it will probably normalize the data before saving.

Another point: you may see Unicode replacement characters at a couple of points in the PHP source. These represent the first and last characters in the high surrogate range.

Experimenting If you want to play with something that uses this you could try my Tłįchǫ (Dogrib) character picker, or my Normalizer tool. I will slowly fit this to all the pickers and to UniView. I have a local version of UniView waiting in the wings that uses the PHP files via AJAX, to reduce download size. For that you need a file that returns the result as plain text across the wire, such as this.

Well, I hope that that may be of use to someone, somewhere. I hope I haven’t forgotten anything.

Here are some lists of characters that are useful for normalization. I’ll probably add some others later.

The lists apply to Unicode version 5.1.

The files below contain declarations for JavaScript sparse arrays. They are easy enough to convert to other formats using global search and replace. The verbose version provides character names and code points.

Combining characters with non-zero properties

Characters with non-zero combining properties are assigned to a sparse array indexed by codepoint. The value gives the combining property value.

http://rishida.net/code/normalization/nonzerocombiningchars.txt

http://rishida.net/code/normalization/nonzerocombiningchars-verbose.txt

There are 498 of these.

Canonically decomposable characters for NFD

This list maps single characters to their decompositions. The single character is referenced by an index into the array, and the value for that index is the decomposed characters.

http://rishida.net/code/normalization/canonicaldecomposables.txt

http://rishida.net/code/normalization/canonicaldecomposables-verbose.txt

There are 2042 of these characters.

The following code converts a hex codepoint to a sequence of bytes that represent the Unicode codepoint in UTF-8.

This is useful because PHP’s chr() function only works on ASCII :((.

function cp2utf8 ($hexcp) {
	$outputString = '';
	$n = hexdec($hexcp);
	if ($n < = 0x7F) {
		$outputString .= chr($n);
		}
	else if ($n <= 0x7FF) {
		$outputString .= chr(0xC0 | (($n>>6) & 0x1F))
		.chr(0x80 | ($n & 0x3F));
		}
	else if ($n < = 0xFFFF) {
		$outputString .= chr(0xE0 | (($n>>12) & 0x0F))
		.chr(0x80 | (($n>>6) & 0x3F))
		.chr(0x80 | ($n & 0x3F));
		}
	else if ($n < = 0x10FFFF) {
		$outputString .= chr(0xF0 | (($n>>18) & 0x07))
		.chr(0x80 | (($n>>12) & 0x3F)).chr(0x80 | (($n>>6) & 0x3F))
		.chr(0x80 | ($n & 0x3F));
		}
	else {
		$outputString .= 'Error: ' + $n +' not recognised!';
		}
	return $outputString;
	}

Some code I put together to import some XML retrieved via AJAX into a document (stored here so I can find it again in the future).

IE won’t let you import a cloned nodeset into a document, so I wrote this for my UniView utility. The code starts with a node in the AJAX data and creates a copy of all elements and attributes in the current document.

function copyNodes (ajaxnode, copiednode) {
	for (var node=ajaxnode.firstChild; node != null; node = node.nextSibling) {
		if (node.nodeType == 3){ //text
			copiednode.appendChild(document.createTextNode(node.data));
			}
		if (node.nodeType == 1){ //element
			var subnode = document.createElement(node.nodeName);
			var attlist = node.attributes;
			if (attlist != null) {  
				for (var i=0; i<attlist.length; i++){
					subnode.setAttribute(attlist[i].name, attlist[i].value);
					}
				}
			copiednode.appendChild(subnode);
			copyNodes(node, subnode);
			}
		}
	}

It doesn’t expect processing instructions, comments etc. Just elements and attributes. (Though of course that can be added, if needed.)

I always forget how to get around the namespace issue when transforming XHTML files to XHTML using XSL, and it always takes ages for me to figure it out again. So I’m going to make a note here to remind me. This seems to work:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:transform version="2.0"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/02/xpath-functions" xmlns:xdt="http://www.w3.org/2005/02/xpath-datatypes"
xmlns:saxon="http://icl.com/saxon"
<strong>exclude-result-prefixes="saxon fn xs xdt html"</strong>>
;

<xsl:output method="xhtml" encoding="UTF-8"
doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" indent="no" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" />

Then you need to refer to elements in the source to be converted by using the html: namespace prefix, eg. <xsl :template match=”html:div”>….</xsl>.

I always have to look up the template that copies everything not fiddled with in the other templates, too, so here it is, for good measure:

<xsl:template match="@*|node()">
	<xsl:copy>
		<xsl:apply-templates select="@*|node()"/>
		</xsl:copy>
	</xsl:template>

Some applications insert a signature or Byte Order Mark (BOM) at the beginning of UTF-8 text. For example, Notepad always adds a BOM when saving as UTF-8.

Older text editors or browsers will display the BOM as a blank line on-screen, others will display unexpected characters, such as . This may also occur in the latest browsers if a file that starts with a BOM is included into another file by PHP.

For more information, see the article Unexpected characters or blank lines and the test pages and results on the W3C site.

If you have problems that you think might be related to this, the following may help.

Checking for the BOM

I created a small utility that checks for a BOM at the beginning of a file. Just type in the URI for the file and it will take a look. (Note, if it’s a file included by PHP that you think is causing the problem, type in the URI of the included file.)

Removing the BOM

If there is a BOM, you will probably want to remove it. One way would be to save the file using a BOM-aware editor that allows you to specify that you don’t want a BOM at the start of the file. For example, if Dreamweaver detects a BOM the Save As dialogue box will have a check mark alongside the text “Include Unicode Signature (BOM)”. Just uncheck the box and save.

Another way would be to run a script on your file. Here is some simple Perl scripting to check for a BOM and remove it if it exists (developed by Martin Dürst and tweaked a little by myself).

# program to remove a leading UTF-8 BOM from a file
# works both STDIN -> STDOUT and on the spot (with filename as argument)

if ($#ARGV > 0) {
    print STDERR "Too many arguments!\n";
    exit;
    }

my @file;   # file content
my $lineno = 0;

my $filename = @ARGV[0];
if ($filename) {
    open( BOMFILE, $filename ) || die "Could not open source file for reading.";
    while (<BOMFILE>) {
        if ($lineno++ == 0) {
            if ( index( $_, '' ) == 0 ) {
                s/^\xEF\xBB\xBF//;
                print "BOM found and removed.\n";
                }
            else { print "No BOM found.\n"; }
            }
        push @file, $_ ;
        }
    close (BOMFILE)  || die "Can't close source file after reading.";

    open (NOBOMFILE, ">$filename") || die "Could not open source file for writing.";
    foreach $line (@file) {
        print NOBOMFILE $line;
        }
    close (NOBOMFILE)  || die "Can't close source file after writing.";
    }
else {  # STDIN -> STDOUT
    while (<>) {
    if (!$lineno++) {
        s/^\xEF\xBB\xBF//;
        }
    push @file, $_ ;
    }

    foreach $line (@file) {
        print $line;
        }
    }

This post was updated. See bottom of post for details.

I developed a set of JavaScript routines (W3C DOM standard) for hiding and revealing information on a page that you should be able to plug in to a wide range of content. Please feel free to use the code (though an acknowledgement would be nice.)

Files
JavaScript: expandcollapse.js

CSS: expandcollapsestyle.css
Original HTML: news.html
Resulting HTML: newsWithJS.html

The uncollapsed text. (Click to see larger version.)

We’ll illustrate how to apply this with an example. The picture shows what it looks like initially. (View the HTML.) We’ll collapse the additional news after the first item to just the headlines, but allow you to reveal the detail by clicking or tabbing and hitting return.

[I applied this to a hacked down version of someone else’s page, because I was short on time. This is good, in that it shows that it’s easy to apply this to existing pages. However, due to my hacking, the general markup of the page may look a little strange in parts. Please ignore that.]

Structuring the content

The markup of content you want to hide and reveal may be structured in a number of ways. This approach assumes that:

  1. you will click on a block element (which we will call the trigger) to cause some content below it to expand or contract
  2. the content revealed/hidden by clicking on the trigger can be in any number of block elements of any type. (You can also include other block elements above the trigger, if you like, though they won’t be hidden.)
  3. each trigger and its revealable content is bounded by a block element. (We will use a div, but it could be any block element.)
  4. all the expanding and collapsing content is surrounded by another element with an id. This allows you to work with expanding content in different areas on the same page separately. (Again we use a div, with the id otherNews, but the id could just as easily be on the body element, since we only have one area of affected content on this page.)

The diagram below shows the arrangement used in the example file. The trigger element is red. The content to be hidden/revealed is green. You don’t have to use an h3 as the trigger. You could even use an ordinary paragraph tag. If you do, however, you should use a class name on each trigger element, so that the trigger can be identified.

Note that the trigger element should not contain an <a> element, since the JavaScript will add an <a> element to create a clickable zone. (It doesn’t make sense, anyway.)

The structure of the content in the example.

Setting up the markup

Very little change is required to the markup.

What I did

Add this to the document head:
<script type="text/javascript" src="expandcollapse.js"></script>
<link rel="stylesheet" type="text/css" href="expandcollapsestyle.css"/>

Add this to the body element start tag:
onload="setCollapseExpand('otherNews', 'h3','');revealControl('On'); revealControl('Off');"

Add this next to the RSS icon, just above the expanding content:
<a id="On" name="On" onclick="openAll('otherNews', 'h3','');" href="#_" class="hideIfNoJS">Open All</a>
<a id="Off" name="Off" onclick="closeAll('otherNews', 'h3','');" href="#_" class="hideIfNoJS">Close All</a>

Notes:

  1. onload="setCollapseExpand('otherNews', 'h3','');"

    After the document has loaded, this collapses the content.

    The JavaScript will look through the div with id otherNews for all h3 elements. It then finds the parent of the h3 element, and adds a class name to all the remaining elements after the h3 within that parent (a div, in our case). The class is associated with styling that makes these elements disappear. It will also surround the contents of the h3 element with an a element. This allows keyboard users to access the functionality using tabs. Each a element is given an onclick function to enable it to toggle the hidden content on or off.

    If we had wanted to use an ordinary p tag with a class name of, say, trigger rather than the h3, the onload code would look like this:

    onload="setCollapseExpand('otherNews', 'p','trigger');"
  2. Optional. You may want to add some buttons to expand and collapse all text in one go. If so you’ll need to add these to the markup. In our example I added the following code alongside the RSS feed icon. I used an a element so that keyboard users can tab to it.

    <a id="On" name="On" onclick="openAll('otherNews', 'h3','');" 
       href="#_" class="hideIfNoJS">Open All</a>
    <a id="Off" name="Off" onclick="closeAll('otherNews', 'h3','');" 
       href="#_" class="hideIfNoJS">Close All</a>
    

    I added the class name hideIfNoJS to each a element. We can now use CSS to hide this text unless JavaScript is detected.

    We then need to add two more statements to the onload value on the body tag, one for each a element.

    revealControl('On'); revealControl('Off');

    After the document loads, the JavaScript will remove those class names, and the switches will become visible.

  3. <link rel="stylesheet" type="text/css" href="expandcollapsestyle.css"/>

    CSS will drive most of the behaviour. The JavaScript simply changes the class names associated with the markup. This references a stylesheet that will do all the hard lifting.

A walk through the CSS

Let’s take a look at the CSS in the expandcollapsestyle.css file.

First, we add some styling to the new ‘Open All’ and ‘Close All’ text we added. This will make this text look like small graphical buttons, and change the cursor to a pointing hand as we mouse over them.

   a#On, a#Off {
      padding: 0.1em 0.5em 0.1em 0.5em;
      margin: 0 0.5em 0 0;
      text-decoration: none;
      background: #005a9c;
      color: #fc6;
      font-weight: bold;
      cursor: pointer;
      }

Next, we add a rule to remove the ‘Open All’ and ‘Close All’ buttons from view initially. The revealControl calls in the body onload attribute will remove this class if JavaScript is enabled.

   .hideIfNoJS {
      display: none;
      }

Now, we style the trigger text (in our case the h3 elements).

The first set of rules makes the cursor become a pointer when we mouse over the text, and adds a graphic to show whether the content is revealed or not.

   .triggerOpen {
	background:url(http://www.w3.org/International/icons/open-thin.gif)
              no-repeat left 2px #fffaf0;
	}

   .triggerClosed {
	background:url(http://www.w3.org/International/icons/close-thin.gif)
              no-repeat left 2px #fffaf0;
	}

You can, of course, change the styling to suit yourself. For example, you may want to use a different graphic.

We also fix the colour of the trigger text, pads the left side of the text so that you can see the graphic, and make the trigger change colour as you mouse over it. (Note that the JavaScript has introduced this a element.)

.triggerOpen a, .triggerClosed a {
	padding-left:14px;
	color:#000;
	text-decoration: none;
	cursor: pointer;
	}

.triggerOpen a:hover, .triggerClosed a:hover {
	color:#00f;
	}

Finally, we add the styling for the content that will be hidden/revealed. The .hiddenContent class will be attached to content by the JavaScript to hide it.

   .hiddenContent {
      display: none;
      }

When that content is not hidden, it gets the revealedContent class. We added some styling to pad the left side of the blocks by the same amount as the trigger text.

   #otherNews .revealedContent {
      padding-left: 14px;
      }
   #otherNews ul.revealedContent {
      padding-left: 30px;
      margin-left: 0;
      }

The end result

The collapsed text. (Click to see larger version.)

This picture shows what you will see when you open the page in a user agent that has JavaScript turned on. (See the HTML.) If JavaScript is turned off, you will see exactly what you saw before.


Updates to this post

2007-07-01: Moved cursor:pointer from rules for .triggerOpen and .triggerClosed to the rules for ‘.triggerOpen a, .triggerClosed a’. Stops the pointer appearing to the right of the trigger text. Also added note about <a> in trigger.

2007-06-05: Small change to CSS to ensure that the expand/collapse works when clicking on the + or – icon too. (Moved the padding.)

2007-04-18: Largely rewrote the text to make it more readable, and to take into account changes made to the JavaScript and CSS files (which incorporate the ideas from several comments below).

(I’m making notes here so I can find these techniques again later.)

I wanted to use JavaScript (W3C DOM compliant) to wrap the content of a heading with an a element, ie.

<h3>This is <em>my</em> header</h3>

Needed to become

<h3><a href=”#mytarget”>This is <em>my</em> header</a></h3>

Here’s what I came up with:

var h = document.getElementBySomeMethod('h3'); // grab the heading
var a = document.createElement('a');       // create an a element
    a.setAttribute('href', '#mytarget');   // set the href
while (h.childNodes.length > 0) {          // for each child node in the h3
    a.appendChild( content.firstChild );   // move the node to the a element
    }
h.appendChild(anchor);                    // stick a under the now empty h3

It seems so simple now to look at. Took me ages to figure it out. 🙁