The html5 specification contains a bunch of new features to support bidirectional text in web pages. Language written with right-to-left scripts, such as Arabic, Hebrew, Persian, Thaana, Urdu, etc., commonly mixes in words or phrases in English or some other language that uses a left-to-right script. The result is called bidirectional or bidi text.
HTML 4.01 coupled with the Unicode Bidirectional algorithm already does a pretty good job of managing bidirectional text, but there are still some problems when dealing with embedded text from user input or from stored data.
Here’s an example where the names of restaurants are added to a page from a database. This is the code, with the Hebrew shown using ASCII:
<p>Aroma - 3 reviews</p> <p>PURPLE PIZZA - 5 reviews</p>
And here’s what you’d expect to see, and what you’d actually see.
The problem arises because the browser thinks that the ” – 5″ is part of the Hebrew text. This is what the Unicode Bidi Algorithm tells it to do, and usually it is correct. Not here though.
So the question is how to fix it?
<bdi> to the rescue
The trick is to use the bdi element around the text to isolate it from its surrounding content. (bdi stands for ‘bidi-isolate’.)
<p><bdi>Aroma</bdi> - 3 reviews</p> <p><bdi>PURPLE PIZZA</bdi> - 5 reviews</p>
The bidi algorithm now treats the Hebrew and “- 5″ as separate chunks of content, and orders those chunks per the direction of the overall context, ie. from left-to-right here.
You’ll notice that the example above has bdi around the name Aroma too. Of course, you don’t actually need that, but it won’t do any harm. On the other hand, it means you can write a script in something like PHP that says:
foreach $restaurant echo "<bdi>$restaurant['name']</bdi> - $restaurant['reviews'] reviews";
This means you can handle any name that comes out of the database, whatever script it is in.
bdi isn’t supported fully by all browsers yet, but it’s coming.
Things to avoid
- Using the dir attribute on a span element
You may think that something like this would work:
<p><span dir=rtl>PURPLE PIZZA</span> - 5 reviews</p>
But actually that won’t make any difference, because it doesn’t isolate the content of the span from what surrounds it.
- Using Unicode control characters
You could actually produce the desired result in this case using U+200E LEFT-TO-RIGHT MARK just before the hyphen.
<p>PURPLE PIZZA ‎- 5 reviews</p>
For a number of reasons, however, it is better to use markup. Markup is part of the structure of the document, it avoids the need to add logic to the application to choose between LRM and RLM, and it doesn’t cause search failures like the Unicode characters sometimes do. Also, the markup can neatly deal with any unbalanced embedding controls inadvertently left in the embedded text.
- Using CSS
CSS has also been updated to allow you to isolate text, but you should always use dedicated markup for bidi rather than CSS. This means that the information about the directionality of the document is preserved even in situations where the CSS is not available.
- Using bdo
Although it sounds similar, and it’s used for bidi text too, the bdo element is very different. It overrides the bidi algorithm altogether for the text it contains, and doesn’t isolate its contents from the surrounding text.
Using the dir attribute with bdi
The dir attribute can be used on the bdi element to set the base direction. With simple strings of text like PURPLE PIZZA you don’t really need it, however if your bdi element contains text that is itself bidirectional you’ll want to indicate the base direction.
Until now, you could only set the dir attribute to ltr or rtl. The problem is that in a situation such as the one described above, where you are pulling strings from a database or user, you may not know which of these you need to use.
That’s why html5 has provided a new auto value for the dir attribute, and bdi comes with that set by default. The auto value tells the browser to look at the first strongly typed character in the element and work out from that what the base direction of the element should be. If it’s a Hebrew (or Arabic, etc.) character, the element will get a direction of rtl. If it’s, say, a Latin character, the direction will be ltr.
There are some rare corner cases where this may not give the desired outcome, but in the vast majority of cases it should produce the expected result.
Want another use case?
Here’s another situation where bdi can be useful. This time we are constructing multilingual breadcrumbs on the W3C i18n site. The page titles are generated by a script, and this page is in Hebrew, so the base direction is right-to-left.
Again here’s what you’d expect to see, and what you’d actually see.
Whereas in the previous example we were dealing with a number that was confused about its directionality, here we are dealing with a list of same script items in a base direction of the opposite direction.
If you wanted to generate markup that would produce the right ordering, whatever combination of titles was thrown at it, you could wrap each title in bdi elements.
Want more information?
The inclusion of these features has been championed by Aharon Lanin of Google within the W3C Internationalization (i18n) Working Group. He is the editor of a W3C Working Draft, Additional Requirements for Bidi in HTML, that tracks a range of proposals made to the HTML5 Working Group, giving rationales and recording resolutions. (The bdi element started out as a suggestion to include a ubi attribute.)
If you like more information on handling bidi in HTML in general, try Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts
And here’s the description of bdi in the HTML5 spec.