Thursday, May 7, 2015

Displaying Sinhala characters on the web

Sinhala language is spoken only by the Sinhalese people in the small island of Sri Lanka who are about 60% of the total population of a 20 million. So its one of the least spoken languages in the world which makes seeing Sinhala characters on the web is a delightful experience for a Sinhalese. At least it used to be. Nowadays with the breakethrough of unicode its become so commonplace, theres nothing so special about it. There are many sinhala websites, half of posts on my facebook wall are in Sinhala and though its still somewhat dodgy, Google translate includes Sinhala.

Before unicode ASCII was popular. It still is here but unicode is the norm. There are 256 ASCII characters and 128 of them were used for letters of the English language. As a standard ascii code 065 is used for capital A in English fonts. In fonts that have glyphs for languages other than English its for some standard letter of that language's alphabet. And in Wijesekara layout for Sinhala fonts it stands for "Hal kireema" . The problem with this standard is that same sequence of ascii codes could display different glyphs depending on the font used. Some text written in Sinhala language using a Sinhala ascii font, if viewed using a different font, could display nothing but gibberish. Or worse, if two languages contain similar letters, it could give out a meaningful yet different meaning. So its obvious that ascii texts are very difficult to be used universally. Hence unicode.

There were 65000 unicode characters in the beginning and now there are 17 times more which allows every letter in every language in the world to have its own unicode code. There's still an excess of codes which are taken up by glyphs like ♥, ♫, ☯,  ☺. With unicode, the font used should not matter in deciding which letter of which language is displayed. Only in the visual properties of glyphs it should matter. Font makers are guided on which unicode code should display which characters. 128 characters from code U+0D80 through U+0DFF are reserved for Sinhala characters.

Obviously a font cannot contain glyphs for every unicode code. If a selected font does not contain glyphs for a certain unicode characters those characters would be displayed in a font that does. Applications including web browsers would select the fallback font depending on the way the system is configured. In my ubuntu 14.04 machine Sinhala characters are displayed by the font LKLUG. It can be changed by changing configuration of fontconfig.

Now to displaying characters on the web. Earlier, content of web sites are displayed entirely in fonts that are installed in viewers system. Websites could optionally specify a certain family or a font or a chain of fonts for fallbacks. Though a webpage could end up being displayed entirely different from the way the developer expected because of lack of a certain installed font. Though this is the case with many websites even now, there is the introduction of webfonts which could change all that.

Developers can specify a font to use and the place to get that font, using @font-face notation so the clients (web browsers) would do everything they can to display text using that font. Usually they only fallback if they could not download the webfont from the specified location.

Early Sinhala websites would include ASCII text. And as none of the sinhala fonts they could use could be considered web safe, they asked users to download and install whatever font they are using. Notices appeared that says "Do you see sinhala characters? If not download and install this font" while gibberish apeard in whatever the english ascii font the web browser decided to fallback. Only once the font is installed the text would look meaningful.

When unicode came through many sinhala websites changed from ASCII to Unicode. The upside is most systems included a unicode font that covers the sinhala unicode characters. This got rid of the step of downloading and installing fonts. Unfortunately this is also the downside. Most systems... Some systems does not include a Sinhala unicode font. For example Android devices with versions KitKat and prior. And without rooting its very difficult to install a new font there. LollyPop standard font includes glyphs for Sinhala characters. But some manufacturers like Sony removed them for reasons known only to them. Maybe they thought extra few KiloBytes is not worth an entire nation reading and writing from their native toung. Sinhala websites like bbc.lk, lankadeepa.lk contains unicode text. So they are readable from most pcs but not from most Android hand helds.

Then webfonts came up. Which allows developers to include a Sinhala unicode font with the rest of the content from the website. So its readable from most browsers including ones in Android devices. gossip.hirufm.lk does this. Many other Sinhala websites do not seem to do it.

gossiplankanews.lk, another gossip site!, use webfonts but they are sticking to ASCII. If text from their site is copied and pasted somewhere you can see the gibberish they truely are. But at least since they use webfonts content should be readable from systems without Sinhala Unicode fonts, so Android devices.

If the developers of Sinhala websites use webfonts with unicode content they can increase their audience. fontsquirrel is a good place, among others, to generate a webfonts kit. The hodipotha font from icta.lk is released under creative commons license, so it can be used to generate the webfonts kit.

In fontsquirrel it is important to chose the expert option and pick no subsetting. Unless it will generate webfonts with characters only in the range of western charaters omitting Sinhala characters.

Following text is using webfonts (hodipotha) and hence should be visible in many browsers including ones in Android mobiles in (not very beautiful) glyphs of the hodipotha font.

සිංහල යුනිකෝඩ් (unicode) වෙබ් ෆොන්ට්ස් හරහා

Following is not. And hence would show up in whatever the font your system decides.(is configured)

සිංහල යුනිකෝඩ් (unicode)


No comments:

Post a Comment