Thursday, May 7, 2015

Displaying Sinhala characters on the web

Sinhala language is spoken only by the Sinhalese people in the small island of Sri Lanka who are about 60% of the total population of a 20 million. So its one of the least spoken languages in the world which makes seeing Sinhala characters on the web is a delightful experience for a Sinhalese. At least it used to be. Nowadays with the breakethrough of unicode its become so commonplace, theres nothing so special about it. There are many sinhala websites, half of posts on my facebook wall are in Sinhala and though its still somewhat dodgy, Google translate includes Sinhala.

Before unicode ASCII was popular. It still is here but unicode is the norm. There are 256 ASCII characters and 128 of them were used for letters of the English language. As a standard ascii code 065 is used for capital A in English fonts. In fonts that have glyphs for languages other than English its for some standard letter of that language's alphabet. And in Wijesekara layout for Sinhala fonts it stands for "Hal kireema" . The problem with this standard is that same sequence of ascii codes could display different glyphs depending on the font used. Some text written in Sinhala language using a Sinhala ascii font, if viewed using a different font, could display nothing but gibberish. Or worse, if two languages contain similar letters, it could give out a meaningful yet different meaning. So its obvious that ascii texts are very difficult to be used universally. Hence unicode.

There were 65000 unicode characters in the beginning and now there are 17 times more which allows every letter in every language in the world to have its own unicode code. There's still an excess of codes which are taken up by glyphs like ♥, ♫, ☯,  ☺. With unicode, the font used should not matter in deciding which letter of which language is displayed. Only in the visual properties of glyphs it should matter. Font makers are guided on which unicode code should display which characters. 128 characters from code U+0D80 through U+0DFF are reserved for Sinhala characters.

Obviously a font cannot contain glyphs for every unicode code. If a selected font does not contain glyphs for a certain unicode characters those characters would be displayed in a font that does. Applications including web browsers would select the fallback font depending on the way the system is configured. In my ubuntu 14.04 machine Sinhala characters are displayed by the font LKLUG. It can be changed by changing configuration of fontconfig.

Now to displaying characters on the web. Earlier, content of web sites are displayed entirely in fonts that are installed in viewers system. Websites could optionally specify a certain family or a font or a chain of fonts for fallbacks. Though a webpage could end up being displayed entirely different from the way the developer expected because of lack of a certain installed font. Though this is the case with many websites even now, there is the introduction of webfonts which could change all that.

Developers can specify a font to use and the place to get that font, using @font-face notation so the clients (web browsers) would do everything they can to display text using that font. Usually they only fallback if they could not download the webfont from the specified location.

Early Sinhala websites would include ASCII text. And as none of the sinhala fonts they could use could be considered web safe, they asked users to download and install whatever font they are using. Notices appeared that says "Do you see sinhala characters? If not download and install this font" while gibberish apeard in whatever the english ascii font the web browser decided to fallback. Only once the font is installed the text would look meaningful.

When unicode came through many sinhala websites changed from ASCII to Unicode. The upside is most systems included a unicode font that covers the sinhala unicode characters. This got rid of the step of downloading and installing fonts. Unfortunately this is also the downside. Most systems... Some systems does not include a Sinhala unicode font. For example Android devices with versions KitKat and prior. And without rooting its very difficult to install a new font there. LollyPop standard font includes glyphs for Sinhala characters. But some manufacturers like Sony removed them for reasons known only to them. Maybe they thought extra few KiloBytes is not worth an entire nation reading and writing from their native toung. Sinhala websites like bbc.lk, lankadeepa.lk contains unicode text. So they are readable from most pcs but not from most Android hand helds.

Then webfonts came up. Which allows developers to include a Sinhala unicode font with the rest of the content from the website. So its readable from most browsers including ones in Android devices. gossip.hirufm.lk does this. Many other Sinhala websites do not seem to do it.

gossiplankanews.lk, another gossip site!, use webfonts but they are sticking to ASCII. If text from their site is copied and pasted somewhere you can see the gibberish they truely are. But at least since they use webfonts content should be readable from systems without Sinhala Unicode fonts, so Android devices.

If the developers of Sinhala websites use webfonts with unicode content they can increase their audience. fontsquirrel is a good place, among others, to generate a webfonts kit. The hodipotha font from icta.lk is released under creative commons license, so it can be used to generate the webfonts kit.

In fontsquirrel it is important to chose the expert option and pick no subsetting. Unless it will generate webfonts with characters only in the range of western charaters omitting Sinhala characters.

Following text is using webfonts (hodipotha) and hence should be visible in many browsers including ones in Android mobiles in (not very beautiful) glyphs of the hodipotha font.

සිංහල යුනිකෝඩ් (unicode) වෙබ් ෆොන්ට්ස් හරහා

Following is not. And hence would show up in whatever the font your system decides.(is configured)

සිංහල යුනිකෝඩ් (unicode)


Monday, March 16, 2015

Pike, A Hidden Beauty

Last few days I was getting myself familiar with an interesting project named sTeam. One of the things that makes it interesting is that its written in pike. Pike is the programming language that is used to write it.

Don't worry if you haven't heard of pike. Many people have not. Its a relatively undescovered language. Average google search along the lines of "Pike tutorials" would not get you much other than this official beginner tutorial. The stackoverflow tag 'pike' has mere 8 subscribers. There is only the single official intrepretter and not many IDEs except for an emacs mode and an eclipse plugin. Text editor I use, gedit, by default does not know how to highlight syntax of a .pike file.

But installing pike and trying it out makes you wonder.. why? why don't we use this more.

Pike is very attractive just like an unseen hidden forrest flower.

In my short career so far I have been lucky enough to use many languages. Most of my assignments in the university are to be done using C or Java. My final year project is done with Diaspora, which is ruby. I coded python for melange in Google summer of code. During an internship with IroneOne I worked mostly on an iOS app which is objective C. And my first task in my first job was building a nodejs application, which is Javascript. While going through the tutorial, I felt pike is made of good parts of many of those languages.

Duck Typing


There are both pros and cons of Duck Typing. While coding python, ruby and javascript code I felt great and previlaged to be able to use dynamic data typing. But it wasn't unclear to me, that some of the code that takes advantage of Duck typing could lead to problems. Specially if they are not documented or the documentation is not reffered. Sometimes it took some of my time to understand poorly documented javascript APIs.

On the other hand, sometimes, In early days as an undergraduate I felt frustrated to not being able to return couple of integers and a string in a single array from a java function.

Pike has a simple yet effective solution. In addition to its three basic types, int float and string it has another queer little type, mixed. You can specify a variable as mixed and and store any value irrespective of whether it is an int float string or even a complex type.

Pike v7.8 release 866 running Hilfe v3.5 (Incremental Pike Frontend)

> int a;
> a = 7;
(1) Result: 7
> a = "aruna"
>> ;
Compiler Error: 3: Bad type in assignment.
Compiler Error: 3: Expected: int.
Compiler Error: 3: Got     : string(0..255).

> mixed b;  // using mixed type
> b = 21;
(2) Result: 21
> b = "herath";
(3) Result: "herath"


Arrays declared to hold mixed types, or not declared to hold any type, can contain values of mixed types. A function declared to return mixed type can return anything.

Syntax


I feel syntax in pike manages, again to get the best, most elagant bits out of the languages I am familiar with. It uses c style parantheses to identify scopes of funtions conditional and loop statesments. In my view much better than pythons awkward tabs/spaces.

Outshining c it has a simple 'dictionary' data structure known as mappings, with elegant syntax.

mapping(string:string) batsmen = (["Sangakkara": "Sri Lanka", "Maxwell":"Australia", "Kohli": "India"])

Arrays,

array(string) cricketers = ({"McCullum", "De Villiars", "Malinga", "O'Brien"})

In my view pike syntax is pleasing to the eye...

Interpreter


You probably have guessed from whats mentioned so far that pike is interpreted. Yes, Pike comes with an official interpreter named Hilfe.
A real time interpreter is something I find very useful to learn a language. It is of great value when developing as well. You can just enter few lines in the interpreter and see if you got the syntax or the logic correct. It's much easier and quicker than writing a program and compiling and testing or looking through documentation.

Further pike does not have many external libraries, but its distributed with many modules so you can get most of your work done with pike itself.

In Top Gear consumer advice style, pike would not treat you like a kid and hold you back from doing stuff freely, nor would it consider you a saint and over trust you with everything.

It probably is not a good idea to use pike in your next big project, for the very little support you'd get, but definitely worth knowing about and trying it out in one of those pet projects.

Wednesday, January 21, 2015

GCI mentoring with FOSSASIA

For the last few weeks I got the opportunity to be involved in the Google Code-In 2014 program as a mentor for FOSSASIA (Thanks Andun Sameera!). It was challenging than I thought specially while doing a full time job. But was a great experience and I learned things myself with the students.

The program is almost over, with only the results are yet to be out.

FOSSASIA's co-admin Mario Behling initiated an interesting project at the start of the program to give students an opportunity to experience open source development culture. The project was to create a small website to hold FOSSASIA's students' and mentors' details. It came out to be a great success with a cute little website being created and more importantly a nice little community of students created around it.



Usually there is a barrier you need to get past as a novice contributor, to get your first commit merged in to an open source project. The administrators would want you to follow annoying coding conventions, to "combine your 5 commits, solving a simple small bug into one big commit" or to "rebase your pull request on top of master". Until you continue contributing for some time and realize the importance of those, and start to appreciate them, they are just some annoyance that you have to deal with, on the way to get your work integrated.

We for this project initially made this barrier very very less challenging. We would merge pull requests if they do the job. This so that young student contributors don't feel discouraged and only until they get themselves started.

But having being well mentored at Google Summer of Code 2013 I wanted some niceties in our git commits. So I made learning them into a task.

The task was to learn how to make your local commits look nice before you push them to the repo. To make it more organized and can be evaluated, and hopefully fun, I built up a small set of commits with a interesting bit of a commit history; a story.

I added the set of commits to a Github repo that includes wrongly commited commit message and two commits that could look better sqashed into a bigger commit. Students are asked to clone the repo and then using git interactive rebase, make the commit history look better. The story of the commits and a set of instructions are given. Then they have to blog about there experience.

They came up with some great write ups! Some focused on the technical aspects and were of a tutorial point of view. Some were explaining the personal experience writers themselves got and were on a lighter, less technical, language. However all were great!

I think I got few students to learn something that will be valuable in their future careers and also one student to start blogging!

When I saw a set of commits that could be better organized in a pull request for any of FOSSASIA's repositories, from a student who completed this task, I asked them to make them better. Thanks to above task, they knew the terminology, and communication was easier. When I say squash these commits and reword the commit message to something like this, they knew what I was saying, and how to do that, and were happy to oblige.

We gradually made it harder and more challenging, bringing the barrier to the usual level, for students who hang around to complete more tasks.

This hopefully resulted in not only the finish product, but also the path towards it, to be in great shape.

Students managed to complete many more very valuable work for FOSSASIA.

It was fun working with them and I wish them an exciting and a fruitful future!