Tamil Unicode FAQ

unicode_logoThis is an attempt to provide brief answers to frequently asked questions relating to Tamil Unicode.
Please feel free to send in your comments and additional querries that can be listed here.

1. What is Unicode Encoding?
Unicode is an universal font encoding scheme, designed to cover all world languages. It is a 32-bit scheme with over 65500 slots to assign to various languages. Each language (except few like chinese) is given a 128-slot block.

2. What is meant by Character encoding?

Tamil is a language where, in addition to the basic vowels (uyir) and consonants (mei), the compounded (uyirmei) characters all have unique glyph forms. Popular Tamil font encoding schemes like TAB, TSCII, TAM are glyph based ones. As many of these unique uyirmeis with distinct glyph forms are directly encoded in the scheme. Thus uyirmeis like ku, pU etc are directly encoded.

Unicode, on the contrary, encodes only basic uyir and mei characters and a set of modifiers to represent situations where the uyir/mei pair appear as a combination (uyirmei). Unicode file stores textual information solely at this “character” level. It does not care about the actual form of the glyphs. Rendering of the glyphs corresponding to stored characters is left to softwares.

3. How is Tamil Language encoded in Unicode?

All indic languages are allocated 128-slots each. Assignment of characters to specifc slots within this block is based on ISCII (Indian Standard Code for Information Interchange) scheme, that uses Devanagari as the basic reference language. Thus the vowels, consonants and their modifiers of each indic language appears at the same slot. “ka” of Tamil and Telugu are separated by same 128 slots, greatly facilitating programming.

4. How do Unicode fonts work?

As stated in (2), in Unicode, unique glyph forms of uyirmeis are stored separately and are “rendered” on the screen when a unicode-based text file is displayed using softwares.

The process of picking up these unique glyph forms of uyirmeis stored in the font and rendering them on the screen is called “glyph substitution (GSUB)”.A new Font technology called “OpenTrueType” (OTT) has been developed for use with Unicode.

Different platforms/Operating systems use different font-rendering engines to handle these Unicode OTT-type fonts (use of GSUB, GPOS tables).

To use a Unicode Tamil text, you need to have a Unicode OTT-type font that has Tamil block (yes many unicode fonts carry only few languages) and also the font-rendering tool/engine (a DLL) of respective platform.

5. What Operating System do I need to use Tamil Unicode?

On Windows platform, only Windows 2000 and Win XP come with the required .dll file to handle Tamil characters. Windows ME and 98 though they are “unicode-intelligent”, they do not have the specific .dll file support required for Tamil. So unicode Tamil texts will be rendered in a “linear” fashion as stored in character-based scheme without glyph substitution. Latha, Arial UnicodeMS, Code2000 are some of the Unicode fonts that carry Tamil block.

Apple uses a different font-rendering engine called ATSUI to handle GSUB, GPOS tables of unicode OTT fonts. Though Mac OS 9.x and X fully support Devanagari, Gujarati and Gurmuki, their ATSUI does not support Tamil.

Tamil Linux group has developed necessary tools to enable Unicode Tamil in this platform.

6. What application softwares do I need for Tamil Unicode in Windows ?

Even if you use Win 2K/XP, you need “compatible” application softwares to handle Tamil Unicode in these. MS Office 2000 appeared before Windows2000 release and hence displays unicode Tamil text in linear fashion even when used in Windows 2000 OS!. So you need to use recent Office XP package with Win 2000.

Alternate choice is to use a simple text editor like Notepad or WordPad.

7. What keyboard do I need to input Tamil Unicode in Windows?

Windows 2000 and Win XP come with a special “on-screen keyboard” (available under “accessories”) that allows unicode Tamil Text input. This keyboard is based on “inscript” keyboard layout used widely in India for use with ISCII-based softwares. Thus the key to type “ka” is the same whether you type in Tamil, Hindi or Telugu unicode Text.

8. How do I know if a given Tamil font is of Unicode kind and also includes Tamil block?

On Windows 2000/XP, you can use the “character-map” utility available within /accessories/system tools/ to look at the contents of all fonts installed in your computer. Select the font you are interested on the top and also “unicode-Tamil” as the block you want to look at. If your font is based on Unicode and has Tamil support, you should be able to see a set of all basic characters that are defined in the Unicode Tamil block and also additional Tamil glyphs stored in the font.

9. If I prepare the text in TSCII or TAB, is there a text convertor to convert it to Unicode format?

Yes, Murasu Anjal 2000SE comes with a Text Convertor that allows you to open any TSCII or TAB or other popular Text Encodings (has even an auto encoding detection tool incorporated) and generate equivalent unicode text.
Remember that, for use in Web pages, the Unicode based text must be stored in UTF-8 format.

10. How are unicode-based Tamil texts handled on the Web?

For use in the Web/Net, Unicode based texts are to be stored in the source/html files in a specific UTF-8 format. (Notepad allows you to save Unicode Texts in this UTF-8 format. So if you know how to add html tags yourself, you can prepare a Tamil webpage using Notepad alone).

11. What browsers do I need to view Unicode Tamil webpages?

As stated under (5) and (6), only few operating systems and application softwares currently “fully” support Tamil Unicode.

Netscape browser 4.6 onwards and Internet Explorer 4 onwards are unicode-intelligent. Hence if they are used in conjunction with Win 2k/XP, they will display Tamil webpages correctly.

Because Unicode-based texts are stored in UTF-8 format, you need to set the browser also correspondingly before viewing Unicode Tamil webpages. Two things you need to do: a) select a unicode font that carries Tamil block as the default font for use with unicode encoding/char-set and b) set also the browser to display the webpage in UTF-8 format. If you have done(a), reload the page if necessary?

For reasons indicated under (5) and(6), same Netscape or IE browsers used in Win ME or 98 will display Unicode Tamil texts in “linear form” only. It should be possible to use “dynamic fonts technology” with eot-type fonts to render Unicode Tamil texts correctly in these platforms, but this is yet to be demonstrated.

12. How about current support to Unicode Tamil texts in Adobe PDF?

Adobe Acrobat 4 allows you to prepare PDF files of Unicode Tamil texts without any problem. With “font embedding” option, PDF files are readable integrally in Windows 2000/XP and also in Macintosh OS 9 and X (though in the latter case, ATSUI engine does not yet support Tamil).

13. What tools do I need to prepare a Unicode font with support for Tamil block elements?

Unicode fonts are of a special kind OTT (OpenTrueType) unlike 8-bit bilingual fonts used for TAB &TSCII (Truetype). Preparation of an OTT font proceeds in two distinct steps:
stage i) preparation of a TT font with all glyphs you want to include in the font using one of the Font-editing softwares that support Unicode encoding. Currently these are Font Creator, FontEdit and Fontographer. With these you can name the glyphs to have Unicode-based naming and numbering. stage ii) preparation of glyph positioning (GPOS) and glyph substitution (GSUB) tables and bundle these along with the glyph outline files to create OTT for use in Windows. Best software for this purpose is MS VOLT, distributed free by Microsoft to registered software professionals.

K.Kalyanasundaram, Ph.D.

நன்றி: மின்மஞ்சரி

இப்பதிவு பற்றிய உங்கள் அபிப்பிராயங்களை “உங்கள் கருத்துக்கள்” பகுதியில் பதிவு செய்யுங்கள்!