introduction to w3c i18n best practices

Introduction to W3C I18n Best Practices

Presented by Gopal Venkatesan<[email protected]>

नमस्का�र

வணக்கம்

ನಮಸ್ಕಾ��ರ

నమస్కా�రం�

ਸਤਿ� ਸ�� ਅਕਾ�ਲ

നമസ്�കാ�രം�

ନମସ୍କ�ର୍

নমস্কা�র

علیکم السالمનમસ્કા�ર

Training Outline

• Internationalisation Vocabulary• Typical Problems– Outline the common problems found across the web

• Java and Internationalisation– The level of Internationalisation support is available in Java

• Resource Bundles– Formatting messages the correct way

• PHP and Internationalisation– The level of Internationalisation support is available in PHP

VOCABULARY

Unicode

• International standard for representing written language in computers

• Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit

• Maintained in sync with ISO 10646• Three main encodings: UTF-8, UTF-16 and

UTF-32• Address space of 21 bits

Unicode (contd.)

• UTF-8 is a multi-byte encoding and is eight bytes long

• An encoded character can take one, two, three or four bytes

• UTF-8 is backward compatible with US-ASCII• Default encoding for PHP6?

Unicode (contd.)

• UTF-16 uses 16-bit code units• Cannot address the complete set, so uses

surrogates• Default encoding for strings in Java and

JavaScript

Unicode (contd.)

• UTF-32 uses 32-bit code units• Every Unicode character is addressed within a

single code unit

Internationalisation

• Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language

• Abbreviated as I18n as there are eighteen characters between “I” and “n”

Localisation

• Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”)

• Translation is one aspect of localisation• Abbreviated as L10n as there are ten

characters between “L” and “n”

TYPICAL PROBLEMS

Typical Problem

Typical Problem (Contd.)

The Solution

• Determine the user environment– Format dates, times, currencies as per the locale

• Understand the Internationalisation support available with your implementation language

• Use the ICU/Internationalisation libraries rather than rolling out your own functions

COMMON ENCODING PROBLEMS

Tofu characters – Black hollow boxes

• Shown as a black hollow box, typically one per character

• Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s)

• Tofu isn’t always a software problem – not a bug but really annoying

Tofu characters – Black hollow boxes

Question Marks – Incorrect conversion

• “???” usually displayed when converting text from one encoding to another

• Means there is no equivalent character in the target encoding for the corresponding source

• May not be a bug always, though sometimes occurs when an incorrect encoding is specified

Question Marks – Incorrect conversion

Mojibake – 文字化け • Pronounced as “Moh-jee-baa-kay” is a

Japanese word meaning “garbled characters”• Occurs when text in one encoding is

“interpreted” as some other encoding• Most of the times caused by interpreting

Latin-1 as UTF-8– UTF-8 is compatible only with US-ASCII– Characters outside the ASCII range are

incompatible with UTF-8 and cause Mojibake

Mojibake – 文字化け

JAVA™ AND UNICODE

Unicode support in Java™

• Java™ has always supported Unicode• Java™ strings are UTF-16– A “char” in Java™ is a UTF-16 code unit, not a code

point• By default the input and output streams use

the OS native charset– On Windows™ this is Windows-1252– On most Unices and Unix-like OS this is UTF-8

A “Hello, world” example

A “Hello, world” example (contd.)

“Hello, world” on GNU/Linux

Garbage In, Garbage Out!

“Hello, world” Corrected!

EXTERNALISING STRINGSResource Bundles

The Need

• Allows a single code base to display strings in multiple languages

• No need to refactor code to support new languages

Beginning

Beginning (Sum.properties)

• SUM_OF = Sum of• AND = and• IS = is

That was broken!

• Its generally a bad idea to concatenate strings– Does not work for all languages since the grammar

is different!• Always use string substitution using positional

parameters

Correct Way

Correct Way (contd.)

• SumI18n.properties– SUM = Sum of {0} and {1} is {2}

• SumI18n_hi.properties– SUM = {0} अतिरिरक्त {1} {2} का बर�बर है�

• SumI18n_ta.properties– SUM = {0} மற்றும் {1} கூட்டினா ல் {2}

Oops!

• Java 1.5 property files are read as ISO-8859-1 (Latin-1)

• Use “native2ascii” tool to convert Unicode files to escape sequences (U+??)

• native2ascii –encoding UTF-8 SumI18n_hi.properties

• native2ascii –encoding UTF-8 SumI18n_ta.properties

It’s working!

INTERNATIONALISATION IN PHP

Challenges

• PHP 5 (and earlier) does not understand characters and encodings

• The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK)

• PHP has very limited functions for formatting date, time, currencies, etc.

• PHP doesn’t provide linguistic sorting!

The Good News – Intl extension

• Open source – http://pecl.php.net/intl• Designed for PHP 5.x, part of PHP 5.3– Configure using “—enable-intl”

• Leverages ICU and CLDR• Available as OO and procedural APIs– Collator::sort() vs. collator_sort()

• Yahoo! is a key contributor

http://pecl.php.net/intl

The PHP Intl Library

Collator

Intl

NumberFormatter

Locale

Normalizer

MessageFormatter

IntlDateFormatter

Grapheme

ResourceBundle

IDN

Corrected substring implementation

Formatting Numbers

Resource Bundles

• Externalize strings in your application• Similar to how desktop applications are built– One binary and additional language packs

• Similar to Windows™ resource files and Unix® message files– Structure is different, see ICU resource bundles

• Key/value pairs– Key is used by the application at run time to

display the value

Additional Things

• Change the “default_charset” in php.ini to “utf-8”

• While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library

• “echo” is encoding agnostic

Why Intl is better than mbstring?

Why Intl is better than mbstring? (contd.)

Resources

• http://www.w3.org/International/• http://unicode.org/• http://

java.sun.com/javase/technologies/core/basic/intl/faq.jsp

• http://pecl.php.net/intl• http://php.net/manual/en/refs.international.php

http://www.w3.org/International/

http://www.w3.org/International/

http://unicode.org/

http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp



http://pecl.php.net/intl

http://php.net/manual/en/refs.international.php

http://php.net/manual/en/refs.international.php

introduction to w3c i18n best practices

Technology