introduction to w3c i18n best practices

Post on 24-May-2015

10.273 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A tutorial on Internationalisation, typical issues found across the web and how to go about solving it.

TRANSCRIPT

Introduction to W3C I18n Best Practices

Presented by Gopal Venkatesan<g13n@ymail.com>

नमस्का�र

வணக்கம்

ನಮಸ್ಕಾ��ರ

నమస్కా�రం�

ਸਤਿ� ਸ�� ਅਕਾ�ਲ

നമസ്�കാ�രം�

ନମସ୍କ�ର୍

নমস্কা�র

علیکم السالمનમસ્કા�ર

Training Outline

• Internationalisation Vocabulary• Typical Problems– Outline the common problems found across the web

• Java and Internationalisation– The level of Internationalisation support is available in Java

• Resource Bundles– Formatting messages the correct way

• PHP and Internationalisation– The level of Internationalisation support is available in PHP

VOCABULARY

Unicode

• International standard for representing written language in computers

• Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit

• Maintained in sync with ISO 10646• Three main encodings: UTF-8, UTF-16 and

UTF-32• Address space of 21 bits

Unicode (contd.)

• UTF-8 is a multi-byte encoding and is eight bytes long

• An encoded character can take one, two, three or four bytes

• UTF-8 is backward compatible with US-ASCII• Default encoding for PHP6?

Unicode (contd.)

• UTF-16 uses 16-bit code units• Cannot address the complete set, so uses

surrogates• Default encoding for strings in Java and

JavaScript

Unicode (contd.)

• UTF-32 uses 32-bit code units• Every Unicode character is addressed within a

single code unit

Internationalisation

• Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language

• Abbreviated as I18n as there are eighteen characters between “I” and “n”

Localisation

• Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”)

• Translation is one aspect of localisation• Abbreviated as L10n as there are ten

characters between “L” and “n”

TYPICAL PROBLEMS

Typical Problem

Typical Problem (Contd.)

Typical Problem (Contd.)

Typical Problem (Contd.)

Typical Problem (Contd.)

The Solution

• Determine the user environment– Format dates, times, currencies as per the locale

• Understand the Internationalisation support available with your implementation language

• Use the ICU/Internationalisation libraries rather than rolling out your own functions

COMMON ENCODING PROBLEMS

Tofu characters – Black hollow boxes

• Shown as a black hollow box, typically one per character

• Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s)

• Tofu isn’t always a software problem – not a bug but really annoying

Tofu characters – Black hollow boxes

Question Marks – Incorrect conversion

• “???” usually displayed when converting text from one encoding to another

• Means there is no equivalent character in the target encoding for the corresponding source

• May not be a bug always, though sometimes occurs when an incorrect encoding is specified

Question Marks – Incorrect conversion

Mojibake – 文字化け • Pronounced as “Moh-jee-baa-kay” is a

Japanese word meaning “garbled characters”• Occurs when text in one encoding is

“interpreted” as some other encoding• Most of the times caused by interpreting

Latin-1 as UTF-8– UTF-8 is compatible only with US-ASCII– Characters outside the ASCII range are

incompatible with UTF-8 and cause Mojibake

Mojibake – 文字化け

JAVA™ AND UNICODE

Unicode support in Java™

• Java™ has always supported Unicode• Java™ strings are UTF-16– A “char” in Java™ is a UTF-16 code unit, not a code

point• By default the input and output streams use

the OS native charset– On Windows™ this is Windows-1252– On most Unices and Unix-like OS this is UTF-8

A “Hello, world” example

A “Hello, world” example (contd.)

A “Hello, world” example (contd.)

“Hello, world” on GNU/Linux

Garbage In, Garbage Out!

“Hello, world” Corrected!

Oops!

“Hello, world” Corrected!

EXTERNALISING STRINGSResource Bundles

The Need

• Allows a single code base to display strings in multiple languages

• No need to refactor code to support new languages

Beginning

Beginning (Sum.properties)

• SUM_OF = Sum of• AND = and• IS = is

That was broken!

• Its generally a bad idea to concatenate strings– Does not work for all languages since the grammar

is different!• Always use string substitution using positional

parameters

Correct Way

Correct Way (contd.)

• SumI18n.properties– SUM = Sum of {0} and {1} is {2}

• SumI18n_hi.properties– SUM = {0} अतिरिरक्त {1} {2} का बर�बर है�

• SumI18n_ta.properties– SUM = {0} மற்றும் {1} கூட்டினா ல் {2}

Oops!

• Java 1.5 property files are read as ISO-8859-1 (Latin-1)

• Use “native2ascii” tool to convert Unicode files to escape sequences (U+??)

• native2ascii –encoding UTF-8 SumI18n_hi.properties

• native2ascii –encoding UTF-8 SumI18n_ta.properties

It’s working!

INTERNATIONALISATION IN PHP

Challenges

• PHP 5 (and earlier) does not understand characters and encodings

• The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK)

• PHP has very limited functions for formatting date, time, currencies, etc.

• PHP doesn’t provide linguistic sorting!

The Good News – Intl extension

• Open source – http://pecl.php.net/intl• Designed for PHP 5.x, part of PHP 5.3– Configure using “—enable-intl”

• Leverages ICU and CLDR• Available as OO and procedural APIs– Collator::sort() vs. collator_sort()

• Yahoo! is a key contributor

The PHP Intl Library

Collator

Intl

NumberFormatter

Locale

Normalizer

MessageFormatter

IntlDateFormatter

Grapheme

ResourceBundle

IDN

Corrected substring implementation

Formatting Numbers

Resource Bundles

• Externalize strings in your application• Similar to how desktop applications are built– One binary and additional language packs

• Similar to Windows™ resource files and Unix® message files– Structure is different, see ICU resource bundles

• Key/value pairs– Key is used by the application at run time to

display the value

Additional Things

• Change the “default_charset” in php.ini to “utf-8”

• While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library

• “echo” is encoding agnostic

Why Intl is better than mbstring?

Why Intl is better than mbstring? (contd.)

top related