introduction to w3c i18n best practices

54
Introduction to W3C I18n Best Practices Presented by Gopal Venkatesan <[email protected]>

Upload: gopal-venkatesan

Post on 24-May-2015

10.273 views

Category:

Technology


0 download

DESCRIPTION

A tutorial on Internationalisation, typical issues found across the web and how to go about solving it.

TRANSCRIPT

Page 1: Introduction to W3C I18N Best Practices

Introduction to W3C I18n Best Practices

Presented by Gopal Venkatesan<[email protected]>

Page 2: Introduction to W3C I18N Best Practices

नमस्का�र

வணக்கம்

ನಮಸ್ಕಾ��ರ

నమస్కా�రం�

ਸਤਿ� ਸ�� ਅਕਾ�ਲ

നമസ്�കാ�രം�

ନମସ୍କ�ର୍

নমস্কা�র

علیکم السالمનમસ્કા�ર

Page 3: Introduction to W3C I18N Best Practices

Training Outline

• Internationalisation Vocabulary• Typical Problems– Outline the common problems found across the web

• Java and Internationalisation– The level of Internationalisation support is available in Java

• Resource Bundles– Formatting messages the correct way

• PHP and Internationalisation– The level of Internationalisation support is available in PHP

Page 4: Introduction to W3C I18N Best Practices

VOCABULARY

Page 5: Introduction to W3C I18N Best Practices

Unicode

• International standard for representing written language in computers

• Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit

• Maintained in sync with ISO 10646• Three main encodings: UTF-8, UTF-16 and

UTF-32• Address space of 21 bits

Page 6: Introduction to W3C I18N Best Practices

Unicode (contd.)

• UTF-8 is a multi-byte encoding and is eight bytes long

• An encoded character can take one, two, three or four bytes

• UTF-8 is backward compatible with US-ASCII• Default encoding for PHP6?

Page 7: Introduction to W3C I18N Best Practices

Unicode (contd.)

• UTF-16 uses 16-bit code units• Cannot address the complete set, so uses

surrogates• Default encoding for strings in Java and

JavaScript

Page 8: Introduction to W3C I18N Best Practices

Unicode (contd.)

• UTF-32 uses 32-bit code units• Every Unicode character is addressed within a

single code unit

Page 9: Introduction to W3C I18N Best Practices

Internationalisation

• Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language

• Abbreviated as I18n as there are eighteen characters between “I” and “n”

Page 10: Introduction to W3C I18N Best Practices

Localisation

• Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”)

• Translation is one aspect of localisation• Abbreviated as L10n as there are ten

characters between “L” and “n”

Page 11: Introduction to W3C I18N Best Practices

TYPICAL PROBLEMS

Page 12: Introduction to W3C I18N Best Practices

Typical Problem

Page 13: Introduction to W3C I18N Best Practices

Typical Problem (Contd.)

Page 14: Introduction to W3C I18N Best Practices

Typical Problem (Contd.)

Page 15: Introduction to W3C I18N Best Practices

Typical Problem (Contd.)

Page 16: Introduction to W3C I18N Best Practices

Typical Problem (Contd.)

Page 17: Introduction to W3C I18N Best Practices

The Solution

• Determine the user environment– Format dates, times, currencies as per the locale

• Understand the Internationalisation support available with your implementation language

• Use the ICU/Internationalisation libraries rather than rolling out your own functions

Page 18: Introduction to W3C I18N Best Practices

COMMON ENCODING PROBLEMS

Page 19: Introduction to W3C I18N Best Practices

Tofu characters – Black hollow boxes

• Shown as a black hollow box, typically one per character

• Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s)

• Tofu isn’t always a software problem – not a bug but really annoying

Page 20: Introduction to W3C I18N Best Practices

Tofu characters – Black hollow boxes

Page 21: Introduction to W3C I18N Best Practices

Question Marks – Incorrect conversion

• “???” usually displayed when converting text from one encoding to another

• Means there is no equivalent character in the target encoding for the corresponding source

• May not be a bug always, though sometimes occurs when an incorrect encoding is specified

Page 22: Introduction to W3C I18N Best Practices

Question Marks – Incorrect conversion

Page 23: Introduction to W3C I18N Best Practices

Mojibake – 文字化け • Pronounced as “Moh-jee-baa-kay” is a

Japanese word meaning “garbled characters”• Occurs when text in one encoding is

“interpreted” as some other encoding• Most of the times caused by interpreting

Latin-1 as UTF-8– UTF-8 is compatible only with US-ASCII– Characters outside the ASCII range are

incompatible with UTF-8 and cause Mojibake

Page 24: Introduction to W3C I18N Best Practices

Mojibake – 文字化け

Page 25: Introduction to W3C I18N Best Practices

JAVA™ AND UNICODE

Page 26: Introduction to W3C I18N Best Practices

Unicode support in Java™

• Java™ has always supported Unicode• Java™ strings are UTF-16– A “char” in Java™ is a UTF-16 code unit, not a code

point• By default the input and output streams use

the OS native charset– On Windows™ this is Windows-1252– On most Unices and Unix-like OS this is UTF-8

Page 27: Introduction to W3C I18N Best Practices

A “Hello, world” example

Page 28: Introduction to W3C I18N Best Practices

A “Hello, world” example (contd.)

Page 29: Introduction to W3C I18N Best Practices

A “Hello, world” example (contd.)

Page 30: Introduction to W3C I18N Best Practices

“Hello, world” on GNU/Linux

Page 31: Introduction to W3C I18N Best Practices

Garbage In, Garbage Out!

Page 32: Introduction to W3C I18N Best Practices

“Hello, world” Corrected!

Page 33: Introduction to W3C I18N Best Practices

Oops!

Page 34: Introduction to W3C I18N Best Practices

“Hello, world” Corrected!

Page 35: Introduction to W3C I18N Best Practices

EXTERNALISING STRINGSResource Bundles

Page 36: Introduction to W3C I18N Best Practices

The Need

• Allows a single code base to display strings in multiple languages

• No need to refactor code to support new languages

Page 37: Introduction to W3C I18N Best Practices

Beginning

Page 38: Introduction to W3C I18N Best Practices

Beginning (Sum.properties)

• SUM_OF = Sum of• AND = and• IS = is

Page 39: Introduction to W3C I18N Best Practices

That was broken!

• Its generally a bad idea to concatenate strings– Does not work for all languages since the grammar

is different!• Always use string substitution using positional

parameters

Page 40: Introduction to W3C I18N Best Practices

Correct Way

Page 41: Introduction to W3C I18N Best Practices

Correct Way (contd.)

• SumI18n.properties– SUM = Sum of {0} and {1} is {2}

• SumI18n_hi.properties– SUM = {0} अतिरिरक्त {1} {2} का बर�बर है�

• SumI18n_ta.properties– SUM = {0} மற்றும் {1} கூட்டினா ல் {2}

Page 42: Introduction to W3C I18N Best Practices

Oops!

• Java 1.5 property files are read as ISO-8859-1 (Latin-1)

• Use “native2ascii” tool to convert Unicode files to escape sequences (U+??)

• native2ascii –encoding UTF-8 SumI18n_hi.properties

• native2ascii –encoding UTF-8 SumI18n_ta.properties

Page 43: Introduction to W3C I18N Best Practices

It’s working!

Page 44: Introduction to W3C I18N Best Practices

INTERNATIONALISATION IN PHP

Page 45: Introduction to W3C I18N Best Practices

Challenges

• PHP 5 (and earlier) does not understand characters and encodings

• The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK)

• PHP has very limited functions for formatting date, time, currencies, etc.

• PHP doesn’t provide linguistic sorting!

Page 46: Introduction to W3C I18N Best Practices

The Good News – Intl extension

• Open source – http://pecl.php.net/intl• Designed for PHP 5.x, part of PHP 5.3– Configure using “—enable-intl”

• Leverages ICU and CLDR• Available as OO and procedural APIs– Collator::sort() vs. collator_sort()

• Yahoo! is a key contributor

Page 47: Introduction to W3C I18N Best Practices

The PHP Intl Library

Collator

Intl

NumberFormatter

Locale

Normalizer

MessageFormatter

IntlDateFormatter

Grapheme

ResourceBundle

IDN

Page 48: Introduction to W3C I18N Best Practices

Corrected substring implementation

Page 49: Introduction to W3C I18N Best Practices

Formatting Numbers

Page 50: Introduction to W3C I18N Best Practices

Resource Bundles

• Externalize strings in your application• Similar to how desktop applications are built– One binary and additional language packs

• Similar to Windows™ resource files and Unix® message files– Structure is different, see ICU resource bundles

• Key/value pairs– Key is used by the application at run time to

display the value

Page 51: Introduction to W3C I18N Best Practices

Additional Things

• Change the “default_charset” in php.ini to “utf-8”

• While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library

• “echo” is encoding agnostic

Page 52: Introduction to W3C I18N Best Practices

Why Intl is better than mbstring?

Page 53: Introduction to W3C I18N Best Practices

Why Intl is better than mbstring? (contd.)