let's explore chinese i18n/l10n on gnu/linux!anthony fok, thizlinux laboratory ltd.hklug linux...

Post on 30-Jan-2016

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

齊來探討 齊來探討 GNU/Linux GNU/Linux 中文化中文化Let's Explore Chinese Let's Explore Chinese

internationalization and localization internationalization and localization on GNU/Linux!on GNU/Linux!

霍東靈,即時系統科研有限公司霍東靈,即時系統科研有限公司Anthony Fok, ThizLinux Laboratory Ltd.Anthony Fok, ThizLinux Laboratory Ltd.

HKLUG Linux Talk, 13 April 2002HKLUG Linux Talk, 13 April 2002

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

概覽 概覽 OverviewOverview

● 中文字符集及編碼簡介Introduction to Chinese charsets and encodings– GB 18030-2000 和 HKSCS-2001

● GNU/Linux 系統上的中文 i18n/L10n 架構Chinese i18n/L10n infrastructure on GNU/Linux

● 如何參與中文化的工作Participating in Chinese i18n/L10n

● 待辦工作及未來展望Todo list and future developments

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

中文字符集及編碼簡介Chinese character sets and

encodings

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

在起初,只有 在起初,只有 0 0 和 和 11In the beginning, there's In the beginning, there's

only 0 and 1only 0 and 1● Computer sees all data as 0s and 1s

● Each “on-off switch” unit is a “bit” (位元、比特 )● 8-bits make up 1“byte”or“octet” (位元組、字節 )● 0000 0000 to 1111 1111 (0x00 to 0xFF) make up

256 code points● Initially, each character is stored in 1 byte

– ASCII (ISO 646 IRV)– ISO 8859-1 至 ISO 8859-16 (Latin1, Latin2,

Greek, Hebrew, Thai, Cyrillic, etc.)– 256 codepoints is NOT enough for Chinese!

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

萬「碼」奔騰:眾多中文編碼標準萬「碼」奔騰:眾多中文編碼標準So many charsets and So many charsets and

encodings!encodings!● All Chinese (Han) characters that have

ever existed exceeds 100,000– Unicode 3.2 / ISO 10646 includes over

70,000– CCCII includes over 75,000– Invented in China; adopted by Japan, Korea,

and Vietnam: “CJKV”– Sources include:

● 漢語大字典 (Hanyu Da Zidian)● 康熙字典 (Kangxi Zidian)● Regional Standards (GB, CNS, HKSCS, JIS, KSC)

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

1 byte not enough? Let's 1 byte not enough? Let's use more!use more!

● If all bits are available:– 1 byte, 8 bits, 2^8 = 256 (0x00..0xFF)– 2 bytes, 16 bits, 2^16 = 65536

(0x0000..0xFFFF)– 3 bytes, 24 bits, 2^24 = 16,777,216

(0x000000..0xFFFFFF)– 4 bytes, 32 bits, 4,294,967,296

(0x00000000..0xFFFFFFFF)● Most legacy encodings must ensure ASCII

compatibility, so cannot use all the space

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

GB 2312-80GB 2312-80

● GB2312 是中國大陸國家標準(國標)– ─ ─《信息技術 信息交換用漢字編碼字符集 基本集》 ,

published in 1980– 2-byte, {0xA1-0xFE}{0xA1-0xFE}, or 94x94,

for a total of 8836 possible 2-byte codepoints.– 6500+ Han characters, for a total of 6700+

chars● Sidenote: GB 12345-T provides a Traditional Chinese

charset encoded in the same space as GB 2312-80● Called zh_CN.GB2312 or zh_CN.EUC-CN on

GNU/Linux– Too few characters! (朱鎔 基 -> 朱容基 )

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

GBK GBK 規範 規範 SpecificationSpecification

● China actively participates in ISO 10646● GB13000.1 = Unicode 2.1 (ISO 10646-1993)● Too many legacy GB2312 applications● Need a migration plan, an intermediate solution

● GBK is the first step in that direction (1995)

● Includes the repertoire of the CJK Unified Ideographs in GB13000.1 / Unicode 2.1

● U+4E00 to U+9FA5, over 20000 Han ideographs● Backward compatible with GB2312● Implemented in Windows 95 (simp. Chin) (CP936)● {0x81-0xFE}{0x40-0x7E, 0x80-0xFE}

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Big-5 Big-5 「五大碼」「五大碼」

● A “round-table” standard made up by the “Big-5” companies in Taiwan

● Implemented by all major Chinese OS's– 倚天、零一、國喬、繁體中文 Windows 等等

● Not very well designed, 選字不夠規範– Two characters are duplicated– Missing 「 」 and other chars used in HK– In Taiwan, attempts to fix/extend Big5

basically failed (CMEX's Big-5+, Big-5E...)

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

First steps beyond Big-5First steps beyond Big-5

– 倚天 ETen added some characters (Hirigana, Katagana, 「裏、銹」 , etc. (Some call it Big5-ETen). De facto Big5 standard on GNU/Linux

– Microsoft Code Page 950 includes 「裏、銹」etc., but not all of ETen's extensions

● User-Defined Areas (UDA), Vendor-Defined Areas (VDA), EUDC (End-User Defined Characters), Private User Areas (PUA)

– Different people use EUDC differently... a messy situation

– The demise of CMEX's Big-5+ standard

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Unicode / ISO 10646Unicode / ISO 10646

● Unicode Consortium (Industry)● ISO/IEC 10646 (Academic/Int'l Standard)● The two join in their efforts to produce

Unicode / UCS– Universal Multiple-Octet Coded Character Set– ISO: Design, adding characters to repertoire– Unicode Consortium: Technical

implementation● Code range: U+0000 to U+10FFFF

– 1,114,112 possible code points

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Unicode / ISO 10646Unicode / ISO 10646

● Think “integers”: UCS2, UCS4● Think “strings”

– UTF-7– UTF-8

● Variable width, 1 to 4 bytes (up to – UTF-16

● Fixed width 16-bit, with surrogates (U+D800-U+DFFF, high and low doubles up), up to U+10FFFF

– UTF-32● Fixed width 32-bit, up to U+7FFFFFFF

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Unicode / ISO 10646Unicode / ISO 10646

● ISO 10646-1:1993● ISO 10646-1:2000● ISO 10646-2:2001● Unicode 3.2 just came out● More world languages are being

researched and added, a truly worldwide effort.

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

香港增補字符集香港增補字符集 -2001-2001HKSCS-2001HKSCS-2001

– A brief history● GCCS ( 政府通用字庫 Government Common

Character Set), 1995● HKSCS-1999

– Official encoding name: BIG5-HKSCS (IANA Registry)● HKSCS-2001

– Actively promoted by ITSD– ITSD (HKSARG) wishes HKSCS-2001 to be

implemented on GNU/Linux too, and actively assists the community by providing guidance and advice

– Excellent official website, open standard(starts from http://www.digital21.gov.hk/eng/hkscs/

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

香港中文字範例香港中文字範例Sample HKSCS Chinese TextSample HKSCS Chinese Text● 大家好!你同我一齊玩!● 李、仔、魚涌、深水● 大廈 /有啊!● ( ……仲好似有五個粗口字 ) Hehe...

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

GB 18030-2000GB 18030-2000

● GB 18030-2000 Standard● Rationale for a new standard: The 70207+ unified

Han ideographs in Unicode 3.1 won't all fit in the 2-byte codespace of the GBK specification

– ─ ─全名為《信息技術 信息交換用漢字編碼字符集 基本 集的擴充》 (2000-03-17, 2000-11-30)

– Further extends GBK to add 4-byte codespace● More than enough to cover U+0000 to U+10FFFF● Compatible with all future versions of ISO 10646● Backward compatible with GB2312 and GBK

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

GB 18030-2000GB 18030-2000

● Why is GB18030 significant?– It solves a pressing issue in China. Finally,

all people's names, geographic names, and ancient text can be properly processed

– It is mandatory: all operating systems sold after 2001-08-31 must support GB18030

– Products must pass GB18030 certification to ensure proper input, editing, screen display, and printing of GB18030 text

– Thiz Linux Desktop was awarded A+ Grade in GB18030 Certification Test!

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

GB 18030-2000GB 18030-2000

● 1-byte = ISO 646-IRV (US-ASCII)– {0x00-0x7F}

● 2-byte =~ GBK– {0x81-0xFE}{0x40-0x7E}

● 4-byte● Mapped linearly with Unicode while skipping all

existing mappings● Can be calculated algorithmically● {0x81-0xFE}{0x30-0x39}{0x81-0xFE}{0x30-

0x39)

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

GB 18030-2000GB 18030-2000

● Official information hard to find– Hard to obtain the printed version of the

GB18030 standard outside China● Fortunately, many early implementers

and charsets experts have provided info:– Dirk Meyer (Adobe) translated the summary– Markus Scherer (IBM, Unicode Consortium)

provides gb-18030-2000.xml conv. table– Many efforts and interests from others,

including ThizLinux Laboratory

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

UnicodeData.txt, UnicodeData.txt, Unihan.txtUnihan.txt

● UnicodeData.txt– Important information on the character

repertoires and control codes in Unicode● Unihan.txt

– Valuable information (attributes) of over 70,000 CJK Unified ideographs

● Source● Pronunciations in CJKV (+ Cantonese and

Mandarin)● Meaning

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

實施 實施 HKSCS HKSCS 和 和 GB18030 GB18030 的難的難處處

● HKSCS-2001● CJK Extension B etc. (U+20000 – U+2FFFF), but

not all programs support beyond U+FFFF yet● Lack of fonts

● GB18030● Huge! 4-byte ● Certification● Fonts available, expensive (TrueType or bitmap)

– Both are Unicode solutions, so as Unicode support improves, so will HKSCS and GB18030

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

其他中文編碼標準其他中文編碼標準

● CCCII (Chinese Character Codes for Information Exchange)– http://public.ptl.edu.tw/publish/suyan/42/

text_07.htm● CNS 11643● Big-5+, Big-5E● 使用倉頡進行編碼● And many more

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

GNU/Linux GNU/Linux 及 及 *BSD *BSD 中文化團中文化團隊隊

● CLE (Chinese GNU/Linux Extension)– A group of pioneering volunteers originally

led by Platin (小虫 )● Debian 中文計劃● FreeBSD 中文化小組● 中、港、台三地的翻譯團隊● Many more CJKV teams and i18n/L10n

worldwide, including Chinese and non-Chinese!

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

各大中文 各大中文 GNU/Linux GNU/Linux 發行版本發行版本Major Chinese GNU/Linux Major Chinese GNU/Linux

DistributionsDistributions● 各大中文 GNU/Linux 發行版本

– 即時 Linux 桌面環境 6.0 (Thiz Linux Desktop 6.0)

– Turbolinux 7.0 中文版– 中文 2000 (Chinese 2000)– 沖浪 (Xteam) 、 紅旗 (Red Flag) 、中軟

(COSIX) 、幸福 (Happy) 、百資 (Linpus) 、網虎(XLinux)

● 國外著名而有中文化的 GNU/Linux 發行版本– Debian GNU/Linux, Red Hat Linux, Linux

Mandrake, (SuSE, Slackware), FreeBSD

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

GNU C Library (GLIBC)GNU C Library (GLIBC)

● Libc5● Glibc 2.1● Glibc 2.2● Conversion tables

– Big5 (CLE), GBK (Justin Yu, Sean Chen)– big5hkscs.c (Roger So, Ulrich Drepper,

ThizLinux, James Su)– GB18030 (Wu Jian, Ulrich, ThizLinux, James

Su, another version by Yu Shao)

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

XFree86 / X XFree86 / X 視窗系統視窗系統X Window SystemX Window System

● XFLD, fontset● Xrender / Xft (Keith Packard)● X-TT, “freetype” module● Addition of Big5-HKSCS encodings

(Roger So)● Addition of GB18030 encoding

(James Su et al.)

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

GTK+ and GNOMEGTK+ and GNOME

● GNOME 1.x– Charset handling Based on Glibc and

Xfree86– Good, but not perfect

● GNOME 2.0 (in development)– Pango– Xft

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Qt 3.0.4 and KDE 3.0.1Qt 3.0.4 and KDE 3.0.1

● Qt comes with its own “codecs” in order to be a multiplatform toolkit.– Somewhat tedious... the tables already

created for Glibc must be re-created for Qt● except we cannot directly use Glibc's code

because of licensing issues... No big deal, just extra efforts.

– Good Unicode support; handles everything with Unicode internally.

– Currently only supports UCS2, challenges for HKSCS-2001

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

中文輸入平台中文輸入平台Chinese Input Method Chinese Input Method

ServersServers● XCIN● Chinput

– miniChinput– magicChinput

● 楊春白雪● MyIM

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

中文輸入法中文輸入法

● 倉頡● 行列 30● 大易● 五筆字型● 智能 ABC、智能拼音● 混合● Many others

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

中文字型中文字型Chinese fontsChinese fonts

● 文鼎– AR PL Mingti2L Big5– AR PL SungtiL GB– AR PL KaitiM Big5– AR PL KaitiM GB

● 華康● 方正● 王漢忠十套 GNU GPL 中文字型

– ……可惜格式不太合用

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Web BrowsersWeb Browsers

● Netscape 4.79● Mozilla 0.9.9

– Dillo, Galeon, etc.● Konqueror

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

CJK LaTeX and FreeTypeCJK LaTeX and FreeType

● CJK LaTeX Written by Werner Lemberg from Germany– Yes, Werner can speak Chinese too!

Amazing!● FreeType 1.3.1 and FreeType 2.0.9:

– TrueType (and Type1, BDF etc.) font library

– Main authors: David Turner, Robert Wilhelm, Werner Lemberg

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

PostScript PostScript 與 與 PDFPDF

● Ghostscript + CJK (GS-CJK)● Adobe's CMaps (HKscs, GBK2K, etc.)● Acrobat Reader 4.05 for Linux does not

come with CMaps (HKscs and GBK2K) that are already in Acrobat Reader 5.0

● Ghostscript and XPDF are constantly improving

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

Office SuitesOffice Suites

– OpenOffice.org family (Thiz Office, Kai Office, Red Office)

● Chinese support improving, a joint effort● Excellent i18n/L10n support for all languages

– HancomOffice● Will be based on Qt 3● qbig5hkscscodec.cpp for Qt2 provided by

ThizLinux Laboratory; Hancom ported the code for Qt3

– Lightweight: AbiWord and Gnumeric● Quite good too!

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

如何參與 如何參與 GNU/Linux GNU/Linux 中文化中文化How to participate in How to participate in

i18n effortsi18n efforts● Improve existing infrastructure● Work on new areas● Help with localization and translation

efforts● Join a project that you like, whether it is

Chinese i18n/L10n related or not● Help spread the word! :-)

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

PO PO 翻譯翻譯

● GNOME 2.0● KDE 3.0● GNU Utilities● Gettext 工具● PO / MO 格式● 用法、編碼 (Usage, encoding issues)● 寧可不譯,不可誤譯● 「非化名的字型」 (平滑字型、反鋸齒字型 )

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

參考網站參考網站

– http://cle.linux.org.tw/– http://xcin.linux.org.tw/– http://www.debian.org.hk/intl/zh/– http://linuxfab.cx/– http://www.linuxforum.net/– http://www.unicode.org/– 朱邦復先生工作室 http://www.cflabs.com/– http://www.google.com/

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

待辦工作 待辦工作 / TODO/ TODO

● Some programs still need to be revised in order to conform to i18n/L10n infrastructure

● Always room for improvement in terms of ease of use, completeness, and stability

● More people's participations are welcome

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

未來發展未來發展Future Developments and Future Developments and

OpportunitiesOpportunities● 手寫板 Handwriting Pad● 語音識別 Voice Recognition● More smart Cantonese input methods?● IIIMF to replace XIM?● OpenType to replace TrueType?● More interesting Chinese language

researches based on GNU/Linux systems?

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

● All skills are useful, even if you are not in CS, CE or EE!

● Mathematics, Physics theory● C, C++, Perl, Python, GTK, Qt

– IPA, Jyutping, Japanese, Korean...● e.g. XCIN 作者是讀 Physics...● 語言學 Linguistics, 語音學 Phonetics

● What we can learn during the process– Skills development, learning English,

learning other new languages, meeting friends, and many more!

Comments and SuggestionsComments and Suggestions

Let's Explore Chinese i18n/L10n on GNU/Linux! Anthony Fok, ThizLinux Laboratory Ltd. HKLUG Linux Talk, 13 April 2002

歡迎任何問題!Questions? :-)

top related