101035 中文信息处理 chinese nlp lecture 2 1. 字 —— 中文编码 chinese character...

1

101035 中文信息处理

Chinese NLP

Lecture 2

2

字——中文编码Chinese Character Encoding

• 中文字符集（ Character Set ）• 中文编码集（ Code Set ）• 基本编码方式• 中文编码方式• 国际编码方式

3

中文字符集Chinese Character Set

• Character Set （字符集）• A character set is a collection of characters.

• {a, b, c, …, z, A, B, C, …, Z, 0, 1, 2, …, 9} is an English character set, { 啊 , 阿 , 唉 , …, 作 , 坐 , 座 } is a Chinese character set.

• Each character set has a name, such as ASCII or KANG XI ( 康熙 )

• There are more than one Chinese character set, over time and cross regions.

4

• Chinese Character Set

• GB

• They are developed in Mainland China and are based on simplified Chinese characters.

• GB is short for 国家标准 and means National Standard.

• Countries such as China and Singapore are using this standard.

• Big5

• Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters.

• Countries such as Taiwan and Hong Kong are using this standard.

• ISO 10646-1 and Unicode

• ISO and Unicode Consortium jointly develop a multilingual character set to combine the majority of the world’s character sets into a large repertoire of characters.

• Simplified/Traditional Chinese, Korean and Japanese characters can be displayed on the same HTML pages

5

中文编码集Chinese Code Set

• Code Set （编码集）• Code set means “coded character set”.

• Encoding of a character set is to represent its characters in bytes or bits.

• The complete set of numerical values is called code space (denoted by CODE).

• A value in code space is called a code (or a code point).

• Encoding is a mapping of a (unique) character (in a character set) to a (unique) code (in a code space).

6

• Chinese Code Set

• A coded character set, denoted by CC, is a set of tuples, CC={(ci, codei) | ciC, codeiCODE}, where codei codej if ci cj .

• For example, C={ 中文计算 }, C can be encoded with different code spaces.

• If CODE1={00, 01, 10, 11}, CC1={( 中 , 00), ( 文 , 01), ( 计 , 10), ( 算 , 11)}.

• If CODE2={0000, 0001, 0010, 0011}, CC2={( 中 , 0000), ( 文 , 0001), ( 计 , 0010), ( 算 , 0011)}.

• If CODE3 ={1000, 1001, 1001, 1011}, CC3 ={( 中 , 1000), ( 文 , 1001), ( 计 , 1001), ( 算 , 1011)}.

7

In-Class Exercise

• What binary values can be assigned to these 6 characters according to this code space?

(Tip: at first,

how many

bits do you

need to

encode 6

rows and 6

columns?)

Two Dimensional Code Space (66)

啊阿唉

作坐座

1

2

3

4

5

6

1 2 3 4 5 6

8

基本编码方式Basic Encoding Method

• The mapping of a character in a character set to a code point in code space is called code point assignment.

• An encoding method explains how a character is being mapped into a code point and also how assignments are made to identify a mixture of difference code sets.

CC1CC2

CC3CC4

9

• ASCII

• A popular encoding scheme for English characters is called ASCII (American Standard Code for Information Interchange).

• It defines 128 character code points (from 0x00 to 0x7F), of which the first 32 are control codes (non-printable) from 0x00 to 0x1F and the other 96 are graphic (printable) characters from 0x20 to 0x7F.

• But actually, only 94 are printable (0x21-0x7E).

• Values are represented with only 7 bits (the first bit is 0).

10

• ASCII

high-bits 0000

0100

0010

0001

0101

0110

0011

0111

low-bits

1000100110101011110011011110111100000001001000110100010101100111

11

中文编码方式Chinese Encoding Method

• It takes 2 bytes, or 16 bits, to encode all the Chinese characters.

• However, not all of these 256×256 points are used for representing displayable characters. Generally, 94×94 is considered for a Chinese character encoding matrix.

12

• One Byte English Characters vs. Two Byte Chinese Characters.

• High-Bit-On Scheme

0 x x x x x x x

English

Code range is 33-126 (<128) or 0x21-0x7E.

Code range of the first byte is 161-254 (>128) or 0xA1-0xFE.

1 x x x x x x x

Chinese

x x x x x x x x

13

• Chinese Characters or English Characters?

AB AC 41 42 43 A4 40

0xAB=171>128

ABAC is a Chinese

Character

0x41=65<128

41 is a English

Character

0x42=66<128

0x43=67<128

0xA4=164>128

42 is a English

Character

43 is a English

Character

A440 is a Chinese

Character

14

• There are two encoding methods that are common to many character sets in China, Taiwan and other Asia countries.

• ISO-2022 and EUC

• They are locale-independent encoding methods.

• However, the exact definitions of them depend greatly on the locale. In other words, there are locale-specific instances of these encodings, e.g.

• ISO-2022-CN, ISO-2022-CN-EXT, …

• EUC-CN, EUC-TW, …

15

• Encoding method and character set

Supported Character SetsEncoding Method

ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS 11643-1992 …

ISO-2022

EUC ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS 11643-1992 …

ASCII ASCII, GB-Roman, CNS-Roman …

00-1EControl character

21-7EGraphic characters

0-31

33-126

20Space character

7FDelete characters

32

127

ASCII Encoding Range

94 printable characters

16

• ISO-2022

• ISO-2022 is a modal encoding, which uses escape sequences or other special characters to switch between different modes (one-byte vs. two-byte).

• It is used primarily as an information interchange code for moving text between computer systems, such as Email.

• It is also often referred to as a seven-bit encoding methods, because all the bytes used to represent characters do not have their eighth-bit enabled.

17

• ISO-2022-CN (-EXT)

• ISO-2022-CN (-EXT) is a locale-specific implementation of ISO-2022.

• It is achieved through the use of designators and shifts.

• Designator specifies the character set associated with a particular shift.

• There are four shifts, SI, SO, SS2 and SS3. Shift specify how to interpret the subsequent bytes.

• Each line starts in ASCII, and ends in ASCII.

• A shifting character, indicated by SO (0x0E) or SI (0x0F) switches between one-byte and two-byte modes.

• SO (Shift Out) invokes two-byte mode (for GB 2312-80 and CNS 11643-1992 Plane 1) for the following bytes until SI (Shift In) is encountered which invokes one-byte mode.

• There must be a shift back to ASCII (by SI) before the end of the line.

18

• ISO-2022-CN (-EXT)

• A single shift sequence, indicated by SS2 (0x1B 0x4E) or SS3 (0x1B 0x4F), invokes two-byte mode only for the following two bytes, and is typically employed for rarely-used character sets.

• A designator (escape) sequence indicates what character set should be invoked when in two-byte mode, e.g.,

• 0x1B 0x24 0x29 0x41 (<ESC> $ ) A in ASCII) indicates GB 2312-80.

Shifting Types

SO

invoked Character Sets

SS2

SS3

GB 2312-80, CNS 11643-1992 Plane 1

CNS 11643-1992 Plane 2

CNS 11643-1992 Planes 3-7

19

• Designator and shift

1B 24 29 47 31 30 0E 45 4C 0F 31 38 0E 45 4A 0F

SO Shift to two byte mode

ASCIIcode

Designate CNS-11643 plane 1

CNS Plane 1code

one byte mode

SI Shift to one byte

mode

ASCIIcode

SO Shift to two byte mode

CNS Plane 1code

SI Shift to one byte

mode

20

• EUC

• EUC (Extended Unix Code) encoding is implemented as the internal code for most Unix software configured to support Japanese.

• Although U represents Unix, this encoding is commonly used on other platforms, such as Windows and Mac OS.

• The full definition of EUC encoding consists of four code sets.

• Code set 0 is always set to the ASCII character set or a country’s own version thereof.

• The remaining code sets are defined as a set of variants from which each country can select.

21

• EUC-CN

• EUC-CN is a locale-specific implementation of EUC.

21-7ECode set 0

Byte range

A1-FEA1-FE

Code set 1First byte range

Second byte range

33-126

161-254161-254

Code range of both the first byte and the second byte is 161-254 (>128) or 0xA1-0xFE.

1 x x x x x x x

EUC-CN (GB)

1 x x x x x x x

9494

22

• EUC-TW

• EUC-TW is by far the most complex implementation of EUC encoding in terms of how many characters it encodes, i.e. close to 50,000 characters.

23

• ISO-2022 vs EUC

• EUC encoding is closely related to ISO-2022.

• In fact, every character that can be encoded by ISO-2022 can be converted to an EUC-encoded equivalent.

ISO-2022-CNLocale (Character Set)

3A3A5756

China (GB 2312-80)汉字

69474773

Taiwan (CNS 11643-1992)

漢字

EUC-CN or EUC-TW Set 1

BABAD7D6

E9C7C7F3

24

• GBK

• GBK encoding is implemented as the internal code for the Chinese (PRC) version of Microsoft’s Windows and IBM’s OS/2.

• GBK character set contains 21,886 Symbols and Chinese characters.

21-7EASCII or GB-Roman

Byte range

81-FE40-7E, 80-FE

GBKFirst byte range

Second byte ranges

33-126

129-25464-126, 128-

254

25

• GBK

• One of the design principle of GBK is that it should be fully compatible with GB2312 and extend to support Unicode which has 20,902 characters in its first version.

26

• Big5

• Big5 encoding range has a lot in common with EUC-TW code sets 0 and 1; the main difference being that there is an additional encoding block.

21-7EASCII or CNS-Roman

Byte range

A1-FE40-7E, A1-FE

Big5First byte range

Second byte ranges

33-126

161-25464-126, 161-254

27

• Big5

• Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters.

28

国际编码方式International Encoding

Method

• Unicode and ISO 10646-1.

• We need to develop a multilingual character set combining the majority of the world’s writing systems and character sets into a Universal Character Set (UCS) or Unicode.

Character Set

Encoding Method

Unicode andISO 10646-1

UCS-2, UCS-4UTF-7, UTF-8, UTF-16

29

• BMP

• The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters.

30

• UCS-2

• A 16-bit representation can end up to 65,536 (=216) unique code points.

• It allocates the entire encoding space for characters (0x0000-0xFFFF).

• UCS-2 and Unicode encodings are identical for most of Chinese characters.

00-FF00-FF

UCS-2 First byte range Second byte range

0-2550-255

31

• UCS-4

• It is a four byte (actually, 31-bit) representation which can encode up to 2,147,483,648 (=231) code points (0x00000000-0x7FFFFFFF).

• It allocates the entire encoding space for characters (0x0000-0xFFFF).

00-7F00-FF00-FF00-FF

UCS-4 First byte range Second byte range Third byte range Fourth byte range

0-1270-2560-2560-256

• UCS-2 vs UCS-4

32

0000………

FFFF

65,536 code points

UCS-2 (16-bit)

Unicode (17-plane)

17256256=1,114,112characters 2,147,483,648

code points

00000000………

0000FFFF

UCS-4 (31-bit)

00010000………

7FFFFFFF

Can only encodeBMP Plane

Sufficient to encodeall 17 Planes of Unicode Set

33

• UTF-16

• In essence, UTF-16 encodes the BMP according to UCS-2 (16 bits) encoding (compatible).

• But it also allows the next 16 planes, which are normally only accessible through UCS-4 (32 bits) encoding.

• The surrogates area is defined with UTF-16 to allow for expansion beyond the 16-bit code space.

• UCS-2 vs UCS-4 vs UTF-16

34

0000………

FFFF

65,536 code points

UCS-2 Unicode

2,147,483,648code points

00000000………

0000FFFF

UCS-4

00010000………

7FFFFFFF

Can only encodeBMP Plane

Sufficient to encodeall 17 Planes of Unicode Set

Plane 0

Plane 1

Plane 16

…

Scalar ValueDenoted by U+

00010000…

0010FFFF

U+10000…

U+10FFFFD800 DC00

…DBFF DFFF

UTF-16Surrogates

35

• Base64

• 64 characters are used, they are the upper-case and lower-case Roman alphabet characters (i.e. A-Z, a-z), the numerals (0-9), and the "+" and "/" symbols.

36

• Base64

37

• Base64

• Step 1: Base64 takes every three bytes (each consisting of eight bits), and convert it to four six-bits.

• Step 2: Each six-bit segment is then converted into a character in the Base64 character set.

• Step 3: If the size of the original data in bytes is not a multiple of three, we append enough bytes with a value of “0” to create a 3-byte group. The Base64 padding character is “=”.

001001011011010001101001010000010000000000000000

101001010001011011001001 000000000000010000010000

JbRp QQ==

38

In-Class Exercise

• What is the result of applying Base64 to three Hex characters BEAE, CED3 and B7F5 (it is a Japanese name, 小林剑 )?

(Please first convert the Hex to Bin.)

39

• UTF-7

• UTF-7 uses the same set of Base64 character set.

• UTF-7 is different from Base64 in that:

• The “padding” character is not necessary.

• The Base64-like transformation is applied only to specific characters

• Those characters that require Base64 transformation according to UTF-7 encoding begin with a “plus” character (+, 0x2B) and end with a “hyphen” (-, 0x2D) character.

Character String M y 河豚UCS-2 Encoding 004D 0079 0020 6CB3 8C5A

UTF-7 Encoding M(4D) y(79) 20 +bLOMWg-

ASCII Codes

40

• UTF-8

• UFT-8 encoding is developed as a way to represent Unicode text as a stream of one or more eight-bits, rather than a true 16-bit units.

• It converts UCS-2 into a mixed one- through three-byte encoding.

• It converts UCS-4 into a mixed one- through six-byte encoding.

• It converts UTF-16 into a mixed one- through four-byte encoding.

• It is therefore an eight-bit and variable-length encoding.

• UTF-8 is the de facto standard encoding for interchange of Unicode text.

41

• UTF-8

• Encoding Templates

• For all but the ASCII-compatible range, the number of first-byte high-order bits set to “1” indicates the byte length.

• Filling the templates from the rightmost side bits.UCS-2 Range

0000-007F

UTF-8 Bit Arrays

0xxxxxxx

0080-07FF 110xxxxx 10xxxxxx

0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

0001 0000 – 001F FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Unicode Range (+) UTF-8 Bit Arrays

0020 0000 – 03FF FFFF

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

UCS-4 Range (+) UTF-8 Bit Arrays

0400 0000 – 7FFF FFFF

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

42

• UTF-8 Example: Convert a Unicode character “ 茶” into UTF-8 code.

• Step 1” the Unicode value of 茶 (tea) is 8336. So we need 3 bytes.

• Step 2: the binary form of hexadecimal 8336 is 1000 001100 110110.

• Step 3: Fill the empty slots of the three-byte template with the binary value of and get:

• Step 4: UTF-8 code value is thus E8 8C B6.

1110xxxx 10xxxxxx 10xxxxxx

11101000 10001100 10110110

43

• 中文字符集• 中文编码集• 基本编码方式• ASCII

• 中文编码方式• ISO-2022

• EUC

• GBK

• BIG5

Wrap-Up

• 国际编码方式• Unicode

• ISO 10646-1

• UCS-2

• UCS-4

• UTF-16

• Base64

• UTF-7

• UTF-8

101035 中文信息处理 chinese nlp lecture 2 1. 字 —— 中文编码 chinese character...

Documents