101035 中文信息处理 chinese nlp lecture 2 1. 字 —— 中文编码 chinese character...
TRANSCRIPT
1
101035 中文信息处理
Chinese NLP
Lecture 2
2
字——中文编码Chinese Character Encoding
• 中文字符集( Character Set )• 中文编码集( Code Set )• 基本编码方式• 中文编码方式• 国际编码方式
3
中文字符集Chinese Character Set
• Character Set (字符集)• A character set is a collection of characters.
• {a, b, c, …, z, A, B, C, …, Z, 0, 1, 2, …, 9} is an English character set, { 啊 , 阿 , 唉 , …, 作 , 坐 , 座 } is a Chinese character set.
• Each character set has a name, such as ASCII or KANG XI ( 康熙 )
• There are more than one Chinese character set, over time and cross regions.
4
• Chinese Character Set
• GB
• They are developed in Mainland China and are based on simplified Chinese characters.
• GB is short for 国家标准 and means National Standard.
• Countries such as China and Singapore are using this standard.
• Big5
• Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters.
• Countries such as Taiwan and Hong Kong are using this standard.
• ISO 10646-1 and Unicode
• ISO and Unicode Consortium jointly develop a multilingual character set to combine the majority of the world’s character sets into a large repertoire of characters.
• Simplified/Traditional Chinese, Korean and Japanese characters can be displayed on the same HTML pages
5
中文编码集Chinese Code Set
• Code Set (编码集)• Code set means “coded character set”.
• Encoding of a character set is to represent its characters in bytes or bits.
• The complete set of numerical values is called code space (denoted by CODE).
• A value in code space is called a code (or a code point).
• Encoding is a mapping of a (unique) character (in a character set) to a (unique) code (in a code space).
6
• Chinese Code Set
• A coded character set, denoted by CC, is a set of tuples, CC={(ci, codei) | ciC, codeiCODE}, where codei codej if ci cj .
• For example, C={ 中文计算 }, C can be encoded with different code spaces.
• If CODE1={00, 01, 10, 11}, CC1={( 中 , 00), ( 文 , 01), ( 计 , 10), ( 算 , 11)}.
• If CODE2={0000, 0001, 0010, 0011}, CC2={( 中 , 0000), ( 文 , 0001), ( 计 , 0010), ( 算 , 0011)}.
• If CODE3 ={1000, 1001, 1001, 1011}, CC3 ={( 中 , 1000), ( 文 , 1001), ( 计 , 1001), ( 算 , 1011)}.
7
In-Class Exercise
• What binary values can be assigned to these 6 characters according to this code space?
(Tip: at first,
how many
bits do you
need to
encode 6
rows and 6
columns?)
Two Dimensional Code Space (66)
啊 阿 唉
作 坐 座
1
2
3
4
5
6
1 2 3 4 5 6
8
基本编码方式Basic Encoding Method
• The mapping of a character in a character set to a code point in code space is called code point assignment.
• An encoding method explains how a character is being mapped into a code point and also how assignments are made to identify a mixture of difference code sets.
CC1CC2
CC3CC4
9
• ASCII
• A popular encoding scheme for English characters is called ASCII (American Standard Code for Information Interchange).
• It defines 128 character code points (from 0x00 to 0x7F), of which the first 32 are control codes (non-printable) from 0x00 to 0x1F and the other 96 are graphic (printable) characters from 0x20 to 0x7F.
• But actually, only 94 are printable (0x21-0x7E).
• Values are represented with only 7 bits (the first bit is 0).
10
• ASCII
high-bits 0000
0100
0010
0001
0101
0110
0011
0111
low-bits
1000100110101011110011011110111100000001001000110100010101100111
11
中文编码方式Chinese Encoding Method
• It takes 2 bytes, or 16 bits, to encode all the Chinese characters.
• However, not all of these 256×256 points are used for representing displayable characters. Generally, 94×94 is considered for a Chinese character encoding matrix.
12
• One Byte English Characters vs. Two Byte Chinese Characters.
• High-Bit-On Scheme
0 x x x x x x x
English
Code range is 33-126 (<128) or 0x21-0x7E.
Code range of the first byte is 161-254 (>128) or 0xA1-0xFE.
1 x x x x x x x
Chinese
x x x x x x x x
13
• Chinese Characters or English Characters?
AB AC 41 42 43 A4 40
0xAB=171>128
ABAC is a Chinese
Character
0x41=65<128
41 is a English
Character
0x42=66<128
0x43=67<128
0xA4=164>128
42 is a English
Character
43 is a English
Character
A440 is a Chinese
Character
14
• There are two encoding methods that are common to many character sets in China, Taiwan and other Asia countries.
• ISO-2022 and EUC
• They are locale-independent encoding methods.
• However, the exact definitions of them depend greatly on the locale. In other words, there are locale-specific instances of these encodings, e.g.
• ISO-2022-CN, ISO-2022-CN-EXT, …
• EUC-CN, EUC-TW, …
15
• Encoding method and character set
Supported Character SetsEncoding Method
ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS 11643-1992 …
ISO-2022
EUC ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS 11643-1992 …
ASCII ASCII, GB-Roman, CNS-Roman …
00-1EControl character
21-7EGraphic characters
0-31
33-126
20Space character
7FDelete characters
32
127
ASCII Encoding Range
94 printable characters
16
• ISO-2022
• ISO-2022 is a modal encoding, which uses escape sequences or other special characters to switch between different modes (one-byte vs. two-byte).
• It is used primarily as an information interchange code for moving text between computer systems, such as Email.
• It is also often referred to as a seven-bit encoding methods, because all the bytes used to represent characters do not have their eighth-bit enabled.
17
• ISO-2022-CN (-EXT)
• ISO-2022-CN (-EXT) is a locale-specific implementation of ISO-2022.
• It is achieved through the use of designators and shifts.
• Designator specifies the character set associated with a particular shift.
• There are four shifts, SI, SO, SS2 and SS3. Shift specify how to interpret the subsequent bytes.
• Each line starts in ASCII, and ends in ASCII.
• A shifting character, indicated by SO (0x0E) or SI (0x0F) switches between one-byte and two-byte modes.
• SO (Shift Out) invokes two-byte mode (for GB 2312-80 and CNS 11643-1992 Plane 1) for the following bytes until SI (Shift In) is encountered which invokes one-byte mode.
• There must be a shift back to ASCII (by SI) before the end of the line.
18
• ISO-2022-CN (-EXT)
• A single shift sequence, indicated by SS2 (0x1B 0x4E) or SS3 (0x1B 0x4F), invokes two-byte mode only for the following two bytes, and is typically employed for rarely-used character sets.
• A designator (escape) sequence indicates what character set should be invoked when in two-byte mode, e.g.,
• 0x1B 0x24 0x29 0x41 (<ESC> $ ) A in ASCII) indicates GB 2312-80.
Shifting Types
SO
invoked Character Sets
SS2
SS3
GB 2312-80, CNS 11643-1992 Plane 1
CNS 11643-1992 Plane 2
CNS 11643-1992 Planes 3-7
19
• Designator and shift
1B 24 29 47 31 30 0E 45 4C 0F 31 38 0E 45 4A 0F
SO Shift to two byte mode
ASCIIcode
Designate CNS-11643 plane 1
CNS Plane 1code
one byte mode
SI Shift to one byte
mode
ASCIIcode
SO Shift to two byte mode
CNS Plane 1code
SI Shift to one byte
mode
20
• EUC
• EUC (Extended Unix Code) encoding is implemented as the internal code for most Unix software configured to support Japanese.
• Although U represents Unix, this encoding is commonly used on other platforms, such as Windows and Mac OS.
• The full definition of EUC encoding consists of four code sets.
• Code set 0 is always set to the ASCII character set or a country’s own version thereof.
• The remaining code sets are defined as a set of variants from which each country can select.
21
• EUC-CN
• EUC-CN is a locale-specific implementation of EUC.
21-7ECode set 0
Byte range
A1-FEA1-FE
Code set 1First byte range
Second byte range
33-126
161-254161-254
Code range of both the first byte and the second byte is 161-254 (>128) or 0xA1-0xFE.
1 x x x x x x x
EUC-CN (GB)
1 x x x x x x x
9494
22
• EUC-TW
• EUC-TW is by far the most complex implementation of EUC encoding in terms of how many characters it encodes, i.e. close to 50,000 characters.
23
• ISO-2022 vs EUC
• EUC encoding is closely related to ISO-2022.
• In fact, every character that can be encoded by ISO-2022 can be converted to an EUC-encoded equivalent.
ISO-2022-CNLocale (Character Set)
3A3A5756
China (GB 2312-80)汉字
69474773
Taiwan (CNS 11643-1992)
漢字
EUC-CN or EUC-TW Set 1
BABAD7D6
E9C7C7F3
24
• GBK
• GBK encoding is implemented as the internal code for the Chinese (PRC) version of Microsoft’s Windows and IBM’s OS/2.
• GBK character set contains 21,886 Symbols and Chinese characters.
21-7EASCII or GB-Roman
Byte range
81-FE40-7E, 80-FE
GBKFirst byte range
Second byte ranges
33-126
129-25464-126, 128-
254
25
• GBK
• One of the design principle of GBK is that it should be fully compatible with GB2312 and extend to support Unicode which has 20,902 characters in its first version.
26
• Big5
• Big5 encoding range has a lot in common with EUC-TW code sets 0 and 1; the main difference being that there is an additional encoding block.
21-7EASCII or CNS-Roman
Byte range
A1-FE40-7E, A1-FE
Big5First byte range
Second byte ranges
33-126
161-25464-126, 161-254
27
• Big5
• Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters.
28
国际编码方式International Encoding
Method
• Unicode and ISO 10646-1.
• We need to develop a multilingual character set combining the majority of the world’s writing systems and character sets into a Universal Character Set (UCS) or Unicode.
Character Set
Encoding Method
Unicode andISO 10646-1
UCS-2, UCS-4UTF-7, UTF-8, UTF-16
29
• BMP
• The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters.
30
• UCS-2
• A 16-bit representation can end up to 65,536 (=216) unique code points.
• It allocates the entire encoding space for characters (0x0000-0xFFFF).
• UCS-2 and Unicode encodings are identical for most of Chinese characters.
00-FF00-FF
UCS-2 First byte range Second byte range
0-2550-255
31
• UCS-4
• It is a four byte (actually, 31-bit) representation which can encode up to 2,147,483,648 (=231) code points (0x00000000-0x7FFFFFFF).
• It allocates the entire encoding space for characters (0x0000-0xFFFF).
00-7F00-FF00-FF00-FF
UCS-4 First byte range Second byte range Third byte range Fourth byte range
0-1270-2560-2560-256
• UCS-2 vs UCS-4
32
0000………
FFFF
65,536 code points
UCS-2 (16-bit)
Unicode (17-plane)
17256256=1,114,112characters 2,147,483,648
code points
00000000………
0000FFFF
UCS-4 (31-bit)
00010000………
7FFFFFFF
Can only encodeBMP Plane
Sufficient to encodeall 17 Planes of Unicode Set
33
• UTF-16
• In essence, UTF-16 encodes the BMP according to UCS-2 (16 bits) encoding (compatible).
• But it also allows the next 16 planes, which are normally only accessible through UCS-4 (32 bits) encoding.
• The surrogates area is defined with UTF-16 to allow for expansion beyond the 16-bit code space.
• UCS-2 vs UCS-4 vs UTF-16
34
0000………
FFFF
65,536 code points
UCS-2 Unicode
2,147,483,648code points
00000000………
0000FFFF
UCS-4
00010000………
7FFFFFFF
Can only encodeBMP Plane
Sufficient to encodeall 17 Planes of Unicode Set
Plane 0
Plane 1
Plane 16
…
Scalar ValueDenoted by U+
00010000…
0010FFFF
U+10000…
U+10FFFFD800 DC00
…DBFF DFFF
UTF-16Surrogates
35
• Base64
• 64 characters are used, they are the upper-case and lower-case Roman alphabet characters (i.e. A-Z, a-z), the numerals (0-9), and the "+" and "/" symbols.
36
• Base64
37
• Base64
• Step 1: Base64 takes every three bytes (each consisting of eight bits), and convert it to four six-bits.
• Step 2: Each six-bit segment is then converted into a character in the Base64 character set.
• Step 3: If the size of the original data in bytes is not a multiple of three, we append enough bytes with a value of “0” to create a 3-byte group. The Base64 padding character is “=”.
001001011011010001101001010000010000000000000000
101001010001011011001001 000000000000010000010000
JbRp QQ==
38
In-Class Exercise
• What is the result of applying Base64 to three Hex characters BEAE, CED3 and B7F5 (it is a Japanese name, 小林剑 )?
(Please first convert the Hex to Bin.)
39
• UTF-7
• UTF-7 uses the same set of Base64 character set.
• UTF-7 is different from Base64 in that:
• The “padding” character is not necessary.
• The Base64-like transformation is applied only to specific characters
• Those characters that require Base64 transformation according to UTF-7 encoding begin with a “plus” character (+, 0x2B) and end with a “hyphen” (-, 0x2D) character.
Character String M y 河 豚UCS-2 Encoding 004D 0079 0020 6CB3 8C5A
UTF-7 Encoding M(4D) y(79) 20 +bLOMWg-
ASCII Codes
40
• UTF-8
• UFT-8 encoding is developed as a way to represent Unicode text as a stream of one or more eight-bits, rather than a true 16-bit units.
• It converts UCS-2 into a mixed one- through three-byte encoding.
• It converts UCS-4 into a mixed one- through six-byte encoding.
• It converts UTF-16 into a mixed one- through four-byte encoding.
• It is therefore an eight-bit and variable-length encoding.
• UTF-8 is the de facto standard encoding for interchange of Unicode text.
41
• UTF-8
• Encoding Templates
• For all but the ASCII-compatible range, the number of first-byte high-order bits set to “1” indicates the byte length.
• Filling the templates from the rightmost side bits.UCS-2 Range
0000-007F
UTF-8 Bit Arrays
0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000 – 001F FFFF
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Unicode Range (+) UTF-8 Bit Arrays
0020 0000 – 03FF FFFF
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
UCS-4 Range (+) UTF-8 Bit Arrays
0400 0000 – 7FFF FFFF
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
42
• UTF-8 Example: Convert a Unicode character “ 茶” into UTF-8 code.
• Step 1” the Unicode value of 茶 (tea) is 8336. So we need 3 bytes.
• Step 2: the binary form of hexadecimal 8336 is 1000 001100 110110.
• Step 3: Fill the empty slots of the three-byte template with the binary value of and get:
• Step 4: UTF-8 code value is thus E8 8C B6.
1110xxxx 10xxxxxx 10xxxxxx
11101000 10001100 10110110
43
• 中文字符集• 中文编码集• 基本编码方式• ASCII
• 中文编码方式• ISO-2022
• EUC
• GBK
• BIG5
Wrap-Up
• 国际编码方式• Unicode
• ISO 10646-1
• UCS-2
• UCS-4
• UTF-16
• Base64
• UTF-7
• UTF-8