101035 中文信息处理 chinese nlp lecture 2 1. 字 —— 中文编码 chinese character...

43

Click here to load reader

Upload: emerald-andrews

Post on 17-Dec-2015

386 views

Category:

Documents


22 download

TRANSCRIPT

Page 1: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

1

101035 中文信息处理

Chinese NLP

Lecture 2

Page 2: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

2

字——中文编码Chinese Character Encoding

• 中文字符集( Character Set )• 中文编码集( Code Set )• 基本编码方式• 中文编码方式• 国际编码方式

Page 3: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

3

中文字符集Chinese Character Set

• Character Set (字符集)• A character set is a collection of characters.

• {a, b, c, …, z, A, B, C, …, Z, 0, 1, 2, …, 9} is an English character set, { 啊 , 阿 , 唉 , …, 作 , 坐 , 座 } is a Chinese character set.

• Each character set has a name, such as ASCII or KANG XI ( 康熙 )

• There are more than one Chinese character set, over time and cross regions.

Page 4: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

4

• Chinese Character Set

• GB

• They are developed in Mainland China and are based on simplified Chinese characters.

• GB is short for 国家标准 and means National Standard.

• Countries such as China and Singapore are using this standard.

• Big5

• Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters.

• Countries such as Taiwan and Hong Kong are using this standard.

• ISO 10646-1 and Unicode

• ISO and Unicode Consortium jointly develop a multilingual character set to combine the majority of the world’s character sets into a large repertoire of characters.

• Simplified/Traditional Chinese, Korean and Japanese characters can be displayed on the same HTML pages

Page 5: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

5

中文编码集Chinese Code Set

• Code Set (编码集)• Code set means “coded character set”.

• Encoding of a character set is to represent its characters in bytes or bits.

• The complete set of numerical values is called code space (denoted by CODE).

• A value in code space is called a code (or a code point).

• Encoding is a mapping of a (unique) character (in a character set) to a (unique) code (in a code space).

Page 6: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

6

• Chinese Code Set

• A coded character set, denoted by CC, is a set of tuples, CC={(ci, codei) | ciC, codeiCODE}, where codei codej if ci cj .

• For example, C={ 中文计算 }, C can be encoded with different code spaces.

• If CODE1={00, 01, 10, 11}, CC1={( 中 , 00), ( 文 , 01), ( 计 , 10), ( 算 , 11)}.

• If CODE2={0000, 0001, 0010, 0011}, CC2={( 中 , 0000), ( 文 , 0001), ( 计 , 0010), ( 算 , 0011)}.

• If CODE3 ={1000, 1001, 1001, 1011}, CC3 ={( 中 , 1000), ( 文 , 1001), ( 计 , 1001), ( 算 , 1011)}.

Page 7: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

7

In-Class Exercise

• What binary values can be assigned to these 6 characters according to this code space?

(Tip: at first,

how many

bits do you

need to

encode 6

rows and 6

columns?)

Two Dimensional Code Space (66)

啊 阿 唉

作 坐 座

1

2

3

4

5

6

1 2 3 4 5 6

Page 8: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

8

基本编码方式Basic Encoding Method

• The mapping of a character in a character set to a code point in code space is called code point assignment.

• An encoding method explains how a character is being mapped into a code point and also how assignments are made to identify a mixture of difference code sets.

CC1CC2

CC3CC4

Page 9: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

9

• ASCII

• A popular encoding scheme for English characters is called ASCII (American Standard Code for Information Interchange).

• It defines 128 character code points (from 0x00 to 0x7F), of which the first 32 are control codes (non-printable) from 0x00 to 0x1F and the other 96 are graphic (printable) characters from 0x20 to 0x7F.

• But actually, only 94 are printable (0x21-0x7E).

• Values are represented with only 7 bits (the first bit is 0).

Page 10: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

10

• ASCII

high-bits 0000

0100

0010

0001

0101

0110

0011

0111

low-bits

1000100110101011110011011110111100000001001000110100010101100111

Page 11: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

11

中文编码方式Chinese Encoding Method

• It takes 2 bytes, or 16 bits, to encode all the Chinese characters.

• However, not all of these 256×256 points are used for representing displayable characters. Generally, 94×94 is considered for a Chinese character encoding matrix.

Page 12: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

12

• One Byte English Characters vs. Two Byte Chinese Characters.

• High-Bit-On Scheme

0 x x x x x x x

English

Code range is 33-126 (<128) or 0x21-0x7E.

Code range of the first byte is 161-254 (>128) or 0xA1-0xFE.

1 x x x x x x x

Chinese

x x x x x x x x

Page 13: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

13

• Chinese Characters or English Characters?

AB AC 41 42 43 A4 40

0xAB=171>128

ABAC is a Chinese

Character

0x41=65<128

41 is a English

Character

0x42=66<128

0x43=67<128

0xA4=164>128

42 is a English

Character

43 is a English

Character

A440 is a Chinese

Character

Page 14: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

14

• There are two encoding methods that are common to many character sets in China, Taiwan and other Asia countries.

• ISO-2022 and EUC

• They are locale-independent encoding methods.

• However, the exact definitions of them depend greatly on the locale. In other words, there are locale-specific instances of these encodings, e.g.

• ISO-2022-CN, ISO-2022-CN-EXT, …

• EUC-CN, EUC-TW, …

Page 15: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

15

• Encoding method and character set

Supported Character SetsEncoding Method

ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS 11643-1992 …

ISO-2022

EUC ASCII, GB-Roman, CNS-Roman, GB 2312-80, CNS 11643-1992 …

ASCII ASCII, GB-Roman, CNS-Roman …

00-1EControl character

21-7EGraphic characters

0-31

33-126

20Space character

7FDelete characters

32

127

ASCII Encoding Range

94 printable characters

Page 16: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

16

• ISO-2022

• ISO-2022 is a modal encoding, which uses escape sequences or other special characters to switch between different modes (one-byte vs. two-byte).

• It is used primarily as an information interchange code for moving text between computer systems, such as Email.

• It is also often referred to as a seven-bit encoding methods, because all the bytes used to represent characters do not have their eighth-bit enabled.

Page 17: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

17

• ISO-2022-CN (-EXT)

• ISO-2022-CN (-EXT) is a locale-specific implementation of ISO-2022.

• It is achieved through the use of designators and shifts.

• Designator specifies the character set associated with a particular shift.

• There are four shifts, SI, SO, SS2 and SS3. Shift specify how to interpret the subsequent bytes.

• Each line starts in ASCII, and ends in ASCII.

• A shifting character, indicated by SO (0x0E) or SI (0x0F) switches between one-byte and two-byte modes.

• SO (Shift Out) invokes two-byte mode (for GB 2312-80 and CNS 11643-1992 Plane 1) for the following bytes until SI (Shift In) is encountered which invokes one-byte mode.

• There must be a shift back to ASCII (by SI) before the end of the line.

Page 18: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

18

• ISO-2022-CN (-EXT)

• A single shift sequence, indicated by SS2 (0x1B 0x4E) or SS3 (0x1B 0x4F), invokes two-byte mode only for the following two bytes, and is typically employed for rarely-used character sets.

• A designator (escape) sequence indicates what character set should be invoked when in two-byte mode, e.g.,

• 0x1B 0x24 0x29 0x41 (<ESC> $ ) A in ASCII) indicates GB 2312-80.

Shifting Types

SO

invoked Character Sets

SS2

SS3

GB 2312-80, CNS 11643-1992 Plane 1

CNS 11643-1992 Plane 2

CNS 11643-1992 Planes 3-7

Page 19: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

19

• Designator and shift

1B 24 29 47 31 30 0E 45 4C 0F 31 38 0E 45 4A 0F

SO Shift to two byte mode

ASCIIcode

Designate CNS-11643 plane 1

CNS Plane 1code

one byte mode

SI Shift to one byte

mode

ASCIIcode

SO Shift to two byte mode

CNS Plane 1code

SI Shift to one byte

mode

Page 20: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

20

• EUC

• EUC (Extended Unix Code) encoding is implemented as the internal code for most Unix software configured to support Japanese.

• Although U represents Unix, this encoding is commonly used on other platforms, such as Windows and Mac OS.

• The full definition of EUC encoding consists of four code sets.

• Code set 0 is always set to the ASCII character set or a country’s own version thereof.

• The remaining code sets are defined as a set of variants from which each country can select.

Page 21: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

21

• EUC-CN

• EUC-CN is a locale-specific implementation of EUC.

21-7ECode set 0

Byte range

A1-FEA1-FE

Code set 1First byte range

Second byte range

33-126

161-254161-254

Code range of both the first byte and the second byte is 161-254 (>128) or 0xA1-0xFE.

1 x x x x x x x

EUC-CN (GB)

1 x x x x x x x

9494

Page 22: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

22

• EUC-TW

• EUC-TW is by far the most complex implementation of EUC encoding in terms of how many characters it encodes, i.e. close to 50,000 characters.

Page 23: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

23

• ISO-2022 vs EUC

• EUC encoding is closely related to ISO-2022.

• In fact, every character that can be encoded by ISO-2022 can be converted to an EUC-encoded equivalent.

ISO-2022-CNLocale (Character Set)

3A3A5756

China (GB 2312-80)汉字

69474773

Taiwan (CNS 11643-1992)

漢字

EUC-CN or EUC-TW Set 1

BABAD7D6

E9C7C7F3

Page 24: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

24

• GBK

• GBK encoding is implemented as the internal code for the Chinese (PRC) version of Microsoft’s Windows and IBM’s OS/2.

• GBK character set contains 21,886 Symbols and Chinese characters.

21-7EASCII or GB-Roman

Byte range

81-FE40-7E, 80-FE

GBKFirst byte range

Second byte ranges

33-126

129-25464-126, 128-

254

Page 25: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

25

• GBK

• One of the design principle of GBK is that it should be fully compatible with GB2312 and extend to support Unicode which has 20,902 characters in its first version.

Page 26: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

26

• Big5

• Big5 encoding range has a lot in common with EUC-TW code sets 0 and 1; the main difference being that there is an additional encoding block.

21-7EASCII or CNS-Roman

Byte range

A1-FE40-7E, A1-FE

Big5First byte range

Second byte ranges

33-126

161-25464-126, 161-254

Page 27: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

27

• Big5

• Big5 is the most widely implemented character set standard used in Taiwan and is used for traditional Chinese characters.

Page 28: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

28

国际编码方式International Encoding

Method

• Unicode and ISO 10646-1.

• We need to develop a multilingual character set combining the majority of the world’s writing systems and character sets into a Universal Character Set (UCS) or Unicode.

Character Set

Encoding Method

Unicode andISO 10646-1

UCS-2, UCS-4UTF-7, UTF-8, UTF-16

Page 29: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

29

• BMP

• The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters.

Page 30: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

30

• UCS-2

• A 16-bit representation can end up to 65,536 (=216) unique code points.

• It allocates the entire encoding space for characters (0x0000-0xFFFF).

• UCS-2 and Unicode encodings are identical for most of Chinese characters.

00-FF00-FF

UCS-2 First byte range Second byte range

0-2550-255

Page 31: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

31

• UCS-4

• It is a four byte (actually, 31-bit) representation which can encode up to 2,147,483,648 (=231) code points (0x00000000-0x7FFFFFFF).

• It allocates the entire encoding space for characters (0x0000-0xFFFF).

00-7F00-FF00-FF00-FF

UCS-4 First byte range Second byte range Third byte range Fourth byte range

0-1270-2560-2560-256

Page 32: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

• UCS-2 vs UCS-4

32

0000………

FFFF

65,536 code points

UCS-2 (16-bit)

Unicode (17-plane)

17256256=1,114,112characters 2,147,483,648

code points

00000000………

0000FFFF

UCS-4 (31-bit)

00010000………

7FFFFFFF

Can only encodeBMP Plane

Sufficient to encodeall 17 Planes of Unicode Set

Page 33: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

33

• UTF-16

• In essence, UTF-16 encodes the BMP according to UCS-2 (16 bits) encoding (compatible).

• But it also allows the next 16 planes, which are normally only accessible through UCS-4 (32 bits) encoding.

• The surrogates area is defined with UTF-16 to allow for expansion beyond the 16-bit code space.

Page 34: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

• UCS-2 vs UCS-4 vs UTF-16

34

0000………

FFFF

65,536 code points

UCS-2 Unicode

2,147,483,648code points

00000000………

0000FFFF

UCS-4

00010000………

7FFFFFFF

Can only encodeBMP Plane

Sufficient to encodeall 17 Planes of Unicode Set

Plane 0

Plane 1

Plane 16

Scalar ValueDenoted by U+

00010000…

0010FFFF

U+10000…

U+10FFFFD800 DC00

…DBFF DFFF

UTF-16Surrogates

Page 35: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

35

• Base64

• 64 characters are used, they are the upper-case and lower-case Roman alphabet characters (i.e. A-Z, a-z), the numerals (0-9), and the "+" and "/" symbols.

Page 36: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

36

• Base64

Page 37: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

37

• Base64

• Step 1: Base64 takes every three bytes (each consisting of eight bits), and convert it to four six-bits.

• Step 2: Each six-bit segment is then converted into a character in the Base64 character set.

• Step 3: If the size of the original data in bytes is not a multiple of three, we append enough bytes with a value of “0” to create a 3-byte group. The Base64 padding character is “=”.

001001011011010001101001010000010000000000000000

101001010001011011001001 000000000000010000010000

JbRp QQ==

Page 38: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

38

In-Class Exercise

• What is the result of applying Base64 to three Hex characters BEAE, CED3 and B7F5 (it is a Japanese name, 小林剑 )?

(Please first convert the Hex to Bin.)

Page 39: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

39

• UTF-7

• UTF-7 uses the same set of Base64 character set.

• UTF-7 is different from Base64 in that:

• The “padding” character is not necessary.

• The Base64-like transformation is applied only to specific characters

• Those characters that require Base64 transformation according to UTF-7 encoding begin with a “plus” character (+, 0x2B) and end with a “hyphen” (-, 0x2D) character.

Character String M y 河 豚UCS-2 Encoding 004D 0079 0020 6CB3 8C5A

UTF-7 Encoding M(4D) y(79) 20 +bLOMWg-

ASCII Codes

Page 40: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

40

• UTF-8

• UFT-8 encoding is developed as a way to represent Unicode text as a stream of one or more eight-bits, rather than a true 16-bit units.

• It converts UCS-2 into a mixed one- through three-byte encoding.

• It converts UCS-4 into a mixed one- through six-byte encoding.

• It converts UTF-16 into a mixed one- through four-byte encoding.

• It is therefore an eight-bit and variable-length encoding.

• UTF-8 is the de facto standard encoding for interchange of Unicode text.

Page 41: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

41

• UTF-8

• Encoding Templates

• For all but the ASCII-compatible range, the number of first-byte high-order bits set to “1” indicates the byte length.

• Filling the templates from the rightmost side bits.UCS-2 Range

0000-007F

UTF-8 Bit Arrays

0xxxxxxx

0080-07FF 110xxxxx 10xxxxxx

0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

0001 0000 – 001F FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Unicode Range (+) UTF-8 Bit Arrays

0020 0000 – 03FF FFFF

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

UCS-4 Range (+) UTF-8 Bit Arrays

0400 0000 – 7FFF FFFF

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Page 42: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

42

• UTF-8 Example: Convert a Unicode character “ 茶” into UTF-8 code.

• Step 1” the Unicode value of 茶 (tea) is 8336. So we need 3 bytes.

• Step 2: the binary form of hexadecimal 8336 is 1000 001100 110110.

• Step 3: Fill the empty slots of the three-byte template with the binary value of and get:

• Step 4: UTF-8 code value is thus E8 8C B6.

1110xxxx 10xxxxxx 10xxxxxx

11101000 10001100 10110110

Page 43: 101035 中文信息处理 Chinese NLP Lecture 2 1. 字 —— 中文编码 Chinese Character Encoding 中文字符集( Character Set ) 中文编码集( Code Set ) 基本编码方式

43

• 中文字符集• 中文编码集• 基本编码方式• ASCII

• 中文编码方式• ISO-2022

• EUC

• GBK

• BIG5

Wrap-Up

• 国际编码方式• Unicode

• ISO 10646-1

• UCS-2

• UCS-4

• UTF-16

• Base64

• UTF-7

• UTF-8