mlc-cdl

Upload: xdboy2006

Post on 03-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 MLC-CDL

    1/8

  • 8/12/2019 MLC-CDL

    2/8

    2

    Multi-Lingual Computing

    An A lphAbet of S troKe t ypeSSince the set of components is unlim-

    ited, the components themselves needto have descriptions, and we dont stopdescribing until weve described (eitherdirectly or by reference) the individualstrokes of each character.

    Fortunately, we nd that an alphabetof just 36 stroke types is enough for avery large class of CJK characters.

    Wenlin Institutes CDL speci cationde nes a set of stroke types, which hasbeen translated into the block of 36 CJKStrokes, shown in the CJK S troKeS tableabove. The strokes in the rst column arealready encoded in Unicode 5.0, whilethe rest will become a formal part of thestandard presumably in version 5.1.

    f rom p ICtureS t o S troKeS

    One of the unique features of CJK char-acters is their high degree of dependenceon norms determining stroke type,stroke order, and stroke count. Thesehandwriting norms began to take formin the earliest known periods of Chinesewriting more than 3,000 years ago, andunderwent various changes and re ne -ments in subsequent periods, as the re-sult of changes in writing implement,writing media, writing method (print-ing), and language changes (vocabularydevelopment).

    A major script reform took place inthe Qn Dynasty (c. 200 BCE). In the

    Eastern Hn period (c. 121 AD) an in-uential analysis codi ed the elementsof the script. There were 9,353 Chinesecharacters at that time, and 1,163 vari-ants, according to the most important Qng Dynasty recension of the

    Shu Wn Ji Z text (that of Dun Yci, d. 1815).

    As written texts grew in complexity, aslibraries collected more and more books,and as the inventories of Chinese charac-ters became larger and larger, it became

    urgent to de ne methods for organizingand indexing the elements of the com-plex character set.

    Through re nements climaxing in the Sng period (c. 1,000 AD), coinciding

    with the widespread use of woodblockprinting, the Han glyphs attained thestyle evident in Unicodes code charts.Glyphs in the Sng style are highly con-strained in many ways, but most impor-

    tantly in terms of the types of strokeswhich may occur.

    Below is an archaic pictograph for theword meaning chariot (such as is seenin oracle bone inscriptions [OBI = ]from more than 3,000 years ago):

    Compare the above with the EasternHnSmall Seal form ( , shown below),more than 1,000 years later, but havingthe relatively simple form attested muchearlier (in middle and late Zhu Dy-nasty inscriptions):

    In the Sng printed style (below) manyold curves are now sharp angles and thestrokes are very regular and stylized:

    TheSng-style chariot character abovehas exactly seven strokes (well countthem later); it is the standard form in Japan, Taiwan, and Korea. In PRC andSingapore, the standard simple form (be-low) has only four strokes:

    Like many simple forms, is based

    on cursive forms in use for many centu-ries. Nevertheless, its strokes are all Sng style. CDL is equally well suited to de-scribing the simpler and more complexCJK variants that are preferred in di er -ent locales.

    On the left below is an archaic pic-tograph for bird (again, an early OBIform), and on the right is the Eastern HnSeal form:

    Below are the modern Sng-stylede-scendants, full form (left, 11 strokes) andsimple form (right, 5 strokes):

    Han characters were not originally

    composed of a limited set of strokes, butonly gradually came to be written thatway. If the characters were still just pic-tures rather than being made up of stan-

    dard strokes, maybe the best that couldbe done would be to assign a numberto each picture. Fortunately, the set ofstroke types evident in a Sng-style type-face can be regarded in a very real senseas the alphabet of CJK writing.

    t he o rIgInS of CDlCDL was developed byWenlin Institute

    as part of its software package for learn-ing Chinese, to teach the fundamentals oftraditional Chinese writing. The softwareprovides animated stroke-by-stroke dis-play of more than 50,000 CJK characters.

    In the published versions of Wenlin,the CDL language itself is a hidden fea-ture of the software implementation.The special abilities to view, create andmodify CDL descriptions have so far onlybeen included in unpublished versionsof the software. The authors are work-ing towards making these features, andthe database of over 50,000 descriptions,available to a wider audience.

    StroKe AnD C omp (onent )The most important keywords in Char-

    acter Description Language arestroke

    andcomp

    (an abbreviation forcompo-

    nent ). Weve called strokes the basic CJKalphabet, and CDL can describe anyHan character purely in terms of strokes.Nevertheless, the majority of characterscan be described more brie y and moremeaningfully in terms of components.Components then have their own de-scriptions, sometimes in terms of sim-pler components, until nally the sim -

  • 8/12/2019 MLC-CDL

    3/8

    3

    Multi-Lingual Computing

    plest components are described purelyin terms of strokes.

    Since that is the usual situation, welldemonstrate the usage of comp rst, andstroke later.

    A B irD In the C HARIOT (IS worth ...)

    A CDL description for our ying chari-ot (or wheeled bird?) character mightlook almost like this:

    Just looking at the code, it isnt hard tosee that the character being described isbuilt from two components: chariot

    on the left, and bird on the right.However, we generally want more pre-

    cise control over the positions of com-ponents and (especially) strokes. (Oth-erwise, CDL would be similar to an oldertechnology known as Ideographic Descrip-tions Sequences, which is useful but lacksmany of the capabilities of CDL.)

    Most often when we create descrip-tions, we use a graphical interface to po-sition the elements visually, by draggingon control points with the mouse. Weusually dont need to read or type any

    numbers; the CDL description is auto-matically updated to contain the exactcoordinates.

    Therefore, the form of CDL that weactually use doesnt have a position attribute. Instead, it has a points attri-bute with numerical (x, y) coordinates,and looks like this:

    [Technical note: Any coordinate systemcould be used. In the current implementation,we have found it convenient and e cientto use coordinates from 0 to 128, where (0,0)is the (left, top) corner and (128,128) is the(right, bottom) corner. Decimal points areallowed for fractional coordinates (such as 3.14159), so the precision is practically un-limited and does not depend on the number

    (such as 128) that is chosen for the maximumcoordinate value.]

    t he w enlIn S troKIng b oxThe illustrations below are screenshots

    from the Stroking Box in an unpublishedCDL-development version of the Wenlin

    software. The interface looks like this:

    Its the same interface that the pub-lished educational version uses whendisplaying stroke-by-stroke animation,but with two additional capabilities: onefor displaying the CDL code, and onefor modifying the positions by pointingwith the mouse and dragging on controlpoints. The illustration below showsthe control points at the corners of thebounding boxes:

    Heres what it looks like if we dragsome points around to make the chariot smaller and the bird bigger:

    Its possible to convert automaticallyany description that uses comp ele-ments into one that uses only stroke elements (eighteen of them in this case).When this is done, all control points are

    at the stroke level, as in the following il-lustration:

    The all-strokes description allows us toadjust the position of any stroke, add orremove strokes, and so forth, perhaps todescribe a peculiar variant like this one:

    Seven -StroKe C HARIOT To demonstrate the stroke element,

    heres a description of :

    Thats pretty simple. In fact, its toosimple for some purposes, and well addsome important details shortly, but rstlets explain what we have so far.

    The character is described as a se-quence of four elements: a stroke, a com-ponent, and two more strokes. (This is

    the order in which the character is tra-ditionally written: basically from top tobottom, but with the long vertical strokelast.)

    The component needs to have itsown description, which well supply fur-ther below. As well see, itself has fourstrokes, so has a total of seven strokes.Now, what doh and s stand for?

  • 8/12/2019 MLC-CDL

    4/8

    4

    Multi-Lingual Computing

    h = h eng = h orIzontAl Each of the stroke elements has a

    stroke type attribute, specifying one ofthe 36 types, represented by the ASCIIabbreviation that is used in the UnicodeCJK Stroke name. starts with an h andends with an s .

    The rst stroke in , h , goes horizon-tally from left to right. Its name is anabbreviation for Mandarin hng hori-zontal, which is the traditional name forthis stroke type. (Its a coincidence thathng and horizontal both start with theletter h.)

    The last stroke in , s , starts at thetop center and goes straight down. Theletter s is an abbreviation for sh vertical stroke.

    (The 36 members of Unicodes new CJK

    Strokes block [see the table above] makeit possible to use the Unicode charactersor numbers as alternatives to the Manda-rin-ASCII stroke-name abbreviations.)

    h ere C omeS t he SunWe promised to show the description

    of , which is needed for a complete in-terpretation of the description we gavefor . The character by itself meanssun or day; however, as a componentin , its simply a reusable arrangementof four strokes.

    The only new stroke type here is hz ,which stands for Mandarin hng-zh

    horizontal-turning. This is a strokethat turns a corner; in this case, the top-right corner of .

    We could choose to describe direct-ly in terms of seven strokes, and avoidusing , but it would be a longer de-scription (with seven elements instead offour). The reuse of components makes acollection of descriptions not only moree cient in terms of space, but also moreinternally consistent and easier to devel-op and maintain.

    g ettIng t o t he p oIntS To enable the full power of CDL to de-

    scribe exactly how to display a character,each stroke element has a points at-

    tribute specifying placement of its con-trol points.

    Just as with components, a graphicalinterface enables us to move the controlpoints around visibly on the screen, bydragging on them with a mouse. There-fore, its often unnecessary for the per-son who creates, edits, or uses a CDL de-scription to be concerned with the actualnumerical coordinates.

    This is how the control points are dis-played in the Stroking Box:

    By dragging them apart, we can see thefour individual strokes:

    The number of control points dependson the stroke type. The simple h and stypes each require only a starting pointand an ending point. The more intri-cate hz uses three points: the start, thecorner, and the end. (The most intricatestroke type is named hzzzg and uses sixpoints. It occurs in the character . )

    For those of you who are interested,here is a description of with points included:

    InformAtIon o verloAD ?Before your eyes start to glaze over,

    we recommend that you skip this sec-tion and move on to the next one, unless youre really wanting more technical de-tails right now! This is the last chunk ofcode well show in this article.

    Here is our most complete descriptionof sun:

    The rst line opens the cdl element,and the last line ends it with .This time weve included four optionalattributes for the cdl element: char and uni , which simply say this is a de-scription of ( U+65e5 ); also vari-ant , which assigns the description toa particular variant identi er, in casewe have multiple descriptions for thesame character. (We discuss variationand uni cation further below.)

    The fourth optional attribute is thepoints for the character as a whole.(That is, the rst points attribute,before the two stroke elements.) Asweve used it in , it has the e ect ofmaking slightly narrower and shorterthan the regular square, and this im-proves its appearance, partly by avoidinghaving strokes running along the outeredges of the square. (By the way, thisrst points attribute only a ects thedisplay of as a standalone character.When is used as a comp in anothercharacter, such as , the points at-tribute of the comp tag in that characteroverrides this one.)

    Finally, the strokes have optional attri-butes such as tail=long , which af-fect subtle decorative features of the Sng style outlines, especially where strokesmeet. CDL descriptions can be displayed

  • 8/12/2019 MLC-CDL

    5/8

  • 8/12/2019 MLC-CDL

    6/8

    6

    Multi-Lingual Computing

    characters and represent, on a computer,the characters that were so freely andfrequently created in the past with hand-writing and woodblock printing.

    JAbberwoCKy u nenCoDeD ?If English writers were encumbered by

    the restrictions currently a icting CJKwriters, the situation would be obviouslyintolerable. For example, in productionof editions of Shakespeare, or Chaucer,or Lewis Carroll, it would be impossibleto use the original spellings, unless theyhappened to be in a dictionary used as asource for a computer encoding.

    When the famous linguist Y. R. Chao( ) translated Lewis Carrolls poem Jabberwocky into Chinese, he inventedsome new Han characters and words tomatch the invented English words. Somemodern avant-garde artists have pro-duced beautiful calligraphy using made-up or nonsense Han characters. Whensuch creative people start using CDL, forbetter or worse, theyre going to have alot of fun breaking the Unicode barrier.CJKFinnegans Wake here we come!

    A pplICAtIon 3: D IStInguIShIng e nCoDeD v ArIAntS

    By design (and bynecessity), Unicodeuni es (i.e. lumps together under a singlecode point) many variants of CJK charac-ters that di er in small (and presumably,mostly non-distinctive) ways, such as thepresence or absence of a dot, or whetherstroke segments are joined or separated.

    Such variations can be very importantfor some purposes. For example, di er -ent national standards sometimes assigndi erent o cial stroke counts to thesame character, and the stroke countsdetermine the organization of dictionar-ies, etc.

    For another example, historical Chi-nese texts can often be dated on the ba-sis of a single stroke being intentionallyomitted in taboo avoidance of the name

    of the reigning emperor.ShAKSpere , S hAKSper , S hAxper :u nIfIeD v ArIAntS ?

    The Bard himself reportedly spelled hisname variously in his own day as Shak-spere, Shaksper, Shaxper, and Shake-speare. If English words needed to beencoded like CJK characters, and the vari-ant spellings were uni ed, then wed beunable to write the preceding sentence

    (without resorting to some non-standardtext representation practice). The digiti-zation of original editions of importanthistorical texts would be hindered by thelack of prior lexicographic and encodingwork. This is the status quo for CJK texts,and it presents an unacceptable and un-

    necessary bottleneck.Given that a goal of Unicode is to pro-

    vide an international CJK character en-coding, transcending temporal and po-litical boundaries, Unicode doesnt havethe luxury of narrow scope. In fact theproblems faced in Unicodes CJK encod-ing are greatly compounded, requiringin e ect innovative and comprehensivelexicographic work to be the gating con-straint on the digitization of texts.

    Clearly, an adequate system needs tobe developed and promoted for wide

    adoption, to make possible the short-term progress and long-term success ofCJK digitization projects. CDL enablescomputerized encoding and structuredcombination of the minimally distinctivefeatures that are lost in Unicode uni ca -tion.

    Unicode has recognized the need foridentifying variants that share a codepoint, and for this purpose has estab-lished the variant selector mechanism.The need remains, however, to assign aprecise meaning to each combinationof code point and variant selector (see

    Hiura & Muller 2006:UTS #37 , ). CDLis ideal for this purpose. Each code pointcan have multiple CDL descriptions, onedescription for each variant, uniquelyidenti ed by a selector.

    A pplICAtIon 4: CDl f ontSA database of CDL descriptions serves

    as a font that can be used by a CDL-awareapplication to display any CJK text. A CDLfont has some valuable characteristics.

    A CDL font can be compressed to anextremely small size, much smaller than

    any other format with comparable qual-ity. This is because the vast majority ofcharacters can be described with just twocomp tags each, and for such a characterall that needs to be stored is two Unicodenumbers and a few coordinates. The re-use of component descriptions meansthat each component only needs to bedescribed once in the font. Wenlins CDLfont le of over CJK 50,000 characters oc-

    cupies less than one megabyte of storage(about 18 bytes per character).

    A CDL font displays very quickly, at aspeed comparable to conventional fontformats (TrueType, etc.).

    We have implemented utilities for

    converting CDL descriptions into otherfont/graphics formats including Post-script, Scalable Vector Graphics (SVG),and METAFONT.

    Since the descriptions are standards-based, and components are reused, theCDL font is inherently very consistent.Contrast the CJK fonts that have beenused, for example, to print editions ofThe Unicode Standard, which by necessityhave been combinations of fonts thatdont really go together and thereforeare inconsistent in the ways that par-ticular components and strokes are dis-

    played in di erent characters.Adding a new glyph to a CDL font is

    generally just a matter of combining afew elements and adjusting a few coordi-nates. Thats much easier than adding anew glyph to an ordinary font, especiallyif you are trying to match the style.

    A single CDL font can be displayed indi erent styles and weights. Thus it re -ally serves as a font family. (In this re-spect, CDL has something in commonwith the METAFONT language, and withMultiple Master fonts.) The CDL descrip-

    tions themselves determine the essentialshape or skeleton of a character in ageneral way. The CDL interpreter is re-sponsible for deciding stroke thickness,and whether to include optional decora-tions. Here are two examples of the sametext, displayed using the same CDL de-scriptions, but di erent choices of style:

    A pplICAtIon 5: h AnDwrItIng r eCognItIon

    Since CDL describes the way each CJKcharacter is actually written by hand,stroke by stroke, it naturally has applica-tions for handwriting recognition. Thiscapability of CDL is demonstrated by thebrush tool input method of the publishedWenlin software. When the user selects

  • 8/12/2019 MLC-CDL

    7/8

    7

    Multi-Lingual Computing

    the brush tool, a window is displayedcontaining a square region in which theuser can write a character by hand, us-ing a mouse or a pen input device. If thehandwritten character is similar enoughto any of the 50,000 characters in the CDLdatabase, then the character is inserted

    (as Unicode) into the text which the useris editing. The recognition accuracy rateis high, provided that characters arewritten with standard stroke count andstroke order. There is an option to ignorestroke order, but accuracy is higher ifthat option is turned o and the standardorder is used. There is another option forshowing a list of characters similar to thehandwritten character, from which theuser can choose.

    The recognition algorithm currentlyemployed is essentially simple. It countsthe strokes in the handwritten character,then compares it with all the charactersin the CDL database with the same num-ber of strokes, and nds the most similarcharacter(s), based on the distance (sumsquared di erence) between the cor -responding stroke coordinates. Perhapssurprisingly, this is not only very fast (afraction of a second), but also accurate.

    Currently only the beginning and end-ing coordinates of each stroke are used.Even higher accuracy could be achievedby taking into account intermediatepoints along the path of a stroke, rath-er than only the beginning and endingpoints. The algorithm could be extendedto handle cursive handwriting, in whichstrokes are not necessarily separated bylifting the pen. We have experimentedwith using the CDL database for train-ing a neural network to recognize stroketypes.

    Theoretically, it should be possible toimplement a recognizer which wouldcreate a complete CDL description basedon a handwritten character, even if therewere no matching description in the CDLdatabase. In that situation, the recog-

    nizer could insert the CDL descriptionitself (as XML), rather than a Unicodecharacter code, directly into the text.In this way one could input unencodedcharacters such as our rst example ofthe chariot plus bird combination.

    A pplICAtIon 6: I nDexIng by StroKeS AnD / or C omponentS

    Many CJK dictionaries and other refer-ence works include indexes for locating

    characters according to their strokesand/or components. (Pronunciation isanother popular key to organizing CJKdictionaries, but of course it requiresknowledge of the pronunciation, whichfor someone consulting a dictionary issometimes like putting the cart before

    the horse.)Components used for organizing dic-

    tionaries and indexes are known in Eng-lish as radicals. A conventional set of 214radicals has been used to organize manydictionaries, starting with Zihui in the15th century, and including the famousKangxi Dictionary published in 1716. Ourchariot plus bird character might belisted under either the chariot radical orthe bird radical. (You would have to guesswhich of the two, and generally try both;but don't bother since this characterprobably isn't in your dictionary!)

    Thousands of characters may share thesame radical, just as thousands of wordswritten alphabetically may start withthe same letter. For alphabetical index-ing, the order depends on the non-initialletters in a word. Early radical-based dic-tionaries listed same-radical charactersin essentially random order, but even-tually it became common to sub-groupusing residual stroke count, which is thenumber of strokes a character wouldhave if its radical were removed. Char-acters with the same radical and sameresidual stroke count were still listed inrandom order, and this practice survivesin the Unicode standard. Some 20th-century dictionaries, however, startedusing a third organizing principle to de-termine the order of characters with thesame radical and same residual strokecount. This principle takes into accountthe types of the residual strokes, usingclassi cation systems related to whatare now the CDL stroke typesh , s , etc.The system which has become standardin PRC distinguishes ve categories ofstroke types with a conventional order-ing. The CDL speci cation includes themapping of the 36 CDL stroke types tothe ve stroke categories.

    Owing to the di culties with radi -cals, many reference works employ analternative method instead of (or in ad-dition to) radicals. This method simplyorders characters by their total strokecounts, and orders characters with thesame stroke count by the type of therst stroke. If the stroke count and rst

    stroke type are the same, the type of thesecond stroke is taken into account, andso forth.

    The Wenlin software uses its CDL da-tabase to provide lists of characters or-dered by radical, residual stroke count,and residual stroke type; and also listsordered by total stroke count and stroketypes.

    We used CDL to prepare the radical andstroke indexes for the ABC Chinese-EnglishComprehensive Dictionary, edited by JohnDeFrancis, published in book form by theUniversity of Hawaii Press, 2003 (ISBN0-8248-2766-X). Wenlin Insitute also pro-vided the Unicode Consortium with CDL-based data to assist in the constructionof a radical index for a subset of the Hancharacters in the Unicode Standard 5.0.

    A pplICAtIon 7: m AIntAInIng AnD e xtenDIng u nICoDe

    Organizing the more than 70,000 CJKcharacters already in Unicode, and thethousands more destined for future en-coding, is an extremely challenging task.This is especially true since the mainmethod of organization used so far hasbeen the venerable Kangxi Dictionary sys-tem just described. Characters are rstgrouped by radicals, but di erent lexi -cographers classify the same characterunder di erent radicals (like chariotversus bird). Then characters are sub-

    grouped by residual stroke count, butdi erent lexicographers often countslightly di erent numbers of strokes inthe same character. Finally, the orderof characters with the same radical andsame residual stroke count is random.(Unicode follows the random order ofKangxi Dictionary for characters that areincluded in that dictionary.)

    Under these circumstances, it wouldbe unfair and pointless to blame anybodyfor the fact that some characters haveaccidentally been encoded twice. (Thatis, sometimes two di erent Unicode

    numbers have been assigned to the samecharacter.) Such mistakes were boundto occur when it is so easy not to nd acharacter which is in a di erent locationfrom where one expects to nd it. Whenan obscure character isn't found amongthe already encoded characters, aftersearching several times, the conclusionis naturally reached that the characterhasn't been encoded yet. This problemhas been greatly exacerbated by the fact

  • 8/12/2019 MLC-CDL

    8/8