penn treebank paper fulltext

Upload: rbunjaku9923

Post on 05-Jul-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/15/2019 Penn Treebank Paper Fulltext

    1/25

    University of Pennsylvania

    ScholarlyCommons

    ''%0 6 (CI!) D% * C & I*% !''

    O' 1993

    Building a Large Annotated Corpus of English: !ePenn Treebank 

    Mitchell MarcusUniversity of Pennsylvania , '@'6..

    Beatrice SantoriniUniversity of Pennsylvania

    Mary Ann MarcinkiewicUniversity of Pennsylvania

    F00 6 % %%0 6 %: ://6

  • 8/15/2019 Penn Treebank Paper Fulltext

    2/25

    Building a Large Annotated Corpus of English: !e Penn Treebank 

     Abstract

    I 6 %, ;' '6'+ 6' 0%+ %% '6-- P

    %, % '6 '66+ * 4.5 00 6 * A'% E+06. D+ ?6 -

  • 8/15/2019 Penn Treebank Paper Fulltext

    3/25

    Building A Large Annotated Corpus of

    English: The Penn Treebank

    MS-CIS-93-87

    LINC LAB 26

    Mitchell

    P.

    Marcus

    Beatrice Santorini

    Mary Ann Marcinkiewicz

    University of Pennsylvania

    School of Engineering and Applied Science

    Computer and Information Science Department

    Philadelphia PA

    19104-6389

    October

    1993

  • 8/15/2019 Penn Treebank Paper Fulltext

    4/25

    Building a large annotated

    corpus

    of

    English:

    the Penn Treebank

    Mitchell

    P.

    Marcus* Ba tr ic e ~a .n to ri ni f Mary Ann Ma.rcinkiewicz*

    *Department of Computer t~ ep ar tr ne lr t f Linguistics

    and Information Sciences Northwestern University

    Ulliversity of Pennsylvania Evanston,

    IL (30208

    Philadelpllia, PA 19104

    Introduction

    There

    is

    a growing consensus that significant, rapid progress can be made in both text under-

    standing and spol

  • 8/15/2019 Penn Treebank Paper Fulltext

    5/25

    then corrected by huinaa annot ators. Section briefly

    present,^

    t,he results of a, conlpa.rison between

    entirely manual

    and semi-automated tagging, with t he latt er being shown to be superior on three

    counts: speed, consistency, and accuracy.

    111

    sect.ion

    4 ,

    we turn to the bracketing task. Jus t as with

    the tagging task , we have partially aut omated the bra.cketing ta.sk: the out put of t he P O S ta,gging

    phase is automat ically parsed and simplified t o yield a. skeletal syn tact ic representat ion, which is

    then corrected by human annotat ors. After presenting the set of syntactic ta.gs that we use, we

    illustrate and discuss the bracketing process. In part icular, we will outline va,rious factors that

    affect th e speed with which anno tat ors are able to correct bracketed s truc tures, a task which-not

    surprisingly-is considerably more difficult tl1a.n correctilig POS-tagged tes t . Finally, section 5

    describes t he composi tion and size of the current Treeba,nk corpus, briefly reviews some of the

    research pro jects th at have relied on i t to da,te, a.nd iadica.tes the directions tl1a.t the project is

    likely t o take in th e future.

    2

    Part of speech

    tagging

    2.1

    A

    sil~lplified

    OS

    tagset for

    English

    The P O S

    tag set.^

    used t o allnotate large corpora, in the pa.st have tra.ditioua1ly been fairly extensive.

    Th e pioneering Bro\vn Corpus distinguishes

    87

    sinlple ta.gs j[Fra.ncis 19641 : [Fra.ncis a.nd 1i;uEera. 19821)

    and allows the forma.tion of cornpourtd tags; thus , the contra.ction I 171 , is ta.gged as

    P P S S + B E M

    (PPSS for non-3rd person llonlina,tive personal pronoun and

    REhlI

    for .'a.ln. '111 . Suhse-

    quent project s have tended to elabora te the Brown Corpus tagset . For insta.nce, the Lancaster-

    Oslo/Bergen (L OB) Corpus uses abou t 13.5 tags , the Lancaster IJCREI, group abou t

    165

    tags, and

    the London-Lund C:orpus of Spoken English 197 tags. ~he ra.tiona1e behind developing such la.rge,

    richly articula .ted ta,gsets is to approach the idea81 f providing dist inct codings for all cla.sses of

    words having distinct grammatical behaviour ([Ga,rsideet a1 1987. 1671).

    2.1.1 Recoverability

    Like th e tagsets just lt~entioned, he Perm Treeba.nk ta.gset is based on that of the Brown Corpus.

    However, the stochastic orientation of the Penn Treebanlc anti the resulting colicern with spa.rse

    data led us to modify the Brown Corpus tagset by paring it doivil c,onsidera.bly. A key stra.tegy in

    reducing th e tagset wa.s to eliminate redunda.ncy by taliing in to a.ccount hot11 lexical a,nd syntactic

    informat ion. Thus, whereas many POS ta.gs in the Brown C:orpns tagse t a.re unique t o a, part icular

    lesical i tem, the Perm Treebank ta.gset strives to elilllinate such instances of lesical redundancy.

    For instance, the Browll Corpus distinguishes five different fornls for ma.in verbs: th e base form is

    ta.gged V B , and forms with overt endings a,re indica.ted by appending

    D

    for past tense, G for present

    participle/gerund,

    N

    for past pa.rticiple aad

    Z

    for t,hird person singular present . Esact ly the sa.me

    paradigm is recognized for the h a v e , but

    Ilu ve

    (regardless of whether it is used as a.n ausilia,ry or a

    main verb ) is a.ssigned its oxvn base tag H V . 'The Brown Corpus fi lrther dist,inguishes t,llree forms

    3 ~ o u n t i n g ot,ll simpl e altd compou nd ta.gs, the Brow11 Corpus ta.gset colrt,ains 187 t.ags.

    4 A useful overview

    of

    th e rela.t.io11

    of

    these a ~ ~ c lt,ller t.agsets

    to

    each otller alrd

    1 0

    t.l~e rown

    Corpus

    t,agset is given

    i n Appendix

    B

    of [Ga rsi de et, a1

    19871.

  • 8/15/2019 Penn Treebank Paper Fulltext

    6/25

    of d e t h e base forin

    DO),

    he past tense

    DOD),

    nd the third person singular present (DOZ),

    and eight forms of be-the five forms distinguished for regular verbs as well as th e irregular forms

    a m BEM), re ( BER) and

    w a s

    (BEDZ). By contrast, since the distinctions between the forms of

    V B

    on

    the

    one hand a nd t he forms of BE, DO and H V on the other a.re lesically recoverable, they

    are eliminated in the Penn Treebank, as shown in Table

    1.'

    Table 1:

    Elimination of lexically recoverable distinctions

    ,4 second example of lesical recovera.bility concerns t.hose words that ca.n precede articles in

    noun phrases. Th e Brown Corpus assigns a. separa.te tag t,o pre-qualifiers

    q ~ r i f e ,

    . n t h ~~=

    uch ,

    pre-quantifiers (~~ 11,zcllf, rizn~zy, z n r y ) .nd both Th e Peiin Treehank, on t,lle ot.her ha.nd, assigns all

    of these words to a single category PDT predeterniiner). Further esaniples of lesically recoverable

    categories are t he Bro\vn Corpus ca.tegories PPI, (singu1a.r reflesive p ronoun) and PPL S (pl ura l

    reflexive pronoun) , which we collapse with R (personal l~ ronoun). .nd the Brown Corpus ca.ttegory

    RN (nominal adverb ), wllich we collapse rvit.11 RB (a ,dverh ).

    Beyond reducing lexically recovera.ble distinctjons ,

    we

    also elimina.t,ed certa.in POS distinctions

    that are recoverable with reference to synta,ct,icst,ructure. For inst.a.nce, 'he Pen11 Treeba,nk ta,gset

    does not distinguish subject pronouns frolll object pronouns eve11 in ca.ses where t he distinction is

    not recoverable froin the pronoun's form, as with

    y

    0 2 1

    since the distillction is recoverable

    11

    the

    basis of the pronoun's position in the parse tree in th e parsed version of the corpus. Simila.rly,

    the Penn Treebank ta.gset confla.tes subordina,t ,ing conjunctions with preposit.ions. t,agging both

    categories a.s

    I N .

    The distinctioil betlveen the two categories is not lost , Irowever, since subordinating

    conjunctions can be recovered as those instailces of I N that precede clauses, \+'herea.sprepositions

    are those instances of

    IN

    t11a.t precede noun phrases or prepositional phrases. CVe would like to

    emphasize t ha t th e lesical and syntactic recovera.hility inherent in the POS-ta.ggetl version of t he

    Penn Treebank corpus allows end users t o employ

    a

    nluch richer tagse t t han t he snlall one described

    in section 2 .2 if the need arises.

    2 1 2 Consistency

    As noted above. one reason for elimiilatiiig a P O S tag such a:, RE (n on ~i na l tiverb) is it5 lexical

    recoverability. Another important reasoil for doing so is consistency. For instauce. in the Bro~vn

    T h e

    gerund and t.he pa.rt,iciplr of d o are tagged \ BG a.ntl \ BN i l l the Brown (:'orl>~~s,especti\rely -presunrably

    because they are never

    used

    as auxiliary verbs.

    tr h

    irregular present tense forlns

    urn

    and

    or e

    are tagged as V B P itr th e Penti 'Treel,a.nk (see sectiou 2 . 1 . 3 . ust

    like

    any

    other non-t,hird person singular present tense forin.

  • 8/15/2019 Penn Treebank Paper Fulltext

    7/25

    Corpus , the deictic adverbs there and ihoril are always tagged H B (adv eil) ), whereas their counter-

    parts h e r e and t h en are iiicoilsistently tagged as RB (adv erb) or RN (nominal adverb)-even in

    identical syntacti c conte sts, such as after a preposition. It is clear that reducing t he size of the

    tagset reduces t he chances of such tagging incoilsistencies.

    2.1.3

    Syntactic function

    A

    further difference between the Penn Treeba.iik and the Brown Corpus concerns the significance

    accorded t o syntactic context. In th e Brown Corpus, words tend t o be ta.gged independently of

    thei r syntac ti c f ~ n c t io n .~or inst ance, in the phra.se the one, one is alwa,ys ta.gged as CD (cardina,l

    number ), whereas in the corresponding plural phrase the ones, ones is a1wa.y~a.gged as NNS (pl ural

    common noun), despi te the parallel functioil of on e and ones as 1iea.d~ f their noun phra.se. By

    con tra st, since one of t he main roles of the ta.gged version of t.he Penn Treebank corpus is t o serve as

    th e basis for a bracketed version of th e corpus, we encode a word s syntac tic function in its P OS tag

    whenever possible. Thus, one is ta.gged as N N (singiilar commou no un ) ra.ther th an a.s C D (cardinal

    number) when it is th e llea,d of a

    11 ~11

    lrrase. Silnila,rly, while the Br o~ vnCorpus tags both as ,4BX

    (pre-quantifi er, double conjunction) , rega.rdless of \vl~etlrer t functions a,s a, prenoininal nlodifier

    (both the boys), a postnoininal inodifier

    t h e

    boys. Both), the head of a lroun phra.se (bot h of the

    boys) or par t of a, coillples coord ina.ting coi ijul ictioi~ 60th 1~og.s i ~ l

    i r l .5).

    tlre Peiln Treeba,nk ta.gs

    both differently in each of these syntactic contests---as PDT (predetermines),R 13 (a.dverh)?NNS

    (plu ral common noun ) and coordina,tiilg conjuuctiou

    (CC ) .

    respectively.

    Th er e is one ca.se in which our concern wit11 ta,gging

    by

    synta.ct,icE~lllct~iolra,s led u s to bifurca.te

    Brown Corpus ca,tegories rath er tlmn to collapse them: namely. in the case of t he uninflected fornz

    of verbs. W11erea.s th e Brown Corpus tags th e ba.re form of a verb as V B regardless of whether

    it occurs in

    a

    tensed clause. the Penn Treeba.nk tagset dist,inguishes

    \ jB

    (illfinitive or imperative)

    from V B P (non-th ird person singu1a.r present tens e).

    2 1 4

    Indeterminacy

    A

    final difference between the Penii Treebauli tagset and all othe r tagset s \ve are a\va.re of concerns

    th e issue of indetermina.cy: both POS a.mbiguity in the t,est and a~ lll ota tor ncertainty.

    I11

    nrany

    cases, POS anlbiguity can be resolved with reference to t l ~ eiilgiiistic cont,e st. So, for instance, in

    Katherine Hepburn s witty line

    Grnizt

    c an be otrbsl~okt.18-but not by nr7,yorae 1 know, the presence

    of the by-phra.se forces us to coirsider o-cl,ts11oX.cn

    s

    the past pa.rticiple of a, tralisit ive deriva.tive of

    spea,k-outspeak-rather than a,s th e adjective outspoken. However, even given explicit criteria for

    assigning POS tags t o potentially airrbiguous words, it is not a1wa.y~ ossible to a.ssign a unique t ag

    to a word with confidence. Since a. ma.jor concern of th e Treeba,nk is a,voitl requir ing a.nnota.tors

    to make arbitra .ry decisions, we allow words t o be a.ssocia.ted with more th an one

    P O S

    t,a.g. Such

    inultiple ta.gging indicates either t.1la.t the word s pa rt of speech simply calrnot he decided or t1ia.t

    th e a nnota tor is unsure wlriclr of t.he a,lternative t,ags is the correct one. In principle, annota.tors ca.n

    7 ~ nmportant except.ion is there which the Brown Corpus t.ags as

    E X

    (esist,et~t,ialh e r e ) when it

    is

    used as a

    formal subject. and as

    R B

    (adverb) when it

    is

    used

    as

    a locative

    adverb

    I n

    t h c .

    ca.sc of tl~trr vr tlitl not, pursne

    our st rat egy of tagse t recll~ct.ion o it.s logical cor~ clr~ siou. hich wortltl 11avr irllplied t a g g i ~ ~ gsist.e~~l.ialh t r t a i\ N

    (cortlmon 110~11).

  • 8/15/2019 Penn Treebank Paper Fulltext

    8/25

    tag

    a

    word with ally number of tags, but in pra,ctice, multiple tags a.re restrict ed t o a small number

    of recurring two-ta.g combinations:

    J

    J NN (adjective or lloun as prenominal modifier), JJ

    VBG

    (adjective or gerund /present participle), JJIVBN (a.djective or past pa.rticiple), NNlVBG (noun or

    gerund), and

    RB

    lRP (adverb or particle).

    2 2 The P S tagset

    The Penn Treebank tagset is given in Table

    2.

    It c~l l ta~ins

    6

    POS tags and 12 other tags (for

    punctua tion and currency symbols). A detailed descriptioil of the guidelines governing the use of

    the tagset is available in [Santorini 19901.'

    C C

    C D

    D T

    E X

    F W

    IN

    J

    J

    J JR.

    J JS

    LS

    M D

    NN

    NNS

    N N P

    NNPS

    PDT

    POS

    P R P

    PP$

    RB

    RBR

    RBS

    R P

    S Y M

    Table 2:

    Th e Peiln Treel)a.nli POS tagset

    Coordina.ting conjurlction

    2.5.

    Ca.rdina1 ni~nlber 26.

    Determiner

    27.

    Existential t h r

    28.

    Foreign word 29.

    Preposition/subord. colljli~lctioll 30.

    Adjective 31

    Adjective, comparative

    32.

    Adjective, s~iperla~tive 33.

    List item ma.rker 34.

    Modal 3.5.

    Noun, singular or mass

    36.

    Noun, plural 3 7 .

    Proper iloun, singular

    38.

    Proper noun, plural 39.

    Predetermiller

    40 .

    Possessive ending .

    Persona.1 pronoun

    42.

    Possessive pronoun

    -13.

    Adverb

    4 4 .

    Xdverh, comparative -1.5.

    Adverb, superlative

    4 6 .

    Pa.rt,icle -17.

    Syrnbol ma.thematica,l or scientific)

    .IS.

    TO

    ty

    l1

    \ I3

    V B D

    IJBC4

    VBN

    VBP

    V BZ

    \I'D

    T

    I\.

    P

    If.

    P $

    CfiR

    B

    $

    t o

    u t

    etjectiolr

    \/erl,, ha.se form

    Verb, pa.st t,ense

    Verb, gerund/present pa.rticiplc

    Verb, pa.st pa,rticiple

    Verb, no]\-3rd 11s. sing. present

    Verh, 3rd ps. sing. present

    uyh-deterlniner

    ur

    11-pronoun

    l'ossessive u 1-pronoun

    .to

    h-a.clverb

    Pound sign

    Dollar sign

    Sent.ence-fina.1pullctuation

    (>omma

    Colo11, semi-colon

    Left bracket cha.ra.cter

    R.ight bracket cha.racter

    Straight, double

    quote

    Left

    ol)cn

    siilgle quote

    Left opeu doul~ le u0t.e

    Rigl-~t lose single cluote

    Right close clouble quote

    '1n

    versions of t he t,a.gged corp us dist,ri buted before Nol,emher

    1992.

    singolar proper uoulls, plura.1 prope r nouns

    and personal prono uns were t,a.ggecl as NP , NPS a.nd PP , respect,irely. Th e curr ent t,ags

    N N P . KNPS

    and

    PRP

    were int,rocluced in ord er to avoicl confusion .rvit,h th e synt.act.ic t.ags

    KP

    noun phrase) ant1

    PP''

    (prepositional phra se) (see Table 3 .

  • 8/15/2019 Penn Treebank Paper Fulltext

    9/25

    2 3

    The

    P S

    tagging process

    Th e tagged version of t he Penn Treebank corpus is produced in two stages, using a coinbination of

    autom atic POS a,ssigilme~lt nd manual correction.

    2 3 1 A u t o m a t e d s t a g e

    During th e early st a.ges of t he Penn Treebank project, the initial a.utoma.tic POS assignment wa.s

    provided by PARTS ([Church 1988]),a stochastic algorithm developed at ATSLT Bell Labs. PARTS

    uses a. modified version of the Brown Corpus ta,gset close to our ow11 and a.ssigi1s POS t,a.gs wit,h

    an error rate of 3-5%. Th e output of PARTS was automa.tically tokenized%ild th e ta,gs assigned

    by PARTS were automa,tica.llymapped onto the Penn Treeba,nk ta.gset. This ma.pping iiltroduces

    about 4% error, since the Pen11 Treebank tagset ma.kes certain distinctions tlmt the PARTS ta.gset

    does not.'' A sample of th e resulting ta.gged text , which 11a.s an e rror ra, te of 7-9%, is sho~vnn

    Figure 1

    Fi g u re 1:

    Sainple ta,gged t.est-before correction

    hifore recently, the antoma,tic POS a.ssignmei~ts p~*ovided

    y

    a ca.scade of stocha.stic and rule-

    driven taggers developed on the ba.sis of our early expel-ience. Since t,hesc ta.ggers are b a e d on th e

    Pent1 Treebank ta,gset, the 4 error ra.te introduced as a.n a.rtefa.ct of ma.pping from the PARTS

    tagset to ours is elinlina.ted, a,nd we obt,aiii error rates of '2-CiTI.

    2 3 2 Manual

    correc t ion

    st ge

    The result

    of

    th e f irst , a.utoma.ted sta.ge of POS tagging is given

    to

    allllotators t o correct. The

    annota to rs use a mouse-ba.sed package writ tell i n GNU Elllacs Lisp. \vliich is embedded ~v it hi n he

    '111 contrast to the Brown

    Corpus,

    we tlo not, a.llow compountl tags of t.he sort illustra.t.ed above for I m .

    Rather, co~ltractions nd the Anglo-Sason genitive of I I OU I I S are au tomat ica.11~ plit into their co~rlp oneutmorphemes,

    nd

    each morplreme is t.agged separat,el\r. Tlrl~s.

    hildrerz s

    is t.aggetl clr ild re~~ /NN S

    s /POS

    and toon l is t.agged

    wo-/MD n't/RBX.

    ''The two largest. hources of rna.pping error are t. l~ at he

    PARTS

    tagset, tlistinguishes neither infinitives from the

    11011-third person singular present t,ense forms of \zerbs. 1101. prepositions f r o l ~ ~al.ticles

    n

    case> like

    r ~ i n

    r   t l l a~rt l

    r u n

    u p

    bill.

  • 8/15/2019 Penn Treebank Paper Fulltext

    10/25

    G N U

    Emacs editor ([Lewis et a1 19901). The package allows ann ota tor s to correct POS assignment

    errors by positioning th e cursor on an incorrectly ta.gged word and then entering the desired correct

    ta g (or sequence of multiple tags) . Th e a,nnota.tors' nput is a.utomatically checked a.gainst th e list of

    legal tags in Table 2 an d, if valid, appended to the original word-ta.g pa.ir separated by a.n asterisk.

    Appending the new ta,g ra the r than replacing t he old tag allows us t o ea.sily identify recurring

    errors a t t he automa-t ic PO S assignment stage. Me believe t11a.t the confusioll matrices t11a.t can

    be extracted from this illformation should also prove useful in designing better a,utonia.tic aggers

    in the fut ure . Th e result of this second sta,ge of PO S tagging is shown in Figure

    2

    Finally, in the

    distribution version of the tagged corpus, a ny incorrect tags a.ssigned at th e first, autorna.tic stage

    are removed.

    Figure

    2:

    Sample tagged text -af ter correction

    Tlre learning curve for the

    P O S

    tagging ta.sk t,akes under a niont,l~ a.t

    1.5

    hours a. week). a.nd

    a.llnotation speeds after a. 111011th exceed 3,000 \vorcls per hour,.

    Two

    modes of annotation an experiment

    To

    determine 11ow t,o ~llasill lize he speed, inter-a.unot.atorconsistency and accuracy of POS ta.gging,

    we performed a.n esperiment a.t the very beginning of the project to compare t,wo alternative modes

    of annotation. In the first allllotatioil mode ( ta.gging ), annota.tors tagged una.nnota,ted te st

    entirely by hand; in the second mode ( correct ing7'), they verified a,nd corrected t he outpu t of

    PARTS, modified as described above. This experiment showed tha,t ma.nual t,a,ggillg took about

    twice as long a.s correcting, with ahout twice the int,er-a~~not,at,orisagree~nent at,e and a.n error

    ra te tha t was ab ou t .500i;, 11igher.

    Four annotator s,

    all

    with gra.dua.t,e rai~lil~g

    ~ r

    inguist,ics. part.ici])at,etl in th e experinlent. All

    completed a. tra.ining sequence consisting of fifteen hours of correcting. fo llow~d y six hours of

    tagging. Th e training nla.terial \\-as selected from a va.riety of nonfictio~lgenres in tlle Brown

    Corpus. All t he annota.tors were familiar w i t h G N U Emacs at the outset of the experiment. Eight

    2,000 word sa.mples were selected fro111 the Brown Corpus, t, ~v oa.cl1 from four different genres (t wo

    fiction, two nonfiction), none of which t he a.nnota.tors 11a.d encountered in t rain ing. Th e tes ts for

    the correction ta,sk were autoillatically ta.ggetl as described in sect,ioll 2 3 Eac.11 aanotador first

  • 8/15/2019 Penn Treebank Paper Fulltext

    11/25

    manually tagged four texts and t hen corrected four auto~llatically a.gged texts. Each a~l ilo tat or

    completed the four genres in a different permuta.tion.

    A repeated measures analysis of annotation speed with annot ator identity, genre and annota tion

    mode (tagging vs. correcting) as classification variables sho\ved a significant annotation mode effect

    (p .05). No other effects or interactioils were significant. Th e average speed for correcting was

    more than twice as fast as the average speed for tagging: 20 nlinutes vs. 44 minutes per

    1,000

    words. (Median speeds per 1,000 words were 22 vs. 42 minutes.)

    A simple mea.sure of tagging collsistency is inte r-an nota~ tor isagreement ra te , t he ra.te a.t which

    annota tor s disagree with one another over t he tagging of lexical tokens, expressed a.s a perceiltage

    of the raw number of such disagreements over the number of words in a given tex t sample. For

    a given text and

    n

    annot ator s, there a.re ) disagreeinent ratios (one for ea,cb possible pa,ir of

    annota tor s). Meail inter- annota tor disa,greeuuent wa,s 7.2 ) for the tagging ta.sli a.nd

    4.1

    for the

    correcting ta,sk (with illedians 7.2 and

    3.6 ,

    respect ively). [Tpoii esa.mina. tion, a disproport ionate

    amount of disa.greement in t he correcting case was found to I>e ca.used by one te st th at contained

    many insta.nces of a, cover synlbol for cl~e~il ical,ild ot.lle~*orninlas. tlie alxence of an esplicit,

    guideline for tagging this case. the a.nuota.tors 1a.d ~na.

  • 8/15/2019 Penn Treebank Paper Fulltext

    12/25

     

    Bracketing

    4 1

    Basic

    ~ne t hodo l o g y

    Th e methodology for bracketing the corpus is colnpletely parallel t o that for tagging-11a.nd cor-

    rection of th e ou tpu t of an errorful automa,tic process. Fidditch,

    a

    deterministic pa.rser developed

    by Donald Hindle first a.t the University of Pennsylva.nia and subsequently a.t AT&T Bell Labs

    ([Hindle 19831, [Hindle

    1989]),

    is used to provide a.n initial parse of the ma,terial. Ailnotators then

    hand correct th e parser's o ut pu t using a mouse-based interfa.ce implemented in GNU Elnacs Lisp.

    Fidditch h as thre e properties th at make it ideally suited to serve as a preprocessor to hand correc-

    tion:

    Fidditch a1wa.y~ rovides exa,ctlv one a.na.lysis for any given sentence. so th at a.nnota.torsneed

    not search through illultiple analyses.

    Fidditch never a,tta.chesa.ny collstituellt whose role in th e 1a.rger str uc ture it ca,nnot determine

    with certainty.

    In cases of uilcerta.inty, Fidditch chunks the input into

    a

    string of trees,

    providing only a par tial str uct ure for ea,ch seittence.

    Fidditcll has rather good grammatical coverage. so th at t 1 1 ~ rarn~nat ical l i ~ t l l i~ha,t it does

    build are usually quite accurate.

    Because of these properties, a.nnot,a.t,ors o not need t,o rehra.cl;et 111uc1i of th e parse r's ou tput-

    a relatively time-consuming ta.sk.

    Ra.t,l~er,he a.nnota.tors' lllaiil ta.sl; is to glue together the

    syntact ic chunks producetl hy the pa.rser. Using a, mouse-based interfa.ce. annotat ors move each

    una ttac hed chunk of stru ctu re under t,he node to wllich it should be a tta.ched . Notational devices

    allow a.nnota.tors t-o intlica.te uncerta.inty concerning const,it,uent. la.bels, a,nd to indica.te lnult iple

    at tac hment sites for ai-nbiguous modifiers. The bra.cketing process is descri l~ed n more detail in

    section

    4.3.

    4 2 Th e

    syntactic tagset

    Table shows the set of synta.ctic tags and null elellleiits t>l ~a te use in our slielet,al hra.cket,ing.

    More deta.iled informa,tjoil on th e synt,a.c,tic ,a.gset a.nd guitlelines concerning its use a.re to be found

    in [Santorini a#r-tdMarcinkie~vicz19911

  • 8/15/2019 Penn Treebank Paper Fulltext

    13/25

    Table 3:

    Th e Penn Treebank syntactic ta.gset

    Tags

    ADJP

    ADVP

    NP

    P P

    S

    SBAR

    SBARQ

    SINV

    SQ

    V P

    WHADVP

    WHNP

    WHPP

    X

    Adjective phrase

    Adverb phrase

    Noun phrase

    Prepositional phrase

    Simple declarative clause

    Clause introduced by subordinating conjuilctioi~or (see below)

    Direct question introduced by wh-word

    or

    wh-phrase

    Declarative sentence with subject-aux inversioil

    Subconstituent of SBARQ excluding uih-word or wh-phrase

    Verb phrase

    Il'h-adverb phrase

    Ii;lz-noun phrase

    Ifl'h-prepositional phrase

    ('oiistituent of u~lkllow~lr uncertain category

    Null elements

    1 ..Understood subject of infinitive or imperative

    2 0 Zero variant of h n t in subordinate cla~lses

    3

    T Trace-marlis positioti where moved t~h-cons ti tuents interpreted

    4. NIL

    Marks positioil where prepos itioi~ s in teipret rd in pied-piping con tests

    Although different in detail, our tagset is si1nila.r

    in

    delic,acy t.o t,lxat used by the La.ncaater

    Treebank Pro ject , except that we allow null elements in t,lle syntact ic annota,t ion. Bemuse of the

    need t o achieve a fa,irly high output per hou r, it wa.s decitletl not to require a ,il ~lota torso create

    distinctions beyond those provided by t he parser. Our a.pproach to cleveloping th e syiltactic tagset

    was highly pra,gmat.ic and stroilgly influenced by th e need t.o crea.te a, 1a.rge body of ai111ot.a.ted

    material given limited liurnai~ esources. Despite th e slceletal natu re of

    t,he

    hra,clieting, however, it

    is possible to malie quite delicate distinctions ~vhen sing the corpus

    I>

    searching for combina.tions

    of structures. For esample, a,n SBAR cont,a.ining the word to immedia.tely before the VP will

    necessarily he infinitival. wllile aa SJJAR couta.ining a, verb or auxiliary \vitll a, tense feature will

    necessarily be tensed. To take another esa.niple, so-called thc~,t-c1a.11sesa.n be ider~tifiedeasily

    by

    searching for SBARs conta,ining the ~vord

    t r r t

    or the

    111111

    ele~llent i l l initial position.

    As can be seen froru Tahle 3 the synt.actic tagset used by the Pelill Tretha.nk i~rcludes variety of

    ijull elements, a. subset of the null e le ~l le nt ,~ntroduced hy Fitiditch. While i t \vould be espeilsive to

    insert null elerne~lts ntirely y hand, it llas not proved o\.erly onerous to nlainta.in a.nd correct those

    th at are autorna.tica1ly provided. We have clioseil t,o ret,a.in t.llese null ele~nell ts eca,use we believe

    that they caa be exploited in many ca.ses t,o esta.11lish a, sen tei~ce 'spredicate-ilrgument structure;

  • 8/15/2019 Penn Treebank Paper Fulltext

    14/25

    at least one recipient of the parsed corpus has used it to l~ootst~raphe tlevelopment of lexicons for

    particular N P projects and has found the presence of null elements to be a considerable aid in

    determining verb transitivity (Robert Ingria, personal communication). While these null elements

    correspond more directly to entities in some granllnatical theories thail in others, it is not our

    intention t o lean toward one or another theoretical view in producing our corpus. Rather, since the

    representational framework for

    gramnlatical struc ture in the rreebank is a relatively i~npoverished

    flat context-free notation, the easiest mechanisn~ o include information ahou t predicate-argument

    structure , although indirectly, is by allowing t he parse t r to contain explicit null items.

    4 3 ample bracketing output

    Below, we illustrate the bracketing process

    for

    the first sent.ence of our sa,mple text . Figure shows

    the ou tput of

    Fidditch (modified slight,ly to i nc lude onr P O S t,a,gs).

  • 8/15/2019 Penn Treebank Paper Fulltext

    15/25

    Figure :

    Sainple bracketed test- full s t ruc t l i re provided by Fidd i t ch

    (S

    (NP (NBAR (ADJP (ADJ Battle-tested/JJH)

    (ADJ industrial/JJ1')

    (NPL managers/NNSH)))

    ( ? (ADV here/RB1')

    ( ? (ADV always/RB ))

    (AUX (TNS * ) )

    (VP (VPRES buck/VBP1'))

    ( ? (PP (PREP up/RPtl)

    (NP (NBAR (AD J nervous/

    (NPL newcomers/NNS )))))

    ( ? (PP (PREP with/INU)

    (NP (DART the/DTU)

    (NBAR

    N

    tale/NNU))

    (PP O~/PREP

    (NP (DART the/DTU)

    NBAR (ADJP

    (ADJ first/JJH))))))))

    ( ? (PP of /PREP

    (NP (PROS their/PP$ )

    (NBAR (NPL countrymen/NNS ))))

    ( 7 (S (NP (PRO * ) )

    (AUX to/TNS)

    (VP (V visit/VB1')

    (NP (PNP Mexico/NNP )))))

    ( ? (MID It,/, ))

    ( ? (NP (IART l8a/DT )

    (NBAR (N boatload/NN )

    (PP of/PREP

    (NP (NBAR

    (NPL warriors/NNSU))))

    (VP (VPPRT blown/VBN

    ( ?

    (ADV ashore/RB ))

    (NP (NBAR (CARD 375/CD1')

    (NPL years/NNSU)

    ) ) ) )

    ? (ADV ago/RBH))

    ( ?

    (FIN

    . / . I ) )

    A s

    Figu re shows. F iddi tch 1ea.ves

    v e ry m a n y

    c o n s t i t u e n ts u n a t ~ t a c h e d .abel ing

    them a s . ? ..nd

    i t s o u t p u t is perhaps be t t e r t hough t o f a s

    a

    st rin g of tre e fragment,̂ tallan

    a s a singl

    t l ee s t r u c t u r e .

  • 8/15/2019 Penn Treebank Paper Fulltext

    16/25

    Fidditch only builds s tru ctu re wlleil this is possible for a purely syntact ic parser without a.ccess to

    semantic or pragmat ic information, and it always errs on the side of cau tion. Since determining

    the correct a tta chm ent point of prepositiollal phrases, relative cla.uses, and adverbial modifiers

    almost always requires extrasyn tactic informa,tion, Fidd itcl ~ ursues the very conservative stra,tegy

    of

    a lways

    leaving sucll collst ituents una, ttached, eve11 if only one atta,chrnent point is synta.ctically

    possible. However, Fidditch does indicate its best guess concerniug

    a

    fragment s atta.chn1ent site by

    th e fragment s depth of embedding. Moreover, it a.ttaclles prep osit iol~ al hrases beginning with of f

    th e preposition immediately follows

    a

    noun; thus, t o l e of . a.nd

    boatload o f

    are parsed a.s single

    constituents, while first of is not. Since Fidditch lacks a large verb lexicon, it cannot decide

    whether some coilstitueilts serve as adjuncts or a.rguments a.nd hence lea.ves subordinate cla.uses

    such as infinitives as separate fragments. Note further t ha t Fidditch creates a,djective phra,ses only

    when it determines tlia,t more tha n one lexical item belollgs in the ADJP . Finally, a.s is well known,

    determining the scope of conjunctions and other coordinate structures ca.n only be determined

    given th e richest forins of coiltextual inforination: here aga.in. Fiddit,ch si ~npl y urns out a, string of

    tree f rag men ts a.rounc1 ally conjunction. Beca.use all tl~cis ions cithin Fitltlit,cl~ .re una.tle locally. all

    c o m m a (which often signa,l co~ijunc tion)mlist disrupt tllc input illto sepava.te chultlis.

    The original design of tlle Treebaali called for a level of syntactic ana.lysis compara.ble t.o th e

    skeleta l ana.lysis used

    by

    th e Lanca.st,er Treeba.lrli. but a. linlit,ed esp er i~ ne ntwas performed early in

    the project to investigate t . 1 ~easibility of providing great.er levels of structural tleta.il. While the

    results were somewhat unclear, there wa.s evidence tl1a.t ailnot ,ators could lnaintaiil a, much fa.ster

    ra te of ha,nd correction if t he pa,rser ou tput wa.s

    simplified

    in va.rious \va,ys, reducing th e visual

    complexity of th e tree represe~lta tions .nd elimina.ting a range of minor clecisions. T he key results

    of this esperiment Lvere:

    Allnotat ors take substa,ntially longer to learn the I)~.aclieting:.a.sli t.I~an he P O S tagging. ask.

    with substa,n tial increases in speed occurring even a . f t e ~,\vo ~non tl ls f trainirlg.

    Anilotators can correct the full structure providetl by Fidditch at an average speed of ap-

    pros.

    37 5

    words per hour after three weeks. and

    475

    nords per hour after

    is

    weelis.

    Reducing th e out pu t from the full structur e shotvn in Figure to a more skeletal represen-

    tati on similar t o tha t used by t he Lanca.ster IJCREL Treebank Project increases a,nnotator

    productivity by appros. 100-200 words per hour.

    It proved to he very difficult for a .nno tators to tlistingrrisll hetween

    a

    verb s a,rguineirts and

    a.djuncts in all ca.ses. Allo~ving nnota.to rs to ignore this dist.inction when it is unclear (a.tt,ach-

    ing coilstituents high) increases product,ivit,v 11y appros. 1.50-200 n:ords per liour. Inforrna,l

    exarrlillation of later a.nnota.tion showed that forced diht.itlctions carlnol, be nra,tIe consist,ent,ly.

    ,4s a resul t of this experiment., t.he origina.11,~ roposetl slieletal represent,ation was adopted ,

    without a forced distinctioll between arguments a.nd at1.juncts. Even aft,er estended training, per-

    forma.nce wr ie s ma.rkedly by a~nnota .tor , ith speeds on the ta.sk of correcting ~ke le t~ altructure

    without requir ing a, distiilctioll bet\veen arglilneilts a.nd a.tljuncts ra.nging from appr os . 75 ivords

    per hour to well over

    1 000

    words per hour a.ft,er t.hree or follr rnont,hs experience. The fa.stest

    annota, tor s work in burstas of well over 1..500 words per liour alte~.na~ingvit,ll I~rief est,s. At an

  • 8/15/2019 Penn Treebank Paper Fulltext

    17/25

    average rate

    of

    750 words per hour, a team of five pa.rt-t,imea.nnotators a.nnota.ting three hours a

    day should mainta,in a.n ou tp ut of a.bout

    2 5

    nlillion words a yea,r of treebanked sentences, with

    each sentence corrected once.

    It is worth noting that experienced annota tors can proofread previously corrected ~n at er ia l t

    very high speeds.

    A

    parsed subcorpus of over one inillion words

    wah

    recently proofread at an average

    speed of approx. 4 000 words per ailnotator per hour. A t this rat e of proiluctivity, anno tator s are

    able t o find and correct gross errors in parsing,

    b u t

    do not have time to checli, for example, whether

    they agree with all prepositional phrase at ta ch me ~~ ts .

    Th e process th at creates the skeletal represenfations to be corrected by the anilo tator s si~llplifies

    and flat tens th e struc tures shown in Figure by removing

    POS

    tags, non-bra.nching lexical nodes

    and certain phrasal nodes, notably

    N R A R .

    Th e outpu t of the first a,utoma.tedsta.ge of t he bra.cketing

    task is showri in Figure 4.

  • 8/15/2019 Penn Treebank Paper Fulltext

    18/25

    Fi g u re 4:

    Sample bracketed tes t-aft er simplification, befo re correctioil

    S

    (NP (ADJP Ba t t le - t es te d i nd us t r ia l )

    managers)

    ? h e r e )

    ? always)

    (VP buck) )

    ? (PP up

    (NP nervous newcomers)))

    ? (PP with

    (NP t h e t a l e

    (PP of

    (NP t h e

    (ADJP f i r s t ) ) ) ) ) )

    ?

    (PP of

    (NP t h e i r countrymen)))

    ? S

    (NP * )

    t o

    (VP v i s i t

    (NP Mexico)

    1) )

    ?

    ,>

    ? (NP a boatload

    (PP of

    (NP warriors))

    (VP blown

    ? ashore )

    (NP

    375

    y e a r s ) ) ) )

    ? ago)

    ? > I

    Annotators correct this si~nplified truc ture using a mouse-based interface. The ii prinla.ry job is

    to glue fragments toget her, hut they illust a.lso correct incorrect parses and tlelet,e some structure.

    Single mouse clicks perforin the following ta.sks, a.illong ot,hers. Th e interfa.ce correctly reindents

    th e s tru ctu re \vlleilever necessary.

    Attach constituents labeled

    ?.

    'This is done

    bj

    pressing do~ vu ,he appropria.t e mouse button

    on or immedia.tely after the ? moving th e lllouse ont,o or immedia.tely a.fter the label of the

    intended parent a,ild releasing the mouse. .i\t,ta.ching const ituents ai itoma~ticallvdeletes t.lleir

    ? label.

    Promote a co~lstitiieiltup one level of structu~e,making it a 5ihling of its cun.eut parent.

  • 8/15/2019 Penn Treebank Paper Fulltext

    19/25

    Delete a, pair of collsti tuellt brackets.

    Create a pair of brackets around a constituent. This is done by typing a constituent tag and

    then sweeping out the intended constituent wit11 th e mouse. Th e ta.g is checked to a.ssure

    that it is a legal label.

    Change th e label of a const ituent. The new tag is checked to assure t,hat it is legal.

    The bracketed test after correction is showil in Figure 5. The fraginents are now coilnected

    together into one rooted tree structure . Th e result is a skele t l analysis in tha.t much syilta.ctic

    detail is left unannot ate d. Most prominently, all internal st ruc ture of th e NP up through the head

    and including a.ny single-word post-head modifiers is left uilan~lota,ted.

    Figure 5

    Sa.mple bracketed test -aft er correctioli

    S

    NP B a t t l e - t e s t e d i n du s t r i a l m anagers

    h e r e )

    always

    VP buck

    UP

    NP nervous newcomers)

    PP with

    NP t h e t a l e

    PP of

    NP NP t h e

    ADJP f i r s t

    PP of

    NP t h e i r countrymen)))

    S NP * )

    t o

    VP v i s i t

    NP Mexico))))

    NP NP a bo at lo ad

    PP of

    NP NP wa rr io rs )

    VP-1 blown

    ashore

    ADVP NP 375 y e a r s )

    a g o > > > > >

    V P - 1 *ps e udo - a t t a c h* ) ) ) ) ) ) ) )

  • 8/15/2019 Penn Treebank Paper Fulltext

    20/25

    As noted above in connection with POS ta.gging, a nlajor g o d of the Treebank project is to

    allow annot ato rs only t o indicate struc tur e of which they were certain. Th e Treebank provides two

    notatioila l devices t o ensure this goal: the coilstitueiit label and so-called pseudo-attachment .

    The

    X

    const ituent la,bel is used if a.n a,ni lotator is sure t lmt a sequence of words is a majo r con-

    stituent but is unsure of its syntactic category; in such ca.ses, the annotato r simply brackets t he

    sequence and labels it

    X.

    Th e second nota, tiona l device, pseutlo-a. ttachinent , l1a.s two prima.ry uses.

    On t he one ha.nd, it is used t o an notate what Kay has called pe rn ~r .n en t rec1ictn.ble c~,n~biyu,ities,

    allowing a n annota .tor to indicate tha,t a, st ru ct i~ res globa.lly ambiguous even given the surrounding

    context (anno tat ors always assign struct ure to a, sentence on the ba.sis of it s context). An esa.mple

    of this use of pseudo-attachment is shown in Figure

    5 ,

    where the participial phrase blown ashore 75

    years ago modifies either warriors or boatload,

    hu t

    there is no way of settling the question-both

    attachmen ts mean exa.ctly the sa,me thing. In the ca,se a.t ha.nd, the pseudo-atta.chment notation

    indicates th at the a.nnot ator of the sentence thought tha t VP 1 s most likely a modifier of warrio rs,

    but th at i t is also possible t ha t it is a modifier of b ~a t l on d. ' ~ second use of pseudo-attachment is

    to allow ann ota tor s to represent the underlying position of extraposed elements: in addition to

    being a,ttached in its superficia.1 position in the tree. the est,raposetl coi~stituent,s pseudo-a,t,t,ached

    within the constituent, to which it is semantically rela.t,ed. Note that except for the device of pseudo-

    at tachment , the slieleta,l analysis of th e Treebank is entirely 1,cstrictecl t,o sirnple cont,est-free trees.

    Th e reader ma.y ha,ve noticed tl1a.t the

    AD J P

    bra.cltets in Figuse 4 have vanishetl in Figuse

    5.

    For

    the sa,ke of th e overa.11 efficiency of the annot,a.t, ion ,ask.

    we

    leave

    all AD.11

    hra,cliets in the sinlplified

    str uct ure , with the ai lnotators esyectetl t o reinove many of then1 during a.nnotation. The reason

    for this is some1vl1a.t~ ornples, but provides a good esa.mple of t he considerations tha.t come into

    play in designing the det,ails of annot,a.tion methods. Th e first relevant, fact is tlmt Fidditch only

    outputs ADJP

    bra.ckets \vit,llin NPs for aclject,ive phra.ses containing more t11a.n one lesical item.

    To be consistent, the fina.1 str uct ure nlust coiltaill AD J P ~iocles or all adjective phrases withill

    NPs or for none; we ha.ve chosen to delete all sucli nodes \vit, l~inNPs under noriual circ~lmsta.nces.

    (Thi s does not affect th e use of the ADJP ta g for predica.tive a.djective phrases outside of NPs.)

    In

    a seemingly unrelated guideline, all coordiilat,e st,ructures are a.ii~rota.t.edn the 'Treehank; such

    coordinate structures a.re represented by C:lio~nskp-adjunctionwhen the two col~joiiiedconstitlients

    hear the same la,bel. This illeans t,hat

    if

    an

    YP

    contains coordinatect a.djective phrases, then an

    AD JP ta g will be used to t,a.g ,ha.t coorcliila~tioileven though si~npleADJPs witlii~lXPs will not be

    bear an A PJ P ta,g. Esperieilce ha,s sl io ~v i~ha t annot,at .ors ca.11 delet,e pa,irs of bra.ckets estl-emely

    quickly using the mouse-based tools. whereas crea,ting brackets is a, m1.1c11 lower opera.t,ion. Because

    th e coordination of a.djectives is quite common, it is more efficient to lea.ve in ADJP labels, and

    delete them

    if

    they a.re not part of a coordinate structure , t1ia.n to reint.roduce them if necessary.

    Progress

    to date

    5 1

    Coinpositioil and size of corpus

    Table 4 shows the output of the Penn T1.eel1auli project a t the

    elid

    of its first pllase.

    12~l l i s

    se of pseudo-at t achnlel~t.s

    identical

    to it.s

    oi.igi~ialus i n Cl1u1.clt s

    pitrser

    [Clru

    rch l J R O ] ) .

  • 8/15/2019 Penn Treebank Paper Fulltext

    21/25

    Table

    4:

    Penn Treebank

    (a s of 1 1/92 )

    Tagged for Skeleta l

    Description Part of Speech Parsing

    (Tokens) (Tokens)

    Dept. of Energy abstracts

    Dow Jones Newswire stories

    Dept. of Agriculture bulletins

    Library of America texts

    MUC-3 messa.ges

    IBM Ma,nual sentences

    W B U R radio tra.nscripts

    ATIS seiltences

    Brown Corpus, reta.gged

    Total:

    4.885.781; 2,fiX1,11(8

    All the materials listed above are availa.ble on CD R.OM to mernber-s of the Linguistic Data

    Consortium.13 About inillioil words of POS- a.ggecl ina.teria,l and a small sa.nlpling of skeleta.11~

    parsed text are a,va.ila,blea.s part of the first Association for C omputa.iiona1 Linguistics/Da.ta. Col-

    lection Initiative CD-RO M, and

    a

    some~vll at arger snbsct of nrat,eria.ls s ava.ilable on c x tr id ge t ap e

    directly from the Penn

    Treebank Project,. For

    i n

    fo1.111at o~r. orrt

    art t

    l st a.l~t,I~orf t,llis pa.per

    or send ernail to treeI~ai1li~Bunagi.cis.1i~~e1~11ed11.

    r

    Dept . of Energy abst ra,ct s are scientific abst,ra.ctsfrom a variety of disciplines.

    All of t he skelet,ally parsed Doxv Jones Newswire 1na. teria.l~ .re also a.va,ila,ble as digitrally

    recorded rea.d speech as pa.rt of the DL4RP.4 WSJ-CS RI corpus, a.vaila.ble through the Lin-

    guistic Data. C onsort,iurn.

    r The Dept. of Agriculture materials iilcludc short 1)ulletjns such

    nlien to plant various

    flowers and how to can various vegetables and fruits.

    r Th e Library of .4iuerica t es ts are 5,000-10,000 word passage>, mainly booli c hapte rs, from

    a variety of Arnericall authors includillg Nal-li Twain. I-Ienry i-ldams. Willa Ca ther , Herman

    hilelville. W.E.B. Dubois, and Ralph M-aldo Emerson.

    The

    M U C 3

    tests are all news stories fro111 the Federal Nelcs Service about terrorist activities

    in South America. Some of these t es ts are translations of Spanish news stories or transcripts of

    radio broadcasts. They are taken from training lnaterials for the :3rd hlessage IJnderstanding

    Conference.

    13Cont,actThe Ling11ist.i~

    a.ta

    Consortil~m. 41 l17illianls Ilall. I~'nivc.r>it - l'P rtl r~s yl~. ani a.'hilatlelpllia, PA 19104

    6305 or

    send email

    to ltlc~Ci~~~~ra gi ci~ r~l~enn ed~~or ulore i~~f or r ~~a t io n.

  • 8/15/2019 Penn Treebank Paper Fulltext

    22/25

    Th e Brown Corpus inaterials were coinpletely retagged

    111

    the Penn Treebank project sta rtin g

    from the untagged version of th e Brown Corpus ([Francis 19641).

    The IBM sentences are taken from IBhil colllputer ma~nuals; hey are chosen t o coiltain a

    vocabulary of 3,000 words, and are limited in length.

    Th e ATIS sentences ar e transcr ibed versions of sponta.neous sentences collected a.s training

    materials for the DARPA Air Tra,vel Information Systenl project.

    Th e entire corpus ha,s been ta,gged for POS i nforn~a.t ion,.t a.n est ima ted error ra.te of a.pprox. 3 .

    Th e POS- tagged version of th e Libra.ry of America te xts and the Depa.rtment of Agricul ture bul-

    letins have been corrected twice (ea.ch by a. different ann ot at ~o r) ,.nd t.he corrected files were then

    carefully a.djudica.ted;w e est inlate the error ra te of the a.djudica.ted version a.t well under

    1%.

    Using

    a

    version of PARTS retrained on the entire prelimina.ry corpus a.nd adjudicating between the out-

    put of the retrained versioil and the prelimii~ary ersion of the corpus, we plan to reduce the error

    ra te of the final version of t,he corpus t,o appros . 1%. All the skeletally pa,rsed ma.t,eria,ls1ia.ve been

    corrected once, except for the Brown inat,erials. ~vhich11a.ve been quic.lloyed has a. va.riety of

    limitations. Many ot,herwise clear ar gu ni en t/ ad j~ ~~ ~c t~elations in the corpus are not indica.ted due

    to the current Treeba.nli7sesseiltia,lly contest- free represe11ta.t-ion.For example. there is a.t present,

    no satisfa.ctory represent.a.tion for seilt e~ices n \vl~iclicoluplelnent, nolln plira.ses or clauses occur

    after a sententia,l level a.dverb. Either the adverb is trapped within tlie

    \. P,

    o th at the coillpleillent

    can occur within th e VP where it belongs. or else tlie adverb is a.t,ta.ched to tlle S, closil~goff

    the V P a.nd forcing the conlplenlent to attach to the S . This trappiilg problem serves a.s a

    limitatioil for groups t1ia.t currently use Treebank mat,erial to semiant ,oi~~a.ti ca~l lyerive lexicons for

    part icular applica.tions. For most of these prohlerns. hoirrevcr. solutions a,re possible on t he basis of

    mechanisms already used by the Treebaak Projcct For esa.mple. t,he pseudo-att ,acllment nota.tion

  • 8/15/2019 Penn Treebank Paper Fulltext

    23/25

    can be extended t o indicate

    a

    variety of crossing dependencies. IVe have recently begun to use this

    mechanism t o represent va,rious kinds of disloca.tions, a,nd t,he Treeha.nk aalnota.tors t,hemselves have

    developed a detailed proposal t o extend pseudo-a.tta .chment to a wide range of simi1a.r phenomena.

    A

    variety of ir~consistencies n the annotat ion scllerne used lvithin t he Treebank have also become

    apparent with time. Th e annotat ion schelnes for some syntactic categories should be unified to

    allow a consistent approach to determining predicate-argument struc ture . To take a very simple

    example, sentential adverbs a tta ch under

    V P

    when they occur between ausiliaries and predicative

    ADJPs, but attach under

    S

    when they occur between ausiliaries and VPs. These structures need

    to be regularized.

    As the cur ren t Treebank has been exploi ted by a va.riety of users, a, significaat number have

    expressed a need for forms of a11nota.tion richer tha n provided by t he project s first pha.se. Some

    users would like a less skeletal form of a.nnota.tion of surface grammatical struct ure , espa.nding

    th e essentially context-free a.nalysis of the current Penu Treebank to indicate a, wide va,riety of

    non-contiguous struct,ures a,nd depende~lcies.4 wide rangc of Trceba.nli users now st.rongly desire

    a level of allllotatioil \chic11 ma.kes explicit some form of pretlicat P-a rg ~~ tr lcnttruct,ure. Th e desired

    level of representat ioil ~vould ualie explicit th e logical sl.ihjert

    a.11t1

    logical object of the verb, a.nd

    would indicate, a.t least in clear ca.ses, which sr11)c.onst.it.uents erve as ;\.rgulilents of the underlying

    predicates a.nd wl1ic11 serve as modifiers.

    During the nest, phase of tlle Treebaak project, we expect to provide both a, richer a.na1ysis of tlle

    existing corpus a.nd to provide a pa,ra.llel corpus of pretlica.te-argument st,rlrctures. This will be done

    by first enriching the a .nnotation of the current corpus, a.nd then a.utomatica.lly est,ra.ctingpredica.t,e-

    argument structure: a t t he level of dis tinguishir~g ogical sul)jcct,s ant1 ol>jects. and distirlguisllir~g

    arguments from a,djunct,s or c1ea.r ca,ses. Enri ch~nen twill he acllieved by a~rtomatcallg t,ra,nsforming

    the current Pel111 Treeha.nk into a. level of s t r u c t ~ ~ r elose to t 11eii~t.eildetI ,a.rget.a.ncl then completing

    th e coilversion by hand .

    eferences

    [Brill 19911

    [Brill et a1 19901

    [Church 19801

    [Church 19S8]

    Brill? Eric. 1991. Discovering t,he lcsica,l fea.tures of a.1angua.ge.

    I11 Proceedirzgs of tlzc- 29th .-l~l~ztrcrl2Ieeti12g of tlzt il.s.socintioiz

    or

    Con ,p~ r t~ ~ . t i oncr li~zgtr i s t ics .B ~ I - k c l e y :

    C4

    Brill, Eric: h lagernlau, Da.vid: A~larcus.Mitchell P.; and San-

    torini, Beatrice. 1990. T)etlucing li~~gliistictructllre from the

    s~ at is ti cs f large co1.1,ot.a.

    11

    P1.ocvrclilrg.5 o j lh c V 4

    RPA

    ,Speec/,.

    nnd .\T~rl~r~~rr/f~ .~ ig l rc~ f l tI1ork6ho~).I ~ I I ? . c

    9.90.

    pages 27.55282.

    C:hurch. Iienneth

    C

    1950. hlemory limitations in 11a.tura.l

    1a.ngua.geprocessing, k1IT

    L C:S

    Tecl1njca.l Report

    24.5.

    h/la.ster s

    thesis, Ma.ssac11usetts Ins tit jute of Technology.

    Cll~urch, iennet11 \1 ., 1985.

    4

    stocl~astic arts pr0gra.m a.nd

    llour~p11ra.se pa.rser for unresl.ricted test ,.

    In

    I rocc-etlirzg.5

    o

  • 8/15/2019 Penn Treebank Paper Fulltext

    24/25

    [Francis

    19641

    [F ranc is an d K uE era

    19821

    [Ga rsid e et a.1

    19871

    [Hindle 19831

    [Hindle

    19893

    [L ewis e t a

    19901

    the Second Conferenc e on Applied Nuturul Lunguuge Process-

    ing. 26th An nu al M eeting of tlze Associcr.tioi2 or Co input ation a~l

    Linguis t ics ,

    p a g e s 136-143.

    Francis W . Nelson

    1964.

    A standard sample of present-day

    English for use with digital comp uters . Report to the U.S Of-

    fice of Edu,cation on C ooperative Researc h Project N o.

    E 007.

    Brown University Providence.

    Francis W.

    Nelson a.nd I

  • 8/15/2019 Penn Treebank Paper Fulltext

    25/25

    [Santorini an d Marcinkiewicz 19911 Santorin i Bea trice

    a.nd

    hla.rcinkiewicz Mary Ann 1991.

    Bracketing guideli~les or t he Pen11 Treebank Pro jec t.

    Ms.

    Depart men t of Comput er a.nd 1nforma.tion Science University

    of Pennsylvania..

    [Veilleux and Ostendorf 19921

    Veilleux N. M. and Ostendorf Mari 1992. Probabilistic parse

    scoriilg based on prosodic features. In Proceedings of the Fifth

    D A R P A Speeclz and ATatura,r Language T,lfork.shop, Febr vary

    1992.

    [Weischedel et a1 19911 Weischedel Ralph; Ayuso Dama.ris; Bobrow

    R.;

    Boisen

    Seaan; ngria R.obert; and Pa li~lucci eff

    1991.

    Pa.rtia1 pa.rs-

    ing:

    a

    report of work in progress. In

    Proceeclings of the Fourth

    D A R P A Speech and iVattrra,l Lnngirage W or ks ho p, February

    1991,

    Drajt version.