penn treebank paper fulltext

8/15/2019 Penn Treebank Paper Fulltext

1/25

University of Pennsylvania

ScholarlyCommons

''%0 6 (CI!) D% * C & I*% !''

O' 1993

Building a Large Annotated Corpus of English: !ePenn Treebank

Mitchell MarcusUniversity of Pennsylvania , '@'6..

Beatrice SantoriniUniversity of Pennsylvania

Mary Ann MarcinkiewicUniversity of Pennsylvania

F00 6 % %%0 6 %: ://6


2/25

Building a Large Annotated Corpus of English: !e Penn Treebank

Abstract

I 6 %, ;' '6'+ 6' 0%+ %% '6-- P

%, % '6 '66+ * 4.5 00 6 * A'% E+06. D+ ?6 -


3/25

Building A Large Annotated Corpus of

English: The Penn Treebank

MS-CIS-93-87

LINC LAB 26

Mitchell

P.

Marcus

Beatrice Santorini

Mary Ann Marcinkiewicz

University of Pennsylvania

School of Engineering and Applied Science

Computer and Information Science Department

Philadelphia PA

19104-6389

October

1993


4/25

Building a large annotated

corpus

of

English:

the Penn Treebank

Mitchell

P.

Marcus* Ba tr ic e ~a .n to ri ni f Mary Ann Ma.rcinkiewicz*

*Department of Computer t~ ep ar tr ne lr t f Linguistics

and Information Sciences Northwestern University

Ulliversity of Pennsylvania Evanston,

IL (30208

Philadelpllia, PA 19104

Introduction

There

is

a growing consensus that significant, rapid progress can be made in both text under-

standing and spol


5/25

then corrected by huinaa annot ators. Section briefly

present,^

t,he results of a, conlpa.rison between

entirely manual

and semi-automated tagging, with t he latt er being shown to be superior on three

counts: speed, consistency, and accuracy.

111

sect.ion

4 ,

we turn to the bracketing task. Jus t as with

the tagging task , we have partially aut omated the bra.cketing ta.sk: the out put of t he P O S ta,gging

phase is automat ically parsed and simplified t o yield a. skeletal syn tact ic representat ion, which is

then corrected by human annotat ors. After presenting the set of syntactic ta.gs that we use, we

illustrate and discuss the bracketing process. In part icular, we will outline va,rious factors that

affect th e speed with which anno tat ors are able to correct bracketed s truc tures, a task which-not

surprisingly-is considerably more difficult tl1a.n correctilig POS-tagged tes t . Finally, section 5

describes t he composi tion and size of the current Treeba,nk corpus, briefly reviews some of the

research pro jects th at have relied on i t to da,te, a.nd iadica.tes the directions tl1a.t the project is

likely t o take in th e future.

2

Part of speech

tagging

2.1

A

sil~lplified

OS

tagset for

English

The P O S

tag set.^

used t o allnotate large corpora, in the pa.st have tra.ditioua1ly been fairly extensive.

Th e pioneering Bro\vn Corpus distinguishes

87

sinlple ta.gs j[Fra.ncis 19641 : [Fra.ncis a.nd 1i;uEera. 19821)

and allows the forma.tion of cornpourtd tags; thus , the contra.ction I 171 , is ta.gged as

P P S S + B E M

(PPSS for non-3rd person llonlina,tive personal pronoun and

REhlI

for .'a.ln. '111 . Suhse-

quent project s have tended to elabora te the Brown Corpus tagset . For insta.nce, the Lancaster-

Oslo/Bergen (L OB) Corpus uses abou t 13.5 tags , the Lancaster IJCREI, group abou t

165

tags, and

the London-Lund C:orpus of Spoken English 197 tags. ~he ra.tiona1e behind developing such la.rge,

richly articula .ted ta,gsets is to approach the idea81 f providing dist inct codings for all cla.sses of

words having distinct grammatical behaviour ([Ga,rsideet a1 1987. 1671).

2.1.1 Recoverability

Like th e tagsets just lt~entioned, he Perm Treeba.nk ta.gset is based on that of the Brown Corpus.

However, the stochastic orientation of the Penn Treebanlc anti the resulting colicern with spa.rse

data led us to modify the Brown Corpus tagset by paring it doivil c,onsidera.bly. A key stra.tegy in

reducing th e tagset wa.s to eliminate redunda.ncy by taliing in to a.ccount hot11 lexical a,nd syntactic

informat ion. Thus, whereas many POS ta.gs in the Brown C:orpns tagse t a.re unique t o a, part icular

lesical i tem, the Perm Treebank ta.gset strives to elilllinate such instances of lesical redundancy.

For instance, the Browll Corpus distinguishes five different fornls for ma.in verbs: th e base form is

ta.gged V B , and forms with overt endings a,re indica.ted by appending

D

for past tense, G for present

participle/gerund,

N

for past pa.rticiple aad

Z

for t,hird person singular present . Esact ly the sa.me

paradigm is recognized for the h a v e , but

Ilu ve

(regardless of whether it is used as a.n ausilia,ry or a

main verb ) is a.ssigned its oxvn base tag H V . 'The Brown Corpus fi lrther dist,inguishes t,llree forms

3 ~ o u n t i n g ot,ll simpl e altd compou nd ta.gs, the Brow11 Corpus ta.gset colrt,ains 187 t.ags.

4 A useful overview

of

th e rela.t.io11

of

these a ~ ~ c lt,ller t.agsets

to

each otller alrd

1 0

t.l~e rown

Corpus

t,agset is given

i n Appendix

B

of [Ga rsi de et, a1

19871.


6/25

of d e t h e base forin

DO),

he past tense

DOD),

nd the third person singular present (DOZ),

and eight forms of be-the five forms distinguished for regular verbs as well as th e irregular forms

a m BEM), re ( BER) and

w a s

(BEDZ). By contrast, since the distinctions between the forms of

V B

on

the

one hand a nd t he forms of BE, DO and H V on the other a.re lesically recoverable, they

are eliminated in the Penn Treebank, as shown in Table

1.'

Table 1:

Elimination of lexically recoverable distinctions

,4 second example of lesical recovera.bility concerns t.hose words that ca.n precede articles in

noun phrases. Th e Brown Corpus assigns a. separa.te tag t,o pre-qualifiers

q ~ r i f e ,

. n t h ~~=

uch ,

pre-quantifiers (~~ 11,zcllf, rizn~zy, z n r y ) .nd both Th e Peiin Treehank, on t,lle ot.her ha.nd, assigns all

of these words to a single category PDT predeterniiner). Further esaniples of lesically recoverable

categories are t he Bro\vn Corpus ca.tegories PPI, (singu1a.r reflesive p ronoun) and PPL S (pl ura l

reflexive pronoun) , which we collapse with R (personal l~ ronoun). .nd the Brown Corpus ca.ttegory

RN (nominal adverb ), wllich we collapse rvit.11 RB (a ,dverh ).

Beyond reducing lexically recovera.ble distinctjons ,

we

also elimina.t,ed certa.in POS distinctions

that are recoverable with reference to synta,ct,icst,ructure. For inst.a.nce, 'he Pen11 Treeba,nk ta,gset

does not distinguish subject pronouns frolll object pronouns eve11 in ca.ses where t he distinction is

not recoverable froin the pronoun's form, as with

y

0 2 1

since the distillction is recoverable

11

the

basis of the pronoun's position in the parse tree in th e parsed version of the corpus. Simila.rly,

the Penn Treebank ta.gset confla.tes subordina,t ,ing conjunctions with preposit.ions. t,agging both

categories a.s

I N .

The distinctioil betlveen the two categories is not lost , Irowever, since subordinating

conjunctions can be recovered as those instailces of I N that precede clauses, \+'herea.sprepositions

are those instances of

IN

t11a.t precede noun phrases or prepositional phrases. CVe would like to

emphasize t ha t th e lesical and syntactic recovera.hility inherent in the POS-ta.ggetl version of t he

Penn Treebank corpus allows end users t o employ

a

nluch richer tagse t t han t he snlall one described

in section 2 .2 if the need arises.

2 1 2 Consistency

As noted above. one reason for elimiilatiiig a P O S tag such a:, RE (n on ~i na l tiverb) is it5 lexical

recoverability. Another important reasoil for doing so is consistency. For instauce. in the Bro~vn

T h e

gerund and t.he pa.rt,iciplr of d o are tagged \ BG a.ntl \ BN i l l the Brown (:'orl>~~s,especti\rely -presunrably

because they are never

used

as auxiliary verbs.

tr h

irregular present tense forlns

urn

and

or e

are tagged as V B P itr th e Penti 'Treel,a.nk (see sectiou 2 . 1 . 3 . ust

like

any

other non-t,hird person singular present tense forin.


7/25

Corpus , the deictic adverbs there and ihoril are always tagged H B (adv eil) ), whereas their counter-

parts h e r e and t h en are iiicoilsistently tagged as RB (adv erb) or RN (nominal adverb)-even in

identical syntacti c conte sts, such as after a preposition. It is clear that reducing t he size of the

tagset reduces t he chances of such tagging incoilsistencies.

2.1.3

Syntactic function

A

further difference between the Penn Treeba.iik and the Brown Corpus concerns the significance

accorded t o syntactic context. In th e Brown Corpus, words tend t o be ta.gged independently of

thei r syntac ti c f ~ n c t io n .~or inst ance, in the phra.se the one, one is alwa,ys ta.gged as CD (cardina,l

number ), whereas in the corresponding plural phrase the ones, ones is a1wa.y~a.gged as NNS (pl ural

common noun), despi te the parallel functioil of on e and ones as 1iea.d~ f their noun phra.se. By

con tra st, since one of t he main roles of the ta.gged version of t.he Penn Treebank corpus is t o serve as

th e basis for a bracketed version of th e corpus, we encode a word s syntac tic function in its P OS tag

whenever possible. Thus, one is ta.gged as N N (singiilar commou no un ) ra.ther th an a.s C D (cardinal

number) when it is th e llea,d of a

11 ~11

lrrase. Silnila,rly, while the Br o~ vnCorpus tags both as ,4BX

(pre-quantifi er, double conjunction) , rega.rdless of \vl~etlrer t functions a,s a, prenoininal nlodifier

(both the boys), a postnoininal inodifier

t h e

boys. Both), the head of a lroun phra.se (bot h of the

boys) or par t of a, coillples coord ina.ting coi ijul ictioi~ 60th 1~og.s i ~ l

i r l .5).

tlre Peiln Treeba,nk ta.gs

both differently in each of these syntactic contests---as PDT (predetermines),R 13 (a.dverh)?NNS

(plu ral common noun ) and coordina,tiilg conjuuctiou

(CC ) .

respectively.

Th er e is one ca.se in which our concern wit11 ta,gging

by

synta.ct,icE~lllct~iolra,s led u s to bifurca.te

Brown Corpus ca,tegories rath er tlmn to collapse them: namely. in the case of t he uninflected fornz

of verbs. W11erea.s th e Brown Corpus tags th e ba.re form of a verb as V B regardless of whether

it occurs in

a

tensed clause. the Penn Treeba.nk tagset dist,inguishes

\ jB

(illfinitive or imperative)

from V B P (non-th ird person singu1a.r present tens e).

2 1 4

Indeterminacy

A

final difference between the Penii Treebauli tagset and all othe r tagset s \ve are a\va.re of concerns

th e issue of indetermina.cy: both POS a.mbiguity in the t,est and a~ lll ota tor ncertainty.

I11

nrany

cases, POS anlbiguity can be resolved with reference to t l ~ eiilgiiistic cont,e st. So, for instance, in

Katherine Hepburn s witty line

Grnizt

c an be otrbsl~okt.18-but not by nr7,yorae 1 know, the presence

of the by-phra.se forces us to coirsider o-cl,ts11oX.cn

s

the past pa.rticiple of a, tralisit ive deriva.tive of

spea,k-outspeak-rather than a,s th e adjective outspoken. However, even given explicit criteria for

assigning POS tags t o potentially airrbiguous words, it is not a1wa.y~ ossible to a.ssign a unique t ag

to a word with confidence. Since a. ma.jor concern of th e Treeba,nk is a,voitl requir ing a.nnota.tors

to make arbitra .ry decisions, we allow words t o be a.ssocia.ted with more th an one

P O S

t,a.g. Such

inultiple ta.gging indicates either t.1la.t the word s pa rt of speech simply calrnot he decided or t1ia.t

th e a nnota tor is unsure wlriclr of t.he a,lternative t,ags is the correct one. In principle, annota.tors ca.n

7 ~ nmportant except.ion is there which the Brown Corpus t.ags as

E X

(esist,et~t,ialh e r e ) when it

is

used as a

formal subject. and as

R B

(adverb) when it

is

used

as

a locative

adverb

I n

t h c .

ca.sc of tl~trr vr tlitl not, pursne

our st rat egy of tagse t recll~ct.ion o it.s logical cor~ clr~ siou. hich wortltl 11avr irllplied t a g g i ~ ~ gsist.e~~l.ialh t r t a i\ N

(cortlmon 110~11).


8/25

tag

a

word with ally number of tags, but in pra,ctice, multiple tags a.re restrict ed t o a small number

of recurring two-ta.g combinations:

J

J NN (adjective or lloun as prenominal modifier), JJ

VBG

(adjective or gerund /present participle), JJIVBN (a.djective or past pa.rticiple), NNlVBG (noun or

gerund), and

RB

lRP (adverb or particle).

2 2 The P S tagset

The Penn Treebank tagset is given in Table

2.

It c~l l ta~ins

6

POS tags and 12 other tags (for

punctua tion and currency symbols). A detailed descriptioil of the guidelines governing the use of

the tagset is available in [Santorini 19901.'

C C

C D

D T

E X

F W

IN

J

J

J JR.

J JS

LS

M D

NN

NNS

N N P

NNPS

PDT

POS

P R P

PP$

RB

RBR

RBS

R P

S Y M

Table 2:

Th e Peiln Treel)a.nli POS tagset

Coordina.ting conjurlction

2.5.

Ca.rdina1 ni~nlber 26.

Determiner

27.

Existential t h r

28.

Foreign word 29.

Preposition/subord. colljli~lctioll 30.

Adjective 31

Adjective, comparative

32.

Adjective, s~iperla~tive 33.

List item ma.rker 34.

Modal 3.5.

Noun, singular or mass

36.

Noun, plural 3 7 .

Proper iloun, singular

38.

Proper noun, plural 39.

Predetermiller

40 .

Possessive ending .

Persona.1 pronoun

42.

Possessive pronoun

-13.

Adverb

4 4 .

Xdverh, comparative -1.5.

Adverb, superlative

4 6 .

Pa.rt,icle -17.

Syrnbol ma.thematica,l or scientific)

.IS.

TO

ty

l1

\ I3

V B D

IJBC4

VBN

VBP

V BZ

\I'D

T

I\.

P

If.

P $

CfiR

B

$

t o

u t

etjectiolr

\/erl,, ha.se form

Verb, pa.st t,ense

Verb, gerund/present pa.rticiplc

Verb, pa.st pa,rticiple

Verb, no]\-3rd 11s. sing. present

Verh, 3rd ps. sing. present

uyh-deterlniner

ur

11-pronoun

l'ossessive u 1-pronoun

.to

h-a.clverb

Pound sign

Dollar sign

Sent.ence-fina.1pullctuation

(>omma

Colo11, semi-colon

Left bracket cha.ra.cter

R.ight bracket cha.racter

Straight, double

quote

Left

ol)cn

siilgle quote

Left opeu doul~ le u0t.e

Rigl-~t lose single cluote

Right close clouble quote

'1n

versions of t he t,a.gged corp us dist,ri buted before Nol,emher

1992.

singolar proper uoulls, plura.1 prope r nouns

and personal prono uns were t,a.ggecl as NP , NPS a.nd PP , respect,irely. Th e curr ent t,ags

N N P . KNPS

and

PRP

were int,rocluced in ord er to avoicl confusion .rvit,h th e synt.act.ic t.ags

KP

noun phrase) ant1

PP''

(prepositional phra se) (see Table 3 .


9/25

2 3

The

P S

tagging process

Th e tagged version of t he Penn Treebank corpus is produced in two stages, using a coinbination of

autom atic POS a,ssigilme~lt nd manual correction.

2 3 1 A u t o m a t e d s t a g e

During th e early st a.ges of t he Penn Treebank project, the initial a.utoma.tic POS assignment wa.s

provided by PARTS ([Church 1988]),a stochastic algorithm developed at ATSLT Bell Labs. PARTS

uses a. modified version of the Brown Corpus ta,gset close to our ow11 and a.ssigi1s POS t,a.gs wit,h

an error rate of 3-5%. Th e output of PARTS was automa.tically tokenized%ild th e ta,gs assigned

by PARTS were automa,tica.llymapped onto the Penn Treeba,nk ta.gset. This ma.pping iiltroduces

about 4% error, since the Pen11 Treebank tagset ma.kes certain distinctions tlmt the PARTS ta.gset

does not.'' A sample of th e resulting ta.gged text , which 11a.s an e rror ra, te of 7-9%, is sho~vnn

Figure 1

Fi g u re 1:

Sainple ta,gged t.est-before correction

hifore recently, the antoma,tic POS a.ssignmei~ts p~*ovided

y

a ca.scade of stocha.stic and rule-

driven taggers developed on the ba.sis of our early expel-ience. Since t,hesc ta.ggers are b a e d on th e

Pent1 Treebank ta,gset, the 4 error ra.te introduced as a.n a.rtefa.ct of ma.pping from the PARTS

tagset to ours is elinlina.ted, a,nd we obt,aiii error rates of '2-CiTI.

2 3 2 Manual

correc t ion

st ge

The result

of

th e f irst , a.utoma.ted sta.ge of POS tagging is given

to

allllotators t o correct. The

annota to rs use a mouse-ba.sed package writ tell i n GNU Elllacs Lisp. \vliich is embedded ~v it hi n he

'111 contrast to the Brown

Corpus,

we tlo not, a.llow compountl tags of t.he sort illustra.t.ed above for I m .

Rather, co~ltractions nd the Anglo-Sason genitive of I I OU I I S are au tomat ica.11~ plit into their co~rlp oneutmorphemes,

nd

each morplreme is t.agged separat,el\r. Tlrl~s.

hildrerz s

is t.aggetl clr ild re~~ /NN S

s /POS

and toon l is t.agged

wo-/MD n't/RBX.

''The two largest. hources of rna.pping error are t. l~ at he

PARTS

tagset, tlistinguishes neither infinitives from the

11011-third person singular present t,ense forms of \zerbs. 1101. prepositions f r o l ~ ~al.ticles

n

case> like

r ~ i n

r t l l a~rt l

r u n

u p

bill.


10/25

G N U

Emacs editor ([Lewis et a1 19901). The package allows ann ota tor s to correct POS assignment

errors by positioning th e cursor on an incorrectly ta.gged word and then entering the desired correct

ta g (or sequence of multiple tags) . Th e a,nnota.tors' nput is a.utomatically checked a.gainst th e list of

legal tags in Table 2 an d, if valid, appended to the original word-ta.g pa.ir separated by a.n asterisk.

Appending the new ta,g ra the r than replacing t he old tag allows us t o ea.sily identify recurring

errors a t t he automa-t ic PO S assignment stage. Me believe t11a.t the confusioll matrices t11a.t can

be extracted from this illformation should also prove useful in designing better a,utonia.tic aggers

in the fut ure . Th e result of this second sta,ge of PO S tagging is shown in Figure

2

Finally, in the

distribution version of the tagged corpus, a ny incorrect tags a.ssigned at th e first, autorna.tic stage

are removed.

Figure

2:

Sample tagged text -af ter correction

Tlre learning curve for the

P O S

tagging ta.sk t,akes under a niont,l~ a.t

1.5

hours a. week). a.nd

a.llnotation speeds after a. 111011th exceed 3,000 \vorcls per hour,.

Two

modes of annotation an experiment

To

determine 11ow t,o ~llasill lize he speed, inter-a.unot.atorconsistency and accuracy of POS ta.gging,

we performed a.n esperiment a.t the very beginning of the project to compare t,wo alternative modes

of annotation. In the first allllotatioil mode ( ta.gging ), annota.tors tagged una.nnota,ted te st

entirely by hand; in the second mode ( correct ing7'), they verified a,nd corrected t he outpu t of

PARTS, modified as described above. This experiment showed tha,t ma.nual t,a,ggillg took about

twice as long a.s correcting, with ahout twice the int,er-a~~not,at,orisagree~nent at,e and a.n error

ra te tha t was ab ou t .500i;, 11igher.

Four annotator s,

all

with gra.dua.t,e rai~lil~g

~ r

inguist,ics. part.ici])at,etl in th e experinlent. All

completed a. tra.ining sequence consisting of fifteen hours of correcting. fo llow~d y six hours of

tagging. Th e training nla.terial \\-as selected from a va.riety of nonfictio~lgenres in tlle Brown

Corpus. All t he annota.tors were familiar w i t h G N U Emacs at the outset of the experiment. Eight

2,000 word sa.mples were selected fro111 the Brown Corpus, t, ~v oa.cl1 from four different genres (t wo

fiction, two nonfiction), none of which t he a.nnota.tors 11a.d encountered in t rain ing. Th e tes ts for

the correction ta,sk were autoillatically ta.ggetl as described in sect,ioll 2 3 Eac.11 aanotador first


11/25

manually tagged four texts and t hen corrected four auto~llatically a.gged texts. Each a~l ilo tat or

completed the four genres in a different permuta.tion.

A repeated measures analysis of annotation speed with annot ator identity, genre and annota tion

mode (tagging vs. correcting) as classification variables sho\ved a significant annotation mode effect

(p .05). No other effects or interactioils were significant. Th e average speed for correcting was

more than twice as fast as the average speed for tagging: 20 nlinutes vs. 44 minutes per

1,000

words. (Median speeds per 1,000 words were 22 vs. 42 minutes.)

A simple mea.sure of tagging collsistency is inte r-an nota~ tor isagreement ra te , t he ra.te a.t which

annota tor s disagree with one another over t he tagging of lexical tokens, expressed a.s a perceiltage

of the raw number of such disagreements over the number of words in a given tex t sample. For

a given text and

n

annot ator s, there a.re ) disagreeinent ratios (one for ea,cb possible pa,ir of

annota tor s). Meail inter- annota tor disa,greeuuent wa,s 7.2 ) for the tagging ta.sli a.nd

4.1

for the

correcting ta,sk (with illedians 7.2 and

3.6 ,

respect ively). [Tpoii esa.mina. tion, a disproport ionate

amount of disa.greement in t he correcting case was found to I>e ca.used by one te st th at contained

many insta.nces of a, cover synlbol for cl~e~il ical,ild ot.lle~*orninlas. tlie alxence of an esplicit,

guideline for tagging this case. the a.nuota.tors 1a.d ~na.


12/25

Bracketing

4 1

Basic

~ne t hodo l o g y

Th e methodology for bracketing the corpus is colnpletely parallel t o that for tagging-11a.nd cor-

rection of th e ou tpu t of an errorful automa,tic process. Fidditch,

a

deterministic pa.rser developed

by Donald Hindle first a.t the University of Pennsylva.nia and subsequently a.t AT&T Bell Labs

([Hindle 19831, [Hindle

1989]),

is used to provide a.n initial parse of the ma,terial. Ailnotators then

hand correct th e parser's o ut pu t using a mouse-based interfa.ce implemented in GNU Elnacs Lisp.

Fidditch h as thre e properties th at make it ideally suited to serve as a preprocessor to hand correc-

tion:

Fidditch a1wa.y~ rovides exa,ctlv one a.na.lysis for any given sentence. so th at a.nnota.torsneed

not search through illultiple analyses.

Fidditch never a,tta.chesa.ny collstituellt whose role in th e 1a.rger str uc ture it ca,nnot determine

with certainty.

In cases of uilcerta.inty, Fidditch chunks the input into

a

string of trees,

providing only a par tial str uct ure for ea,ch seittence.

Fidditcll has rather good grammatical coverage. so th at t 1 1 ~ rarn~nat ical l i ~ t l l i~ha,t it does

build are usually quite accurate.

Because of these properties, a.nnot,a.t,ors o not need t,o rehra.cl;et 111uc1i of th e parse r's ou tput-

a relatively time-consuming ta.sk.

Ra.t,l~er,he a.nnota.tors' lllaiil ta.sl; is to glue together the

syntact ic chunks producetl hy the pa.rser. Using a, mouse-based interfa.ce. annotat ors move each

una ttac hed chunk of stru ctu re under t,he node to wllich it should be a tta.ched . Notational devices

allow a.nnota.tors t-o intlica.te uncerta.inty concerning const,it,uent. la.bels, a,nd to indica.te lnult iple

at tac hment sites for ai-nbiguous modifiers. The bra.cketing process is descri l~ed n more detail in

section

4.3.

4 2 Th e

syntactic tagset

Table shows the set of synta.ctic tags and null elellleiits t>l ~a te use in our slielet,al hra.cket,ing.

More deta.iled informa,tjoil on th e synt,a.c,tic ,a.gset a.nd guitlelines concerning its use a.re to be found

in [Santorini a#r-tdMarcinkie~vicz19911


13/25

Table 3:

Th e Penn Treebank syntactic ta.gset

Tags

ADJP

ADVP

NP

P P

S

SBAR

SBARQ

SINV

SQ

V P

WHADVP

WHNP

WHPP

X

Adjective phrase

Adverb phrase

Noun phrase

Prepositional phrase

Simple declarative clause

Clause introduced by subordinating conjuilctioi~or (see below)

Direct question introduced by wh-word

or

wh-phrase

Declarative sentence with subject-aux inversioil

Subconstituent of SBARQ excluding uih-word or wh-phrase

Verb phrase

Il'h-adverb phrase

Ii;lz-noun phrase

Ifl'h-prepositional phrase

('oiistituent of u~lkllow~lr uncertain category

Null elements

1 ..Understood subject of infinitive or imperative

2 0 Zero variant of h n t in subordinate cla~lses

3

T Trace-marlis positioti where moved t~h-cons ti tuents interpreted

4. NIL

Marks positioil where prepos itioi~ s in teipret rd in pied-piping con tests

Although different in detail, our tagset is si1nila.r

in

delic,acy t.o t,lxat used by the La.ncaater

Treebank Pro ject , except that we allow null elements in t,lle syntact ic annota,t ion. Bemuse of the

need t o achieve a fa,irly high output per hou r, it wa.s decitletl not to require a ,il ~lota torso create

distinctions beyond those provided by t he parser. Our a.pproach to cleveloping th e syiltactic tagset

was highly pra,gmat.ic and stroilgly influenced by th e need t.o crea.te a, 1a.rge body of ai111ot.a.ted

material given limited liurnai~ esources. Despite th e slceletal natu re of

t,he

hra,clieting, however, it

is possible to malie quite delicate distinctions ~vhen sing the corpus

I>

searching for combina.tions

of structures. For esample, a,n SBAR cont,a.ining the word to immedia.tely before the VP will

necessarily he infinitival. wllile aa SJJAR couta.ining a, verb or auxiliary \vitll a, tense feature will

necessarily be tensed. To take another esa.niple, so-called thc~,t-c1a.11sesa.n be ider~tifiedeasily

by

searching for SBARs conta,ining the ~vord

t r r t

or the

111111

ele~llent i l l initial position.

As can be seen froru Tahle 3 the synt.actic tagset used by the Pelill Tretha.nk i~rcludes variety of

ijull elements, a. subset of the null e le ~l le nt ,~ntroduced hy Fitiditch. While i t \vould be espeilsive to

insert null elerne~lts ntirely y hand, it llas not proved o\.erly onerous to nlainta.in a.nd correct those

th at are autorna.tica1ly provided. We have clioseil t,o ret,a.in t.llese null ele~nell ts eca,use we believe

that they caa be exploited in many ca.ses t,o esta.11lish a, sen tei~ce 'spredicate-ilrgument structure;


14/25

at least one recipient of the parsed corpus has used it to l~ootst~raphe tlevelopment of lexicons for

particular N P projects and has found the presence of null elements to be a considerable aid in

determining verb transitivity (Robert Ingria, personal communication). While these null elements

correspond more directly to entities in some granllnatical theories thail in others, it is not our

intention t o lean toward one or another theoretical view in producing our corpus. Rather, since the

representational framework for

gramnlatical struc ture in the rreebank is a relatively i~npoverished

flat context-free notation, the easiest mechanisn~ o include information ahou t predicate-argument

structure , although indirectly, is by allowing t he parse t r to contain explicit null items.

4 3 ample bracketing output

Below, we illustrate the bracketing process

for

the first sent.ence of our sa,mple text . Figure shows

the ou tput of

Fidditch (modified slight,ly to i nc lude onr P O S t,a,gs).


15/25

Figure :

Sainple bracketed test- full s t ruc t l i re provided by Fidd i t ch

(S

(NP (NBAR (ADJP (ADJ Battle-tested/JJH)

(ADJ industrial/JJ1')

(NPL managers/NNSH)))

( ? (ADV here/RB1')

( ? (ADV always/RB ))

(AUX (TNS * ) )

(VP (VPRES buck/VBP1'))

( ? (PP (PREP up/RPtl)

(NP (NBAR (AD J nervous/

(NPL newcomers/NNS )))))

( ? (PP (PREP with/INU)

(NP (DART the/DTU)

(NBAR

N

tale/NNU))

(PP O~/PREP

(NP (DART the/DTU)

NBAR (ADJP

(ADJ first/JJH))))))))

( ? (PP of /PREP

(NP (PROS their/PP$ )

(NBAR (NPL countrymen/NNS ))))

( 7 (S (NP (PRO * ) )

(AUX to/TNS)

(VP (V visit/VB1')

(NP (PNP Mexico/NNP )))))

( ? (MID It,/, ))

( ? (NP (IART l8a/DT )

(NBAR (N boatload/NN )

(PP of/PREP

(NP (NBAR

(NPL warriors/NNSU))))

(VP (VPPRT blown/VBN

( ?

(ADV ashore/RB ))

(NP (NBAR (CARD 375/CD1')

(NPL years/NNSU)

) ) ) )

? (ADV ago/RBH))

( ?

(FIN

. / . I ) )

A s

Figu re shows. F iddi tch 1ea.ves

v e ry m a n y

c o n s t i t u e n ts u n a t ~ t a c h e d .abel ing

them a s . ? ..nd

i t s o u t p u t is perhaps be t t e r t hough t o f a s

a

st rin g of tre e fragment,̂ tallan

a s a singl

t l ee s t r u c t u r e .


16/25

Fidditch only builds s tru ctu re wlleil this is possible for a purely syntact ic parser without a.ccess to

semantic or pragmat ic information, and it always errs on the side of cau tion. Since determining

the correct a tta chm ent point of prepositiollal phrases, relative cla.uses, and adverbial modifiers

almost always requires extrasyn tactic informa,tion, Fidd itcl ~ ursues the very conservative stra,tegy

of

a lways

leaving sucll collst ituents una, ttached, eve11 if only one atta,chrnent point is synta.ctically

possible. However, Fidditch does indicate its best guess concerniug

a

fragment s atta.chn1ent site by

th e fragment s depth of embedding. Moreover, it a.ttaclles prep osit iol~ al hrases beginning with of f

th e preposition immediately follows

a

noun; thus, t o l e of . a.nd

boatload o f

are parsed a.s single

constituents, while first of is not. Since Fidditch lacks a large verb lexicon, it cannot decide

whether some coilstitueilts serve as adjuncts or a.rguments a.nd hence lea.ves subordinate cla.uses

such as infinitives as separate fragments. Note further t ha t Fidditch creates a,djective phra,ses only

when it determines tlia,t more tha n one lexical item belollgs in the ADJP . Finally, a.s is well known,

determining the scope of conjunctions and other coordinate structures ca.n only be determined

given th e richest forins of coiltextual inforination: here aga.in. Fiddit,ch si ~npl y urns out a, string of

tree f rag men ts a.rounc1 ally conjunction. Beca.use all tl~cis ions cithin Fitltlit,cl~ .re una.tle locally. all

c o m m a (which often signa,l co~ijunc tion)mlist disrupt tllc input illto sepava.te chultlis.

The original design of tlle Treebaali called for a level of syntactic ana.lysis compara.ble t.o th e

skeleta l ana.lysis used

by

th e Lanca.st,er Treeba.lrli. but a. linlit,ed esp er i~ ne ntwas performed early in

the project to investigate t . 1 ~easibility of providing great.er levels of structural tleta.il. While the

results were somewhat unclear, there wa.s evidence tl1a.t ailnot ,ators could lnaintaiil a, much fa.ster

ra te of ha,nd correction if t he pa,rser ou tput wa.s

simplified

in va.rious \va,ys, reducing th e visual

complexity of th e tree represe~lta tions .nd elimina.ting a range of minor clecisions. T he key results

of this esperiment Lvere:

Allnotat ors take substa,ntially longer to learn the I)~.aclieting:.a.sli t.I~an he P O S tagging. ask.

with substa,n tial increases in speed occurring even a . f t e ~,\vo ~non tl ls f trainirlg.

Anilotators can correct the full structure providetl by Fidditch at an average speed of ap-

pros.

37 5

words per hour after three weeks. and

475

nords per hour after

is

weelis.

Reducing th e out pu t from the full structur e shotvn in Figure to a more skeletal represen-

tati on similar t o tha t used by t he Lanca.ster IJCREL Treebank Project increases a,nnotator

productivity by appros. 100-200 words per hour.

It proved to he very difficult for a .nno tators to tlistingrrisll hetween

a

verb s a,rguineirts and

a.djuncts in all ca.ses. Allo~ving nnota.to rs to ignore this dist.inction when it is unclear (a.tt,ach-

ing coilstituents high) increases product,ivit,v 11y appros. 1.50-200 n:ords per liour. Inforrna,l

exarrlillation of later a.nnota.tion showed that forced diht.itlctions carlnol, be nra,tIe consist,ent,ly.

,4s a resul t of this experiment., t.he origina.11,~ roposetl slieletal represent,ation was adopted ,

without a forced distinctioll between arguments a.nd at1.juncts. Even aft,er estended training, per-

forma.nce wr ie s ma.rkedly by a~nnota .tor , ith speeds on the ta.sk of correcting ~ke le t~ altructure

without requir ing a, distiilctioll bet\veen arglilneilts a.nd a.tljuncts ra.nging from appr os . 75 ivords

per hour to well over

1 000

words per hour a.ft,er t.hree or follr rnont,hs experience. The fa.stest

annota, tor s work in burstas of well over 1..500 words per liour alte~.na~ingvit,ll I~rief est,s. At an


17/25

average rate

of

750 words per hour, a team of five pa.rt-t,imea.nnotators a.nnota.ting three hours a

day should mainta,in a.n ou tp ut of a.bout

2 5

nlillion words a yea,r of treebanked sentences, with

each sentence corrected once.

It is worth noting that experienced annota tors can proofread previously corrected ~n at er ia l t

very high speeds.

A

parsed subcorpus of over one inillion words

wah

recently proofread at an average

speed of approx. 4 000 words per ailnotator per hour. A t this rat e of proiluctivity, anno tator s are

able t o find and correct gross errors in parsing,

b u t

do not have time to checli, for example, whether

they agree with all prepositional phrase at ta ch me ~~ ts .

Th e process th at creates the skeletal represenfations to be corrected by the anilo tator s si~llplifies

and flat tens th e struc tures shown in Figure by removing

POS

tags, non-bra.nching lexical nodes

and certain phrasal nodes, notably

N R A R .

Th e outpu t of the first a,utoma.tedsta.ge of t he bra.cketing

task is showri in Figure 4.


18/25

Fi g u re 4:

Sample bracketed tes t-aft er simplification, befo re correctioil

S

(NP (ADJP Ba t t le - t es te d i nd us t r ia l )

managers)

? h e r e )

? always)

(VP buck) )

? (PP up

(NP nervous newcomers)))

? (PP with

(NP t h e t a l e

(PP of

(NP t h e

(ADJP f i r s t ) ) ) ) ) )

?

(PP of

(NP t h e i r countrymen)))

? S

(NP * )

t o

(VP v i s i t

(NP Mexico)

1) )

?

,>

? (NP a boatload

(PP of

(NP warriors))

(VP blown

? ashore )

(NP

375

y e a r s ) ) ) )

? ago)

? > I

Annotators correct this si~nplified truc ture using a mouse-based interface. The ii prinla.ry job is

to glue fragments toget her, hut they illust a.lso correct incorrect parses and tlelet,e some structure.

Single mouse clicks perforin the following ta.sks, a.illong ot,hers. Th e interfa.ce correctly reindents

th e s tru ctu re \vlleilever necessary.

Attach constituents labeled

?.

'This is done

bj

pressing do~ vu ,he appropria.t e mouse button

on or immedia.tely after the ? moving th e lllouse ont,o or immedia.tely a.fter the label of the

intended parent a,ild releasing the mouse. .i\t,ta.ching const ituents ai itoma~ticallvdeletes t.lleir

? label.

Promote a co~lstitiieiltup one level of structu~e,making it a 5ihling of its cun.eut parent.


19/25

Delete a, pair of collsti tuellt brackets.

Create a pair of brackets around a constituent. This is done by typing a constituent tag and

then sweeping out the intended constituent wit11 th e mouse. Th e ta.g is checked to a.ssure

that it is a legal label.

Change th e label of a const ituent. The new tag is checked to assure t,hat it is legal.

The bracketed test after correction is showil in Figure 5. The fraginents are now coilnected

together into one rooted tree structure . Th e result is a skele t l analysis in tha.t much syilta.ctic

detail is left unannot ate d. Most prominently, all internal st ruc ture of th e NP up through the head

and including a.ny single-word post-head modifiers is left uilan~lota,ted.

Figure 5

Sa.mple bracketed test -aft er correctioli

S

NP B a t t l e - t e s t e d i n du s t r i a l m anagers

h e r e )

always

VP buck

UP

NP nervous newcomers)

PP with

NP t h e t a l e

PP of

NP NP t h e

ADJP f i r s t

PP of

NP t h e i r countrymen)))

S NP * )

t o

VP v i s i t

NP Mexico))))

NP NP a bo at lo ad

PP of

NP NP wa rr io rs )

VP-1 blown

ashore

ADVP NP 375 y e a r s )

a g o > > > > >

V P - 1 *ps e udo - a t t a c h* ) ) ) ) ) ) ) )


20/25

As noted above in connection with POS ta.gging, a nlajor g o d of the Treebank project is to

allow annot ato rs only t o indicate struc tur e of which they were certain. Th e Treebank provides two

notatioila l devices t o ensure this goal: the coilstitueiit label and so-called pseudo-attachment .

The

X

const ituent la,bel is used if a.n a,ni lotator is sure t lmt a sequence of words is a majo r con-

stituent but is unsure of its syntactic category; in such ca.ses, the annotato r simply brackets t he

sequence and labels it

X.

Th e second nota, tiona l device, pseutlo-a. ttachinent , l1a.s two prima.ry uses.

On t he one ha.nd, it is used t o an notate what Kay has called pe rn ~r .n en t rec1ictn.ble c~,n~biyu,ities,

allowing a n annota .tor to indicate tha,t a, st ru ct i~ res globa.lly ambiguous even given the surrounding

context (anno tat ors always assign struct ure to a, sentence on the ba.sis of it s context). An esa.mple

of this use of pseudo-attachment is shown in Figure

5 ,

where the participial phrase blown ashore 75

years ago modifies either warriors or boatload,

hu t

there is no way of settling the question-both

attachmen ts mean exa.ctly the sa,me thing. In the ca,se a.t ha.nd, the pseudo-atta.chment notation

indicates th at the a.nnot ator of the sentence thought tha t VP 1 s most likely a modifier of warrio rs,

but th at i t is also possible t ha t it is a modifier of b ~a t l on d. ' ~ second use of pseudo-attachment is

to allow ann ota tor s to represent the underlying position of extraposed elements: in addition to

being a,ttached in its superficia.1 position in the tree. the est,raposetl coi~stituent,s pseudo-a,t,t,ached

within the constituent, to which it is semantically rela.t,ed. Note that except for the device of pseudo-

at tachment , the slieleta,l analysis of th e Treebank is entirely 1,cstrictecl t,o sirnple cont,est-free trees.

Th e reader ma.y ha,ve noticed tl1a.t the

AD J P

bra.cltets in Figuse 4 have vanishetl in Figuse

5.

For

the sa,ke of th e overa.11 efficiency of the annot,a.t, ion ,ask.

we

leave

all AD.11

hra,cliets in the sinlplified

str uct ure , with the ai lnotators esyectetl t o reinove many of then1 during a.nnotation. The reason

for this is some1vl1a.t~ ornples, but provides a good esa.mple of t he considerations tha.t come into

play in designing the det,ails of annot,a.tion methods. Th e first relevant, fact is tlmt Fidditch only

outputs ADJP

bra.ckets \vit,llin NPs for aclject,ive phra.ses containing more t11a.n one lesical item.

To be consistent, the fina.1 str uct ure nlust coiltaill AD J P ~iocles or all adjective phrases withill

NPs or for none; we ha.ve chosen to delete all sucli nodes \vit, l~inNPs under noriual circ~lmsta.nces.

(Thi s does not affect th e use of the ADJP ta g for predica.tive a.djective phrases outside of NPs.)

In

a seemingly unrelated guideline, all coordiilat,e st,ructures are a.ii~rota.t.edn the 'Treehank; such

coordinate structures a.re represented by C:lio~nskp-adjunctionwhen the two col~joiiiedconstitlients

hear the same la,bel. This illeans t,hat

if

an

YP

contains coordinatect a.djective phrases, then an

AD JP ta g will be used to t,a.g ,ha.t coorcliila~tioileven though si~npleADJPs witlii~lXPs will not be

bear an A PJ P ta,g. Esperieilce ha,s sl io ~v i~ha t annot,at .ors ca.11 delet,e pa,irs of bra.ckets estl-emely

quickly using the mouse-based tools. whereas crea,ting brackets is a, m1.1c11 lower opera.t,ion. Because

th e coordination of a.djectives is quite common, it is more efficient to lea.ve in ADJP labels, and

delete them

if

they a.re not part of a coordinate structure , t1ia.n to reint.roduce them if necessary.

Progress

to date

5 1

Coinpositioil and size of corpus

Table 4 shows the output of the Penn T1.eel1auli project a t the

elid

of its first pllase.

12~l l i s

se of pseudo-at t achnlel~t.s

identical

to it.s

oi.igi~ialus i n Cl1u1.clt s

pitrser

[Clru

rch l J R O ] ) .


21/25

Table

4:

Penn Treebank

(a s of 1 1/92 )

Tagged for Skeleta l

Description Part of Speech Parsing

(Tokens) (Tokens)

Dept. of Energy abstracts

Dow Jones Newswire stories

Dept. of Agriculture bulletins

Library of America texts

MUC-3 messa.ges

IBM Ma,nual sentences

W B U R radio tra.nscripts

ATIS seiltences

Brown Corpus, reta.gged

Total:

4.885.781; 2,fiX1,11(8

All the materials listed above are availa.ble on CD R.OM to mernber-s of the Linguistic Data

Consortium.13 About inillioil words of POS- a.ggecl ina.teria,l and a small sa.nlpling of skeleta.11~

parsed text are a,va.ila,blea.s part of the first Association for C omputa.iiona1 Linguistics/Da.ta. Col-

lection Initiative CD-RO M, and

a

some~vll at arger snbsct of nrat,eria.ls s ava.ilable on c x tr id ge t ap e

directly from the Penn

Treebank Project,. For

i n

fo1.111at o~r. orrt

art t

l st a.l~t,I~orf t,llis pa.per

or send ernail to treeI~ai1li~Bunagi.cis.1i~~e1~11ed11.

r

Dept . of Energy abst ra,ct s are scientific abst,ra.ctsfrom a variety of disciplines.

All of t he skelet,ally parsed Doxv Jones Newswire 1na. teria.l~ .re also a.va,ila,ble as digitrally

recorded rea.d speech as pa.rt of the DL4RP.4 WSJ-CS RI corpus, a.vaila.ble through the Lin-

guistic Data. C onsort,iurn.

r The Dept. of Agriculture materials iilcludc short 1)ulletjns such

nlien to plant various

flowers and how to can various vegetables and fruits.

r Th e Library of .4iuerica t es ts are 5,000-10,000 word passage>, mainly booli c hapte rs, from

a variety of Arnericall authors includillg Nal-li Twain. I-Ienry i-ldams. Willa Ca ther , Herman

hilelville. W.E.B. Dubois, and Ralph M-aldo Emerson.

The

M U C 3

tests are all news stories fro111 the Federal Nelcs Service about terrorist activities

in South America. Some of these t es ts are translations of Spanish news stories or transcripts of

radio broadcasts. They are taken from training lnaterials for the :3rd hlessage IJnderstanding

Conference.

13Cont,actThe Ling11ist.i~

a.ta

Consortil~m. 41 l17illianls Ilall. I~'nivc.r>it - l'P rtl r~s yl~. ani a.'hilatlelpllia, PA 19104

6305 or

send email

to ltlc~Ci~~~~ra gi ci~ r~l~enn ed~~or ulore i~~f or r ~~a t io n.


22/25

Th e Brown Corpus inaterials were coinpletely retagged

111

the Penn Treebank project sta rtin g

from the untagged version of th e Brown Corpus ([Francis 19641).

The IBM sentences are taken from IBhil colllputer ma~nuals; hey are chosen t o coiltain a

vocabulary of 3,000 words, and are limited in length.

Th e ATIS sentences ar e transcr ibed versions of sponta.neous sentences collected a.s training

materials for the DARPA Air Tra,vel Information Systenl project.

Th e entire corpus ha,s been ta,gged for POS i nforn~a.t ion,.t a.n est ima ted error ra.te of a.pprox. 3 .

Th e POS- tagged version of th e Libra.ry of America te xts and the Depa.rtment of Agricul ture bul-

letins have been corrected twice (ea.ch by a. different ann ot at ~o r) ,.nd t.he corrected files were then

carefully a.djudica.ted;w e est inlate the error ra te of the a.djudica.ted version a.t well under

1%.

Using

a

version of PARTS retrained on the entire prelimina.ry corpus a.nd adjudicating between the out-

put of the retrained versioil and the prelimii~ary ersion of the corpus, we plan to reduce the error

ra te of the final version of t,he corpus t,o appros . 1%. All the skeletally pa,rsed ma.t,eria,ls1ia.ve been

corrected once, except for the Brown inat,erials. ~vhich11a.ve been quic.lloyed has a. va.riety of

limitations. Many ot,herwise clear ar gu ni en t/ ad j~ ~~ ~c t~elations in the corpus are not indica.ted due

to the current Treeba.nli7sesseiltia,lly contest- free represe11ta.t-ion.For example. there is a.t present,

no satisfa.ctory represent.a.tion for seilt e~ices n \vl~iclicoluplelnent, nolln plira.ses or clauses occur

after a sententia,l level a.dverb. Either the adverb is trapped within tlie

\. P,

o th at the coillpleillent

can occur within th e VP where it belongs. or else tlie adverb is a.t,ta.ched to tlle S, closil~goff

the V P a.nd forcing the conlplenlent to attach to the S . This trappiilg problem serves a.s a

limitatioil for groups t1ia.t currently use Treebank mat,erial to semiant ,oi~~a.ti ca~l lyerive lexicons for

part icular applica.tions. For most of these prohlerns. hoirrevcr. solutions a,re possible on t he basis of

mechanisms already used by the Treebaak Projcct For esa.mple. t,he pseudo-att ,acllment nota.tion


23/25

can be extended t o indicate

a

variety of crossing dependencies. IVe have recently begun to use this

mechanism t o represent va,rious kinds of disloca.tions, a,nd t,he Treeha.nk aalnota.tors t,hemselves have

developed a detailed proposal t o extend pseudo-a.tta .chment to a wide range of simi1a.r phenomena.

A

variety of ir~consistencies n the annotat ion scllerne used lvithin t he Treebank have also become

apparent with time. Th e annotat ion schelnes for some syntactic categories should be unified to

allow a consistent approach to determining predicate-argument struc ture . To take a very simple

example, sentential adverbs a tta ch under

V P

when they occur between ausiliaries and predicative

ADJPs, but attach under

S

when they occur between ausiliaries and VPs. These structures need

to be regularized.

As the cur ren t Treebank has been exploi ted by a va.riety of users, a, significaat number have

expressed a need for forms of a11nota.tion richer tha n provided by t he project s first pha.se. Some

users would like a less skeletal form of a.nnota.tion of surface grammatical struct ure , espa.nding

th e essentially context-free a.nalysis of the current Penu Treebank to indicate a, wide va,riety of

non-contiguous struct,ures a,nd depende~lcies.4 wide rangc of Trceba.nli users now st.rongly desire

a level of allllotatioil \chic11 ma.kes explicit some form of pretlicat P-a rg ~~ tr lcnttruct,ure. Th e desired

level of representat ioil ~vould ualie explicit th e logical sl.ihjert

a.11t1

logical object of the verb, a.nd

would indicate, a.t least in clear ca.ses, which sr11)c.onst.it.uents erve as ;\.rgulilents of the underlying

predicates a.nd wl1ic11 serve as modifiers.

During the nest, phase of tlle Treebaak project, we expect to provide both a, richer a.na1ysis of tlle

existing corpus a.nd to provide a pa,ra.llel corpus of pretlica.te-argument st,rlrctures. This will be done

by first enriching the a .nnotation of the current corpus, a.nd then a.utomatica.lly est,ra.ctingpredica.t,e-

argument structure: a t t he level of dis tinguishir~g ogical sul)jcct,s ant1 ol>jects. and distirlguisllir~g

arguments from a,djunct,s or c1ea.r ca,ses. Enri ch~nen twill he acllieved by a~rtomatcallg t,ra,nsforming

the current Pel111 Treeha.nk into a. level of s t r u c t ~ ~ r elose to t 11eii~t.eildetI ,a.rget.a.ncl then completing

th e coilversion by hand .

eferences

[Brill 19911

[Brill et a1 19901

[Church 19801

[Church 19S8]

Brill? Eric. 1991. Discovering t,he lcsica,l fea.tures of a.1angua.ge.

I11 Proceedirzgs of tlzc- 29th .-l~l~ztrcrl2Ieeti12g of tlzt il.s.socintioiz

or

Con ,p~ r t~ ~ . t i oncr li~zgtr i s t ics .B ~ I - k c l e y :

C4

Brill, Eric: h lagernlau, Da.vid: A~larcus.Mitchell P.; and San-

torini, Beatrice. 1990. T)etlucing li~~gliistictructllre from the

s~ at is ti cs f large co1.1,ot.a.

11

P1.ocvrclilrg.5 o j lh c V 4

RPA

,Speec/,.

nnd .\T~rl~r~~rr/f~ .~ ig l rc~ f l tI1ork6ho~).I ~ I I ? . c

9.90.

pages 27.55282.

C:hurch. Iienneth

C

1950. hlemory limitations in 11a.tura.l

1a.ngua.geprocessing, k1IT

L C:S

Tecl1njca.l Report

24.5.

h/la.ster s

thesis, Ma.ssac11usetts Ins tit jute of Technology.

Cll~urch, iennet11 \1 ., 1985.

4

stocl~astic arts pr0gra.m a.nd

llour~p11ra.se pa.rser for unresl.ricted test ,.

In

I rocc-etlirzg.5

o


24/25

[Francis

19641

[F ranc is an d K uE era

19821

[Ga rsid e et a.1

19871

[Hindle 19831

[Hindle

19893

[L ewis e t a

19901

the Second Conferenc e on Applied Nuturul Lunguuge Process-

ing. 26th An nu al M eeting of tlze Associcr.tioi2 or Co input ation a~l

Linguis t ics ,

p a g e s 136-143.

Francis W . Nelson

1964.

A standard sample of present-day

English for use with digital comp uters . Report to the U.S Of-

fice of Edu,cation on C ooperative Researc h Project N o.

E 007.

Brown University Providence.

Francis W.

Nelson a.nd I


25/25

[Santorini an d Marcinkiewicz 19911 Santorin i Bea trice

a.nd

hla.rcinkiewicz Mary Ann 1991.

Bracketing guideli~les or t he Pen11 Treebank Pro jec t.

Ms.

Depart men t of Comput er a.nd 1nforma.tion Science University

of Pennsylvania..

[Veilleux and Ostendorf 19921

Veilleux N. M. and Ostendorf Mari 1992. Probabilistic parse

scoriilg based on prosodic features. In Proceedings of the Fifth

D A R P A Speeclz and ATatura,r Language T,lfork.shop, Febr vary

1992.

[Weischedel et a1 19911 Weischedel Ralph; Ayuso Dama.ris; Bobrow

R.;

Boisen

Seaan; ngria R.obert; and Pa li~lucci eff

1991.

Pa.rtia1 pa.rs-

ing:

a

report of work in progress. In

Proceeclings of the Fourth

D A R P A Speech and iVattrra,l Lnngirage W or ks ho p, February

1991,

Drajt version.

penn treebank paper fulltext

Documents