acl読み会2014@pfi "less grammar, more features"

Less Grammar, More Features

David Hall, Greg Durre6 and Dan Klein@ Berkeley

能地宏 (@nozyh)

NII

この論文の主張‣低レイヤー NLP タスクの曖昧性を解消するには、単語の表層からの素性があれば十分

評判分析

Recursive Deep Models for Semantic CompositionalityOver a Sentiment Treebank

Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang,Christopher D. Manning, Andrew Y. Ng and Christopher Potts

Stanford University, Stanford, CA 94305, [email protected],{aperelyg,jcchuang,ang}@cs.stanford.edu

{jeaneis,manning,cgpotts}@stanford.edu

Abstract

Semantic word spaces have been very use-ful but cannot express the meaning of longerphrases in a principled way. Further progresstowards understanding compositionality intasks such as sentiment detection requiresricher supervised training and evaluation re-sources and more powerful models of com-position. To remedy this, we introduce aSentiment Treebank. It includes fine grainedsentiment labels for 215,154 phrases in theparse trees of 11,855 sentences and presentsnew challenges for sentiment composition-ality. To address them, we introduce theRecursive Neural Tensor Network. Whentrained on the new treebank, this model out-performs all previous methods on several met-rics. It pushes the state of the art in singlesentence positive/negative classification from80% up to 85.4%. The accuracy of predictingfine-grained sentiment labels for all phrasesreaches 80.7%, an improvement of 9.7% overbag of features baselines. Lastly, it is the onlymodel that can accurately capture the effectsof negation and its scope at various tree levelsfor both positive and negative phrases.

1 Introduction

Semantic vector spaces for single words have beenwidely used as features (Turney and Pantel, 2010).Because they cannot capture the meaning of longerphrases properly, compositionality in semantic vec-tor spaces has recently received a lot of attention(Mitchell and Lapata, 2010; Socher et al., 2010;Zanzotto et al., 2010; Yessenalina and Cardie, 2011;Socher et al., 2012; Grefenstette et al., 2013). How-ever, progress is held back by the current lack oflarge and labeled compositionality resources and

–

0

0

This

0

film

–

–

–

0

does

0

n’t

0

+

care

+

0

about

+

+

+

+

+

cleverness

0

,

0

wit

0

or

+

0

0

any

0

0

other

+

kind

+

0

of

+

+

intelligent

+ +

humor

0

.

Figure 1: Example of the Recursive Neural Tensor Net-work accurately predicting 5 sentiment classes, very neg-ative to very positive (– –, –, 0, +, + +), at every node of aparse tree and capturing the negation and its scope in thissentence.

models to accurately capture the underlying phe-nomena presented in such data. To address this need,we introduce the Stanford Sentiment Treebank anda powerful Recursive Neural Tensor Network thatcan accurately predict the compositional semanticeffects present in this new corpus.

The Stanford Sentiment Treebank is the first cor-pus with fully labeled parse trees that allows for acomplete analysis of the compositional effects ofsentiment in language. The corpus is based onthe dataset introduced by Pang and Lee (2005) andconsists of 11,855 single sentences extracted frommovie reviews. It was parsed with the Stanfordparser (Klein and Manning, 2003) and includes atotal of 215,154 unique phrases from those parsetrees, each annotated by 3 human judges. This newdataset allows us to analyze the intricacies of senti-ment and to capture complex linguistic phenomena.Fig. 1 shows one of the many examples with clearcompositional structure. The granularity and size of

Sochar et al.’13

Deep Learning 以上の性能

構文解析

多くの言語で Berkeley parser 以上

句構造構文解析

‣文の背後にある木構造を推定する-‐ あらゆる上位レイヤーの処理でボトルネック？

-‐ 目的は曖昧性の解消

目的は曖昧性の解消

He eats sushi with chops.cksN

S

PPNPV

VP

VP

PNP

He eats sushi with chops.cksN

S

PPNPV

NP

VP

PNP

目的は曖昧性の解消

‣どちらの構造も、文法的には正しい

-‐ 人間の解釈は左側なので、左側の構造を推定することが目的

He eats sushi with chops,cksN

S

PPNPV

VP

VP

PNP


S

PPNPV

NP

VP

PNP


S

PPNPV

VP

VP

PNP


S

PPNPV

NP

VP

PNP

Naive PCFG では性能が低い

VP -‐> V NP 0.2

NP -‐> NP PP 0.15

VP -‐> VP PP 0.1

0.1 × 0.2 = 0.02 0.2 × 0.15 = 0.03

PCFG は曖昧性の解消には不十分F1-‐Score: 72.1 Treebank から確率値を推定

He eats sushi with chops,cks

S

PPV

VP

NNP

VP

PNP

Head lexicalizaWonEisner’96; Collins’97

[I][eat][I]

[sushi][sushi]

[eat]

[eat]

[with]

[eat]

• 葉ノードの情報を伝播させる• (eats, with) の関係を捉えられる • ルールの数が膨大

• 多言語への拡張性 (headの情報に依存)

欠点：

Latent annotaWon (state spliang)Matsuzaki et al.’05; Petrov et al.’06

He eats sushi with chops,cks

S

PP-‐1V-‐1

VP-‐2

N-‐3NP-‐4

VP-‐3

P-‐1NP-‐2

• 各ノードに存在する隠れ状態を推定• 現在の Berkeley Parser の実装; F1-‐score: 90.2

これまでの手法のまとめ‣これまでの手法は基本的に、 CFG のルールを増やすことで、大域的な情報を取り出してきた

‣ lexicalizaWon: 部分木に head の情報を付与する

-‐ shic-‐reduce 系の手法も当てはまる Zhang and Clark’09; Zhu et al.’13

‣ノードに粗い情報を付与する

-‐ 言語学的な分析に基づく

Klein and Manning’03 (Stanford parser)

-‐ 隠れ変数として EM で推定

Petrov et al.’06 (Berkeley parser)

VP [eat]

VP [eat] PP [with]

VP ^S

VP PP ^VP

VP-‐3

VP-‐2 PP-‐1

本研究のアプローチ‣アノテーションを最低限にした状態で、構文解析の精度をあげることは果たして可能か？

-‐ 曖昧性の解消を行う際、ノードに情報を付与することは本当に必要なのか

‣モチベーション

-‐ lexicalized parser は head の情報が必要だが、言語によっては head の情報が利用できないことがある（リソース不足により）

-‐ Berkeley parser は、単語の表層の情報をあまり使わない

-‐ morphological rich language に弱い（チューニングが必要）

-‐ 実験によって、本手法が多言語の解析により有効であることを示す

本研究のアプローチ‣曖昧性の解消の多くは、ルールの貼るスパンの回りの表層を見るので十分なのではないか？


S

PPNPV

NP

VP

PNP


S

PPNPV

VP

VP

PNP

[FIRSTWORD=eats × RULE=VP→V PP]

[SPANLENGTH=5 × RULE=VP→V PP]

[LASTWORD=chop.. × RULE=VP→V PP] [LASTWORD=chop.. × RULE=VP→V NP]

[SPANLENGTH=5 × RULE=VP→V NP]

[FIRSTWORD=eats × RULE=VP→V NP]

負の重みが学習されて欲しい

Result OverviewAnnotation Dev, len 40v = 0, h = 0 90.1v = 1, h = 0 90.5v = 0, h = 1 90.2v = 1, h = 1 90.9Lexicalized 90.3

Table 2: Results for the Penn Treebank develop-ment set, sentences of length 40, for differentannotation schemes implemented on top of the X-bar grammar.

Recall from Section 3 that every span feature isconjoined with indicators over rules and rule par-ents to produce features over anchored rule pro-ductions; when we consider adding an annotationlayer to the grammar, what that does is refine therule indicators that are conjoined with every spanfeature. While this is a powerful way of refiningfeatures, we show that common successful anno-tation schemes provide at best modest benefit ontop of the base parser.

5.1 Structural Annotation

The most basic, well-understood kind of annota-tion on top of an X-bar grammar is structural an-notation, which annotates each nonterminal withproperties of its environment (Johnson, 1998;Klein and Manning, 2003). This includes verticalannotation (parent, grandparent, etc.) as well ashorizontal annotation (only partially Markovizingrules as opposed to using an X-bar grammar).

Table 2 shows the performance of our featureset in grammars with several different levels ofstructural annotation.3 Klein and Manning (2003)find large gains (6% absolute improvement, 20%relative improvement) going from v = 0, h = 0 tov = 1, h = 1; however, we do not find the samelevel of benefit. To the extent that our parser needsto make use of extra information in order to ap-ply a rule correctly, simply inspecting the input todetermine this information appears to be almostas effective as relying on information threadedthrough the parser.

In Section 6 and Section 7, we use v = 1 andh = 0; we find that v = 1 provides a small, reli-able improvement across a range of languages andtasks, whereas other annotations are less clearlybeneficial.

3We use v = 0 to indicate no annotation, diverging fromthe notation in Klein and Manning (2003).

Test 40 Test allBerkeley 90.6 90.1This work 89.9 89.2

Table 3: Final Parseval results for the v = 1, h = 0

parser on Section 23 of the Penn Treebank.

5.2 Lexical Annotation

Another commonly-used kind of structural an-notation is lexicalization (Eisner, 1996; Collins,1997; Charniak, 1997). By annotating grammarnonterminals with their headwords, the idea is tobetter model phenomena that depend heavily onthe semantics of the words involved, such as coor-dination and PP attachment.

Table 2 shows results from lexicalizing the X-bar grammar; it provides meager improvements.One probable reason for this is that our parser al-ready includes monolexical features that inspectthe first and last words of each span, which cap-tures the syntactic or the semantic head in manycases or can otherwise provide information aboutwhat the constituent’s type may be and how it islikely to combine. Lexicalization allows us to cap-ture bilexical relationships along dependency arcs,but it has been previously shown that these addonly marginal benefit to Collins’s model anyway(Gildea, 2001).

5.3 English Evaluation

Finally, Table 3 shows our final evaluation on Sec-tion 23 of the Penn Treebank. We use the v =

1, h = 0 grammar. While we do not do as well asthe Berkeley parser, we will see in Section 6 thatour parser does a substantially better job of gener-alizing to other languages.

6 Other Languages

Historically, many annotation schemes for parsershave required language-specific engineering: forexample, lexicalized parsers require a set of headrules and manually-annotated grammars requiredetailed analysis of the treebank itself (Klein andManning, 2003). A key strength of a parser thatdoes not rely heavily on an annotated grammar isthat it may be more portable to other languages.We show that this is indeed the case: on nine lan-guages, our system is competitive with or betterthan the Berkeley parser, which is the best single

Arabic Basque French German Hebrew Hungarian Korean Polish Swedish AvgDev, all lengths

Berkeley 78.24 69.17 79.74 81.74 87.83 83.90 70.97 84.11 74.50 78.91Berkeley-Rep 78.70 84.33 79.68 82.74 89.55 89.08 82.84 87.12 75.52 83.28

Our work 78.89 83.74 79.40 83.28 88.06 87.44 81.85 91.10 75.95 83.30Test, all lengths

Berkeley 79.19 70.50 80.38 78.30 86.96 81.62 71.42 79.23 79.18 78.53Berkeley-Tags 78.66 74.74 79.76 78.28 85.42 85.22 78.56 86.75 80.64 80.89

Our work 78.75 83.39 79.70 78.43 87.18 88.25 80.18 90.66 82.00 83.17

Table 4: Results for the nine treebanks in the SPMRL 2013 Shared Task; all values are F-scores forsentences of all lengths using the version of evalb distributed with the shared task. Berkeley-Rep isthe best single parser from (Bjorkelund et al., 2013); we only compare to this parser on the developmentset because neither the system nor test set values are publicly available. Berkeley-Tags is a version ofthe Berkeley parser run by the task organizers where tags are provided to the model, and is the bestsingle parser submitted to the official task. In both cases, we match or outperform the baseline parsers inaggregate and on the majority of individual languages.

parser4 for the majority of cases we consider.We evaluate on the constituency treebanks from

the Statistical Parsing of Morphologically RichLanguages Shared Task (Seddah et al., 2013).We compare to the Berkeley parser (Petrov andKlein, 2007) as well as two variants. First,we use the “Replaced” system of Bjorkelund etal. (2013) (Berkeley-Rep), which is their bestsingle parser.5 The “Replaced” system modi-fies the Berkeley parser by replacing rare wordswith morphological descriptors of those wordscomputed using language-specific modules, whichhave been hand-crafted for individual languagesor are trained with additional annotation layersin the treebanks that we do not exploit. Unfor-tunately, Bjorkelund et al. (2013) only report re-sults on the development set for the Berkeley-Repmodel; however, the task organizers also use a ver-sion of the Berkeley parser provided with partsof speech from high-quality POS taggers for eachlanguage (Berkeley-Tags). These part-of-speechtaggers often incorporate substantial knowledgeof each language’s morphology. Both Berkeley-Rep and Berkeley-Tags make up for some short-comings of the Berkeley parser’s unknown wordmodel, which is tuned to English.

In Table 4, we see that our performance is over-all substantially higher than that of the Berkeleyparser. On the development set, we outperform theBerkeley parser and match the performance of theBerkeley-Rep parser. On the test set, we outper-

4I.e. it does not use a reranking step or post-hoc combina-tion of parser results.

5Their best parser, and the best overall parser from theshared task, is a reranked product of “Replaced” Berkeleyparsers.

form both the Berkeley parser and the Berkeley-Tags parser on seven of nine languages, losingonly on Arabic and French.

These results suggest that the Berkeley parsermay be heavily fit to English, particularly in itslexicon. However, even when language-specificunknown word handling is added to the parser, ourmodel still outperforms the Berkeley parser over-all, showing that our model generalizes even bet-ter across languages than a parser for which thisis touted as a strength (Petrov and Klein, 2007).Our span features appear to work well on bothhead-initial and head-final languages (see Basqueand Korean in the table), and the fact that ourparser performs well on such morphologically-rich languages as Hungarian indicates that our suf-fix model is sufficient to capture most of the mor-phological effects relevant to parsing. Of course,a language that was heavily prefixing would likelyrequire this feature to be modified. Likewise, ourparser does not perform as well on Arabic and He-brew. These closely related languages use tem-platic morphology, for which suffixing is not ap-propriate; however, using additional surface fea-tures based on the output of a morphological ana-lyzer did not lead to increased performance.

Finally, our high performance on languagessuch as Polish and Swedish, whose training tree-banks consist of 6578 and 5000 sentences, respec-tively, show that our feature-rich model performsrobustly even on treebanks much smaller than thePenn Treebank.6

6The especially strong performance on Polish relative toother systems is partially a result of our model being able toproduce unary chains of length two, which occur frequentlyin the Polish treebank (Bjorkelund et al., 2013).

Berkeley-‐Rep: Berkeley parser で、低頻度語を言語毎にチューニングした素性表現で置き換える

多言語データ： SPMPL 2013 Shared Task

モデル：CRF ParsingFinkel et al.’07

we need a grammar at all. As a thought experi-ment, consider a parser with no grammar, whichfunctions by independently classifying each span(i, j) of a sentence as an NP, VP, and so on, ornull if that span is a non-constituent. For exam-ple, spans that begin with the might tend to beNPs, while spans that end with of might tend tobe non-constituents. An independent classificationapproach is actually very viable for part-of-speechtagging (Toutanova et al., 2003), but is problem-atic for parsing – if nothing else, parsing comeswith a structural requirement that the output be awell-formed, nested tree. Our parser uses a min-imal PCFG backbone grammar to ensure a ba-sic level of structural well-formedness, but reliesmostly on features of surface spans to drive accu-racy. Formally, our model is a CRF where the fea-tures factor over anchored rules of a small back-bone grammar, as shown in Figure 1.

Some aspects of the parsing problem, such asthe tree constraint, are clearly best captured by aPCFG. Others, such as heaviness effects, are nat-urally captured using surface information. Theopen question is whether surface features are ade-quate for key effects like subcategorization, whichhave deep definitions but regular surface reflexes(e.g. the preposition selected by a verb will oftenlinearly follow it). Empirically, the answer seemsto be yes, and our system produces strong results,e.g. up to 90.5 F1 on English parsing. Our parseris also able to generalize well across languageswith little tuning: it achieves state-of-the-art re-sults on multilingual parsing, scoring higher thanthe best single-parser system from the SPMRL2013 Shared Task on a range of languages, as wellas on the competition’s average F1 metric.

One advantage of a system that relies on surfacefeatures and a simple grammar is that it is portablenot only across languages but also across tasksto an extent. For example, Socher et al. (2013)demonstrates that sentiment analysis, which isusually approached as a flat classification task,can be viewed as tree-structured. In their work,they propagate real-valued vectors up a tree usingneural tensor nets and see gains from their recur-sive approach. Our parser can be easily adaptedto this task by replacing the X-bar grammar overtreebank symbols with a grammar over the sen-timent values to encode the output variables andthen adding n-gram indicators to our feature setto capture the bulk of the lexical effects. When

applied to this task, our system generally matchestheir accuracy overall and is able to outperform iton the overall sentence-level subtask.

2 Parsing Model

In order to exploit non-independent surface fea-tures of the input, we use a discriminative formula-tion. Our model is a conditional random field (Laf-ferty et al., 2001) over trees, in the same vein asFinkel et al. (2008) and Petrov and Klein (2008a).Formally, we define the probability of a tree Tconditioned on a sentence w as

p(T |w) / exp

✓|X

r2Tf(r,w)

!

(1)

where the feature domains r range over the (an-chored) rules used in the tree. An anchored ruler is the conjunction of an unanchored grammarrule rule(r) and the start, stop, and split indexeswhere that rule is anchored, which we refer to asspan(r). It is important to note that the richness ofthe backbone grammar is reflected in the structureof the trees T , while the features that condition di-rectly on the input enter the equation through theanchoring span(r). To optimize model parame-ters, we use the Adagrad algorithm of Duchi et al.(2010) with L2 regularization.

We start with a simple X-bar grammar whoseonly symbols are NP, NP-bar, VP, and so on. Ourbase model has no surface features: formally, oneach anchored rule r we have only an indicator ofthe (unanchored) rule identity, rule(r). Becausethe X-bar grammar is so minimal, this grammardoes not parse very accurately, scoring just 73 F1on the standard English Penn Treebank task.

In past work that has used tree-structured CRFsin this way, increased accuracy partially camefrom decorating trees T with additional annota-tions, giving a tree T 0 over a more complex symbolset. These annotations introduce additional con-text into the model, usually capturing linguistic in-tuition about the factors that influence grammati-cality. For instance, we might annotate every con-stituent X in the tree with its parent Y , giving atree with symbols X[ˆY ]. Finkel et al. (2008) usedparent annotation, head tag annotation, and hori-zontal sibling annotation together in a single largegrammar. In Petrov and Klein (2008a) and Petrovand Klein (2008b), these annotations were latent;they were inferred automatically during training.

I eat sushi with chops.cks

S

PPV

VP

NNP

VP

PNP

Inside-‐Outsideで周辺確率を計算AdaGrad + L2 （オンライン学習）

素性の抽出Hall and Klein (2012) employed both kinds of an-notations, along with lexicalized head word anno-tation. All of these past CRF parsers do also ex-ploit span features, as did the structured marginparser of Taskar et al. (2004); the current work pri-marily differs in shifting the work from the gram-mar to the surface features.

The problem with rich annotations is that theyincrease the state space of the grammar substan-tially. For example, adding parent annotation cansquare the number of symbols, and each subse-quent annotation causes a multiplicative increasein the size of the state space. Hall and Klein(2012) attempted to reduce this state space by fac-toring these annotations into individual compo-nents. Their approach changed the multiplicativepenalty of annotation into an additive penalty, buteven so their individual grammar projections aremuch larger than the base X-bar grammar.

In this work, we want to see how much of theexpressive capability of annotations can be cap-tured using surface evidence, with little or no an-notation of the underlying grammar. To that end,we avoid annotating our trees at all, opting insteadto see how far simple surface features will go inachieving a high-performance parser. We will re-turn to the question of annotation in Section 5.

3 Surface Feature Framework

To improve the performance of our X-bar gram-mar, we will add a number of surface feature tem-plates derived only from the words in the sentence.We say that an indicator is a surface property ifit can be extracted without reference to the parsetree. These features can be implemented with-out reference to structured linguistic notions likeheadedness; however, we will argue that they stillcapture a wide range of linguistic phenomena in adata-driven way.

Throughout this and the following section, wewill draw on motivating examples from the En-glish Penn Treebank, though similar examplescould be equally argued for other languages. Forperformance on other languages, see Section 6.

Recall that our CRF factors over anchored rulesr, where each r has identity rule(r) and anchor-ing span(r). The X-bar grammar has only indi-cators of rule(r), ignoring the anchoring. Let asurface property of r be an indicator function ofspan(r) and the sentence itself. For example, thefirst word in a constituent is a surface property, as

averted financial disaster

VP

NPVBD

JJ NN

PARENT = VP

FIRSTWORD = averted

LENGTH = 3

RULE = VP → VBD NP

PARENT = VP

Span properties

Rule backoffs

Features

...

5 6 7 8... LASTWORD = disaster

�FIRSTWORD = averted

LASTWORD = disaster PARENT = VP��FIRSTWORD = averted RULE = VP → VBD NP

Figure 1: Features computed over the applicationof the rule VP ! VBD NP over the anchoredspan averted financial disaster with the shown in-dices. Span properties are generated as describedthroughout Section 4; they are then conjoined withthe rule and just the parent nonterminal to give thefeatures fired over the anchored production.

is the word directly preceding the constituent. Asillustrated in Figure 1, the actual features of themodel are obtained by conjoining surface proper-ties with various abstractions of the rule identity.For rule abstractions, we use two templates: theparent of the rule and the identity of the rule. Thesurface features are somewhat more involved, andso we introduce them incrementally.

One immediate computational and statistical is-sue arises from the sheer number of possible sur-face features. There are a great number of spansin a typical treebank; extracting features for ev-ery possible combination of span and rule is pro-hibitive. One simple solution is to only extractfeatures for rule/span pairs that are actually ob-served in gold annotated examples during train-ing. Because these “positive” features correspondto observed constituents, they are far less numer-ous than the set of all possible features extractedfrom all spans. As far as we can tell, all past CRFparsers have used “positive” features only.

However, negative features—features that arenot observed in any tree—are still powerful indica-tors of (un)grammaticality: if we have never seena PRN that starts with “has,” or a span that be-gins with a quotation mark and ends with a closebracket, then we would like the model to be able toplace negative weights on these features. Thus, weuse a simple feature hashing scheme where posi-tive features are indexed individually, while nega-

we need a grammar at all. As a thought experi-ment, consider a parser with no grammar, whichfunctions by independently classifying each span(i, j) of a sentence as an NP, VP, and so on, ornull if that span is a non-constituent. For exam-ple, spans that begin with the might tend to beNPs, while spans that end with of might tend tobe non-constituents. An independent classificationapproach is actually very viable for part-of-speechtagging (Toutanova et al., 2003), but is problem-atic for parsing – if nothing else, parsing comeswith a structural requirement that the output be awell-formed, nested tree. Our parser uses a min-imal PCFG backbone grammar to ensure a ba-sic level of structural well-formedness, but reliesmostly on features of surface spans to drive accu-racy. Formally, our model is a CRF where the fea-tures factor over anchored rules of a small back-bone grammar, as shown in Figure 1.

Some aspects of the parsing problem, such asthe tree constraint, are clearly best captured by aPCFG. Others, such as heaviness effects, are nat-urally captured using surface information. Theopen question is whether surface features are ade-quate for key effects like subcategorization, whichhave deep definitions but regular surface reflexes(e.g. the preposition selected by a verb will oftenlinearly follow it). Empirically, the answer seemsto be yes, and our system produces strong results,e.g. up to 90.5 F1 on English parsing. Our parseris also able to generalize well across languageswith little tuning: it achieves state-of-the-art re-sults on multilingual parsing, scoring higher thanthe best single-parser system from the SPMRL2013 Shared Task on a range of languages, as wellas on the competition’s average F1 metric.

One advantage of a system that relies on surfacefeatures and a simple grammar is that it is portablenot only across languages but also across tasksto an extent. For example, Socher et al. (2013)demonstrates that sentiment analysis, which isusually approached as a flat classification task,can be viewed as tree-structured. In their work,they propagate real-valued vectors up a tree usingneural tensor nets and see gains from their recur-sive approach. Our parser can be easily adaptedto this task by replacing the X-bar grammar overtreebank symbols with a grammar over the sen-timent values to encode the output variables andthen adding n-gram indicators to our feature setto capture the bulk of the lexical effects. When

applied to this task, our system generally matchestheir accuracy overall and is able to outperform iton the overall sentence-level subtask.

2 Parsing Model

In order to exploit non-independent surface fea-tures of the input, we use a discriminative formula-tion. Our model is a conditional random field (Laf-ferty et al., 2001) over trees, in the same vein asFinkel et al. (2008) and Petrov and Klein (2008a).Formally, we define the probability of a tree Tconditioned on a sentence w as

p(T |w) / exp

✓|X

r2Tf(r,w)

!

(1)

where the feature domains r range over the (an-chored) rules used in the tree. An anchored ruler is the conjunction of an unanchored grammarrule rule(r) and the start, stop, and split indexeswhere that rule is anchored, which we refer to asspan(r). It is important to note that the richness ofthe backbone grammar is reflected in the structureof the trees T , while the features that condition di-rectly on the input enter the equation through theanchoring span(r). To optimize model parame-ters, we use the Adagrad algorithm of Duchi et al.(2010) with L2 regularization.

We start with a simple X-bar grammar whoseonly symbols are NP, NP-bar, VP, and so on. Ourbase model has no surface features: formally, oneach anchored rule r we have only an indicator ofthe (unanchored) rule identity, rule(r). Becausethe X-bar grammar is so minimal, this grammardoes not parse very accurately, scoring just 73 F1on the standard English Penn Treebank task.

In past work that has used tree-structured CRFsin this way, increased accuracy partially camefrom decorating trees T with additional annota-tions, giving a tree T 0 over a more complex symbolset. These annotations introduce additional con-text into the model, usually capturing linguistic in-tuition about the factors that influence grammati-cality. For instance, we might annotate every con-stituent X in the tree with its parent Y , giving atree with symbols X[ˆY ]. Finkel et al. (2008) usedparent annotation, head tag annotation, and hori-zontal sibling annotation together in a single largegrammar. In Petrov and Klein (2008a) and Petrovand Klein (2008b), these annotations were latent;they were inferred automatically during training.

0010…0101

10.3-‐1.23.20.01…0.30.1-‐20.110.1

内積でスコア計算PCFG のルール確率に対応 ⇒ CKY チャートのスコアに

どのような素性が有効か

Features Section F1RULE 4 73.0

+ SPAN FIRST WORD + SPAN LAST WORD + LENGTH 4.1 85.0+ WORD BEFORE SPAN + WORD AFTER SPAN 4.2 89.0+ WORD BEFORE SPLIT + WORD AFTER SPLIT 4.3 89.7

+ SPAN SHAPE 4.4 89.9

Table 1: Results for the Penn Treebank development set, reported in F1 on sentences of length 40

on Section 22, for a number of incrementally growing feature sets. We show that each feature typepresented in Section 4 adds benefit over the previous, and in combination they produce a reasonablygood yet simple parser.

tive features are bucketed together. During train-ing there are no collisions between positive fea-tures, which generally receive positive weight, andnegative features, which generally receive nega-tive weight; only negative features can collide.Early experiments indicated that using a numberof negative buckets equal to the number of posi-tive features was effective.

4 Features

Our goal is to use surface features to replicatethe functionality of other annotations, without in-creasing the state space of our grammar, meaningthat the rules rule(r) remain simple, as does thestate space used during inference.

Before we present our main features, we brieflydiscuss the issue of feature sparsity. While lexicalfeatures are a powerful driver of our parser, firingfeatures on rare words would allow it to overfit thetraining data quite heavily. To that end, for thepurposes of computing our features, a word is rep-resented by its longest suffix that occurs 100 ormore times in the training data (which will be theentire word, for common words).1

Table 1 shows the results of incrementallybuilding up our feature set on the Penn Treebankdevelopment set. RULE specifies that we use onlyindicators on rule identity for binary productionand nonterminal unaries. For this experiment andall others, we include a basic set of lexicon fea-tures, i.e. features on preterminal part-of-speechtags. A given preterminal unary at position i inthe sentence includes features on the words (suf-fixes) at position i � 1, i, and i + 1. Because thelexicon is especially sensitive to morphological ef-fects, we also fire features on all prefixes and suf-

1Experiments with the Brown clusters (Brown et al.,1992) provided by Turian et al. (2010) in lieu of suffixes werenot promising. Moreover, lowering this threshold did not im-prove performance.

fixes of the current word up to length 5, regardlessof frequency.

Subsequent lines in Table 1 indicate additionalsurface feature templates computed over the span,which are then conjoined with the rule identity asshown in Figure 1 to give additional features. Inthe rest of the section, we describe the features ofthis type that we use. Note that many of these fea-tures have been used before (Taskar et al., 2004;Finkel et al., 2008; Petrov and Klein, 2008b); ourgoal here is not to amass as many feature tem-plates as possible, but rather to examine the ex-tent to which a simple set of features can replace acomplicated state space.

4.1 Basic Span Features

We start with some of the most obvious proper-ties available to us, namely, the identity of the firstand last words of a span. Because heads of con-stituents are often at the beginning or the end ofa span, these feature templates can (noisily) cap-ture monolexical properties of heads without hav-ing to incur the inferential cost of lexicalized an-notations. For example, in English, the syntactichead of a verb phrase is typically at the beginningof the span, while the head of a simple noun phraseis the last word. Other languages, like Korean orJapanese, are more consistently head final.

Structural contexts like those captured by par-ent annotation (Johnson, 1998) are more subtle.Parent annotation can capture, for instance, thedifference in distribution in NPs that have S as aparent (that is, subjects) and NPs under VPs (ob-jects). We try to capture some of this same intu-ition by introducing a feature on the length of aspan. For instance, VPs embedded in NPs tendto be short, usually as embedded gerund phrases.Because constituents in the treebank can be quitelong, we bin our length features into 8 buckets, of

長さ40以下、WSJ Sec. 22 (development)

ほとんどの意味は直感的に分かる以下、具体例でどのような文に役立つか説明

Word before/acer span

no read messages in his inbox

VP

VBP NNS

VP → no VBP NNS

Figure 2: An example showing the utility of spancontext. The ambiguity about whether read is anadjective or a verb is resolved when we constructa VP and notice that the word proceeding it is un-likely.

has an impact on the market

PPNP

NP

NP → (NP ... impact) PP)

Figure 3: An example showing split point featuresdisambiguating a PP attachment. Because impactis likely to take a PP, the monolexical indicatorfeature that conjoins impact with the appropriaterule will help us parse this example correctly.

lengths 1, 2, 3, 4, 5, 10, 20, and �21 words.Adding these simple features (first word, last

word, and lengths) as span features of the X-bar grammar already gives us a substantial im-provement over our baseline system, improvingthe parser’s performance from 73.0 F1 to 85.0 F1(see Table 1).

4.2 Span Context FeaturesOf course, there is no reason why we should con-fine ourselves to just the words within the span:words outside the span also provide a rich sourceof context. As an example, consider disambiguat-ing the POS tag of the word read in Figure 2. AVP is most frequently preceded by a subject NP,whose rightmost word is often its head. Therefore,we fire features that (separately) look at the wordsimmediately preceding and immediately follow-ing the span.

4.3 Split Point FeaturesAnother important source of features are the wordsat and around the split point of a binary rule ap-plication. Figure 3 shows an example of one in-

( CEO of Enron )

PRN

(XxX)

said , “ Too bad , ”

VP

x,“Xx,”

Figure 4: Computation of span shape features ontwo examples. Parentheticals, quotes, and otherpunctuation-heavy, short constituents benefit frombeing explicitly modeled by a descriptor like this.

stance of this feature template. impact is a nounthat is more likely to take a PP than other nouns,and so we expect this feature to have high weightand encourage the attachment; this feature provesgenerally useful in resolving such cases of right-attachments to noun phrases, since the last wordof the noun phrase is often the head. As anotherexample, coordination can be represented by anindicator of the conjunction, which comes imme-diately after the split point. Finally, control struc-tures with infinitival complements can be capturedwith a rule S ! NP VP with the word “to” at thesplit point.

4.4 Span Shape FeaturesWe add one final feature characterizing the span,which we call span shape. Figure 4 shows how thisfeature is computed. For each word in the span,2

we indicate whether that word begins with a cap-ital letter, lowercase letter, digit, or punctuationmark. If it begins with punctuation, we indicatethe punctuation mark explicitly. Figure 4 showsthat this is especially useful in characterizing con-structions such as parentheticals and quoted ex-pressions. Because this feature indicates capital-ization, it can also capture properties of NP in-ternal structure relevant to named entities, and itssensitivity to capitalization and punctuation makesit useful for recognizing appositive constructions.

5 Annotations

We have built up a strong set of features by thispoint, but have not yet answered the question ofwhether or not grammar annotation is useful ontop of them. In this section, we examine two of themost commonly used types of additional annota-tion, structural annotation, and lexical annotation.

2For longer spans, we only use words sufficiently close tothe span’s beginning and end.

no read messages in ...

JJ NNS

NP

read の品詞は VBP か JJ か？

read messages を張るルールを決める際、VP の前に no は来ない、という情報が手がかりになる（負の重みが学習されてほしい）

Word before/acer split


VP

VBP NNS

VP → no VBP NNS



PPNP

NP







( CEO of Enron )

PRN

(XxX)


VP

x,“Xx,”





5 Annotations



PP a6achment

impact は修飾を受けやすい名詞 ⇒ 大きい重みが学習されて欲しい

各句の head は、前後両端のどちらかに来やすいという情報を利用（多くの言語で成り立つ；日本語の文節の head は右端）

Span shape


VP

VBP NNS

VP → no VBP NNS



PPNP

NP







( CEO of Enron )

PRN

(XxX)


VP

x,“Xx,”





5 Annotations



先頭の大文字、括弧を抽出する（英語の場合）named enWty の判別、括弧の一致など

Less Grammar の意味について

‣言語学を捨てて機械学習だけで問題が解決できる、ということではない

-‐ ここでの Grammar は、用いる CFG ルールのサイズのこと

-‐ 本論文の主張は、表層から意味のある素性を抽出すれば、小さな文法でも十分である、というもの

-‐ 用いている機械学習はシンプル（CRF + SGD）

‣設計に必要な言語学的知識は、既存手法のほうが少ない？

-‐ Berkeley parser: 確率モデルで EM で spliang （全自動）

-‐ shic-‐reduce: 突っ込める素性はとにかく突っ込む

余談：この研究の方向性‣ EMNLP 2013 の共参照の論文と方向性が同じに見える

-‐ 共参照解析は、menWon 間の表層から取り出した素性のみに基づく識別モデルを用いることで、最高精度を達成できる（WordNet 等の外部知識は必要ではない）

-‐ Berkeley coreference はツール公開中で、Stanford より高い精度（のはず）

Easy Victories and Uphill Ba6les in Coreference ResoluWon

Greg Durre6 and Dan Klein (Berkeley)

[Barack$Obama]1$met$with$[David$Cameron]2$.$[He]1$said$...

[with$X%−%.%Y][with$X%−%Y%said]

...

Centering

with%[X]%. .%[X]%said

NLP の多くの解析タスクは、単語の表層からうまく素性を選べば高精度が達成できる

SenWment analysis

‣Mechanical turk を使って木構造の上に5段階のラベルを付与した

‣Neural net で既存手法より良いことを示した (去年の EMNLP)

Recursive Deep Models for Semantic CompositionalityOver a Sentiment Treebank

Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang,Christopher D. Manning, Andrew Y. Ng and Christopher Potts

Stanford University, Stanford, CA 94305, [email protected],{aperelyg,jcchuang,ang}@cs.stanford.edu

{jeaneis,manning,cgpotts}@stanford.edu

Abstract

Semantic word spaces have been very use-ful but cannot express the meaning of longerphrases in a principled way. Further progresstowards understanding compositionality intasks such as sentiment detection requiresricher supervised training and evaluation re-sources and more powerful models of com-position. To remedy this, we introduce aSentiment Treebank. It includes fine grainedsentiment labels for 215,154 phrases in theparse trees of 11,855 sentences and presentsnew challenges for sentiment composition-ality. To address them, we introduce theRecursive Neural Tensor Network. Whentrained on the new treebank, this model out-performs all previous methods on several met-rics. It pushes the state of the art in singlesentence positive/negative classification from80% up to 85.4%. The accuracy of predictingfine-grained sentiment labels for all phrasesreaches 80.7%, an improvement of 9.7% overbag of features baselines. Lastly, it is the onlymodel that can accurately capture the effectsof negation and its scope at various tree levelsfor both positive and negative phrases.

1 Introduction

Semantic vector spaces for single words have beenwidely used as features (Turney and Pantel, 2010).Because they cannot capture the meaning of longerphrases properly, compositionality in semantic vec-tor spaces has recently received a lot of attention(Mitchell and Lapata, 2010; Socher et al., 2010;Zanzotto et al., 2010; Yessenalina and Cardie, 2011;Socher et al., 2012; Grefenstette et al., 2013). How-ever, progress is held back by the current lack oflarge and labeled compositionality resources and

–

0

0

This

0

film

–

–

–

0

does

0

n’t

0

+

care

+

0

about

+

+

+

+

+

cleverness

0

,

0

wit

0

or

+

0

0

any

0

0

other

+

kind

+

0

of

+

+

intelligent

+ +

humor

0

.

Figure 1: Example of the Recursive Neural Tensor Net-work accurately predicting 5 sentiment classes, very neg-ative to very positive (– –, –, 0, +, + +), at every node of aparse tree and capturing the negation and its scope in thissentence.

models to accurately capture the underlying phe-nomena presented in such data. To address this need,we introduce the Stanford Sentiment Treebank anda powerful Recursive Neural Tensor Network thatcan accurately predict the compositional semanticeffects present in this new corpus.

The Stanford Sentiment Treebank is the first cor-pus with fully labeled parse trees that allows for acomplete analysis of the compositional effects ofsentiment in language. The corpus is based onthe dataset introduced by Pang and Lee (2005) andconsists of 11,855 single sentences extracted frommovie reviews. It was parsed with the Stanfordparser (Klein and Manning, 2003) and includes atotal of 215,154 unique phrases from those parsetrees, each annotated by 3 human judges. This newdataset allows us to analyze the intricacies of senti-ment and to capture complex linguistic phenomena.Fig. 1 shows one of the many examples with clearcompositional structure. The granularity and size of

Sochar et al.’13

本研究の手法がそのまま適応できる

‣木構造が与えられた上で、各スパンを5段階に分類

-‐ 構造を固定して Inside-‐Outside, CKY を走らせる

While “ Gangs ” is never lethargic , it is hindered by its plot .

4 1

22 → (4 While...) 1

Figure 5: An example of a sentence from the Stan-ford Sentiment Treebank which shows the utilityof our span features for this task. The presenceof “While” under this kind of rule tells us that thesentiment of the constituent to the right dominatesthe sentiment to the left.

7 Sentiment Analysis

Finally, because the system is, at its core, a classi-fier of spans, it can be used equally well for tasksthat do not normally use parsing algorithms. Oneexample is sentiment analysis. While approachesto sentiment analysis often simply classify the sen-tence monolithically, treating it as a bag of n-grams (Pang et al., 2002; Pang and Lee, 2005;Wang and Manning, 2012), the recent dataset ofSocher et al. (2013) imposes a layer of structureon the problem that we can exploit. They annotateevery constituent in a number of training trees withan integer sentiment value from 1 (very negative)to 5 (very positive), opening the door for modelssuch as ours to learn how syntax can structurallyaffect sentiment.7

Figure 5 shows an example that requires someanalysis of sentence structure to correctly under-stand. The first constituent conveys positive senti-ment with never lethargic and the second conveysnegative sentiment with hindered, but to determinethe overall sentiment of the sentence, we need toexploit the fact that while signals a discounting ofthe information that follows it. The grammar rule2 ! 4 1 already encodes the notion of the senti-ment of the right child being dominant, so whenthis is conjoined with our span feature on the firstword (While), we end up with a feature that cap-tures this effect. Our features can also lexicalizeon other discourse connectives such as but or how-ever, which often occur at the split point betweentwo spans.

7Note that the tree structure is assumed to be given; theproblem is one of labeling a fixed parse backbone.

7.1 Adapting to Sentiment

Our parser is almost entirely unchanged from theparser that we used for syntactic analysis. Thoughthe treebank grammar is substantially different,with the nonterminals consisting of five integerswith very different semantics from syntactic non-terminals, we still find that parent annotation is ef-fective and otherwise additional annotation layersare not useful.

One structural difference between sentimentanalysis and syntactic parsing lies in where the rel-evant information is present in a span. Syntax isoften driven by heads of constituents, which tendto be located at the beginning or the end, whereassentiment is more likely to depend on modifierssuch as adjectives, which are typically presentin the middle of spans. Therefore, we augmentour existing model with standard sentiment anal-ysis features that look at unigrams and bigramsin the span (Wang and Manning, 2012). More-over, the Stanford Sentiment Treebank is uniquein that each constituent was annotated in isolation,meaning that context never affects sentiment andthat every word always has the same tag. We ex-ploit this by adding an additional feature templatesimilar to our span shape feature from Section 4.4which uses the (deterministic) tag for each wordas its descriptor.

7.2 Results

We evaluated our model on the fine-grained sen-timent analysis task presented in Socher et al.(2013) and compare to their released system. Thetask is to predict the root sentiment label of eachparse tree; however, because the data is annotatedwith sentiment at each span of each parse tree, wecan also evaluate how well our model does at theseintermediate computations. Following their exper-imental conditions, we filter the test set so that itonly contains trees with non-neutral sentiment la-bels at the root.

Table 5 shows that our model outperforms themodel of Socher et al. (2013)—both the publishednumbers and latest released version—on the taskof root classification, even though the system wasnot explicitly designed for this task. Their modelhas high capacity to model complex interactionsof words through a combinatory tensor, but it ap-pears that our simpler, feature-driven model is justas effective at capturing the key effects of compo-sitionality for sentiment analysis.

スパンの先頭の語が論理関係であることが多い; but など

Neural net よりも高い性能Root All Spans

Non-neutral Dev (872 trees)Stanford CoreNLP current 50.7 80.8

This work 53.1 80.5Non-neutral Test (1821 trees)

Stanford CoreNLP current 49.1 80.2Stanford EMNLP 2013 45.7 80.7

This work 49.6 80.4

Table 5: Fine-grained sentiment analysis resultson the Stanford Sentiment Treebank of Socher etal. (2013). We compare against the printed num-bers in Socher et al. (2013) as well as the per-formance of the corresponding release, namelythe sentiment component in the latest version ofthe Stanford CoreNLP at the time of this writ-ing. Our model handily outperforms the resultsfrom Socher et al. (2013) at root classification andedges out the performance of the latest version ofthe Stanford system. On all spans of the tree, ourmodel has comparable accuracy to the others.

8 Conclusion

To date, the most successful constituency parsershave largely been generative, and operate by refin-ing the grammar either manually or automaticallyso that relevant information is available locally toeach parsing decision. Our main contribution isto show that there is an alternative to such anno-tation schemes: namely, conditioning on the inputand firing features based on anchored spans. Webuild up a small set of feature templates as part of adiscriminative constituency parser and outperformthe Berkeley parser on a wide range of languages.Moreover, we show that our parser is adaptable toother tree-structured tasks such as sentiment anal-ysis; we outperform the recent system of Socher etal. (2013) and obtain state of the art performanceon their dataset.

Our system is available as open-source athttps://www.github.com/dlwh/epic.

Acknowledgments

This work was partially supported by BBN un-der DARPA contract HR0011-12-C-0014, by aGoogle PhD fellowship to the first author, and anNSF fellowship to the second.

ReferencesAnders Bjorkelund, Ozlem Cetinoglu, Richard Farkas,

Thomas Mueller, and Wolfgang Seeker. 2013.(Re)ranking Meets Morphosyntax: State-of-the-artResults from the SPMRL 2013 Shared Task. In Pro-ceedings of the Fourth Workshop on Statistical Pars-ing of Morphologically-Rich Languages.

Rens Bod. 1993. Using an Annotated Corpus As aStochastic Grammar. In Proceedings of the SixthConference on European Chapter of the Associationfor Computational Linguistics.

Peter F Brown, Peter V Desouza, Robert L Mercer,Vincent J Della Pietra, and Jenifer C Lai. 1992.Class-based n-gram models of natural language.Computational linguistics, 18(4):467–479.

Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine N-best Parsing and MaxEnt DiscriminativeReranking. In Proceedings of the 43rd Annual Meet-ing on Association for Computational Linguistics.

Eugene Charniak. 1997. Statistical Techniques forNatural Language Parsing. AI Magazine, 18:33–44.

Michael Collins and Terry Koo. 2005. DiscriminativeReranking for Natural Language Parsing. Computa-tional Linguistics, 31(1):25–70, March.

Michael Collins. 1997. Three generative, lexicalisedmodels for statistical parsing. In ACL, pages 16–23.

John Duchi, Elad Hazan, and Yoram Singer. 2010.Adaptive Subgradient Methods for Online Learningand Stochastic Optimization. COLT.

Jason Eisner. 1996. Three New Probabilistic Mod-els for Dependency Parsing: An Exploration. InProceedings of the 16th International Conference onComputational Linguistics (COLING-96).

Jenny Rose Finkel, Alex Kleeman, and Christopher D.Manning. 2008. Efficient, feature-based, condi-tional random field parsing. In ACL 2008, pages959–967.

Daniel Gildea. 2001. Corpus variation and parser per-formance. In Proceedings of Empirical Methods inNatural Language Processing.

David Hall and Dan Klein. 2012. Training factoredPCFGs with expectation propagation. In EMNLP.

James Henderson. 2003. Inducing History Represen-tations for Broad Coverage Statistical Parsing. InProceedings of the North American Chapter of theAssociation for Computational Linguistics on Hu-man Language Technology - Volume 1.

Liang Huang. 2008. Forest reranking: Discrimina-tive parsing with non-local features. In Proceedingsof ACL-08: HLT, pages 586–594, Columbus, Ohio,June. Association for Computational Linguistics.

参考：今年の ACL で別の論文Nal Kalchbrenner, Edward GrefensteJe, Phil Blunsom: A ConvoluRonal Neural Network for Modelling Sentences

Neural net で、木構造を仮定せずに、senWment を分類するTest set で、48.5 point (Stanford current より少し低い)

まとめ‣構文解析で精度を出すためには、ノードに情報を付与し、ルールを増やすことが必要と信じられていていた

‣ルールの数を最小にした構文解析

-‐ 言語/文法への依存性が小さい ⇒ 多言語への拡張性が高い

-‐ 素性を少し変更することで、他のタスクにも適応できる (SenWment)

‣ Parser は公開中 (epic parser)

‣得るべき教訓（？）

-‐ 単語の表層から得られる情報は (やっぱり) 非常に強力

-‐ 意味のある素性を抽出できれば、複雑な手法に匹敵する精度を出せる

h6ps://github.com/dlwh/epic

https://github.com/dlwh/epic

https://github.com/dlwh/epic