retrofitting word vectors to semantic lexicons

Retrofitting Word Vectors to

Semantic Lexicons

Manaal Faruqui, Jese Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, Noah A. Smith

NACL 2015

読む人：高瀬翔知識獲得研究会2015/4/21

1

単語のベクトル表現について

• コーパスからの単語の意味（ベクトル表現）獲得はNLPで重要な技術– 手法：単語-文脈の共起行列，共起行列の次元圧縮，ニューラル言語モデルなど

– 似た性質の単語＝似たベクトル

• 応用タスクの素性としても有用

2

代表作作家陸上競技文学賞時速筋肉順位書く

フランツ・カフカ 80 60 0 30 0 0 0 40

大江健三郎 70 60 0 50 0 0 0 60

ウサイン・ボルト 0 0 100 0 30 40 80 0

カール・ルイス 0 0 90 0 40 50 70 0

単語の文脈ベクトル

ベクトル表現への外部知識導入と先行研究の問題点

• 外部知識利用でベクトル表現の質が向上[Yu+ 14, Chang+ 13]– 外部知識：WordNet，FrameNetなど

• 問題点：ベクトルの構成手法が限定的– コーパスと外部知識の利用を統合してしまっている

– 与えられたベクトルに外部知識を組み込む改良ができない（新たな学習手法などに対応できない）• 例[Yu+ 14]：目的関数に外部知識の項がある

3

文脈外部知識

本研究の貢献

• 外部知識を後処理としてベクトル表現に導入する手法を提案

–任意のベクトルの構築手法と組み合わせ可能

–提案手法は高速

• 10万単語，300次元のベクトルに対し約5秒で動作

• 様々な実験を通して有用性を示す

–学習手法，外部知識，ベクトルの次元，言語など様々な比較

4

提案手法

• やりたいことは２つ– コーパスから得たベクトル（入力）と似たベクトルとする– 外部知識上で関連する単語は似たベクトルとする

• 関連：同義語，上位下位語，言い換え

• 目的関数– 似せたいベクトル間のユークリッド距離を最小化

• 一項目：コーパスの情報（入力ベクトルに近づける）• 二項目：外部知識（外部知識上での関連語に近づける）

– E：外部知識上で関連している単語間に張ったエッジの集合– α，β：ハイパーパラメータ（α=1，β＝1 / エッジの次数）

5

Figure 1: Word graph with edges between related words

showing the observed (grey) and the inferred (white)

word vector representations.

Experimentally, we show that our method works

well with different state-of-the-art word vector mod-

els, using different kinds of semantic lexicons and

gives substantial improvements on a variety of

benchmarks, while beating the current state-of-the-

art approaches for incorporating semantic informa-

tion in vector training and trivially extends to mul-

tiple languages. We show that retrofitting gives

consistent improvement in performance on evalua-

tion benchmarks with different word vector lengths

and show a qualitative visualization of the effect of

retrofitting on word vector quality. The retrofitting

tool is available at: ht t ps: / / gi t hub. com/

mf ar uqui / r et r of i t t i ng.

2 Retrofittingwith Semantic Lexicons

Let V = { w1, . . . , wn } be avocabulary, i.e, the set

of word types, and⌦bean ontology that encodesse-

mantic relations between words in V . We represent

⌦as an undirected graph (V, E ) with one vertex for

each word type and edges (wi , wj ) 2 E ✓ V ⇥V

indicating a semantic relationship of interest. These

relations differ for different semantic lexicons and

aredescribed later (§4).

The matrix Q̂ will be the collection of vector rep-

resentations q̂i 2 Rd, for each wi 2 V , learned

using a standard data-driven technique, where d is

the length of the word vectors. Our objective is

to learn the matrix Q = (q1, . . . , qn ) such that the

columns are both close (under a distance metric) to

their counterparts in Q̂ and to adjacent vertices in⌦.

Figure 1 shows a small word graph with such edge

connections; whitenodesarelabeled with theQ vec-

tors to be retrofitted (and correspond to V⌦); shaded

nodes are labeled with the corresponding vectors in

Q̂, which areobserved. Thegraph can beinterpreted

as a Markov random field (Kindermann and Snell,

1980).

The distance between a pair of vectors is defined

to be the Euclidean distance. Since we want the

inferred word vector to be close to the observed

value q̂i and close to its neighbors qj , 8j such that

(i , j ) 2 E , the objective to be minimized becomes:

(Q) =

nX

i = 1

2

4↵ i kqi − q̂i k2 +

X

( i ,j )2 E

βi j kqi − qj k2

3

5

where ↵ and β values control the relative strengths

of associations (more details in §6.1).

In this case, we first train the word vectors inde-

pendent of the information in the semantic lexicons

and then retrofit them. is convex in Q and its so-

lution can be found by solving a system of linear

equations. To do so, we use an efficient iterative

updating method (Bengio et al., 2006; Subramanya

et al., 2010; Das and Petrov, 2011; Das and Smith,

2011). The vectors in Q are initialized to be equal

to the vectors in Q̂. We take thefirst derivativeof

with respect to one qi vector, and by equating it to

zero arriveat the following online update:

qi =

Pj :( i ,j )2 E βi j qj + ↵ i q̂iP

j :( i ,j )2 E βi j + ↵ i(1)

In practice, running this procedure for 10 iterations

converges to changes in Euclidean distance of ad-

jacent vertices of less than 10− 2. The retrofitting

approach described above is modular; it can be ap-

plied to word vector representations obtained from

any model as theupdates in Eq. 1 areagnostic to the

original vector training model objective.

Semantic Lexicons dur ing Learning. Our pro-

posed approach is reminiscent of recent work on

improving word vectors using lexical resources (Yu

and Dredze, 2014; Bian et al., 2014; Xu et al., 2014)

which alters the learning objective of the original

vector training model with a prior (or a regularizer)

that encourages semantically related vectors (in ⌦)

to be close together, except that our technique is ap-

plied as a second stage of learning. We describe the

コーパスから得たベクトル（入力）

改良後のベクトル

解き方

• 反復更新で解を求める– 各 qi について，目的関数を最小化する値への更新を繰り返す

– qi は入力ベクトルで初期化

• 経験的には10回の反復で近づけたいベクトル間のユークリッド距離は0.01未満になる

6

Figure 1: Word graph with edges between related words

showing the observed (grey) and the inferred (white)

word vector representations.

Experimentally, we show that our method works

well with different state-of-the-art word vector mod-

els, using different kinds of semantic lexicons and

gives substantial improvements on a variety of

benchmarks, while beating the current state-of-the-

art approaches for incorporating semantic informa-

tion in vector training and trivially extends to mul-

tiple languages. We show that retrofitting gives

consistent improvement in performance on evalua-

tion benchmarks with different word vector lengths

and show a qualitative visualization of the effect of

retrofitting on word vector quality. The retrofitting

tool is available at: ht t ps: / / gi t hub. com/

mf ar uqui / r et r of i t t i ng.

2 Retrofittingwith Semantic Lexicons

Let V = { w1, . . . , wn } beavocabulary, i.e, the set

of word types, and⌦beanontology that encodesse-

mantic relations between words in V . We represent

⌦as an undirected graph (V, E ) with one vertex for

each word type and edges (wi , wj ) 2 E ✓ V ⇥V

indicating a semantic relationship of interest. These

relations differ for different semantic lexicons and

aredescribed later (§4).

Thematrix Q̂ will be thecollection of vector rep-

resentations q̂i 2 Rd, for each wi 2 V , learned

using a standard data-driven technique, where d is

the length of the word vectors. Our objective is

to learn the matrix Q = (q1, . . . , qn ) such that the

columns are both close (under a distance metric) to

their counterparts in Q̂ and to adjacent vertices in⌦.

Figure 1 shows a small word graph with such edge

connections; whitenodesarelabeled with theQ vec-

tors to be retrofitted (and correspond to V⌦); shaded

nodes are labeled with the corresponding vectors in

Q̂, which areobserved. Thegraph can beinterpreted

as a Markov random field (Kindermann and Snell,

1980).

The distance between a pair of vectors is defined

to be the Euclidean distance. Since we want the

inferred word vector to be close to the observed

value q̂i and close to its neighbors qj , 8j such that

(i , j ) 2 E , theobjective to beminimized becomes:

(Q) =

nX

i= 1

2

4↵ i kqi − q̂i k2 +

X

( i ,j )2 E

βi j kqi − qj k2

3

5

where ↵ and β values control the relative strengths

of associations (more details in §6.1).

In this case, we first train the word vectors inde-

pendent of the information in the semantic lexicons

and then retrofit them. is convex in Q and its so-

lution can be found by solving a system of linear

equations. To do so, we use an efficient iterative

updating method (Bengio et al., 2006; Subramanya

et al., 2010; Das and Petrov, 2011; Das and Smith,

2011). The vectors in Q are initialized to be equal

to the vectors in Q̂. We take thefirst derivativeof

with respect to one qi vector, and by equating it to

zero arriveat the following online update:

qi =

Pj :( i ,j )2 E βi j qj + ↵ i q̂iP

j :( i ,j )2 E βi j + ↵ i(1)

In practice, running this procedure for 10 iterations

converges to changes in Euclidean distance of ad-

jacent vertices of less than 10− 2. The retrofitting

approach described above is modular; it can be ap-

plied to word vector representations obtained from

any model as theupdates in Eq. 1 areagnostic to the

original vector training model objective.

Semantic Lexicons dur ing Learning. Our pro-

posed approach is reminiscent of recent work on

improving word vectors using lexical resources (Yu

and Dredze, 2014; Bian et al., 2014; Xu et al., 2014)

which alters the learning objective of the original

vector training model with a prior (or a regularizer)

that encourages semantically related vectors (in ⌦)

to beclose together, except that our technique is ap-

plied as a second stage of learning. We describe the

更新式：

実験

• 様々な公開されているベクトルを入力とし– Glove[Pennington+ 14]：共起情報をベクトルでモデル化– SG[Mikolov+ 13]：周囲の単語を予測できるよう学習– GC[Huang+ 12]：ローカルと文書レベルの文脈を組み合わせて学習– Multi[Faruqui+ 14]：異なる言語間で単語ベクトルにCCA

• 様々な外部知識を利用して– PPDB：翻訳すると同じ語になる単語を言い換えとして収集したDB

– WordNet：人手の辞書（同義語のみ（syn） or 同義+上位下位（all））– FrameNet：フレーム辞書，同一のフレームを持つ単語にエッジを張る

• 様々なタスクでの性能向上を検証– 単語の類似度タスク– TOFEL：与えられた単語と同じ意味の単語を選択肢から選ぶ– 単語の統語的アナロジータスク– Sentiment analysis：文内の単語のベクトルの平均を素性に分類器構築

7

結果（各タスクで向上した値）

8

SYN-REL（単語の統語的アナロジー）以外で向上が見られる→単語ベクトルに意味的な情報を付与し，質が向上

後処理の効果を測る

• 外部知識の情報は学習時に組み込む事も可能

– 2種類の組み込み方を試す

• log-bilinearのモデルを考え

–学習時に正則化項で導入

• 10万単語毎の遅延更新（lazy）

–確率的勾配降下法で k 個の事例を見る毎に本研究の提案手法でベクトル更新（periodic）

9

結果

• lazyでも上昇有り

• periodicはlazyよりも大幅に性能向上

• retrofitting（提案手法）はperiodicとcompetitive，性能が上回ることもある

10

先行研究との比較

• [Yu+ 14]との比較では全てのタスクで性能向上

• [Xu+ 14]との比較でもほぼ全てのタスクで性能向上

11

まとめ

• 単語ベクトル表現に外部知識を組み込む手法を提案

–後処理として組み込むので任意のベクトルに適用可能

• 提案手法による性能向上を実験で検証

–外部知識を利用する既存手法より性能向上

12

retrofitting word vectors to semantic lexicons

Technology