abstractive text summarization @retrieva seminar

Abstractive Text Summarization

2017/05/17 レトリバセミナーアルバイト：小平知範

1

自己紹介

• 小平知範 (@kdaira_)

• 首都大学東京大学院修士２年（小町研究室）

• 研究分野: 要約、平易化

2

目次

• Abstractive Text Summarizationについて

• RNNを用いた生成モデル

• Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond (Nallapati et al., CoNLL’16)

• Get To The Point: Summarization with Pointer-Generator Networks (See et al., ACL’17)

3

Abstractive Text Summarization

• タイトル

• 要約

• 記事

http://www.dailymail.co.uk/news/article-4497890/Samurai-swords-axes-air-guns-brought-school.html4

Sequence-to-Sequence Model

sth0 h9

Get To The Point: Summarization with Pointer-Generator Networks (See et al., ACL’17): Figure 25

• タスク：abstractive text summarization

• 解決した問題：1. 文章の構造を捉えれてない，2.＜UNK>対応

• 解決手法：1. hierarchicalな構造のEncoderとそれを考慮したattentionモデルの提案2. Large Vocabulary Trickとgenerator/pointerを導入．

6

Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond (Nallapati et al., CoNLL’16)

提案モデル

1. Encoder-Decoder RNN with Attention and Large Vocabulary Trick

2. Capturing Keywords using Feature-rich Encoder

3. Modeling Rare/Unseen Words using Switching Generator-Pointer

4. Capturing Hierarchical Document Structure with Hierarchical Attention

7

1. Encoder-Decoder RNN with Attention and Large Vocabulary Trick

• ベースモデル: NMT model (Bahdanau et al., 2014)encoderはbi-directional, decoderはuni-directionalGRU-RNN, attention．

• ＋αLarge vocabulary ‘trick’ (LVT) (Jean et al., 14)

decoderの語彙をミニバッチ内の語彙と，高頻度語彙を規定の数になるまで取ってくる

（翻訳とは違い，使う言語が一緒だからできる技）

8

ミニバッチ内のソースの語彙高頻度の語彙＝ N

2. Capturing Keywords using Feature-rich Encoder

• 基本：単語ベクトル

• +α：linguistic features (One-Hot representation)POSタグ（品詞），named-entity（固有表現），TF and IDF

• エンコーダ側でのみ+αを使用，デーコーダ側では単語ベクトルのみ使用

9

3．ModelingRare/UnseenWords using Switching Generator-Pointer

• 未知語対応のためのSwitching Generator-Pointer

• ポインターが発動する確率：P(si=1) = σ(vs・Wshhi + WseE[oi-1] + Wscci + bs))[hi : デコーダの隠れ層，E[oi-1]デコーダが前に出した単語ベクトル, ciはattention-weighted context vector

• ポインターを使ってソース側の単語を選ぶPia(j) ∝ (va・Wahhi-1 + WaeE[oi-1] + Wahdj + ba))pi = arg maxj (Pia(j)) for j ∈{1,…, Nd}.

jはdocument内の単語の位置，hjdはエンコーダの隠れ層

10

3．ModelingRare/UnseenWords using Switching Generator-Pointer

• P(si=1) = σ(vs・Wshhi + WseE[oi-1] + Wscci + bs)

11

• Pai(j) ∝ (va・Wahhi-1 + WaeE[oi-1] + Wahdj + ba))

ij hjd

E[oi-1]

hi-1

4. Capturing Hierarchical Document Structure with Hierarchical attention

• 文レベルと単語レベルでbidirectional-RNNsを走らせる文レベルのLSTMには，何文目かの素性を追加する．

• 文レベルのアテンションと単語レベルのアテンションを考慮した，softmax

12

4. Capturing Hierarchical Document Structure with Hierarchical attention

13

Result 1

• Gigaword Corpus (Rushらが用いた文要約のデータ)での結果

14

Result 2

• CNN/dailymailのデータを使ったときの結果

• データ量が少量かつ、入力が記事、出力が複数文と複雑であるため、提案手法であまり良くはならなかった

15

• 1. Hybrid Pointer-Generator networkソースの単語をコピーするpointerと新しい（ソースにない）単語を生成するgenerator

• 2. Coverage Mechanism今までに出力した情報を保持することで、単語の繰り返し生成を防ぐ

• CNN/Daily Mail のデータでsota

Get To The Point: Summarization with Pointer-Generator Networks

(See et al., ACL’17)

16

概要

Baseline Sequence-to-Sequence Model

sth0 h9

17

Generator-Pointer

ht*

st

xt

at

w: vocabulary U source words

18

if w is OOV, Pvocab(w) = 0

Coverage Mechanism• coverage model (Tu et al., 2016)のものを適用する。

• coverage vector: un (ct は正規化していない）これまでのattentionの和をとり、今までに本文のどこを見ていたのかを明示的に与える。

• Attention:

• Loss:

今まで見たいた単語(cit)と逆の方を見るように学習

19

Experiments

• 語彙は50k(sourceとtarget)

• 学習、テスト時の記事の単語長は400単語に制限。

• 学習時の要約長は100、テスト時は120

• pointer-generatorモデルで約230k iteration 学習し、その後にcoverage mechanismを追加して、3k iteration追加学習をした。

20

Result

• ＊(Nallapatiらはテストセットを公開していない)

21

Discussion

• (lead-3 baseline)の結果からこのコーパスでは、前半に重要な情報が存在する。

• 事前に、記事の最初の400単語と800単語で学習した時、400単語で学習したモデルの方がROUGEスコアが高かった。

22

Modelのabstractiveさ

23

※Baselineは多くのnovel word を生成しているが、その多くはエラーである

重複

24

Coverageの学習は1%の時間にも関わらず、正解に近い単語の重複率になっている

abstractive text summarization @retrieva seminar

Science