multi-style generative reading comprehension · multi-style generative reading comprehension...

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2273–2284Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

2273

Multi-Style Generative Reading Comprehension

Kyosuke Nishida1, Itsumi Saito1, Kosuke Nishida1,Kazutoshi Shinoda2∗, Atsushi Otsuka1, Hisako Asano1, Junji Tomita1

1NTT Media Intelligence Laboratory, NTT Corporation 2The University of [email protected]

Abstract

This study tackles generative reading compre-hension (RC), which consists of answeringquestions based on textual evidence and nat-ural language generation (NLG). We proposea multi-style abstractive summarization modelfor question answering, called Masque. Theproposed model has two key characteristics.First, unlike most studies on RC that have fo-cused on extracting an answer span from theprovided passages, our model instead focuseson generating a summary from the questionand multiple passages. This serves to covervarious answer styles required for real-worldapplications. Second, whereas previous stud-ies built a specific model for each answer stylebecause of the difficulty of acquiring one gen-eral model, our approach learns multi-style an-swers within a model to improve the NLG ca-pability for all styles involved. This also en-ables our model to give an answer in the tar-get style. Experiments show that our modelachieves state-of-the-art performance on theQ&A task and the Q&A + NLG task of MSMARCO 2.1 and the summary task of Nar-rativeQA. We observe that the transfer of thestyle-independent NLG capability to the targetstyle is the key to its success.

1 Introduction

Question answering has been a long-standing re-search problem. Recently, reading comprehension(RC), a challenge to answer a question given tex-tual evidence provided in a document set, has re-ceived much attention. Current mainstream stud-ies have treated RC as a process of extracting ananswer span from one passage (Rajpurkar et al.,2016, 2018) or multiple passages (Joshi et al.,2017; Yang et al., 2018), which is usually doneby predicting the start and end positions of the an-swer (Yu et al., 2018; Devlin et al., 2018).

∗Work done during an internship at NTT.

0

1

0

1

10 weeks </s>

it takes 10 weeks to get new york state tax refund . </s>

Question: “how long to get nys tax refund”

Generate from Voc.Copy from QuestionCopy from Passages

Mix

ture

we

igh

ts

[NLG]

[Q&A]

Figure 1: Visualization of how our model generatesan answer on MS MARCO. Given an answer style(top: NLG, bottom: Q&A), the model controls themixture of three distributions for generating wordsfrom a vocabulary and copying words from the ques-tion and multiple passages at each decoding step.

The demand for answering questions in naturallanguage is increasing rapidly, and this has led tothe development of smart devices such as Alexa.In comparison with answer span extraction, how-ever, the natural language generation (NLG) capa-bility for RC has been less studied. While datasetssuch as MS MARCO (Bajaj et al., 2018) and Nar-rativeQA (Kocisky et al., 2018) have been pro-posed for providing abstractive answers, the state-of-the-art methods for these datasets are based onanswer span extraction (Wu et al., 2018; Hu et al.,2018). Generative models suffer from a dearth oftraining data to cover open-domain questions.

Moreover, to satisfy various information needs,intelligent agents should be capable of answer-ing one question in multiple styles, such as well-formed sentences, which make sense even withoutthe context of the question and passages, and con-cise phrases. These capabilities complement eachother, but previous studies cannot use and controldifferent styles within a model.

In this study, we propose Masque, a genera-tive model for multi-passage RC. It achieves state-of-the-art performance on the Q&A task and theQ&A + NLG task of MS MARCO 2.1 and thesummary task of NarrativeQA. The main contri-

2274

butions of this study are as follows.

Multi-source abstractive summarization. Weintroduce the pointer-generator mechanism (Seeet al., 2017) for generating an abstractive answerfrom the question and multiple passages, whichcovers various answer styles. We extend the mech-anism to a Transformer (Vaswani et al., 2017)based one that allows words to be generated froma vocabulary and to be copied from the questionand passages.

Multi-style learning for style control and trans-fer. We introduce multi-style learning that en-ables our model to control answer styles and im-proves RC for all styles involved. We also ex-tend the pointer-generator to a conditional decoderby introducing an artificial token corresponding toeach style, as in (Johnson et al., 2017). For eachdecoding step, it controls the mixture weights overthree distributions with the given style (Figure 1).

2 Problem Formulation

This paper considers the following task:

PROBLEM 1. Given a question with J words xq ={xq1, . . . , x

qJ}, a set of K passages, where the

k-th passage is composed of L words xpk ={xpk1 , . . . , x

pkL }, and an answer style label s, an RC

model outputs an answer y = {y1, . . . , yT } condi-tioned on the style.

In short, given a 3-tuple (xq, {xpk}, s), the sys-tem predicts P (y). The training data is a set of6-tuples: (xq, {xpk}, s, y, a, {rpk}), where a and{rpk} are optional. Here, a is 1 if the question isanswerable with the provided passages and 0 oth-erwise, and rpk is 1 if the k-th passage is requiredto formulate the answer and 0 otherwise.

3 Proposed Model

We propose a Multi-style Abstractive Summa-rization model for QUEstion answering, calledMasque. Masque directly models the conditionalprobability p(y|xq, {xpk}, s). As shown in Fig-ure 2, it consists of the following modules.

1. The question-passages reader (§3.1) modelsinteractions between the question and passages.

2. The passage ranker (§3.2) finds passages rele-vant to the question.

3. The answer possibility classifier (§3.3) identi-fies answerable questions.

Masked Multi-Head Attention

SharedEncoder

Block

SharedEncoder

Block

SharedEncoder

Block

ModelingEncoder

Block

ModelingEncoder

Block

Modeling Encoder

Block

Multi-Head Attention

Multi-Head Attention

Add & Norm

Feed Forward

Add & Norm

Highway

Glove&ELMo

Add & NormDual Attention

Concat

Add & Norm

Highway

Glove&ELMo

Highway

Glove&ELMo

Highway

Glove&ELMo

Additive Attention

Additive Attention

Combined Attention

Multi-source Pointer-Generator

Passage 1 Passage K Question Style + Answer (Shiftted)

8x

3x

5x 2x

Answer Possibility Answer Word Sequence

Passage Ranker

Classifier

query

query

query

query

Decoder

Reader

Relevance

Figure 2: Masque model architecture.

4. The answer sentence decoder (§3.4) outputsan answer sentence conditioned on the target style.

Our model is based on multi-source abstractivesummarization: the answer that it generates can beviewed as a summary from the question and pas-sages. The model also learns multi-style answerstogether. With these two characteristics, we aimto acquire the style-independent NLG ability andtransfer it to the target style. In addition, to im-prove natural language understanding in the readermodule, our model considers RC, passage rank-ing, and answer possibility classification togetheras multi-task learning.

3.1 Question-Passages Reader

The reader module is shared among multiple an-swer styles and the three task-specific modules.

3.1.1 Word Embedding LayerLet xq and xpk represent one-hot vectors (of sizeV ) for words in the question and the k-th pas-sage. First, this layer projects each of the vec-tors to a dword-dimensional vector with a pre-trained weight matrix W e ∈ Rdword×V such asGloVe (Pennington et al., 2014). Next, it uses con-textualized word representations via ELMo (Pe-ters et al., 2018), which allows our model to usemorphological clues to form robust representa-tions for out-of-vocabulary words unseen in train-ing. Then, the concatenation of the word and con-

2275

textualized vectors is passed to a two-layer high-way network (Srivastava et al., 2015) to fuse thetwo types of embeddings, as in (Seo et al., 2017).The highway network is shared by the questionand passages.

3.1.2 Shared Encoder LayerThis layer uses a stack of Transformer blocks,which are shared by the question and passages,on top of the embeddings provided by the wordembedding layer. The input of the first block isimmediately mapped to a d-dimensional vectorby a linear transformation. The outputs of thislayer are Epk ∈ Rd×L for each k-th passage, andEq ∈ Rd×J for the question.

Transformer encoder block. The block con-sists of two sub-layers: a self-attention layer anda position-wise feed-forward network. For theself-attention layer, we adopt the multi-head atten-tion mechanism (Vaswani et al., 2017). FollowingGPT (Radford et al., 2018), the feed-forward net-work consists of two linear transformations with aGELU (Hendrycks and Gimpel, 2016) activationfunction in between. Each sub-layer is placed in-side a residual block (He et al., 2016). For an in-put x and a given sub-layer function f , the outputis LN(f(x) + x), where LN indicates the layernormalization (Ba et al., 2016). To facilitate theseresidual connections, all sub-layers produce a se-quence of d-dimensional vectors. Note that ourmodel does not use any position embeddings inthis block because ELMo gives the positional in-formation of the words in each sequence.

3.1.3 Dual Attention LayerThis layer uses a dual attention mechanism to fuseinformation from the question to the passages aswell as from the passages to the question.

It first computes a similarity matrix Upk ∈RL×J between the question and the k-th passage,as done in (Seo et al., 2017), where

Upklj = wa>[Epk

l ;Eqj ;Epk

l � Eqj ]

indicates the similarity between the l-th word ofthe k-th passage and the j-th question word. Thewa ∈ R3d are learnable parameters. The �operator denotes the Hadamard product, and the[; ] operator denotes vector concatenation acrossthe rows. Next, the layer obtains the row andcolumn normalized similarity matrices Apk =softmaxj(U

pk>) and Bpk = softmaxl(Upk). It

then uses DCN (Xiong et al., 2017) to obtain dualattention representations, Gq→pk ∈ R5d×L andGp→q ∈ R5d×J :

Gq→pk = [Epk ; Apk ; ¯Apk ;Epk � Apk ;Epk � ¯Apk ]

Gp→q = [Eq; B; ¯B;Eq � B;Eq � ¯B].

Here, Apk = EqApk , Bpk = EpkBpk , ¯Apk =BpkApk , ¯Bpk = ApkBpk , B = maxk(Bpk), and¯B = maxk( ¯Bpk).

3.1.4 Modeling Encoder LayerThis layer uses a stack of the Transformer en-coder blocks for question representations and ob-tains M q ∈ Rd×J from Gp→q. It also uses an-other stack for passage representations and obtainsMpk ∈ Rd×L from Gq→pk for each k-th pas-sage. The outputs of this layer, M q and {Mpk},are passed on to the answer sentence decoder; the{Mpk} are also passed on to the passage rankerand the answer possibility classifier.

3.2 Passage Ranker

The ranker maps the output of the modeling layer,{Mpk}, to the relevance score of each passage. Ittakes the output for the first word, Mpk

1 , whichcorresponds to the beginning-of-sentence token, toobtain the aggregate representation of each pas-sage sequence. Given wr ∈ Rd as learnable pa-rameters, it calculates the relevance of each k-thpassage to the question as

βpk = sigmoid(wr>Mpk1 ).

3.3 Answer Possibility Classifier

The classifier maps the output of the modelinglayer to a probability for the answer possibility. Italso takes the output for the first word,Mpk

1 , for allpassages and concatenates them. Givenwc ∈ RKd

as learnable parameters, it calculates the answerpossibility for the question as

P (a) = sigmoid(wc>[Mp11 ; . . . ;MpK

1 ]).

3.4 Answer Sentence Decoder

Given the outputs provided by the reader mod-ule, the decoder generates a sequence of an-swer words one element at a time. It is auto-regressive (Graves, 2013), consuming the previ-ously generated words as additional input at eachdecoding step.

2276

3.4.1 Word Embedding LayerLet y represent one-hot vectors of the words in theanswer. This layer has the same components asthe word embedding layer of the reader module,except that it uses a unidirectional ELMo to ensurethat the predictions for position t depend only onthe known outputs at positions previous to t.

Artificial tokens. To be able to use multiple an-swer styles within a single system, our model in-troduces an artificial token corresponding to thestyle at the beginning of the answer (y1), as donein (Johnson et al., 2017; Takeno et al., 2017). Attest time, the user can specify the first token tocontrol the style. This modification does not re-quire any changes to the model architecture. Notethat introducing the token at the decoder preventsthe reader module from depending on the answerstyle.

3.4.2 Attentional Decoder LayerThis layer uses a stack of Transformer decoderblocks on top of the embeddings provided by theword embedding layer. The input is immedi-ately mapped to a d-dimensional vector by a lin-ear transformation, and the output is a sequence ofd-dimensional vectors: {s1, . . . , sT }.

Transformer decoder block. In addition to theencoder block, this block consists of the secondand third sub-layers after the self-attention blockand before the feed-forward network, as shown inFigure 2. As in (Vaswani et al., 2017), the self-attention sub-layer uses a sub-sequent mask to pre-vent positions from attending to subsequent posi-tions. The second and third sub-layers perform themulti-head attention over M q and Mpall , respec-tively. The Mpall is the concatenated outputs ofthe encoder stack for the passages,

Mpall = [Mp1 , . . . ,MpK ] ∈ Rd×KL.

Here, the [, ] operator denotes vector concatenationacross the columns. This attention for the concate-nated passages produces attention weights that arecomparable between passages.

3.4.3 Multi-source Pointer-GeneratorOur extended mechanism allows both words tobe generated from a vocabulary and words to becopied from both the question and multiple pas-sages (Figure 3). We expect that the capabilityof copying words will be shared among answerstyles.

Additive Attention

Additive Attention

Final distribution

Mixing weights

Feed- Forward

Feed- Forward

Context vec.Attention

weights

Voc. dist.

Passage Representations

Question Representations

Decoder t-th state

Attention

dist.

querykey,Êvalue

Figure 3: Multi-source pointer-generator mechanism.For each decoding step t, mixture weights λv, λq, λp

for the probability of generating words from the vo-cabulary and copying words from the question and thepassages are calculated. The three distributions areweighted and summed to obtain the final distribution.

Extended vocabulary distribution. Let the ex-tended vocabulary, Vext, be the union of the com-mon words (a small subset of the full vocabulary,V , defined by the input-side word embedding ma-trix) and all words appearing in the input questionand passages. P v then denotes the probability dis-tribution of the t-th answer word, yt, over the ex-tended vocabulary. It is defined as:

P v(yt) = softmax(W 2>(W 1st + b1)),

where the output embedding W 2 ∈ Rdword×Vext istied with the corresponding part of the input em-bedding (Inan et al., 2017), and W 1 ∈ Rdword×d

and b1 ∈ Rdword are learnable parameters. P v(yt)is zero if yt is an out-of-vocabulary word for V .

Copy distributions. A recent Transformer-based pointer-generator randomly chooses one ofthe attention-heads to form a copy distribution;that approach gave no significant improvements intext summarization (Gehrmann et al., 2018).

In contrast, our model uses an additional atten-tion layer for each copy distribution on top of thedecoder stack. For the passages, the layer takes stas the query and outputs αp

t ∈ RKL as the atten-tion weights and cpt ∈ Rd as the context vectors:

epkl = wp> tanh(W pmMpkl +W psst + bp),

αpt = softmax([ep1 ; . . . ; epK ]), (1)

cpt =∑

l αptlM

palll ,

where wp, bp ∈ Rd and W pm,W ps ∈ Rd×d arelearnable parameters. For the question, our model

2277

uses another identical layer and obtains αqt ∈ RJ

and cqt ∈ Rd. As a result, P q and P p are the copydistributions over the extended vocabulary:

P q(yt) =∑

j:xqj=yt

αqtj ,

P p(yt) =∑

l:xpk(l)l =yt

αptl,

where k(l) means the passage index correspond-ing to the l-th word in the concatenated passages.

Final distribution. The final distribution of yt isdefined as a mixture of the three distributions:

P (yt) = λvP v(yt) + λqP q(yt) + λpP p(yt),

λv, λq, λp = softmax(Wm[st; cqt ; c

pt ] + bm),

where Wm ∈ R3×3d and bm ∈ R3 are learnableparameters.

3.4.4 Combined AttentionIn order not to attend words in irrelevant passages,our model introduces a combined attention. Whilethe original technique combined word and sen-tence level attentions (Hsu et al., 2018), our modelcombines the word and passage level attentions.The word attention, Eq. 1, is re-defined as

αptl =

αptlβ

pk(l)∑l′ α

ptl′β

pk(l′).

3.5 Loss FunctionWe define the training loss as the sum of losses via

L(θ) = Ldec + γrankLrank + γclsLcls

where θ is the set of all learnable parameters, andγrank and γcls are balancing parameters.

The loss of the decoder, Ldec, is the negativelog likelihood of the whole target answer sentenceaveraged over Nable answerable examples:

Ldec = − 1

Nable

∑(a,y)∈D

a

T

∑t

logP (yt),

where D is the training dataset. The losses of thepassage ranker, Lrank, and the answer possibilityclassifier, Lcls, are the binary cross entropy be-tween the true and predicted values averaged overall N examples:

Lrank = − 1

NK

∑k

∑rpk∈D

(rpk log βpk+

(1− rpk ) log(1− βpk )

),

Lcls = −1

N

∑a∈D

(a logP (a)+

(1− a) log(1− P (a))

).

Dataset Subset Train Dev. Eval.ALL 808,731 101,093 101,092

MS MARCO ANS 503,370 55,636 –NLG 153,725 12,467 –

NarrativeQA Summary 32,747 3,461 10,557

Table 1: Numbers of questions used in the experiments.

4 Experiments on MS MARCO 2.1

We evaluated our model on MS MARCO 2.1 (Ba-jaj et al., 2018). It is the sole dataset providing ab-stractive answers with multiple styles and servesas a great test bed for building open-domain QAagents with the NLG capability that can be used insmart devices. The details of our setup and outputexamples are in the supplementary material.

4.1 SetupDatasets. MS MARCO 2.1 provides two tasksfor generative open-domain QA: the Q&A taskand the Q&A + Natural Language Generation(NLG) task. Both tasks consist of questions sub-mitted to Bing by real users, and each questionrefers to ten passages. The dataset also includesannotations on the relevant passages, which wereselected by humans to form the final answers, andon whether there was no answer in the passages.

Answer styles. We associated the two tasks withtwo answer styles. The NLG task requires a well-formed answer that is an abstractive summary ofthe question and passages, averaging 16.6 words.The Q&A task also requires an abstractive answerbut prefers it to be more concise than in the NLGtask, averaging 13.1 words, and many of the an-swers do not contain the context of the question.For the question “tablespoon in cup”, a referenceanswer in the Q&A task is “16,” while that in theNLG task is “There are 16 tablespoons in a cup.”

Subsets. In addition to the ALL dataset, we pre-pared two subsets for ablation tests as listed in Ta-ble 1. The ANS set consisted of answerable ques-tions, and the NLG set consisted of the answerablequestions and well-formed answers, so that NLG⊂ ANS ⊂ ALL. We note that multi-style learningenables our model to learn from different answerstyles of data (i.e., the ANS set), and multi-tasklearning with the answer possibility classifier en-ables our model to learn from both answerable andunanswerable data (i.e., the ALL set).

Training and Inference. We trained our modelwith mini-batches consisting of multi-style an-

2278

NLG Q&AModel R-L B-1 R-L B-1BiDAFa 16.91 9.30 23.96 10.64Deep Cascade QAb 35.14 37.35 52.01 54.64S-Net+CES2Sc 45.04 40.62 44.96 46.36BERT+Multi-PGNetd 47.37 45.09 48.14 52.03Selector+CCGe 47.39 45.26 50.63 52.03VNETf 48.37 46.75 51.63 54.37Masque (NLG; single) 49.19 49.63 48.42 48.68Masque (NLG; ensemble) 49.61 50.13 48.92 48.75Masque (Q&A; single) 25.66 36.62 50.93 42.37Masque (Q&A; ensemble) 28.53 39.87 52.20 43.77Human Performance 63.21 53.03 53.87 48.50

Table 2: Performance of our and competing models onthe MS MARCO V2 leaderboard (4 March 2019). aSeoet al. (2017); bYan et al. (2019); cShao (unpublished), avariant of Tan et al. (2018); dLi (unpublished), a modelusing Devlin et al. (2018) and See et al. (2017); eQian(unpublished); fWu et al. (2018). Whether the compet-ing models are ensemble models or not is unreported.

swers that were randomly sampled. We used agreedy decoding algorithm and did not use anybeam search or random sampling, because theydid not provide any improvements.

Evaluation metrics and baselines. ROUGE-Land BLEU-1 were used to evaluate the models’RC performance, where ROUGE-L is the mainmetric on the official leaderboard. We used thereported scores of extractive (Seo et al., 2017; Yanet al., 2019; Wu et al., 2018), generative (Tan et al.,2018), and unpublished RC models at the submis-sion time.

In addition, to evaluate the individual contribu-tions of our modules, we used MAP and MRR forthe ranker and F1 for the classifier, where the pos-itive class was the answerable questions.

4.2 Results

Does our model achieve state-of-the-art on thetwo tasks with different styles? Table 2 showsthe performance of our model and competingmodels on the leaderboard. Our ensemble modelof six training runs, where each model was trainedwith the two answer styles, achieved state-of-the-art performance on both tasks in terms of ROUGE-L. In particular, for the NLG task, our single modeloutperformed competing models in terms of bothROUGE-L and BLEU-1.

Does multi-style learning improve the NLGperformance? Table 3 lists the results of an ab-lation test for our single model (controlled with

Model Train R-L B-1Masque (NLG style; single) ALL 69.77 65.56w/o multi-style learning (§3.4.2) NLG 68.20 63.95↪→ w/o Transformer (§3.1.2, §3.4.2) NLG 67.13 62.96w/o passage ranker (§3.2) NLG 68.05 63.82w/o possibility classifier (§3.3) ANS 69.64 65.41

Masque w/ gold passage ranker ALL 78.70 78.14

Table 3: Ablation test results on the NLG dev. set. Themodels were trained with the subset listed in “Train”.

Model Train MAP MRRBing (initial ranking) - 34.62 35.00Masque (single) ALL 69.51 69.96w/o answer decoder (§3.4) ALL 67.03 67.49w/o multi-style learning (§3.4.2) NLG 65.51 65.59w/o possibility classifier (§3.3) ANS 69.08 69.54

Table 4: Passage ranking results on the ANS dev. set.

the NLG style) on the NLG dev. set1. Our modeltrained with both styles outperformed the modeltrained with the single NLG style. Multi-stylelearning enabled our model to improve its NLGperformance by also using non-sentence answers.

Does the Transformer-based pointer-generatorimprove the NLG performance? Table 3shows that our model also outperformed the modelthat used RNNs and self-attentions instead ofTransformer blocks as in MCAN (McCann et al.,2018). Our deep decoder captured the multi-hopinteraction among the question, the passages, andthe answer better than a single-layer LSTM de-coder could.

Does joint learning with the ranker and classi-fier improve NLG performance? Furthermore,Table 3 shows that our model (jointly trained withthe passage ranker and answer possibility classi-fier) outperformed the model that did not use theranker and classifier. Joint learning thus had a reg-ularization effect on the question-passages reader.

We also confirmed that the gold passage ranker,which can perfectly predict the relevance of pas-sages, significantly improved the RC performance.Passage ranking will be a key to developing a sys-tem that can outperform humans.

Does joint learning improve the passage rank-ing performance? Table 4 lists the passageranking performance on the ANS dev. set2. The

1We confirmed with the organizer that the dev. resultswere much better than the test results, but there was no prob-lem.

2This evaluation requires our ranker to re-rank 10 pas-sages. It is not the same as the Passage Re-ranking task.

2279

1.0

0.9

0.8

0.7

0.6

0.5

0.4

Precision

0.0 0.2 0.4 0.6 0.8 1.0Recall

F1=0.9

0.7

0.50.30.1

Figure 4: Precision-recall curve for answer possibilityclassification on the ALL dev. set.

0

5

10

15

20

25

30

yesno which when who where all how what other why

Prediction (Q&A)

Reference (Q&A)

Prediction (NLG)

Reference (NLG)

Le

ng

th

Figure 5: Lengths of answers generated by Masquebroken down by the answer style and query type on theNLG dev. set. The error bars indicate standard errors.

ranker shares the question-passages reader withthe answer decoder, and this sharing contributed toimprovements over the ranker trained without theanswer decoder. Also, our ranker outperformedthe initial ranking provided by Bing by a signifi-cant margin.

Does our model accurately identify answerablequestions? Figure 4 shows the precision-recallcurve for answer possibility classification on theALL dev. set. Our model identified the answer-able questions well. The maximum F1 score was0.7893, where the threshold of answer possibilitywas 0.4411. This is the first report on answer pos-sibility classification with MS MARCO 2.1.

Does our model control answer lengths withdifferent styles? Figure 5 shows the lengths ofthe answers generated by our model broken downby the answer style and query type. The generatedanswers were relatively shorter than the referenceanswers, especially for the Q&A task, but wellcontrolled with the target style for every querytype. The short answers degraded our model’sBLEU scores in the Q&A task (Table 2) becauseof BLEU’s brevity penalty (Papineni et al., 2002).

5 Experiments on NarrativeQA

Next, we evaluated our model on Narra-tiveQA (Kocisky et al., 2018). It requires under-standing the underlying narrative rather than re-lying on shallow pattern matching. Our detailedsetup and output examples are in the supplemen-tary material.

5.1 Setup

We only describe the settings specific to this ex-periment.

Datasets. Following previous studies, we usedthe summary setting for the comparisons with thereported baselines, where each question refers toone summary (averaging 659 words), and there isno unanswerable questions. Our model thereforedid not use the passage ranker and answer possi-bility classifier.

Answer styles. The NarrativeQA dataset doesnot explicitly provide multiple answer styles. Inorder to evaluate the effectiveness of multi-stylelearning, we used the NLG subset of MS MARCOas additional training data. We associated theNarrativeQA and NLG datasets with two answerstyles. The answer style of NarrativeQA (NQA) isdifferent from that of MS MARCO (NLG) in thatthe answers are short (averaging 4.73 words) andcontained frequently pronouns. For instance, forthe question “Who is Mark Hunter?”, a referenceis “He is a high school student in Phoenix.”

Evaluation metrics and baselines. BLEU-1and 4, METEOR, and ROUGE-L were used inaccordance with the evaluation in the dataset pa-per (Kocisky et al., 2018). We used the reportsof top-performing extractive (Seo et al., 2017; Tayet al., 2018; Hu et al., 2018) and generative (Baueret al., 2018; Indurthi et al., 2018) models.

5.2 Results

Does our model achieve state-of-the-art perfor-mance? Table 5 shows that our single model,trained with two styles and controlled with theNQA style, pushed forward the state-of-the-art bya significant margin. The evaluation scores of themodel controlled with the NLG style were low be-cause the two styles are different. Also, our modelwithout multi-style learning (trained with only theNQA style) outperformed the baselines in terms ofROUGE-L. This indicates that our model architec-

2280

Model B-1 B-4 M R-LBiDAFa 33.72 15.53 15.38 36.30DECAPROPb 42.00 23.42 23.42 40.07MHPGM+NOICc 43.63 21.07 19.03 44.16ConZNetd 42.76 22.49 19.24 46.67RMR+A2De 50.4 26.5 N/A 53.3Masque (NQA) 54.11 30.43 26.13 59.87

w/o multi-style learning 48.70 20.98 21.95 54.74Masque (NLG) 39.14 18.11 24.62 50.09Masque (NQA; valid.)f 52.78 28.72 25.38 58.94

Table 5: Performance of our and competing models onthe NarrativeQA test set. aSeo et al. (2017); bTay et al.(2018); cBauer et al. (2018); dIndurthi et al. (2018);eHu et al. (2018). fResults on the NarrativeQA valida-tion set.

ture itself is powerful for natural language under-standing in RC.

6 Related Work and Discussion

Transfer and multi-task learning in RC. Re-cent breakthroughs in transfer learning demon-strate that pre-trained language models performwell on RC with minimal modifications (Peterset al., 2018; Devlin et al., 2018; Radford et al.,2018, 2019). In addition, our model also usesELMo (Peters et al., 2018) for contextualized em-beddings.

Multi-task learning is a transfer mechanismto improve generalization performance (Caruana,1997), and it is generally applied by sharingthe hidden layers between all tasks, while keep-ing task-specific layers. Wang et al. (2018) andNishida et al. (2018) reported that the sharing ofthe hidden layers between the multi-passage RCand passage ranking tasks was effective. Our re-sults also showed the effectiveness of the sharingof the question-passages reader module among theRC, passage ranking, and answer possibility clas-sification tasks.

In multi-task learning without task-specific lay-ers, Devlin et al. (2018) and Chen et al. (2017)improved RC performance by learning multipledatasets from the same extractive RC setting. Mc-Cann et al. (2018) and Yogatama et al. (2019) in-vestigated multi-task and curriculum learning onmany different NLP tasks; their results were belowtask-specific RC models. Our multi-style learningdoes not use style-specific layers; instead uses astyle-conditional decoder.

Generative RC. S-Net (Tan et al., 2018) usedan extraction-then-synthesis mechanism for multi-

passage RC. The models proposed by McCannet al. (2018), Bauer et al. (2018), and Indurthiet al. (2018) used an RNN-based pointer-generatormechanism for single-passage RC. Although thesemechanisms can alleviate the lack of training data,large amounts of data are still required. Our multi-style learning will be a key technique enablinglearning from many RC datasets with differentstyles.

In addition to MS MARCO and NarrativeQA,there are other datasets that provide abstractiveanswers. DuReader (He et al., 2018), a Chinesemulti-document RC dataset, provides longer doc-uments and answers than those of MS MARCO.DuoRC (Saha et al., 2018) and CoQA (Reddyet al., 2018) contain abstractive answers; most ofthe answers are short phrases.

Controllable text generation. Many studieshave been carried out in the framework of styletransfer, which is the task of rephrasing a text sothat it contains specific styles such as sentiment.Recent studies have used artificial tokens (Sen-nrich et al., 2016; Johnson et al., 2017), varia-tional auto-encoders (Hu et al., 2017), or adver-sarial training (Fu et al., 2018; Tsvetkov et al.,2018) to separate the content and style on the en-coder side. On the decoder side, conditional lan-guage modeling has been used to generate out-put sentences with the target style. In addition,output length control with conditional languagemodeling has been well studied (Kikuchi et al.,2016; Takeno et al., 2017; Fan et al., 2018). Ourstyle-controllable RC relies on conditional lan-guage modeling in the decoder.

Multi-passage RC. The simplest approach is toconcatenate the passages and find the answer fromthe concatenation, as in (Wang et al., 2017). Ear-lier pipelined models found a small number of rel-evant passages with a TF-IDF based ranker andpassed them to a neural reader (Chen et al., 2017;Clark and Gardner, 2018), while more recent mod-els have used a neural re-ranker to more accuratelyselect the relevant passages (Wang et al., 2018;Nishida et al., 2018). Also, non-pipelined models(including ours) consider all the provided passagesand find the answer by comparing scores betweenpassages (Tan et al., 2018; Wu et al., 2018). Themost recent models make a proper trade-off be-tween efficiency and accuracy (Yan et al., 2019;Min et al., 2018).

2281

RC with unanswerable question identification.The previous work of (Levy et al., 2017; Clark andGardner, 2018) outputted a no-answer score de-pending on the probability of all answer spans. Huet al. (2019) proposed an answer verifier to com-pare an answer with the question. Sun et al. (2018)jointly learned an RC model and an answer veri-fier. Our model introduces a classifier on top of thequestion-passages reader, which is not dependenton the generated answer.

Abstractive summarization. Current state-of-the-art models use the pointer-generator mecha-nism (See et al., 2017). In particular, content se-lection approaches, which decide what to sum-marize, have recently been used with abstractivemodels. Most methods select content at the sen-tence level (Hsu et al., 2018; Chen and Bansal,2018) or the word level (Pasunuru and Bansal,2018; Li et al., 2018; Gehrmann et al., 2018). Ourmodel incorporates content selection at the pas-sage level in the combined attention.

Query-based summarization has rarely beenstudied because of a lack of datasets. Nema et al.(2017) proposed an attentional encoder-decodermodel; however, Saha et al. (2018) reported thatit performed worse than BiDAF on DuoRC. Has-selqvist et al. (2017) proposed a pointer-generatorbased model; however, it does not consider copy-ing words from the question.

7 Conclusion

This study sheds light on multi-style generativeRC. Our proposed model, Masque, is based onmulti-source abstractive summarization and learnsmulti-style answers together. It achieved state-of-the-art performance on the Q&A task andthe Q&A + NLG task of MS MARCO 2.1 andthe summary task of NarrativeQA. The key toits success is transferring the style-independentNLG capability to the target style by use ofthe question-passages reader and the conditionalpointer-generator decoder. In particular, the capa-bility of copying words from the question and pas-sages can be shared among the styles, while thecapability of controlling the mixture weights forthe generative and copy distributions can be ac-quired for each style. Our future work will involveexploring the potential of our multi-style learningtowards natural language understanding.

ReferencesLei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton.

2016. Layer normalization. Computing ResearchRepository (CoRR), arXiv:1607.06450. Version 1.

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng,Jianfeng Gao, Xiaodong Liu, Rangan Majumder,Andrew McNamara, Bhaskar Mitra, Tri Nguyen,Mir Rosenberg, Xia Song, Alina Stoica, SaurabhTiwary, and Tong Wang. 2018. MS MARCO: Ahuman generated machine reading comprehensiondataset. Computing Research Repository (CoRR),arXiv:1611.09268. Version 3.

Lisa Bauer, Yicheng Wang, and Mohit Bansal. 2018.Commonsense for generative multi-hop question an-swering tasks. In Empirical Methods in NaturalLanguage Processing (EMNLP), pages 4220–4230.

Richard Caruana. 1997. Multitask learning. MachineLearning, 28(1):41–75.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In Association for Computa-tional Linguistics (ACL), pages 1870–1879.

Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-tive summarization with reinforce-selected sentencerewriting. In Association for Computational Lin-guistics (ACL), pages 675–686.

Christopher Clark and Matt Gardner. 2018. Simpleand effective multi-paragraph reading comprehen-sion. In Association for Computational Linguistics(ACL), pages 845–855.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. Computing Research Repository (CoRR),arXiv:1810.04805. Version 1.

Angela Fan, David Grangier, and Michael Auli. 2018.Controllable abstractive summarization. In Work-shop on Neural Machine Translation and Genera-tion (NMT@ACL), pages 45–54.

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, DongyanZhao, and Rui Yan. 2018. Style transfer in text:Exploration and evaluation. In Association for theAdvancement of Artificial Intelligence (AAAI), pages663–670.

Sebastian Gehrmann, Yuntian Deng, and Alexander M.Rush. 2018. Bottom-up abstractive summarization.In Empirical Methods in Natural Language Process-ing (EMNLP), pages 4098–4109.

Alex Graves. 2013. Generating sequences with recur-rent neural networks. Computing Research Reposi-tory (CoRR), arXiv:1308.0850. Version 5.

Johan Hasselqvist, Niklas Helmertz, and MikaelKageback. 2017. Query-based abstractive summa-rization using neural networks. arXiv, 1712.06100.

https://www.aclweb.org/anthology/D18-1454


http://aclweb.org/anthology/P17-1171


https://www.aclweb.org/anthology/P18-1063






https://www.aclweb.org/anthology/W18-2706


2282

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Computer Vision and Pattern Recognition(CVPR), pages 770–778.

Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao,Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu,Qiaoqiao She, Xuan Liu, Tian Wu, and HaifengWang. 2018. DuReader: a chinese machine read-ing comprehension dataset from real-world applica-tions. In Workshop on Machine Reading for Ques-tion Answering (MRQA@ACL), pages 37–46.

Dan Hendrycks and Kevin Gimpel. 2016. Bridgingnonlinearities and stochastic regularizers with gaus-sian error linear units. Computing Research Reposi-tory (CoRR), arXiv:1606.08415. Version 2.

Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee, KeruiMin, Jing Tang, and Min Sun. 2018. A unifiedmodel for extractive and abstractive summarizationusing inconsistency loss. In Association for Compu-tational Linguistics (ACL), pages 132–141.

Minghao Hu, Yuxing Peng, Furu Wei, Zhen Huang,Dongsheng Li, Nan Yang, and Ming Zhou. 2018.Attention-guided answer distillation for machinereading comprehension. In Empirical Methodsin Natural Language Processing (EMNLP), pages2077–2086.

Minghao Hu, Furu Wei, Yuxing Peng, Zhen Huang,Nan Yang, and Ming Zhou. 2019. Read + Verify:Machine reading comprehension with unanswerablequestions. In Association for the Advancement ofArtificial Intelligence (AAAI).

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P. Xing. 2017. Toward con-trolled generation of text. In International Con-ference on Machine Learning (ICML), pages 1587–1596.

Hakan Inan, Khashayar Khosravi, and Richard Socher.2017. Tying word vectors and word classifiers: Aloss framework for language modeling. In Inter-national Conference on Learning Representations(ICLR).

Sathish Reddy Indurthi, Seunghak Yu, Seohyun Back,and Heriberto Cuayahuitl. 2018. Cut to the chase:A context zoom-in network for reading comprehen-sion. In Empirical Methods in Natural LanguageProcessing (EMNLP), pages 570–575.

Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-rat, Fernanda B. Viegas, Martin Wattenberg, GregCorrado, Macduff Hughes, and Jeffrey Dean. 2017.Google’s multilingual neural machine translationsystem: Enabling zero-shot translation. Transac-tions of the Association for Computational Linguis-tic (TACL), 5:339–351.

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and LukeZettlemoyer. 2017. TriviaQA: A large scale dis-tantly supervised challenge dataset for reading com-prehension. In Association for Computational Lin-guistics (ACL), pages 1601–1611.

Yuta Kikuchi, Graham Neubig, Ryohei Sasano, HiroyaTakamura, and Manabu Okumura. 2016. Control-ling output length in neural encoder-decoders. InEmpirical Methods in Natural Language Processing(EMNLP), pages 1328–1338.

Tomas Kocisky, Jonathan Schwarz, Phil Blunsom,Chris Dyer, Karl Moritz Hermann, Gabor Melis, andEdward Grefenstette. 2018. The NarrativeQA read-ing comprehension challenge. Transactions of theAssociation for Computational Linguistic (TACL),6:317–328.

Omer Levy, Minjoon Seo, Eunsol Choi, and LukeZettlemoyer. 2017. Zero-shot relation extraction viareading comprehension. In Computational NaturalLanguage Learning (CoNLL), pages 333–342.

Chenliang Li, Weiran Xu, Si Li, and Sheng Gao.2018. Guiding generation for abstractive text sum-marization based on key information guide network.In North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies (NAACL-HLT), pages 55–60.

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,and Richard Socher. 2018. The natural languagedecathlon: Multitask learning as question answer-ing. Computing Research Repository (CoRR),arXiv:1806.08730. Version 1.

Sewon Min, Victor Zhong, Richard Socher, and Caim-ing Xiong. 2018. Efficient and robust question an-swering from minimal context over documents. InAssociation for Computational Linguistics (ACL),pages 1725–1735.

Preksha Nema, Mitesh M. Khapra, Anirban Laha, andBalaraman Ravindran. 2017. Diversity driven atten-tion model for query-based abstractive summariza-tion. In Association for Computational Linguistics(ACL), pages 1063–1072.

Kyosuke Nishida, Itsumi Saito, Atsushi Otsuka, HisakoAsano, and Junji Tomita. 2018. Retrieve-and-read: Multi-task learning of information retrievaland reading comprehension. In Conference onInformation and Knowledge Management (CIKM),pages 647–656.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Association forComputational Linguistics (ACL), pages 311–318.

Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-reward reinforced summarization with saliency andentailment. In North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies (NAACL-HLT), pages 646–653.












https://www.aclweb.org/anthology/Q17-1024









https://www.aclweb.org/anthology/K17-1034

https://www.aclweb.org/anthology/K17-1034

https://www.aclweb.org/anthology/N18-2009












2283

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies (NAACL-HLT), pages 2227–2237.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training. Technical re-port, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. Techni-cal report, OpenAI.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In Association for ComputationalLinguistics (ACL), pages 784–789.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,and Percy Liang. 2016. SQuAD: 100,000+ ques-tions for machine comprehension of text. In Em-pirical Methods in Natural Language Processing(EMNLP), pages 2383–2392.

Siva Reddy, Danqi Chen, and Christopher D. Manning.2018. CoQA: A conversational question answeringchallenge. Computing Research Repository (CoRR),arXiv:1808.07042. Version 1.

Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, andKarthik Sankaranarayanan. 2018. DuoRC: Towardscomplex language understanding with paraphrasedreading comprehension. In Association for Compu-tational Linguistics (ACL), pages 1683–1693.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Association for Computa-tional Linguistics (ACL), pages 1073–1083.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Controlling politeness in neural machinetranslation via side constraints. In North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies (NAACL-HLT), pages 35–40.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bidirectional attentionflow for machine comprehension. In InternationalConference on Learning Representations (ICLR).

Rupesh Kumar Srivastava, Klaus Greff, andJurgen Schmidhuber. 2015. Highway net-works. Computing Research Repository (CoRR),arXiv:1505.00387. Version 2.

Fu Sun, Linyang Li, Xipeng Qiu, and Yang Liu. 2018.U-Net: Machine reading comprehension with unan-swerable questions. Computing Research Reposi-tory (CoRR), arXiv:1810.06638. Version 1.

Shunsuke Takeno, Masaaki Nagata, and Kazuhide Ya-mamoto. 2017. Controlling target features in neuralmachine translation via prefix constraints. In Work-shop on Asian Translation (WAT@IJCNLP), pages55–63.

Chuanqi Tan, Furu Wei, Nan Yang, Bowen Du,Weifeng Lv, and Ming Zhou. 2018. S-Net: Fromanswer extraction to answer synthesis for machinereading comprehension. In Association for the Ad-vancement of Artificial Intelligence (AAAI), pages5940–5947.

Yi Tay, Anh Tuan Luu, Siu Cheung Hui, and JianSu. 2018. Densely connected attention propagationfor reading comprehension. In Advances in NeuralInformation Processing Systems (NeurIPS), pages4911–4922.

Yulia Tsvetkov, Alan W. Black, Ruslan Salakhutdi-nov, and Shrimai Prabhumoye. 2018. Style transferthrough back-translation. In Association for Com-putational Linguistics (ACL), pages 866–876.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems (NIPS), pages 6000–6010.

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,Tim Klinger, Wei Zhang, Shiyu Chang, GeraldTesauro, Bowen Zhou, and Jing Jiang. 2018. R3:Reinforced reader-ranker for open-domain questionanswering. In Association for the Advancement ofArtificial Intelligence (AAAI), pages 5981–5988.

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In Association for Computational Linguis-tics (ACL), pages 189–198.

Hua Wu, Haifeng Wang, Sujian Li, Wei He, YizhongWang, Jing Liu, Kai Liu, and Yajuan Lyu. 2018.Multi-passage machine reading comprehension withcross-passage answer verification. In Associationfor Computational Linguistics (ACL), pages 1918–1927.

Caiming Xiong, Victor Zhong, and Richard Socher.2017. Dynamic coattention networks for questionanswering. In International Conference on Learn-ing Representations (ICLR).

Ming Yan, Jiangnan Xia, Chen Wu, Bin Bi, ZhongzhouZhao, Ji Zhang, Luo Si, Rui Wang, Wei Wang, andHaiqing Chen. 2019. A deep cascade model formulti-document reading comprehension. In Associ-ation for the Advancement of Artificial Intelligence(AAAI).

http://www.aclweb.org/anthology/D14-1162

http://www.aclweb.org/anthology/D14-1162





https://aclweb.org/anthology/D16-1264

https://aclweb.org/anthology/D16-1264

















2284

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W. Cohen, Ruslan Salakhutdinov, andChristopher D. Manning. 2018. HotpotQA: Adataset for diverse, explainable multi-hop questionanswering. In Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 2369–2380.

Dani Yogatama, Cyprien de Masson d’Autume, JeromeConnor, Tomas Kocisky, Mike Chrzanowski, Ling-peng Kong, Angeliki Lazaridou, Wang Ling,Lei Yu, Chris Dyer, and Phil Blunsom. 2019.Learning and evaluating general linguistic intelli-gence. Computing Research Repository (CoRR),arXiv:1901.11373. Version 1.

Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc V.Le. 2018. QANet: Combining local convolutionwith global self-attention for reading comprehen-sion. In International Conference on Learning Rep-resentations (ICLR).




multi-style generative reading comprehension · multi-style generative reading comprehension...

Documents