learning neural knowledge representations · 2019-03-26 · ‚esis proposal bhuwan dhingra march...

March 26, 2019DRAFT

Learning Neural KnowledgeRepresentations

�esis Proposal

Bhuwan Dhingra

March 26, 2019

School of Computer ScienceCarnegie Mellon University

Pi�sburgh, PA 15213

�esis Committee:William W. Cohen (co-chair)

Ruslan Salakhutdinov (co-chair)Graham Neubig

Michael Collins, (Google NYC)

Submi�ed in partial ful�llment of the requirements

for the degree of Doctor of Philosophy.

Copyright © 2019 Bhuwan Dhingra

March 26, 2019DRAFT

March 26, 2019DRAFT

AbstractMuch of the collective human knowledge resides on the internet today. Half

the world’s population has access to internet, and consequently this knowledge,but none can navigate this wealth of information without the help of technology.Knowledge representation refers to organizing this information in a form such thatany piece of it can be easily retrieved when a user asks for it. �is involves pro-cessing extremely large-scale data, and, at the same time, resolving �ne-grainedambiguities inherent in natural language. Further di�culties are presented by theheterogeneous mix of structured and unstructured data typically available on theweb, and the expensive cost of annotating such representations.

�is thesis aims to develop e�cient, scalable and �exible knowledge represen-tations by leveraging recent successes in deep learning. We train neural networksto represent diverse sources of knowledge including unstructured text, linguisticannotations, and curated databases, by answering queries posed over them. To in-crease the e�ciency of learning, we discuss inductive biases for adapting recurrentneural networks to represent text, and graph convolution networks to representstructured data. We also present a semi-supervised technique which exploits unla-beled text documents in addition to labeled question and answer pairs for learning.

In the last part of the thesis we propose a distributed text knowledge base forrepresenting very large text corpora, such as the entire Wikipedia. Towards thisend, we present preliminary results investigating the applicability of contextualword representation models for indexing large corpora, as well as �ne-tuning ap-proaches for improving their factual information content.

March 26, 2019DRAFT

iv

March 26, 2019DRAFT

Contents

1 Introduction 11.1 Learning & Evaluating Knowledge Representations . . . . . . . . . . . . . . . . 21.2 Overview of Completed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Overview of Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Proposed Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Text Representations (Completed Work) 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Gated-A�ention Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Extending with Coreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 Cloze-style QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.2 BAbi AI Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5.3 Wikihop Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5.4 LAMBADA Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Text with KB Facts (Completed Work) 293.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Open-Domain QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4 GRAFT-Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.5.2 Compared Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

v

March 26, 2019DRAFT

3.5.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.4 Comparison to Specialized Methods . . . . . . . . . . . . . . . . . . . . 41

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Multi-turn Dialogue for Knowledge Retrieval (Completed Work) 454.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3 Probabilistic KB Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 End-to-End-KB-InfoBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.6 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.6.1 Models & Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.6.2 Simulated User Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 564.6.3 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Semi-Supervised QA (Completed Work) 595.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.4 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 Scaling up (Proposed Work) 696.1 Towards a Distributed Text KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.3 Contextual Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.3.1 Probing Tasks Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.3.2 Pretrained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 736.3.3 Hard Negative Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.4 Exploiting Redundancies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.5 Multi-Hop �eries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

vi

March 26, 2019DRAFT

Bibliography 83

vii

March 26, 2019DRAFT

viii

March 26, 2019DRAFT

Chapter 1

Introduction

�e driver of the power of intelligent systems is the knowledge the systems haveabout their universe of discourse, not the sophistication of the reasoning processthe systems employ.

– Edward Feigenbaum

�e goal of arti�cial intelligence (AI) is to build systems which can reason about their sur-roundings. But reasoning is only possible if the system has a representation of the surroundings.A representation, in this sense, is a substitute for the real thing; an abstract entity which enablesa person or a program to determine consequences internally by thinking, rather than externallyby acting [29]. �e question of how to represent the natural world in AI programs has plaguedresearchers since the earliest conception of computers. A Knowledge Representation is a formatfor storing real-world information in an unambiguous manner such that computers can utilizeit. Starting with the expert systems in the 1970s [113], to the frame semantics of Fillmore [36],to the recent Semantic Web e�ort of adding a “meaning layer” on top of the world wide web[9], several formalisms have been proposed and developed, across di�erent domains and forvarious tasks.

In this thesis we are interested in representing concise factual knowledge, the kind whichmight be found in an encyclopedia. Perhaps the most widely used format for such knowl-edge today is the Knowledge Graph1[10, 115]. �e nodes of this graph are real-world entities,and edges represent relationships between those entities. An information-seeking query is an-swered against the graph by identifying the entities it mentions (entity linking) [34, 98], followedby the relations it asks for and any other constraints it speci�es (semantic parsing) [8, 77, 163].

1In the literature this o�en referred to as a Knowledge Base (KB).

1

March 26, 2019DRAFT

�e entities, relations and constraints together specify the program for extracting the answerfrom the KG. �is approach has some limitations, however. (1) Not all information can be con-veniently stored in a graphical form (e.g. steps for cooking a lobster tail). (2) To ensure highquality, information extraction pipelines which populate KGs typically favor precision. Conse-quently, most existing KGs are highly incomplete [83]. (3) �e world changes with time, andso do its entities and their relationships. Detecting these changes for updating a KG involves adelay.

To deal with these limitations, we explore distributed knowledge representations. A dis-tributed representation assigns real-valued feature vectors to objects, which can include entities[158], text units such as words, sentences or documents [81], images [66], and sensor measure-ments [53]. Importantly, the feature vectors are learned as part of an optimization problemde�ned with respect to an end task, rather than being hand-cra�ed. Here, we will consider endtasks which involve retrieving answers to information-seeking queries. �e tremendous suc-cess of deep learning can be a�ributed, among other things, to the e�ectiveness of distributedrepresentations. In this thesis we argue that they provide an e�cient, scalable and �exibleframework for representing and accessing knowledge.

In this chapter we start by introducing a high-level framework which we will use to bothlearn and evaluate knowledge representations. �en we give a brief overview of our contribu-tions so far within this framework in §1.2. Lastly, we discuss the future research we want topursue in §1.3.

1.1 Learning & Evaluating Knowledge Representations

Distributed representations are typically learned as the intermediate states of a neural networkoptimized to solve an end-task [7]. In our case we will rely on factoid question answering (QA)as the end task, which can be considered as a benchmark for evaluating the quality of a knowl-edge representation. Figure 1.1 shows a high-level overview of the QA pipeline considered inthis thesis. Below we discuss the components of this pipeline in more detail.

�ery and Answer. �e input and output to the QA system. We consider the followingtypes of queries in this thesis:

1. Well-formed questions: �ese are grammatical, natural language questions, typically start-ing with a Wh-word, and which most people colloquially identify as questions [33].

2

March 26, 2019DRAFT

Who voiced Meg in Family Guy? Mila Kunis

Who won Super Bowl L? Denver Broncos

Where is CMU located? Pittsburgh

Training Dataset: Questions + Answers

Query Answer Selector

Knowledge Source Context

Question Representation

Context Representation

Answer

Text KG Tools

Feedback (Optional)

Figure 1.1: �estion Answering system components.

2. Semi-structured queries: �ese are loosely structured queries which usually ask for prop-erties of an entity, such as (Carnegie Mellon, located-in, ?). Such queries may be generatedas an intermediate step in a text processing pipeline, such as at the output of a languageunderstanding unit [167].

3. Cloze questions: �ese are sentences with missing tokens / phrases, which need to be �lledin. Cloze questions are o�en used to test people’s reading comprehension skills, and herewe use them as pseudo-questions to test knowledge representations.

We will restrict our a�ention to factoid questions which ask for information about concise facts(e.g. Who �rst set foot on the Moon?). Answers to such questions are typically unambiguous,and we will make the assumption that they are short spans of text.

Knowledge Source. �e information resource(s) available to the QA system. �is could bethe entire world wide web, or some domain speci�c corpora and databases. We also considerauxiliary models, such as low-level NLP tools here. Most commonly, we will focus on Wikipediaas our knowledge source.

3

March 26, 2019DRAFT

Context. Working with web-scale corpora and databases is beyond the scope of memoryrestrictions placed by modern GPUs. When developing neural models, we will o�en makethe assumption that we are given a much smaller relevant portion which we refer to as thecontext. �e assumption here is that for a given query the context would be retrieved fromthe knowledge source using information retrieval and entity linking tools [122, 143]. We willconsider the following types of context in this thesis:

1. Text: Documents or passages which contain the answer as a span.

2. KG facts: Collection of knowledge graph triples each specifying the relation between twoentities.

3. Coreference annotations: Output of a coreference resolver on a piece of text. Coreferenceresolvers identify the mentions in text which refer to the same real-world entity.

�estion / Context Representations. �ese components are responsible for converting theraw question and context into a format amenable to extracting the answer. In this thesis we areinterested in vector representations which are output from a learned neural network model.An ideal representation is one which projects semantically similar questions and contexts tonearby vectors in terms of a distance metric. O�en, these components will be closely tied tothe Answer Solver below.

Answer Solver. �e answer solver accepts the question and context representation and ex-tracts the speci�c answer from the context. In case of textual context, this module would selecta span within the text. In case of KG context, this module would select an entity node as theanswer. O�en, the answer solver will have its own trainable parameters.

Training Dataset. A labeled dataset of question and answer pairs, which will be used tooptimize the parameters of the components above. In some cases, we will make the assumptionthat the location of the answer in the context is labeled. In other cases, we will only assumethat the answer is known, but its location in the context is unknown. In the la�er case we willrely on distant supervision [84] to heuristically annotate the location of the answer.

1.2 Overview of Completed Work

�is thesis presents a collection of works, each of which tackles a subset of the pipeline pre-sented above. Here we give a brief overview of the main contributions so far.

4

March 26, 2019DRAFT

Text Representations (Chapter 2) [130, 132]. We start with the reading comprehensiontask of answering queries given a short text document, which is a benchmark for testing theability of a model in extracting information. We present a recurrent neural network (RNN)model which iteratively re�nes the document representations by modeling their interactionswith the query. Answering questions o�en involves aggregating information from multiplementions of the same entity in text. �ese mentions may be far apart in the text, and RNNs,which process text sequentially, su�er in these cases. Hence, we propose to use coreferenceannotations, extracted from an external tool, to bridge the gap between these far apart mentions.We show that these models outperform several strong baselines on multiple datasets.

Text with KG facts (Chapter 3) [122]. While text is by far the most common format inwhich information is available, curated knowledge graphs (also called knowledge bases (KBs))hold millions of facts about the real world. �e unambiguity and high precision of these factsmakes them a valuable resource for question answering. In this chapter, we present a modelbased on graph convolution networks to answer open-domain questions using a combination oftext and KBs. We show this model is e�ective across a wide range of se�ings of KB availability.

Multi-turnDialogue for Knowledge Retrieval (Chapter 4) [129]. When interacting withsearch engines, users favor short and simple queries over long and compositional ones. How-ever, such queries may not specify the answer completely. We explore a multi-turn dialogueagent which can ask users follow up questions before retrieving the answer from a KB. A criticalchallenge with such agents is that intermediate interactions with the KB break the di�erentiabil-ity of the system and prohibit end-to-end training. We propose a novel probabilistic frameworkfor KB retrieval which solves this issue, and evaluate it on both real and simulated users.

Semi-Supervised QA (Chapter 5) [133]. Collecting QA pairs where the location of answersis labeled in a source text is an expensive annotation task, but necessary for training knowledgerepresentations. In this chapter we focus on semi-supervised QA, where in addition to a smallset of labeled questions and answers, we have access to a large unlabeled in-domain text cor-pus. We discuss a simple technique for automatically constructing cloze-style questions fromthe unlabeled corpus. We show that pre-training a neural model on these cloze questions candramatically reduce the requirement for labeled data.

5

March 26, 2019DRAFT

1.3 Overview of Proposed Work

Our contributions so far have focused on the case of answering queries from a short and rele-vant context. We assumed that the context consisted of at most a few sentences, possibly withannotations, and a few KB facts, and that the answer was present somewhere in this context.�is assumption allows us to �t the context in GPU memory and quickly train representationlearners over them.

In a more realistic se�ing, instead of a small context we have large corpora and KBs, withbillions of sentences and facts, to extract the answer from. One way of scaling up to sourcesof this size, is to employ an information retrieval (IR) or entity linking (EL) system before therepresentation learner in order to retrieve the relevant context. IR and EL systems are, however,imperfect, and any loss in recall at this step cannot be recovered in the subsequent steps.

In the last part of this thesis, we focus on representing entire corpora for answering queries.In contrast to traditional KBs, which hold information in terms of a symbolic graph structure,we propose to build a so� KB composed of text mentions and their distributed representations.Building upon the phrase-indexed QA (PIQA) setup of Seo et al [111], we answer queries againstthis KB using a maximum inner product search (MIPS) between the query representation andthe text representations. �en, leveraging fast algorithms for extremely large-scale MIPS [59,114], we can answer queries in real-time.

�e construction of such a KB involves several new research challenges. We propose totackle the following directions in the remainder of this thesis.

Contextual Representations (§6.3). �e distributed text KB relies on a model for encodingtext units to representations which hold information from their context. We start by designinga suite of probing tasks which allow us to characterize the relational information encoded by agiven representation learner. �ese tasks will tell us what kind of queries our KB can support,similar to the schema of a traditional KB. We apply these probing tasks to the recently popu-larized pre-trained language models (LM), such as BERT [30], which produce contextual wordrepresentations, and show that these lack the kind of information our task requires. We thenpropose a strategy for �ne-tuning these to improve their relational information content.

Exploiting Redundancies (§6.4). Text corpora are inherently redundant. �e same infor-mation may be expressed in di�erent forms and at di�erent places. Traditional KB populationmethods exploit these redundancies to extract unique high precision facts. We would like to

6

March 26, 2019DRAFT

develop similar aggregation strategies across text mentions for our distributed KB as well. Inparticular, we will study clustering methods over the contextual representations of mentionswhich share the same surface forms. In addition to increasing the recall of facts which arementioned several times, we hope this will reduce the size of the KB to make inference moree�cient.

Multi-Hop �eries (§6.5). �e biggest advantage of symbolic KBs is that they provide anatural mechanism for reasoning over their information. �ese mechanisms involve follow-ing paths of relations, and operating over sets of entities. Along similar lines, we propose tostudy how our so� KB might support multi-hop queries. In particular, we will develop modelswhich answer conjunctive queries, i.e. those which look for the intersection of two sets, andcompositional queries, i.e. those which follow a path of relations.

1.4 Proposed Timeline

Apr – Jul 2019 • Finish work on contextual representationsJul – Nov 2019 • Work on exploiting redundancies

Nov – Feb 2020 • Job searchNov – Apr 2020 • Work on Multi-Hop queriesApr – May 2020 • �esis writing & defense

7

March 26, 2019DRAFT

8

March 26, 2019DRAFT

Chapter 2

Text Representations (Completed Work)

By far the most common source of information is natural language text. Answering questionsover text involves locating where the information requested by the question is expressed in thetext. Hence, in this chapter we focus on machine reading, the automatic understanding of text,by leveraging large-scale QA data. We �rst present a model for building query-dependent textrepresentations. �en, we discuss how to use external NLP models, in particular, coreferenceannotators, to extend our model to tasks which require reasoning over multiple mentions ofentities.

2.1 Introduction

One option to measure progress towards machine reading is to test a system’s ability to an-swer questions about a document it has to comprehend. Towards this end, several large-scaledatasets of cloze-style questions over a context document have been introduced, which allowthe training of supervised machine learning systems [51, 52, 89]. Such datasets can be easilyconstructed automatically and the unambiguous nature of their queries provides an objectivebenchmark to measure a system’s performance at text comprehension.

Deep learning models have been shown to outperform traditional shallow approaches ontext comprehension tasks [51]. �e success of many of these models can be a�ributed primarilyto two factors: (1) Multi-hop architectures [112, 117, 147], allow a model to scan the documentand the question iteratively for multiple passes. (2) A�ention mechanisms, [12, 51] borrowedfrom the machine translation literature [4], allow the model to focus on appropriate subparts ofthe context document. Intuitively, the multi-hop architecture allows the reader to incrementallyre�ne token representations, and the a�ention mechanism re-weights di�erent parts in the

9

March 26, 2019DRAFT

Context: […] mary got the football there […] mary went to the bedroom […]mary travelled to the hallway […]�estion: where was the football before the hallway ?

Context: Louis-Philippe Fiset […] was a local physician and politician in theMauricie area […] is located in the Mauricie region of �ebec, Canada […]�estion: country of citizenship – louis-philippe �set ?

Figure 2.1: Example questions which require coreference-based reasoning from the bAbi dataset(top) and Wikihop dataset (bo�om). Coreferences are in bold, and the correct answers areunderlined.

document according to their relevance to the query.We start by focusing on combining these two aspects in a complementary manner, by de-

signing a novel a�ention mechanism which gates the evolving token representations acrosshops. More speci�cally, unlike existing models where the query a�ention is applied eithertoken-wise [12, 51, 52, 62] or sentence-wise [121, 147] to allow weighted aggregation, the Gated-A�ention (GA) module proposed here allows the query to directly interact with each dimensionof the token embeddings at the semantic-level, and is applied layer-wise as information �ltersduring the multi-hop representation learning process. Such a �ne-grained a�ention enables ourmodel to learn conditional token representations w.r.t. the given question, leading to accurateanswer selections.

Next, we switch our focus to problems which require reasoning about the informationpresent in text. One important form of reasoning for �estion Answering (QA) models is theability to aggregate information from multiple mentions of entities. We call this coreference-

based reasoning since multiple pieces of information, which may lie across sentence, paragraphor document boundaries, are tied together with the help of referring expressions which denotethe same real-world entity. Figure 2.1 shows examples.

Reading comprehension models, including the GA Reader we will present below, typicallyconsist of RNN layers. RNN layers have a bias towards sequential recency [32], i.e. a tendency tofavor short-term dependencies. A�ention mechanisms alleviate part of the issue, but empiricalstudies suggest RNNs with a�ention also have di�culty modeling long-term dependencies [24].We conjecture that when training data is scarce, and inductive biases play an important role,RNN-based models would have trouble with coreference-based reasoning.

At the same time, systems for coreference resolution have seen a gradual increase in accu-

10

March 26, 2019DRAFT

racy over the years [31, 69, 153]. Hence, we propose to use the annotations produced by suchsystems to adapt a standard RNN layer by introducing a bias towards coreferent recency. Specif-ically, given an input sequence and coreference clusters extracted from an external system, weintroduce a term in the update equations for Gated Recurrent Units (GRU) [16] which dependson the hidden state of the coreferent antecedent of the current token (if it exists). �is wayhidden states are propagated along coreference chains and the original sequence in parallel.

2.2 Related Work

LSTMs with Attention. Several architectures introduced in Hermann et al. [51] employLSTM units to compute a combined document-query representation g(d, q), which is usedto rank the candidate answers. �ese include the DeepLSTM Reader which performs a sin-gle forward pass through the concatenated (document, query) pair to obtain g(d, q); the A�en-

tive Reader which �rst computes a document vector d(q) by a weighted aggregation of wordsaccording to a�entions based on q, and then combines d(q) and q to obtain their joint rep-resentation g(d(q), q); and the Impatient Reader where the document representation is builtincrementally. �e architecture of the A�entive Reader has been simpli�ed recently in Stan-

ford A�entive Reader, where shallower recurrent units were used with a bilinear form for thequery-document a�ention [12].

Attention Sum. �e A�ention-Sum (AS) Reader [62] uses two bi-directional GRU networks[16] to encode both d and q into vectors. A probability distribution over the entities in d is ob-tained by computing dot products between q and the entity embeddings and taking a so�max.�en, an aggregation scheme named pointer-sum a�ention is further applied to sum the proba-bilities of the same entity, so that frequent entities the document will be favored compared torare ones. Building on the AS Reader, the A�ention-over-A�ention (AoA) Reader [23] introducesa two-way a�ention mechanism where the query and the document are mutually a�entive toeach other.

Mulit-hopArchitectures. MemoryNetworks (MemNets) were proposed in Weston et al. [147],where each sentence in the document is encoded to a memory by aggregating nearby words.A�ention over the memory slots given the query is used to compute an overall memory andto renew the query representation over multiple iterations, allowing certain types of reason-ing over the salient facts in the memory and the query. Neural Semantic Encoders (NSE) [88]

11

March 26, 2019DRAFT

extended MemNets by introducing a write operation which can evolve the memory over timeduring the course of reading. Iterative reasoning has been found e�ective in several more re-cent models, including the Iterative A�entive Reader [117] and ReasoNet [112]. �e la�er allowsdynamic reasoning steps and is trained with reinforcement learning.

Entity-based models. Ji et al. [57] presented a generative model for jointly predicting thenext word in the text and its gold-standard coreference annotation. �e di�erence in our workis that we look at the task of reading comprehension, and also work in the more practical se�ingof system extracted coreferences. EntNets [48] also maintain dynamic memory slots for entities,but do not use coreference signals and instead update all memories a�er reading each sentence,which leads to poor performance in the low-data regime (c.f. Table 2.3). Yang et al. [161] modelreferences in text as explicit latent variables, but limit their work to text generation. Kobayashiet al. [65] used a pooling operation to aggregate entity information across multiple mentions.Wang et al. [141] also noted the importance of reference resolution for reading comprehension,and we compare our model to their one-hot pointer reader.

Syntactic-recency. Recent work has used syntax, in the form of dependency trees, to replacethe sequential recency bias in RNNs with a syntactic recency bias [14, 95, 123, 124]. However,syntax only looks at dependencies within sentence boundaries, whereas our focus here is onlonger ranges. Our resulting layer is structurally similar toGraphLSTMs [91], with an additionala�ention mechanism over the graph edges. However, while Peng et al. [91] found that usingcoreference did not lead to any gains for the task of relation extraction, here we show that ithas a positive impact on the reading comprehension task.

Self-Attention. Models which compute pair-wise interactions between all pairs of tokens inthe input text [137] are becoming popular for modeling long-term dependencies, and may alsobene�t from coreference information to bias the learning of those dependencies [? ]. Here wefocus on recurrent layers and leave such an analysis to future work.

Other related works includeDynamic Entity Representation network (DER) [65], which buildsdynamic representations of the candidate answers while reading the document, and accumu-lates the information about an entity by max-pooling; EpiReader [136] consists of two networks,where one proposes a small set of candidate answers, and the other reranks the proposed candi-dates conditioned on the query and the context; Bi-Directional A�ention Flow network (BiDAF)

[109] adopts a multi-stage hierarchical architecture along with a �ow-based a�ention mech-

12

March 26, 2019DRAFT

anism; Bajgar et al. [5] showed a 10% improvement on the CBT corpus [52] by training theAS Reader on an augmented training set of about 14 million examples, making a case for thecommunity to exploit data abundance.

2.3 Gated-Attention Reader

Our proposed GA readers perform multiple hops over the document (context), similar to theMemory Networks architecture [121]. Multi-hop architectures mimic the multi-step compre-hension process of human readers, and have shown promising results in several recent modelsfor text comprehension [67, 112, 117]. �e contextual representations in GA readers, namelythe embeddings of words in the document, are iteratively re�ned across hops until reaching a�nal a�ention-sum module [62] which maps the contextual representations in the last hop toa probability distribution over candidate answers.

�e a�ention mechanism has been introduced recently to model human focus, leading tosigni�cant improvement in machine translation and image captioning [4, 87]. In reading com-prehension tasks, ideally, the semantic meanings carried by the contextual embeddings shouldbe aware of the query across hops. As an example, human readers are able to keep the questionin mind during multiple passes of reading, to successively mask away information irrelevant tothe query. However, existing neural network readers are restricted to either a�end to tokens[12, 51] or entire sentences [147], with the assumption that certain sub-parts of the documentare more important than others. In contrast, we propose a �ner-grained model which a�endsto components of the semantic representation being built up by the GRU. �e new a�entionmechanism, called gated-a�ention, is implemented via multiplicative interactions between thequery and the contextual embeddings, and is applied per hop to act as �ne-grained information�lters during the multi-step reasoning. �e �lters weigh individual components of the vectorrepresentation of each token in the document separately.

�e design of gated-a�ention layers is motivated by the e�ectiveness of multiplicative in-teraction among vector-space representations, e.g., in various types of recurrent units [55, 156]and in relational learning [64, 157]. While other types of compositional operators are possi-ble, such as concatenation or addition [86], we �nd that multiplication has strong empiricalperformance, where query representations naturally serve as information �lters across hops.

13

March 26, 2019DRAFT

Preliminaries

All tasks we look at involve tuples of the form (d, q, a, C), where the goal is to �nd the answera from candidates C to question q with passage d as context. In this work we consider datasetswhere each candidate c ∈ C has at least one token which also appears in the document.

Several components of the model use a Gated Recurrent Unit (GRU) [16] which maps aninput sequence X = [x1, x2, . . . , xT ] to an ouput sequence H = [h1, h2, . . . , hT ] as follows:

rt = σ(Wrxt + Urht−1 + b),

zt = σ(Wzxt + Uzht−1 + bz),

ht = tanh(Whxt + Uh(rt � ht−1) + bh),

ht = (1− zt)� ht−1 + zt � ht. (2.1)

where � denotes the Hadamard product or the element-wise multiplication. rt and zt arecalled the reset and update gates respectively, and ht the candidate output. A Bi-directionalGRU (Bi-GRU) processes the sequence in both forward and backward directions to producetwo sequences [hf1 , h

f2 , . . . , h

fT ] and [hb1, h

b2, . . . , h

bT ], which are concatenated at the output

←→GRU(X) = [hf1‖hbT , . . . , hfT‖hb1], (2.2)

where←→

GRU(X) denotes the full output of the Bi-GRU obtained by concatenating each forwardstate hfi and backward state hbT−i+1 at step i given the input X . Note

←→GRU(X) is a matrix in

R2nh×T where nh is the number of hidden units in GRU.Let X(0) = [x

(0)1 , x

(0)2 , . . . x

(0)|D|] denote the token embeddings of the document, which are

also inputs at layer 1 for the document reader below, and Y = [y1, y2, . . . y|Q|] denote the tokenembeddings of the query. Here |D| and |Q| denote the document and query lengths respectively.

Multi-Hop Architecture

Fig. 2.2 illustrates the Gated-A�ention (GA) reader. �e model reads the document and thequery over K horizontal layers, where layer k receives the contextual embeddings X(k−1) ofthe document from the previous layer. �e document embeddings are transformed by takingthe full output of a document Bi-GRU (indicated in blue in Fig. 2.2):

D(k) =←→

GRU(k)

D (X(k−1)). (2.3)

14

March 26, 2019DRAFT

Figure 2.2: Gated-A�ention Reader. Dashed lines represent dropout connections.

At the same time, a layer-speci�c query representation is computed as the full output of aseparate query Bi-GRU (indicated in green in Figure 2.2):

Q(k) =←→

GRU(k)

Q (Y ). (2.4)

Next, Gated-A�ention is applied to D(k) and Q(k) to compute inputs for the next layer X(k).

X(k) = GA(D(k), Q(k)), (2.5)

where GA is de�ned in the following subsection.

Gated-Attention Module

For brevity, let us drop the superscript k in this subsection as we are focusing on a particularlayer. For each token di in D, the GA module forms a token-speci�c representation of thequery qi using so� a�ention, and then multiplies the query representation element-wise withthe document token representation. Speci�cally, for i = 1, . . . , |D|:

αi = so�max(Q>di), (2.6)

qi = Qαi,

xi = di � qi. (2.7)

15

March 26, 2019DRAFT

Answer Prediction

Let q(K)` = qf` ‖qbT−`+1 be an intermediate output of the �nal layer query Bi-GRU at the location

` of the cloze token in the query, and D(K) =←→

GRU(K)

D (X(K−1)) be the full output of �nal layerdocument Bi-GRU. To obtain the probability that a particular token in the document answersthe query, we take an inner-product between these two, and pass through a so�max layer:

s = so�max((q(K)` )TD(K)) (2.8)

where vector s de�nes a probability distribution over the |D| tokens in the document. �e prob-ability of a particular candidate c ∈ C as being the answer is then computed by aggregating theprobabilities of all document tokens which appear in c and renormalizing over the candidates:

Pr(c|d, q) ∝∑

i∈I(c,d)

si (2.9)

where I(c, d) is the set of positions where a token in c appears in the document d. �is aggre-gation operation is the same as the pointer sum a�ention applied in the AS Reader [62].

Finally, the candidate with maximum probability is selected as the predicted answer:

a∗ = argmaxc∈C Pr(c|d, q). (2.10)

During the training phase, model parameters of GA are updated w.r.t. a cross-entropy lossbetween the predicted probabilities and the true answers.

Further Enhancements

Character-level Embeddings: Given a token w from the document or query, its vector space rep-resentation is computed as x = L(w)||C(w). L(w) retrieves the word-embedding for w from alookup tableL ∈ R|V |×nl , whose rows hold a vector for each unique token in the vocabulary. Wealso utilize a character composition model C(w) which generates an orthographic embeddingof the token. Such embeddings have been previously shown to be helpful for tasks like NamedEntity Recognition [158] and dealing with OOV tokens at test time [128]. �e embedding C(w)

is generated by taking the �nal outputs zfncand zbnc

of a Bi-GRU applied to embeddings from alookup table of characters in the token, and applying a linear transformation:

z = zfnc||zbnc

C(w) = Wz + b

16

March 26, 2019DRAFT

Mary went … she

…

…

Mary went … she

…

… hf

t�1

hft

hfyt

hbt0

hbt0+1xt xt0

hbyt0

Figure 2.3: Forward (le�) and Backward (right) Coref-GRU layers. Mary and she are coreferent.

�estion Evidence Common Word Feature (qe-comm): Li et al. [72] proposed a simple tokenlevel indicator feature which signi�cantly boosts reading comprehension performance in somecases. For each token in the document we construct a one-hot vector fi ∈ {0, 1}2 indicating itspresence in the query. It can be incorporated into the GA reader by assigning a feature lookuptable F ∈ RnF×2 (we use nF = 2), taking the feature embedding ei = fTi F and appending it tothe inputs of the last layer document BiGRU as, x(K)

i ‖fi for all i. We conduct our experimentsboth with and without this feature and report the results. Henceforth, we refer to this featureas the qe-comm feature or just feature.

2.4 Extending with Coreference

Next, we discuss how to extend the above model with coreference information to improve itsmodeling of long-term dependencies. We will focus on the recurrent layer (Eq. 2.1), and keepthe rest of the model unchanged.

Coref-RNNLayers. Suppose we are given an input sequencew1, w2, . . . , wT along with theirword vectors x1, . . . , xT and annotations for the most recent coreferent antecedent for eachtoken y1, . . . , yT , where yt ∈ {0, . . . , t− 1} and yt = 0 denotes the null antecedent (for tokensnot belonging to any cluster). We assume all tokens belonging to a mention in a cluster belongto that cluster, and there are C clusters in total. �e update equations in any RNN all take thesame basic form given by:

f(Wxt + Uht−1 + b).

17

March 26, 2019DRAFT

�e bias for sequential recency comes from the second termUht−1. In this work we add anotherterm to introduce a bias towards coreferent recency instead:

f(Wxt + αtUφs(ht−1) + (1− αt)U ′φc(hyt) + b),

where hyt is the hidden state of the coreferent antecedent of wt (with h0 = 0), φs and φc arenon-linear functions applied to the hidden states coming from the sequential antecedent andthe coreferent antecedent, respectively, and αt is a scalar weight which decides the relativeimportance of the two terms based on the current input (so that, for example, pronouns mayassign a higher weight for the coreference state). When yt = 0, αt is set to 1, otherwise itis computed using a key-based addressing scheme [82], as αt = so�max(xTt k), where k is atrainable key vector. In this work we use simple slicing functions φs(x) = x[1 : d/2], andφc(x) = x[d/2 : d] which decompose the hidden states into a sequential and a coreferentcomponent, respectively. Figure 2.3 (le�) shows an illustration of the layer.

Coref-GRU (C-GRU). �e above extension to RNNs can be applied to any recurrent layer;here we will focus on GRU cells (Eq. 2.1). For simplicity, we introduce the variable mt whichconcatenates (||) the sequential and coreferent hidden states:

mt = αtφs(ht−1)||(1− αt)φc(hyt).

�en the update equations are given by:

rt = σ(W rxt + U rmt + br),

zt = σ(W zxt + U zmt + bz),

ht = tanh(W hxt + rt � Uhmt + bh),

ht = (1− zt)�mt + ztht.

�e a�ention parameter αt is given by:

αt =expxTt k1

expxTt k1 + expxTt k2

,

where k1 and k2 are trainable key vectors.

Connection toMemoryNetworks. We can also view the model as a memory network [121]with a memory state Mt at each time step which is a C × d matrix. �e rows of this memorymatrix correspond to the state of each coreference cluster at time step t. �e main di�erence

18

March 26, 2019DRAFT

between Coref-GRUs and a typical memory network such as EntNets [48] lies in the fact thatwe use coreference annotations to read and write from the memory rather than let model learnhow to access the memory. With Coref-GRUs, only the content of the memories needs to belearned. As we shall see in Section 5.4, this turns out to be a useful bias in the low-data regime.

Bidirectional C-GRU. To extend to the bidirectional case, a second layer is fed the samesequence in the reverse direction, xT , . . . , x1 and yt ∈ {0, t+1, . . . , T} now denotes the imme-diately descendent coreferent token fromwt. Outputs from the two layers are then concatenatedto form the bi-directional output (see Figure 2.3).

Complexity. �e resulting layer has the same time-complexity as that of a regular GRU layer.�e memory complexity increases since we have to keep track of the hidden states for eachcoreference cluster in the input. If there are C clusters and B is the batch size, the result-ing complexity is by O(BTCd). �is scales linearly with the input size T , however we leaveexploration of more e�cient architectures to future work.

2.5 Experiments and Results

2.5.1 Cloze-style QA

Datasets

We evaluate the GA reader on �ve large-scale datasets recently proposed in the literature. �e�rst two, CNN and Daily Mail news stories1 consist of articles from the popular CNN and DailyMail websites [51]. A query over each article is formed by removing an entity from the shortsummary which follows the article. Further, entities within each article were anonymized tomake the task purely a comprehension one. N-gram statistics, for instance, computed over theentire corpus are no longer useful in such an anonymized corpus.

�e next two datasets are formed from two di�erent subsets of the Children’s Book Test(CBT)2 [52]. Documents consist of 20 contiguous sentences from the body of a popular chil-dren’s book, and queries are formed by deleting a token from the 21st sentence. We only focuson subsets where the deleted token is either a common noun (CN) or named entity (NE) sincesimple language models already give human-level performance on the other types (cf. [52]).

1https://github.com/deepmind/rc-data2http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz

19

https://github.com/deepmind/rc-data

http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz

March 26, 2019DRAFT

�e �nal dataset is Who Did What3 (WDW) [89], constructed from the LDC English Gi-gaword newswire corpus. First, article pairs which appeared around the same time and withoverlapping entities are chosen, and then one article forms the document and a cloze queryis constructed from the other. Missing tokens are always person named entities. �estionswhich are easily answered by simple baselines are �ltered out, to make the task more challeng-ing. �ere are two versions of the training set—a small but focused “Strict” version and a largebut noisy “Relaxed” version. We report results on both se�ings which share the same validationand test sets.

Performance Comparison

Tables 2.2 and 4.2 show a comparison of the performance of GA Reader with previously pub-lished results on WDW and CNN, Daily Mail, CBT datasets respectively. �e numbers reportedfor GA Reader are for single best models, though we compare to both ensembles and singlemodels from prior work. We present 4 variants of the latest GA Reader, using combinationsof whether the qe-comm feature is used (+feature) or not, and whether the word lookup tableL(w) is updated during training or �xed to its initial value.

Interestingly, we observe that feature engineering leads to signi�cant improvements forWDW and CBT datasets, but not for CNN and Daily Mail datasets. We note that anonymiza-tion of the la�er datasets means that there is already some feature engineering (it adds hintsabout whether a token is an entity), and these are much larger than the other four. In machinelearning it is common to see the e�ect of feature engineering diminish with increasing datasize. Similarly, �xing the word embeddings provides an improvement for the WDW and CBT,but not for CNN and Daily Mail. �is is not surprising given that the la�er datasets are largerand less prone to over��ing.

Comparing with prior work, on the WDW dataset the basic version of the GA Reader out-performs all previously published models when trained on the Strict se�ing. By adding theqe-comm feature the performance increases by 3.2% and 3.5% on the Strict and Relaxed se�ingsrespectively to set a new state of the art4 on this dataset. On the CNN and Daily Mail datasetsthe GA Reader leads to an improvement of 3.2% and 4.3% respectively over the best previoussingle models. �ey also outperform previous ensemble models, se�ing a new state of that artfor both datasets. For CBT-NE, GA Reader with the qe-comm feature outperforms all previoussingle and ensemble models except the AS Reader trained on the much larger BookTest Corpus

3https://tticnlp.github.io/who did what/4As of March, 2017.

20

https://tticnlp.github.io/who_did_what/

March 26, 2019DRAFT

ModelCNN Daily Mail CBT-NE CBT-CN

Val Test Val Test Val Test Val Test

Humans (query) † – – – – – 52.0 – 64.4Humans (context + query) † – – – – – 81.6 – 81.6

LSTMs (context + query) † – – – – 51.2 41.8 62.6 56.0Deep LSTM Reader † 55.0 57.0 63.3 62.2 – – – –A�entive Reader † 61.6 63.0 70.5 69.0 – – – –Impatient Reader † 61.8 63.8 69.0 68.0 – – – –MemNets † 63.4 66.8 – – 70.4 66.6 64.2 63.0AS Reader † 68.6 69.5 75.0 73.9 73.8 68.6 68.8 63.4DER Network † 71.3 72.9 – – – – – –Stanford AR (relabeling) † 73.8 73.6 77.6 76.6 – – – –Iterative A�entive Reader † 72.6 73.3 – – 75.2 68.6 72.1 69.2EpiReader † 73.4 74.0 – – 75.3 69.7 71.5 67.4AoA Reader † 73.1 74.4 – – 77.8 72.0 72.2 69.4ReasoNet † 72.9 74.7 77.6 76.6 – – – –NSE † – – – – 78.2 73.2 74.3 71.9BiDAF † 76.3 76.9 80.3 79.6 – – – –

MemNets (ensemble) † 66.2 69.4 – – – – – –AS Reader (ensemble) † 73.9 75.4 78.7 77.7 76.2 71.0 71.1 68.9Stanford AR (relabeling,ensemble) † 77.2 77.6 80.2 79.2 – – – –Iterative A�entive Reader (ensemble) † 75.2 76.1 – – 76.9 72.0 74.1 71.0EpiReader (ensemble) † – – – – 76.6 71.8 73.6 70.6

AS Reader (+BookTest) † ‡ – – – – 80.5 76.2 83.2 80.8AS Reader (+BookTest,ensemble) † ‡ – – – – 82.3 78.4 85.7 83.7

GA (update L(w)) 77.9 77.9 81.5 80.9 76.7 70.1 69.8 67.3GA (�x L(w)) 77.9 77.8 80.4 79.6 77.2 71.4 71.6 68.0GA (+feature, update L(w)) 77.3 76.9 80.7 80.0 77.2 73.3 73.0 69.8GA (+feature, �x L(w)) 76.7 77.4 80.0 79.3 78.5 74.9 74.4 70.7

Table 2.1: Validation/Test accuracy (%) on CNN, Daily Mail and CBT. Results marked with “†”are cf previously published works. Results marked with “‡” were obtained by training on alarger training set. Best performance on standard training sets is in bold, and on larger trainingsets in italics.

21

March 26, 2019DRAFT

ModelStrict Relaxed

Val Test Val Test

Human † – 84 – –

A�entive Reader † – 53 – 55AS Reader † – 57 – 59Stanford AR † – 64 – 65NSE † 66.5 66.2 67.0 66.7

GA (update L(w)) 67.8 67.0 67.0 66.6GA (�x L(w)) 68.3 68.0 69.6 69.1GA (+feature, update L(w)) 70.1 69.5 70.9 71.0GA (+feature, �x L(w)) 71.6 71.2 72.6 72.6

Table 2.2: Validation/Test accuracy (%) on WDW dataset for both “Strict” and “Relaxed” se�ings.Results with “†” are cf previously published works.

[5]. Lastly, on CBT-CN the GA Reader with the qe-comm feature outperforms all previouslypublished single models except the NSE, and AS Reader trained on a larger corpus. For each ofthe 4 datasets on which GA achieves the top performance, we conducted one-sample propor-tion tests to test whether GA is signi�cantly be�er than the second-best baseline. �e p-valuesare 0.319 for CNN, <0.00001 for DailyMail, 0.028 for CBT-NE, and <0.00001 for WDW. Inother words, GA statistically signi�cantly outperforms all other baselines on 3 out of those 4datasets at the 5% signi�cance level. �e results could be even more signi�cant under pairedtests, however we did not have access to the predictions from the baselines.

Analysis

To gain an insight into the reading process employed by the model we analyzed the a�entiondistributions at intermediate layers of the reader. Figure 2.4 shows an example from the valida-tion set of WDW dataset. In each �gure, the le� and middle plots visualize a�ention over thequery (equation 2.6) for candidates in the document a�er layers 1 & 2 respectively. �e rightplot shows a�ention over candidates in the document of cloze placeholder (XXX) in the queryat the �nal layer. �e full document, query and correct answer are shown at the bo�om.

A generic pa�ern observed in such examples is that in intermediate layers, candidates in

22

March 26, 2019DRAFT

Figure 2.4: Layer-wise a�ention visualization of GA Reader trained on WDW-Strict.

the document (shown along rows) tend to pick out salient tokens in the query which provideclues about the cloze, and in the �nal layer the candidate with the highest match with thesetokens is selected as the answer. In Figure 2.4 there is a high a�ention of the correct answer onfinancial regulatory standards in the �rst layer, and on us president in thesecond layer. �e incorrect answer, in contrast, only a�ends to one of these aspects, and hencereceives a lower score in the �nal layer despite the n-gram overlap it has with the cloze token.

Now, we switch our focus to evaluating the Coref-GRU layer. We present results on threedatasets which explicitly require reasoning over long-term dependencies, discussed below.

2.5.2 BAbi AI Tasks

Our �rst set of experiments are on the 1K training version of the synthetic bAbi AI tasks [146].�e passages and questions in this dataset are generated using templates, removing many com-plexities inherent in natural language, but it still provides a useful testbed for us since sometasks are speci�cally constructed to test the coreference-based reasoning we tackle here. Ex-periments on more natural data are described below.

Table 2.3 shows a comparison of EntNets [48], QRNs [110] (the best published results onbAbi-1K), and our models. We also include the results for a single layer version of GA Reader(which we denote simply as Bi-GRU or Bi-C-GRU when using coreference) to enable fair com-

23

March 26, 2019DRAFT

Method Avg Max # failed

EntNets [48] – 0.704 15QRN [110] – 0.901 7

Bi-GRU 0.727 0.767 13Bi-C-GRU 0.790 0.831 12GA w/ GRU 0.764 0.810 10GA w/ GRU + 1-hot 0.766 0.808 9GA w/ C-GRU 0.870 0.886 5

Table 2.3: Accuracy on bAbi-1K, averaged across all 20 tasks. Following previous work we runeach task for 10 random seeds, and report the Avg and Max (based on dev set) performance. Atask is considered failed if its Max performance is < 0.95.

parison with EntNets. In each case we see clear improvements of using C-GRU layers overGRU layers. Interestingly, EntNets, which have >99% performance when trained with 10K

examples only reach 70% performance with 1K training examples. �e Bi-C-GRU model sig-ni�cantly improves on this baseline, which shows that, with less data, coreference annotationscan provide a useful bias for a memory network on how to read and write memories.

A break-down of task-wise performance is given in 2.4. Comparing C-GRU to the GRUbased method, we �nd that the main gains are on tasks 2 (two supporting facts), 3 (three sup-porting facts) and 16 (basic induction). All these tasks require aggregation of information acrosssentences to derive the answer. Comparing to the QRN baseline, we found that C-GRU was sig-ni�cantly worse on task 15 (basic deduction). On closer examination we found that this wasbecause our simplistic coreference module which matches tokens exactly was not able to re-solve “mice” to “mouses” and “cats” to “cat”. On the other hand, C-GRU was signi�cantly be�erthan QRN on task 16 (basic induction).

We also include a baseline which uses coreference features as 1-hot vectors appended tothe input word vectors (GA w/ GRU + 1-hot). �is provides the model with information aboutthe coreference clusters, but does not improve performance, suggesting that the regular GRUis unable to track the given coreference information across long distances to solve the task. Onthe other hand, in Figure 2.5 (le�) we show how the performance of GA w/ C-GRU varies as weremove gold-standard mentions from coreference clusters, or if we replace them with randommentions (GA w/ random-GRU). In both cases there is a sharp drop in performance, showingthat speci�cally using coreference for connecting mentions is important.

24

March 26, 2019DRAFT

Task QRNGA w/GRU

GA w/C-GRU

1: Single Supporting Fact 1.000 0.997 1.0002: Two Supporting Facts 0.993 0.345 0.9903: �ree Supporting Facts 0.943 0.558 0.9824: Two Argument Relations 1.000 1.000 1.0005: �ree Argument Relations 0.989 0.989 0.9936:Yes/No �estions 0.991 0.962 0.9767: Counting 0.904 0.946 0.9768: Lists / Sets 0.944 0.947 0.9649: Simple Negation 1.000 0.991 0.99010: Inde�nite Knowledge 1.000 0.992 0.98611: Basic Coreference 1.000 0.995 0.99612: Conjunction 1.000 1.000 0.99613: Compound Coreference 1.000 0.998 0.99314: Time Reasoning 0.992 0.895 0.84915: Basic Deduction 1.000 0.521 0.47016: Basic Induction 0.470 0.488 0.99917: Positional Reasoning 0.656 0.580 0.57418: Size Reasoning 0.921 0.908 0.89619: Path Finding 0.213 0.095 0.09920: Agent’s Motivation 0.998 0.998 1.000

Average 0.901 0.810 0.886

Table 2.4: Breakdown of task-wise performance on bAbi dataset. Tasks where C-GRU is signif-icant be�er / worse than either GRU or QRNs are highlighted.

2.5.3 Wikihop Dataset

Next we apply our model to the Wikihop dataset [144], which is speci�cally constructed to testmulti-hop reading comprehension across documents. Each instance in this dataset consists of acollection of passages (p1, . . . , pN), and a query of the form (h, r) where h is an entity and r is arelation. �e task is to �nd the tail entity t from a set of provided candidates C. As preprocessingwe concatenate all documents in a random order, and extract coreference annotations from theBerkeley Entity Resolution system [31] which gets about 62% F1 score on the CoNLL 2011 testset. We only keep the coreference clusters which contain at least one candidate from C or anentity which co-occurs with the head entity h. We report results in Table 2.5 when using thefull training set, as well as when using a reduced training set of sizes 1K and 5K, to test the

25

March 26, 2019DRAFT

0.0 0.1 0.2 0.3 0.4% removed coreferences

0.5

0.6

0.7

0.8

0.9

1.0

Acc

urac

y

GA w/ GRUGA w/ randomGRUGA w/ CGRU

Training progress0.0

0.1

0.2

0.3

0.4

0.5

Val

idat

ion

exp

(−loss

)

full C-GRU

full GRU

5K C-GRU

5K GRU

1K C-GRU

1K GRU

Figure 2.5: Le�: Accuracy of GA w/ C-GRU as coreference annotations are removed for bAbitask 3. Right: Expected probability of correct answer (exp (−loss)) on Validation set as trainingprogresses on Wikihop dataset for 1K, 5K and the full training datasets.

model under a low-data regime. In Figure 2.5 we also show the training curves of exp (−loss)on the validation set.

We see higher performance for the C-GRU model in the low data regime, and be�er gen-eralization throughout the training curve for all three se�ings. �is supports our conjecturethat the GRU layer has di�culty learning the kind of coreference-based reasoning required inthis dataset, and that the bias towards coreferent recency helps with that. However, perhapssurprisingly, given enough data both models perform comparably. �is could either indicatethat the baseline learns the required reasoning pa�erns when given enough data, or, that thebias towards corefence-based reasoning hurts performance for some other types of questions.Indeed, there are 9% questions which are answered correctly by the baseline but not by C-GRU,however, we did not �nd any consistent pa�erns among these in our analyses.

2.5.4 LAMBADA Dataset

Our last set of experiments is on the broad-context language modeling task of LAMBADAdataset [90]. �is dataset consists of passages 4-5 sentences long, where the last word needs tobe predicted. Interestingly, though, the passages are �ltered such that human volunteers wereable to predict the missing token given the full passage, but not given only the last sentence.Hence, predicting these tokens involves a broader understanding of the whole passage. Analysisof the questions [17] suggests that around 20% of the questions need coreference understandingto answer correctly. Hence, we apply our model which uses coreference information for this

26

March 26, 2019DRAFT

MethodFollow

Follow+single

Follow+multiple

Overall

Dev Dev Dev Dev Test

1K

GA w/ GRU 0.307 0.332 0.287 0.263 –GA w/ C-GRU 0.355 0.370 0.354 0.330 –

5K

GA w/ GRU 0.382 0.385 0.390 0.336 –GA w/ C-GRU 0.452 0.454 0.460 0.401 –

full

BiDAF – – – – 0.429GA w/ GRU 0.606 0.615 0.604 0.549 –GA w/ C-GRU 0.614 0.616 0.614 0.560† 0.593

Table 2.5: Accuracy on Wikihop. Follow: annotated as answer follows from the given passages.Follow +multiple: annotated as requiring multiple passages for answering. Follow +singleannotated as requiring one passage for answering. †p = 0.057 using Mcnemar’s test comparedto GA w/ GRU.

task.

We use the same setup as Chu et al. [17] which formulated the problem as a reading com-prehension one by treating the last sentence as query, and the remaining passage as contextto extract the answer from. In this manner only 80% of the questions are answerable, but theperformance increases substantially compared to pure language modeling based approaches.For this dataset we used Stanford CoreNLP to extract coreferences [20], which achieved 0.63

F1 on the CoNLL test set. Table 2.6 shows a comparison of the GA w/ GRU baseline and GA w/C-GRU models. We see a signi�cant gain in performance when using the layer with coreferencebias. Furthermore, the 1-hot baseline which uses the same coreference information, but withsequential recency bias fails to improve over the regular GRU layer. While the improvementfor C-GRU is small, it is signi�cant, and we note that questions in this dataset involve severaldi�erent types of reasoning out of which we only tackle one speci�c kind.

27

March 26, 2019DRAFT

Method overall context

Chu et al. [17] 0.4900 –GA w/ GRU 0.5398 0.6677GA w/ GRU + 1-hot 0.5338 0.6603GA w/ C-GRU 0.5569 0.6888†

Table 2.6: Accuracy on LAMBADA test set, averaged across two runs with random initializa-tions. context: passages for which the answer is in context. overall: full test set for comparisonto prior work. †p < 0.0001 using Mcnemar’s test compared to GA w/ GRU.

2.6 Conclusion

We presented the Gated-A�ention reader for answering cloze-style questions over documents.�e GA reader features a novel multiplicative gating mechanism, combined with a multi-hoparchitecture. �e model shows strong performance on several large-scale benchmark datasetswith more than 4% improvements over competitive baselines. While we have focused on textcomprehension, but we believe that the Gated-A�ention mechanism may bene�t other tasks aswell where multiple sources of information interact.

We have also presented a recurrent layer with a bias towards coreferent recency, with thegoal of tackling reading comprehension problems which require aggregating information frommultiple mentions of the same entity. Our experiments show that when combined with theGA Reader, the layer provides a useful inductive bias for solving problems of this kind. Noisein the coreference annotations has a detrimental e�ect on the performance (Figure 2.5), hencean important direction for future work is exploring joint models which learn to do coreferenceresolution and reading together.

28

March 26, 2019DRAFT

Chapter 3

Text with KB Facts (Completed Work)

Large-scale knowledge bases (KBs), such as Freebase [10], YAGO [120], and DBPedia [3], holdvaluable information about real-world entities in a graphical form, making them a useful re-source to answer open-domain questions [8]. Hence, in this chapter, we look to augment thetext-based QA setup discussed so far with facts from such KBs.

3.1 Introduction

Open domain �estion Answering (QA) is the task of �nding answers to questions posed in nat-ural language. Historically, this required a specialized pipeline consisting of multiple machine-learned and hand-cra�ed modules [35]. Recently, the paradigm has shi�ed towards trainingend-to-end deep neural network models for the task [13, 56, 77, 96, 125]. Most existing mod-els, however, answer questions using a single information source, usually either text from anencyclopedia, or a single knowledge base (KB).

Intuitively, the suitability of an information source for QA depends on both its coverage

and the di�culty of extracting answers from it. A large text corpus has high coverage, but theinformation is expressed using many di�erent text pa�erns. As a result, models which operateon these pa�erns (e.g. BiDAF [109]) do not generalize beyond their training domains [133, 148]or to novel types of reasoning [125, 144]. KBs, on the other hand, su�er from low coverage dueto their inevitable incompleteness and restricted schema [83], but are easier to extract answersfrom, since they are constructed precisely for the purpose of being queried.

In practice, some questions are best answered using text, while others are best answeredusing KBs. A natural question, then, is how to e�ectively combine both types of information.Surprisingly li�le prior work has looked at this problem. In this paper we focus on a scenario

29

March 26, 2019DRAFT

Figure 3.1: Le�: To answer a question posed in natural language, GRAFT-Net considers a het-erogeneous graph constructed from text and KB facts, and thus can leverage the rich relationalstructure between the two information sources. Right: Embeddings are propagated in thegraph for a �xed number of layers (L) and the �nal node representations are used to classifyanswers.

in which a large-scale KB [3, 10] and a text corpus are available, but neither is su�cient alonefor answering all questions.

A naıve option, in such a se�ing, is to take state-of-the-art QA systems developed for eachsource, and aggregate their predictions using some heuristic [6, 35]. We call this approach late

fusion, and show that it can be sub-optimal, as models have limited ability to aggregate evidenceacross the di�erent sources. Instead, we focus on an early fusion strategy, where a single modelis trained to extract answers from a question subgraph (see Fig 3.1, le�) containing relevant KBfacts as well as text sentences. Early fusion allows more �exibility in combining informationfrom multiple sources.

To enable early fusion, we propose a novel graph convolution based neural network, calledGRAFT-Net (Graphs of Relations Among Facts and Text Networks), speci�cally designed to op-erate over heterogeneous graphs of KB facts and text sentences. We build upon recent work ongraph representation learning [63, 107], but propose two key modi�cations to adopt them forthe task of QA. First, we propose heterogeneous update rules that handle KB nodes di�erentlyfrom the text nodes: for instance, LSTM-based updates are used to propagate information intoand out of text nodes. Second, we introduce a directed propagation method, inspired by per-sonalized Pagerank in IR [47], which constrains the propagation of embeddings in the graph tofollow paths starting from seed nodes linked to the question. Empirically, we show that boththese extensions are crucial for the task of QA. An overview of the model is shown in Figure 3.1.

30

March 26, 2019DRAFT

We evaluate these methods on a new suite of benchmark tasks for testing QA models whenboth KB and text are present. Using WikiMovies [82] and Web�estionsSP [164], we constructdatasets with a varying amount of training supervision and KB completeness, and with a vary-ing degree of question complexity. We report baselines for future comparison, including KeyValue Memory Networks [27, 82], and show that our proposed GRAFT-Nets have superior per-formance across a wide range of conditions. We also show that GRAFT-Nets are competitivewith the state-of-the-art methods developed speci�cally for text-only QA, and state-of-the artmethods developed for KB-only QA.

3.2 Related Work

�e work of Das et al. [27] a�empts an early fusion strategy for QA over KB facts and text. �eirapproach is based on Key-Value Memory Networks (KV-MemNNs) [82] coupled with a univer-sal schema [100] to populate a memory module with representations of KB triples and textsnippets independently. �e key limitation for this model is that it ignores the rich relationalstructure between the facts and text snippets.

Non-deep learning approaches have been also a�empted for QA over both text assertionsand KB facts. Gardner and Krishnamurthy [37] use traditional feature extraction methods ofopen-vocabulary semantic parsing for the task. Ryu et al. [101] use a pipelined system aggre-gating evidence from both unstructured and semi-structured sources for open-domain QA.

Another line of work has looked at learning combined representations of KBs and text forrelation extraction and Knowledge Base Completion (KBC) [26, 46, 68, 100, 135, 138]. �e keydi�erence in QA compared to KBC is that in QA the inference process on the knowledge sourcehas to be conditioned on the question, so di�erent questions induce di�erent representations ofthe KB and warrant a di�erent inference process. Furthermore, KBC operates under the �xedschema de�ned by the KB before-hand, whereas natural language questions might not adhereto this schema.

�e GRAFT-Net model itself is motivated from the large body of work on graph represen-tation learning [2, 63, 76, 103, 107]. Like most other graph-based models, GRAFT-Nets canalso be viewed as an instantiation of the Message Passing Neural Network (MPNN) frameworkof Gilmer et al. [40]. GRAFT-Nets are also inductive representation learners like GraphSAGE[45], but operate on a heterogeneous mixture of nodes and use retrieval for ge�ing a subgraphinstead of random sampling. �e recently proposed Walk-Steered Convolution model uses ran-dom walks for learning graph representations [58]. Our personalization technique also borrows

31

March 26, 2019DRAFT

from such random walk literature, but uses it to localize propagation of embeddings.

3.3 Open-Domain QA

Notations

A knowledge base is denoted asK = (V , E ,R), where V is the set of entities in the KB, and theedges E are triplets (s, r, o) which denote that relation r ∈ R holds between the subject s ∈ Vand object o ∈ V . A text corpusD is a set of documents {d1, . . . , d|D|} where each document isa sequence of words di = (w1, . . . , w|di|). We further assume that an (imperfect) entity linkingsystem has been run on the collection of documents whose output is a set L of links (v, dp)

connecting an entity v ∈ V with a word at position p in document d, and we denote with Ldthe set of all entity links in document d. For entity mentions spanning multiple words in d, weinclude links to all the words in the mention in L.

�e task is, given a natural language question q = (w1, . . . , w|q|), extract its answers {a}qfromG = (K,D,L). �ere may be multiple correct answers for a question. Here we will assumethat the answers are entities from either the documents or the KB. We are interested in a widerange of se�ings, where the KBK varies from highly incomplete to complete for answering thequestions, and we will introduce datasets for testing our models under these se�ings.

To solve this task we proceed in two steps. First, we extract a subgraph Gq ⊂ G whichcontains the answer to the question with high probability. �e goal for this step is to ensure highrecall for answers while producing a graph small enough to �t into GPU memory for gradient-based learning. Next, we use our proposed model GRAFT-Net to learn node representations inGq, conditioned on q, which are used to classify each node as being an answer or not. Trainingdata for the second step is generated using distant supervision. �e entire process mimics thesearch-and-read paradigm for text-based QA [131].

�estion Subgraph Retrieval

We retrieve the subgraph Gq using two parallel pipelines – one over the KB K which returns aset of entities, and the other over the corpusD which returns a set of documents. �e retrievedentities and documents are then combined with entity links to produce a fully-connected graph.

KBRetrieval. To retrieve relevant entities from the KB we �rst perform entity linking on thequestion q, producing a set of seed entities, denoted Sq. Next we run the Personalized PageRank

32

March 26, 2019DRAFT

(PPR) method [47] around these seeds to identify other entities which might be an answer tothe question. �e edge-weights around Sq are distributed equally among all edges of the sametype, and they are weighted such that edges relevant to the question receive a higher weightthan those which are not. Speci�cally, we average word vectors to compute a relation vectorv(r) from the surface form of the relation, and a question vector v(q) from the words in thequestion, and use cosine similarity between these as the edge weights. A�er running PPR weretain the top E entities v1, . . . , vE by PPR score, along with any edges between them, and addthem to Gq.

Text Retrieval. We use Wikipedia as the corpus and retrieve text at the sentence level, i.e.documents in D are de�ned along sentences boundaries1. We perform text retrieval in twosteps: �rst we retrieve the top 5 most relevant Wikipedia articles, using the weighted bag-of-words model from DrQA [13]; then we populate a Lucene2 index with sentences from thesearticles, and retrieve the top ranking ones d1, . . . , dD, based on the words in the question. Forthe sentence-retrieval step, we found it bene�cial to include the title of the article as an addi-tional �eld in the Lucene index. As most sentences in an article talk about the title entity, thishelps in retrieving relevant sentences that do not explicitly mention the entity in the question.We add the retrieved documents, along with any entities linked to them, to the subgraph Gq.

�e �nal question subgraph is Gq = (Vq, Eq,R+), where the vertices Vq consist of all theretrieved entities and documents, i.e. Vq = {v1, . . . , vE} ∪{d1, . . . , dD}. �e edges are allrelations from K among these entities, plus the entity-links between documents and entities,i.e.

Eq = {(s, o, r) ∈ E : s, o ∈ Vq, r ∈ R} ∪ {(v, dp, rL) : (v, dp) ∈ Ld, d ∈ Vq},

where rL denotes a special “linking” relation. R+ = R∪{rL} is the set of all edge types in thesubgraph.

3.4 GRAFT-Nets

�e question q and its answers {a}q induce a labeling of the nodes in Vq: we let yv = 1 ifv ∈ {a}q and yv = 0 otherwise for all v ∈ Vq . �e task of QA then reduces to performingbinary classi�cation over the nodes of the graph Gq. Several graph-propagation based models

1�e term document will always refer to a sentence in this chapter.2https://lucene.apache.org/

33

https://lucene.apache.org/

March 26, 2019DRAFT

have been proposed in the literature which learn node representations and then perform classi-�cation of the nodes [63, 107]. Such models follow the standard gather-apply-sca�er paradigmto learn the node representation with homogeneous updates, i.e. treating all neighbors equally.

�e basic recipe for these models is as follows:1. Initialize node representations h(0)

v .

2. For l = 1, . . . , L update node representations

h(l)v = φ

h(l−1)v ,

∑v′∈Nr(v)

h(l−1)v′

,

where Nr(v) denotes the neighbours of v along incoming edges of type r, and φ is aneural network layer.

Here L is the number of layers in the model and corresponds to the maximum length of thepaths along which information should be propagated in the graph. Once the propagation iscomplete the �nal layer representations h(L)

v are used to perform the desired task, for examplelink prediction in knowledge bases [107].

However, there are two di�erences in our se�ing from previously studied graph-based clas-si�cation tasks. �e �rst di�erence is that, in our case, the graph Gq consists of heteroge-

neous nodes. Some nodes in the graph correspond to KB entities which represent symbolicobjects, whereas other nodes represent textual documents which are variable length sequencesof words. �e second di�erence is that we want to condition the representation of nodes in thegraph on the natural language question q. In §3.4 we introduce heterogeneous updates to ad-dress the �rst di�erence, and in §3.4 we introduce mechanisms for conditioning on the question(and its entities) for the second.

Node Initialization

Nodes corresponding to entities are initialized using �xed-size vectors h(0)v = xv ∈ Rn, where

xv can be pre-trained KB embeddings or random, and n is the embedding size. Documentnodes in the graph describe a variable length sequence of text. Since multiple entities mightlink to di�erent positions in the document, we maintain a variable length representation of thedocument in each layer. �is is denoted by H(l)

d ∈ R|d|×n. Given the words in the document(w1, . . . , w|d|), we initialize its hidden representation as:

H(0)d = LSTM(w1, w2, . . . ),

34

March 26, 2019DRAFT

Chabertvoiced by Lacey during ...

LSTM -- layer l-1

LSTM -- layer l

Lacey Chabert

...

FFN

...Chabert

Lacey Chabert

Q. Who voiced Meg in Family Guy?

voiced... by Lacey during ...

LSTM

CVT1

Entity Update Text Update

Figure 3.2: Illustration of the heterogeneous update rules for entities (le�) and text documents(right)

where LSTM refers to a long short-term memory unit. We denote the p-th row of H(l)d , corre-

sponding to the embedding of p-th word in the document d at layer l, as H(l)d,p.

Heterogeneous Updates

Figure 3.2 shows the update rules for entities and documents, which we describe in detail here.

Entities. LetM(v) = {(d, p)} be the set of positions p in documents dwhich correspond to amention of entity v. �e update for entity nodes involves a single-layer feed-forward network(FFN) over the concatenation of four states:

h(l)v = FFN

h(l−1)v

h(l−1)q∑

r

∑v′∈Nr(v) α

v′r ψr(h

(l−1)v′ )∑

(d,p)∈M(v) H(l−1)d,p

. (3.1)

�e �rst two terms correspond to the entity representation and question representation (detailsbelow), respectively, from the previous layer.

�e third term aggregates the states from the entity neighbours of the current node, Nr(v),a�er scaling with an a�ention weight αv′r (described in the next section), and applying relationspeci�c transformations ψr. Previous work on Relational-Graph Convolution Networks [107]used a linear projection for ψr. For a batched implementation, this results in matrices of size

35

March 26, 2019DRAFT

O(B|Rq||Eq|n), whereB is the batch size, which can be prohibitively large for large subgraphs3.Hence in this work we use relation vectors xr for r ∈ Rq instead of matrices, and compute theupdate along an edge as:

ψr(h(l−1)v′ ) = pr

(l−1)v′ FFN

(xr, h

(l−1)v′

). (3.2)

Here pr(l−1)v′ is a PageRank score used to control the propagation of embeddings along paths

starting from the seed nodes, which we describe in detail in the next section. �e memorycomplexity of the above isO(B(|Fq|+|Eq|)n), where |Fq| is the number of facts in the subgraphGq.

�e last term aggregates the states of all tokens that correspond to mentions of the entityv among the documents in the subgraph. Note that the update depends on the positions ofentities in their containing document.

Documents. Let L(d, p) be the set of all entities linked to the word at position p in documentd. �e document update proceeds in two steps. First we aggregate over the entity states comingin at each position separately:

H(l)d,p = FFN

H(l−1)d,p ,

∑v∈L(d,p)

h(l−1)v

. (3.3a)

Here h(l−1)v are normalized by the number of outgoing edges at v. Next we aggregate states

within the document using an LSTM:

H(l)d = LSTM(H

(l)d ). (3.3b)

Conditioning on the�estion

For the parts described thus far, the graph learner is largely agnostic of the question. We in-troduce dependence on question in two ways: by a�ention over relations, and by personalizedpropagation.

To represent q, let wq1, . . . , wq|q| be the words in the question. �e initial representation is

computed as:h(0)q = LSTM(wq1, . . . , w

q|q|)|q| ∈ Rn, (3.4)

3�is is because we have to use adjacency matrices of size |Rq| × |Eq| × |Eq| to aggregate embeddings fromneighbours of all nodes simultaneously.

36

March 26, 2019DRAFT

Figure 3.3: Directed propagation of embeddings in GRAFT-Net. A scalar PageRank score pr(l)v is

maintained for each node v across layers, which spreads out from the seed node. Embeddingsare only propagated from nodes with pr(l)

v > 0.

where we extract the �nal state from the output of the LSTM. In subsequent layers the ques-tion representation is updated as h(l)

q = FFN(∑

v∈Sqh

(l)v

), where Sq denotes the seed entities

mentioned in the question.

Attention over Relations. �e a�ention weight in the third term of Eq. (3.1) is computedusing the question and relation embeddings:

αv′

r = so�max(xTr h(l−1)q ),

where the so�max normalization is over all outgoing edges from v′, and xr is the relation vectorfor relation r. �is ensures that embeddings are propagated more along edges relevant to thequestion.

Directed Propagation. Many questions require multi-hop reasoning, which follows a pathfrom a seed node mentioned in the question to the target answer node. To encourage such abehaviour when propagating embeddings, we develop a technique inspired from personalizedPageRank in IR [47]. �e propagation starts at the seed entities Sq mentioned in the question. Inaddition to the vector embeddings h(l)

v at the nodes, we also maintain scalar “PageRank” scorespr

(l)v which measure the total weight of paths from a seed entity to the current node, as follows:

37

March 26, 2019DRAFT

pr(0)v =

1|Sq | if v ∈ Sq0 o.w.

,

pr(l)v = (1− λ)pr(l−1)

v + λ∑r

∑v′∈Nr(v)

αv′

r pr(l−1)v′ .

Notice that we reuse the a�ention weights αv′r when propagating PageRank, to ensure thatnodes along paths relevant to the question receive a high weight. �e PageRank score is usedas a scaling factor when propagating embeddings along the edges in Eq. (3.2). For l = 1, thePageRank score will be 0 for all entities except the seed entities, and hence propagation willonly happen outward from these nodes. For l = 2, it will be non-zero for the seed entitiesand their 1-hop neighbors, and propagation will only happen along these edges. Figure 3.3illustrates this process.

Training & Inference

�e �nal representations h(L)v ∈ Rn, are used for binary classi�cation to select the answers:

Pr (v ∈ {a}q|Gq, q) = σ(wTh(L)v + b), (3.5)

where σ is the sigmoid function.Training uses binary cross-entropy loss over these probabilities. To encourage the model to

learn a robust classi�er, which exploits all available sources of information, we randomly dropedges from the graph during training with probability p0. We call this fact-dropout. It is usuallyeasier to extract answers from the KB than from the documents, so the model tends to rely onthe former, especially when the KB is complete. �is method is similar to DropConnect [140].

3.5 Experiments & Results

3.5.1 Datasets

WikiMovies-10K consists of 10K randomly sampled training questions from the WikiMoviesdataset [82], along with the original test and validation sets. We sample the training questionsto create a more di�cult se�ing, since the original dataset has 100K questions over only 8

di�erent relation types, which is unrealistic in our opinion. In § 3.5.4 we also compare to theexisting state-of-the-art using the full training set.

38

March 26, 2019DRAFT

We use the KB and text corpus constructed from Wikipedia released by Miller et al. [82]. Forentity linking we use simple surface level matches, and retrieve the top 50 entities around theseeds to create the question subgraph. We further add the top 50 sentences (along with theirarticle titles) to the subgraph using Lucene search over the text corpus. �e overall answerrecall in our constructed subgraphs is 99.6%.

Web�estionsSP [164] consists of 4737 natural language questions posed over Freebase enti-ties, split up into 3098 training and 1639 test questions. We reserve 250 training questions formodel development and early stopping. We use the entity linking outputs from S-MART4 andretrieve 500 entities from the neighbourhood around the question seeds in Freebase to popu-late the question subgraphs5. We further retrieve the top 50 sentences from Wikipedia withthe two-stage process described in §�. �e overall recall of answers among the subgraphs is94.0%.

Dataset # train/dev/test # entities # relations # documents # question words

WikiMovies-10K 10K/10K/10K 43,233 9 79,728 1759Web�estionsSP 2848/250/1639 528,617 513 235,567 3781

Table 3.1: Statistics of all the retrieved subgraphs ∪qGq for WikiMovies-10K and Web�estion-sSP.

Table 3.1 shows the combined statistics of all the retreived subgraphs for the questions ineach dataset. �ese two datasets present varying levels of di�culty. While all questions inWikiMovies correspond to a single KB relation, for Web�estionsSP the model needs to aggre-gate over two KB facts for∼30% of the questions, and also requires reasoning over constraintsfor ∼7% of the questions [77]. For maximum portability, QA systems need to be robust acrossseveral degrees of KB availability since di�erent domains might contain di�erent amounts ofstructured data; and KB completeness may also vary over time. Hence, we construct an addi-tional 3 datasets each from the above two, with the number of KB facts downsampled to 10%,30% and 50% of the original to simulate se�ings where the KB is incomplete. We repeat theretrieval process for each sampled KB.

4https://github.com/scottyih/STAGG5A total of 13 questions had no detected entities. �ese were ignored during training and considered as incor-

rect during evaluation.

39

https://github.com/scottyih/STAGG

March 26, 2019DRAFT

Model Text OnlyKB + Text

10 % 30% 50% 100%

WikiMovies-10K

KV-KB – 15.8 / 9.8 44.7 / 30.4 63.8 / 46.4 94.3 / 76.1KV-EF 50.4 / 40.9 53.6 / 44.0 60.6 / 48.1 75.3 / 59.1 93.8 / 81.4GN-KB – 19.7 / 17.3 48.4 / 37.1 67.7 / 58.1 97.0 / 97.6GN-LF

73.2 / 64.0

74.5 / 65.4 78.7 / 68.5 83.3 / 74.2 96.5 / 92.0

GN-EF 75.4 / 66.3 82.6 / 71.3 87.6 / 76.2 96.9 / 94.1GN-EF+LF 79.0 / 66.7 84.6 / 74.2 88.4 / 78.6 96.8 / 97.3

Web�estionsSP

KV-KB – 12.5 / 4.3 25.8 / 13.8 33.3 / 21.3 46.7 / 38.6KV-EF 23.2 / 13.0 24.6 / 14.4 27.0 / 17.7 32.5 / 23.6 40.5 / 30.9GN-KB – 15.5 / 6.5 34.9 / 20.4 47.7 / 34.3 66.7 / 62.4GN-LF

25.3 / 15.3

29.8 / 17.0 39.1 / 25.9 46.2 / 35.6 65.4 / 56.8

GN-EF 31.5 / 17.7 40.7 / 25.2 49.9 / 34.7 67.8 / 60.4GN-EF+LF 33.3 / 19.3 42.5 / 26.7 52.3 / 37.4 68.7 / 62.3

Table 3.2: Hits@1 / F1 scores of GRAFT-Nets (GN) compared to KV-MemNN (KV) in KB only(-KB), early fusion (-EF), and late fusion (-LF) se�ings.

3.5.2 Compared Models

KV-KB is the Key Value Memory Networks model from Das et al. [27], Miller et al. [82] butusing only KB and ignoring the text. KV-EF (early fusion) is the same model with access toboth KB and text as memories. For text we use a BiLSTM over the entire sentence as keys, andentity mentions as values. �is re-implementation shows be�er performance on the text-onlyand KB-only WikiMovies tasks than the results reported previously6 (see Table 3.3). GN-KBis the GRAFT-Net model ignoring the text. GN-LF is a late fusion version of the GRAFT-Netmodel: we train two separate models, one using text only and the other using KB only, and thenensemble the two7. GN-EF is our main GRAFT-Net model with early fusion. GN-EF+LF is an

6For all KV models we tuned the number of layers {1, 2, 3}, batch size {10, 30, 50}, model dimension {50, 80}.We also use fact dropout regularization in the KB+Text se�ing tuned between {0, 0.2, 0.4}.

7For ensembles we take a weighted combination of the answer probabilities produced by the models, with theweights tuned on the dev set. For answers only in text or only in KB, we use the probability as is.

40

March 26, 2019DRAFT

ensemble over the GN-EF and GN-LF models, with the same ensembling method as GN-LF. Wereport Hits@1, which is the accuracy of the top-predicted answer from the model, and the F1score. To compute the F1 score we tune a threshold on the development set to select answersbased on binary probabilities for each node in the subgraph.

3.5.3 Main Results

Table 3.2 presents a comparison of the above models across all datasets. GRAFT-Nets (GN)shows consistent improvement over KV-MemNNs on both datasets in all se�ings, including KBonly (-KB), text only (-EF, Text Only column), and early fusion (-EF). Interestingly, we observea larger relative gap between the Hits and F1 scores for the KV models than we do for ourGN models. We believe this is because the a�ention for KV is normalized over the memories,which are KB facts (or text sentences): hence the model is unable to assign high probabilitiesto multiple facts at the same time. On the other hand, in GN, we normalize the a�ention overtypes of relations outgoing from a node, and hence can assign high weights to all the correctanswers.

We also see a consistent improvement of early fusion over late fusion (-LF), and by ensem-bling them together we see the best performance across all the models. In Table 3.2 (right), wefurther show the improvement for KV-EF over KV-KB, and GN-LF and GN-EF over GN-KB, asthe amount of KB is increased. �is measures how e�ective these approaches are in utilizingtext plus a KB. For KV-EF we see improvements when the KB is highly incomplete, but in thefull KB se�ing, the performance of the fused approach is worse. A similar trend holds for GN-LF. On the other hand, GN-EF with text improves over the KB-only approach in all se�ings.As we would expect, though, the bene�t of adding text decreases as the KB becomes more andmore complete.

3.5.4 Comparison to Specialized Methods

In Table 3.3 we compare GRAFT-Nets to state-of-the-art models that are speci�cally designedand tuned for QA using either only KB or only text. For this experiment we use the full Wiki-Movies dataset to enable direct comparison to previously reported numbers. For DrQA [13],following the original paper, we restrict answer spans for Web�estionsSP to match an entityin Freebase. In each case we also train GRAFT-Nets using only KB facts or only text sentences.In three out of the four cases, we �nd that GRAFT-Nets either match or outperform the existing

41

March 26, 2019DRAFT

MethodWikiMovies (full) Web�estionsSPkb doc kb doc

MINERVA 97.0 / – – – –R2-AsV – 85.8 / – – –NSM – – – / 69.0 –DrQA* – – – 21.5 / –R-GCN# 96.5 / 97.4 – 37.2 / 30.5 –KV 93.9 / – 76.2 / – – / – – / –KV# 95.6 / 88.0 80.3 / 72.1 46.7 / 38.6 23.2 / 13.0GN 96.8 / 97.2 86.6 / 80.8 67.8 / 62.8 25.3 / 15.3

Table 3.3: Hits@1 / F1 scores compared to SOTA models using only KB or text: MINERVA [25],R2-AsV [143], Neural Symbolic Machines (NSM) [77], DrQA [13], R-GCN [107] and KV-MemNN[82]. *DrQA is pretrained on S�AD. #Re-implemented.

�estion Correct Answers Predicted Answers

what language do most people speak in afghanistanPashto language,Farsi (Eastern Language)

Pashto language

what college did john stockton go to Gonzaga UniversityGonzaga University,Gonzaga Preparatory School

Table 3.4: Examples from Web�estionsSP dataset. Top: �e model misses a correct answer.Bottom: �e model predicts an extra incorrect answer.

state-of-the-art models. We emphasize that the la�er have no mechanism for dealing with thefused se�ing.

�e one exception is the KB-only case for Web�estionsSP where GRAFT-Net does 6.2%

F1 points worse than Neural Symbolic Machines [77]. Analysis suggested three explanations:(1) In the KB-only se�ing, the recall of subgraph retrieval is only 90.2%, which limits overallperformance. In an oracle se�ing where we ensure the answers are part of the subgraph, the F1score increases by 4.8%. (2) We use the same probability threshold for all questions, even thoughthe number of answers may vary signi�cantly. Models which parse the query into a symbolicform do not su�er from this problem since answers are retrieved in a deterministic fashion.If we tune separate thresholds for each question the F1 score improves by 7.6%. (3) GRAFT-Nets perform poorly in the few cases where there is a constraint involved in picking out the

42

March 26, 2019DRAFT

answer (for example, “who �rst voiced Meg in Family Guy”). If we ignore such constraints, andconsider all entities with the same sequence of relations to the seed as correct, the performanceimproves by 3.8% F1. Heuristics such as those used by Yu et al. [168] can be used to improvethese cases. Figure 3.4 shows examples where GRAFT-Net fails to predict the correct answerset exactly.

3.6 Conclusion

In this chapter we investigated QA using text combined with an incomplete KB, a task whichhas received limited a�ention in the past. We introduced several benchmark problems for thistask by modifying existing question-answering datasets, and discuss two broad approaches tosolving this problem—“late fusion” and “early fusion”. We show that early fusion approachesperform be�er.

We also introduce a novel early-fusion model, called GRAFT-Net, for classifying nodes insubgraph consisting of both KB entities and text documents. GRAFT-Net builds on recent ad-vances in graph representation learning but includes several innovations which improve per-formance on this task. GRAFT-Nets are a single model which achieve performance competitiveto state-of-the-art methods in both text-only and KB-only se�ings, and outperform baselinemodels when using text combined with an incomplete KB. Current directions for future workinclude – (1) extending GRAFT-Nets to pick spans of text as answers, rather than only entitiesand (2) improving the subgraph retrieval process.

43

March 26, 2019DRAFT

44

March 26, 2019DRAFT

Chapter 4

Multi-turn Dialogue for KnowledgeRetrieval (Completed Work)

�e previous chapters have discussed automated agents which learn to extract answers from aknowledge source for a single �xed query from the user. Implicitly, this assumes that the userwill provide a coherent, well-formed and complete query to the system in one pass. In practice,users prefer short and simple queries, which may be incomplete in terms of the informationneed requested. Hence, in this chapter we focus on a multi-turn se�ing, where the user startswith an under-speci�ed query, and the agent responds with follow-up questions to preciselylocate the information requested.

4.1 Introduction

Dialogue agents are programs which interact with users in natural language. ReinforcementLearning (RL) has been recently explored to leverage user interactions to adapt various dialogueagents designed, respectively, for task completion [38], information access [145], and chitchat[71].

We focus on KB-InfoBots, a particular type of dialogue agent that helps users navigate aKnowledge Base (KB) in search of an entity, as illustrated by the example in Figure 4.1. Suchagents must necessarily query databases in order to retrieve the requested information. Oneapproach for doing so is by performing semantic parsing on the input to construct a symbolicquery representing the beliefs of the agent about the user goal, such as Wen et al. [145], Williamsand Zweig [151], and Li et al. [75]’s work. We call such an operation a Hard-KB lookup. Whilenatural, this approach has two drawbacks: (1) the retrieved results do not carry any information

45

March 26, 2019DRAFT

Movie=?Actor=Bill Murray

Release Year=1993

Find me the Bill Murray’s movie.

I think it came out in 1993.

When was it released?

Groundhog Day is a Bill Murray movie which came out in 1993.

KB-InfoBotUser

Entity-Centric Knowledge Base

Movie ActorRelease

Year

Groundhog Day Bill Murray 1993

Australia Nicole Kidman X

Mad Max: Fury Road X 2015

Figure 4.1: An interaction between a user looking for a movie and the KB-InfoBot. An entity-centric knowledge base is shown above the KB-InfoBot (missing values denoted by X).

about uncertainty in semantic parsing, and (2) the retrieval operation is non di�erentiable, andhence the parser and dialog policy are trained separately. �is makes online end-to-end learningfrom user feedback di�cult once the system is deployed.

In this work, we propose a probabilistic framework for computing the posterior distributionof the user target over a knowledge base, which we term a So�-KB lookup. �is distributionis constructed from the agent’s belief about the a�ributes of the entity being searched for. �edialogue policy network, which decides the next system action, receives as input this full dis-tribution instead of a handful of retrieved results. We show in our experiments that this frame-work allows the agent to achieve a higher task success rate in fewer dialogue turns. Further, theretrieval process is di�erentiable, allowing us to construct an end-to-end trainable KB-InfoBot,all of whose components are updated online using RL.

Reinforcement learners typically require an environment to interact with, and hence staticdialogue corpora cannot be used for their training. Running experiments on human subjects,on the other hand, is unfortunately too expensive. A common workaround in the dialoguecommunity [105, 106, 167] is to instead use user simulators which mimic the behavior of realusers in a consistent manner. For training KB-InfoBot, we adapt the publicly available1 simu-lator described in Li et al. [74]. We evaluate several versions of KB-InfoBot with the simulatorand on real users, and show that the proposed So�-KB lookup helps the reinforcement learnerdiscover be�er dialogue policies. Initial experiments on the end-to-end agent also demonstrateits strong learning capability.

1https://github.com/MiuLab/TC-Bot

46

https://github.com/MiuLab/TC-Bot

March 26, 2019DRAFT

4.2 Related Work

Our work is motivated by the neural GenQA [165] and neural enquirer [166] models for query-ing KBs via natural language in a fully “neuralized” way. However, the key di�erence is thatthese systems assume that users can compose a complicated, compositional natural languagequery that can uniquely identify the element/answer in the KB. �e research task is to �parse�the query, i.e., turning the natural language query into a sequence of SQL-like operations. In-stead we focus on how to query a KB interactively without composing such complicated queriesin the �rst place. Our work is motivated by the observations that (1) users are more used toissuing simple queries of length less than 5 words [118]; (2) in many cases, it is unreasonable toassume that users can construct compositional queries without prior knowledge of the structureof the KB to be queried.

Also related is the growing body of literature focused on building end-to-end dialogue sys-tems, which combine feature extraction and policy optimization using deep neural networks.Wen et al. [145] introduced a modular neural dialogue agent, which uses a Hard-KB lookup,thus breaking the di�erentiability of the whole system. As a result, training of various compo-nents of the dialogue system is performed separately. �e intent network and belief trackersare trained using supervised labels speci�cally collected for them; while the policy network andgeneration network are trained separately on the system u�erances.

Dialogue agents can also interface with the database by augmenting their output actionspace with prede�ned API calls [11, 75, 151, 169]. �e API calls modify a query hypothesismaintained outside the end-to-end system which is used to retrieve results from this KB. �isframework does not deal with uncertainty in language understanding since the query hypoth-esis can only hold one slot-value at a time.

Wu et al. [154] presented an entropy minimization dialogue management strategy for In-foBots. �e agent always asks for the value of the slot with maximum entropy over the remain-ing entries in the database, which is optimal in the absence of language understanding errors,and serves as a baseline against our approach.

4.3 Probabilistic KB Lookup

�is section describes a probabilistic framework for querying a KB given the agent’s beliefsover the �elds in the KB.

47

March 26, 2019DRAFT

Entity-Centric Knowledge Base (EC-KB)

A Knowledge Base consists of triples of the form (h, r, t), which denotes that relation r holdsbetween the head h and tail t. We assume that the KB-InfoBot has access to a domain-speci�centity-centric knowledge base (EC-KB) [170] where all head entities are of a particular type(such as movies or persons), and the relations correspond to a�ributes of these head entities.Such a KB can be converted to a table format whose rows correspond to the unique head entities,columns correspond to the unique relation types (slots henceforth), and some entries may bemissing. An example is shown in Figure 4.1.

Notations and Assumptions

Let T denote the KB table described above and Ti,j denote the jth slot-value of the ith entity.1 ≤ i ≤ N and 1 ≤ j ≤ M . We let V j denote the vocabulary of each slot, i.e. the set ofall distinct values in the j-th column. We denote missing values from the table with a specialtoken and write Ti,j = Ψ. Mj = {i : Ti,j = Ψ} denotes the set of entities for which the valueof slot j is missing. Note that the user may still know the actual value of Ti,j , and we assumethis lies in V j . We do not deal with new entities or relations at test time.

We assume a uniform prior G ∼ U [{1, ...N}] over the rows in the table T , and let binaryrandom variables Φj ∈ {0, 1} indicate whether the user knows the value of slot j or not. �eagent maintains M multinomial distributions ptj(v) for v ∈ V j denoting the probability at turnt that the user constraint for slot j is v, given their u�erances U t

1 till that turn. �e agent alsomaintains M binomials qtj = Pr(Φj = 1) which denote the probability that the user knows thevalue of slot j.

We assume that column values are independently distributed to each other. �is is a strongassumption but it allows us to model the user goal for each slot independently, as opposedto modeling the user goal over KB entities directly. Typically maxj |V j| < N and hence thisassumption reduces the number of parameters in the model.

So�-KB Lookup

Let ptT (i) = Pr(G = i|U t1) be the posterior probability that the user is interested in row i of

the table, given the u�erances up to turn t. We assume all probabilities are conditioned onuser inputs U t

1 and drop it from the notation below. From our assumption of independence ofslot values, we can write ptT (i) ∝ ∏M

j=1 Pr(Gj = i), where Pr(Gj = i) denotes the posterior

48

March 26, 2019DRAFT

probability of user goal for slot j pointing to Ti,j . Marginalizing this over Φj gives:

Pr(Gj = i) =1∑

φ=0

Pr(Gj = i,Φj = φ) (4.1)

= qtj Pr(Gj = i|Φj = 1) + (1− qtj) Pr(Gj = i|Φj = 0).

For Φj = 0, the user does not know the value of the slot, and from the prior:

Pr(Gj = i|Φj = 0) =1

N, 1 ≤ i ≤ N (4.2)

For Φj = 1, the user knows the value of slot j, but this may be missing from T , and we againhave two cases:

Pr(Gj = i|Φj = 1) =

1N, i ∈Mj

ptj(v)

Nj(v)

(1− |Mj |

N

), i 6∈Mj

(4.3)

Here, Nj(v) is the count of value v in slot j. Combining (4.1), (4.2), and (4.3) gives us theprocedure for computing the posterior over KB entities.

4.4 End-to-End-KB-InfoBot

We claim that the So�-KB lookup method has two bene�ts over the Hard-KB method – (1) ithelps the agent discover be�er dialogue policies by providing it more information from thelanguage understanding unit, (2) it allows end-to-end training of both dialogue policy and lan-guage understanding in an online se�ing. In this section we describe several agents to test theseclaims.

Overview

Figure 4.2 shows an overview of the components of the KB-InfoBot. At each turn, the agentreceives a natural language u�erance ut as input, and selects an action at as output. �e actionspace, denoted by A, consists of M + 1 actions — request(slot=i) for 1 ≤ i ≤ M will ask theuser for the value of slot i, and inform(I) will inform the user with an ordered list of results Ifrom the KB. �e dialogue ends once the agent chooses inform.

We adopt a modular approach, typical to goal-oriented dialogue systems [145], consistingof: a belief tracker module for identifying user intents, extracting associated slots, and trackingthe dialogue state [15, 44, 49, 50, 162]; an interface with the database to query for relevantresults (So�-KB lookup); a summary module to summarize the state into a vector; a dialogue

49

March 26, 2019DRAFT

Belief Trackers

Policy Network Beliefs Summary

Soft-KB Lookup

KB-InfoBot

User

User Utterance

System Action

Figure 4.2: High-level overview of the end-to-end KB-InfoBot. Components with trainableparameters are highlighted in gray.

policy which selects the next system action based on current state [167]. We assume the agentonly responds with dialogue acts. A template-based Natural Language Generator (NLG) can beeasily constructed for converting dialogue acts into natural language.

Belief Trackers

�e InfoBot consists of M belief trackers, one for each slot, which get the user input xt andproduce two outputs, ptj and qtj , which we shall collectively call the belief state: ptj is a multino-mial distribution over the slot values v, and qtj is a scalar probability of the user knowing thevalue of slot j. We describe two versions of the belief tracker.

Hand-Cra�ed Tracker: We �rst identify mentions of slot-names (such as “actor”) or slot-values (such as “Bill Murray”) from the user input ut, using token-level keyword search. Let{w ∈ x} denote the set of tokens in a string x2, then for each slot in 1 ≤ j ≤M and each valuev ∈ V j , we compute its matching score as follows:

stj[v] =|{w ∈ ut} ∩ {w ∈ v}|

|{w ∈ v}| (4.4)

A similar score btj is computed for the slot-names. A one-hot vector reqt ∈ {0, 1}M denotesthe previously requested slot from the agent, if any. qtj is set to 0 if reqt[j] is 1 but stj[v] = 0

∀v ∈ V j , i.e. the agent requested for a slot but did not receive a valid value in return, else it isset to 1.

Starting from an prior distribution p0j (based on the counts of the values in the KB), ptj[v] is

updated as:ptj[v] ∝ pt−1

j [v] + C(stj[v] + btj + 1(reqt[j] = 1)

)(4.5)

2We use the NLTK tokenizer available at http://www.nltk.org/api/nltk.tokenize.html

50

http://www.nltk.org/api/nltk.tokenize.html

March 26, 2019DRAFT

Here C is a tuning parameter, and the normalization is given by se�ing the sum over v to 1.

Neural Belief Tracker: For the neural tracker the user input ut is converted to a vectorrepresentation xt, using a bag of n-grams (with n = 2) representation. Each element of xt isan integer indicating the count of a particular n-gram in ut. We let V n denote the number ofunique n-grams, hence xt ∈ NV n

0 .Recurrent neural networks have been used for belief tracking [50, 145] since the output dis-

tribution at turn t depends on all user inputs till that turn. We use a Gated Recurrent Unit (GRU)[16] (Eq. 2.1) for each tracker, which, starting from h0

j = 0 computes htj = GRU(x1, . . . , xt).htj ∈ Rd can be interpreted as a summary of what the user has said about slot j till turn t. �ebelief states are computed from this vector as follows:

ptj = so�max(W pj h

tj + bpj) (4.6)

qtj = σ(WΦj h

tj + bΦ

j ) (4.7)

Here W pj ∈ RV j×d, bpj ∈ RV j , WΦ

j ∈ Rd and bΦj ∈ R, are trainable parameters.

So�-KB Lookup + Summary

�is module uses the So�-KB lookup described in section 4.3 to compute the posterior ptT ∈ RN

over the EC-KB from the belief states (ptj , qtj). Collectively, outputs of the belief trackers andthe so�-KB lookup can be viewed as the current dialogue state internal to the KB-InfoBot. Letst = [pt1, p

t2, ..., p

tM , q

t1, q

t2, ..., q

tM , p

tT ] be the vector of size

∑j V

j+M+N denoting this state. Itis possible for the agent to directly use this state vector to select its next action at. However, thelarge size of the state vector would lead to a large number of parameters in the policy network.To improve e�ciency we extract summary statistics from the belief states, similar to [150].

Each slot is summarized into an entropy statistic over a distribution wtj computed fromelements of the KB posterior ptT as follows:

wtj(v) ∝∑

i:Ti,j=v

ptT (i) + p0j(v)

∑i:Ti,j=Ψ

ptT (i) . (4.8)

Here, p0j is a prior distribution over the values of slot j, estimated using counts of each value

in the KB. �e probability mass of v in this distribution is the agent’s con�dence that the usergoal has value v in slot j. �is two terms in (4.8) correspond to rows in KB which have value v,and rows whose value is unknown (weighted by the prior probability that an unknown might

51

March 26, 2019DRAFT

be v). �en the summary statistic for slot j is the entropy H(wtj). �e KB posterior ptT is alsosummarized into an entropy statistic H(ptT ).

�e scalar probabilities qtj are passed as is to the dialogue policy, and the �nal summaryvector is st = [H(pt1), ..., H(ptM), qt1, ..., q

tM , H(ptT )]. Note that this vector has size 2M + 1.

Dialogue Policy

�e dialogue policy’s job is to select the next action based on the current summary state st andthe dialogue history. We present a hand-cra�ed baseline and a neural policy network.

Hand-Cra�ed Policy: �e rule based policy is adapted from [154]. It asks for the slot j =

arg minH(ptj) with the minimum entropy, except if – (i) the KB posterior entropyH(ptT ) < αR,(ii) H(ptj) < min(αT , βH(p0

j), (iii) slot j has already been requested Q times. αR, αT , β, Q aretuned to maximize reward against the simulator.

Neural Policy Network: For the neural approach, similar to [151, 169], we use an RNN toallow the network to maintain an internal state of dialogue history. Speci�cally, we use a GRUunit followed by a fully-connected layer and so�max nonlinearity to model the policy π overactions in A (W π ∈ R|A|×d, bπ ∈ R|A|):

htπ = GRU(s1, ..., st) (4.9)

π = so�max(W πhtπ + bπ) . (4.10)

During training, the agent samples its actions from the policy to encourage exploration. Ifthis action is inform(), it must also provide an ordered set of entities indexed by I = (i1, i2, . . . , iR)

in the KB to the user. �is is done by sampling R items from the KB-posterior ptT . �is mimicsa search engine type se�ing, where R may be the number of results on the �rst page.

4.5 Training

Parameters of the neural components (denoted by θ) are trained using the REINFORCE algo-rithm [152]. We assume that the learner has access to a reward signal rt throughout the courseof the dialogue, details of which are in the next section. We can write the expected discountedreturn of the agent under policy π as follows:

J(θ) = E

[H∑t=0

γtrt

](4.11)

52

March 26, 2019DRAFT

Here, the expectation is over all possible trajectories τ of the dialogue, θ denotes the trainableparameters of the learner, H is the maximum length of an episode, and γ is the discountingfactor. We also use a baseline reward signal b, which is the average of all rewards in a batch, toreduce the variance in the updates [43]. We can use the likelihood ratio trick [41] to write thegradient of the objective as follows:

∇θJ(θ) = E

[∇θ log pθ(τ)

H∑t=0

γt(rt − b)], (4.12)

where pθ(τ) is the probability of observing a particular trajectory under the current policy.With a Markovian assumption, we can write

pθ(τ) = p(s0)H∏k=0

p(sk+1|sk, ak)πθ(ak|sk), (4.13)

where θ denotes dependence on the neural network parameters. From 4.12,4.13 we obtain

∇θJ(θ) = Ea∼π

[ H∑k=0

∇θ log πθ(ak)H∑t=0

γt(rt − b)], (4.14)

If we need to train both the policy network and the belief trackers using the reinforcementsignal, we can view the KB posterior ptT as another policy. During training then, to encourageexploration, when the agent selects the inform action we sample R results from the followingdistribution to return to the user:

µ(I) = ptT (i1)× ptT (i2)

1− ptT (i1)× · · · . (4.15)

�is formulation also leads to a modi�ed version of the episodic REINFORCE update rule [152].Speci�cally, eq. 4.13 now becomes,

pθ(τ) =

[p(s0)

H∏k=0

p(sk+1|sk, ak)πθ(ak|sk)]µθ(I), (4.16)

Notice the last termµθ above which is the posterior of a set of results from the KB. From 4.12,4.16we obtain

∇θJ(θ) = Ea∼π,I∼µ

[(∇θ log µθ(I) +

H∑k=0

∇θ log πθ(ak)) H∑t=0

γt(rt − b)], (4.17)

In the case of end-to-end learning, we found that for a moderately sized KB, the agent almostalways fails if starting from random initialization. In this case, credit assignment is di�cult for

53

March 26, 2019DRAFT

the agent, since it does not know whether the failure is due to an incorrect sequence of actionsor incorrect set of results from the KB. Hence, at the beginning of training we have an Imitation

Learning (IL) phase where the belief trackers and policy network are trained to mimic the hand-cra�ed agents. Assume that ptj and qtj are the belief states from a rule-based agent, and at itsaction at turn t. �en the loss function for imitation learning is:

L(θ) = E[D(ptj||ptj(θ)) +H(qtj, q

tj(θ))− log πθ(a

t)]

D(p||q) and H(p, q) denote the KL divergence and cross-entropy between p and q respectively.�e expectations are estimated using a mini-batch of dialogues of size B. For RL we use

RMSProp [54] and for IL we use vanilla SGD updates to train the parameters θ.


Previous work in KB-based QA has focused on single-turn interactions and is not directly com-parable to the present study. Instead we compare di�erent versions of the KB-InfoBot describedabove to test our claims.

4.6.1 Models & Data

KB-InfoBot versions

We have described two belief trackers – (A) Hand-Cra�ed and (B) Neural, and two dialoguepolicies – (C) Hand-Cra�ed and (D) Neural.

Rule agents use the hand-cra�ed belief trackers and hand-cra�ed policy (A+C). RL agentsuse the hand-cra�ed belief trackers and the neural policy (A+D). We compare three variants ofboth sets of agents, which di�er only in the inputs to the dialogue policy. �e No-KB versiononly takes entropy H(ptj) of each of the slot distributions. �e Hard-KB version performs ahard-KB lookup and selects the next action based on the entropy of the slots over retrievedresults. �is is the same approach as in Wen et al. [145], except that we take entropy insteadof summing probabilities. �e So�-KB version takes summary statistics of the slots and KBposterior described in Section 4.4. At the end of the dialogue, all versions inform the user withthe top results from the KB posterior ptT , hence the di�erence only lies in the policy for actionselection. Lastly, the E2E agent uses the neural belief tracker and the neural policy (B+D), witha So�-KB lookup. For the RL agents, we also append qtj and a one-hot encoding of the previousagent action to the policy network input.

54

March 26, 2019DRAFT

KB-split N M maxj |V j| |Mj|

Small 277 6 17 20%Medium 428 6 68 20%

Large 857 6 101 20%X-Large 3523 6 251 20%

Table 4.1: Movies-KB statistics for four splits. Refer to Section 4.3 for description of columns.

User Simulator

Training reinforcement learners is challenging because they need an environment to operate in.In the dialogue community it is common to use simulated users for this purpose [1, 22, 104, 105].In this work we adapt the publicly-available user simulator presented in Li et al. [74] to followa simple agenda while interacting with the KB-InfoBot, as well as produce natural languageu�erances . During training, the simulated user also provides a reward signal at the end ofeach dialogue. �e dialogue is a success if the user target is in top R = 5 results returned bythe agent; and the reward is computed as max(0, 2(1− (r− 1)/R)), where r is the actual rankof the target. For a failed dialogue the agent receives a reward of−1, and at each turn it receivesa reward of −0.1 to encourage short sessions3. �e maximum length of a dialogue is 10 turnsbeyond which it is deemed a failure.

Movies-KB

We use a movie-centric KB constructed using the IMDBPy4 package. We constructed four dif-ferent splits of the dataset, with increasing number of entities, whose statistics are given inTable 4.1. �e original KB was modi�ed to reduce the number of actors and directors in orderto make the task more challenging5. We randomly remove 20% of the values from the agent’scopy of the KB to simulate a scenario where the KB may be incomplete. �e user, however, maystill know these values.

3A turn consists of one user action and one agent action.4http://imdbpy.sourceforge.net/5We restricted the vocabulary to the �rst few unique values of these slots and replaced all other values with a

random value from this set.

55

http://imdbpy.sourceforge.net/

March 26, 2019DRAFT

AgentSmall KB Medium KB Large KB X-Large KB

T S R T S R T S R T S R

No KBRule 5.04 .64 .26±.02 5.05 .77 .74±.02 4.93 .78 .82±.02 4.84 .66 .43±.02RL 2.65 .56 .24±.02 3.32 .76 .87±.02 3.71 .79 .94±.02 3.64 .64 .50±.02

Hard KBRule 5.04 .64 .25±.02 3.66 .73 .75±.02 4.27 .75 .78±.02 4.84 .65 .42±.02RL 3.36 .62 .35±.02 3.07 .75 .86±.02 3.53 .79 .98±.02 2.88 .62 .53±.02

So� KBRule 2.12 .57 .32±.02 3.94 .76 .83±.02 3.74 .78 .93±.02 4.51 .66 .51±.02RL 2.93 .63 .43±.02 3.37 .80 .98±.02 3.79 .83 1.05±.02 3.65 .68 .62±.02E2E 3.13 .66 .48±.02 3.27 .83 1.10±.02 3.51 .83 1.10±.02 3.98 .65 .50±.02

Max 3.44 1.0 1.64 2.96 1.0 1.78 3.26 1.0 1.73 3.97 1.0 1.37

Table 4.2: Performance comparison. Average (±std error) for 5000 runs a�er choosing the bestmodel during training. T: Average number of turns. S: Success rate. R: Average reward.

4.6.2 Simulated User Evaluation

We compare each of the discussed versions along three metrics: the average rewards obtained(R), success rate (S) (where success is de�ned as providing the user target among topR results),and the average number of turns per dialogue (T). For the RL and E2E agents, during trainingwe �x the model every 100 updates and run 2000 simulations with greedy action selection toevaluate its performance. �en a�er training we select the model with the highest average re-ward and run a further 5000 simulations and report the performance in Table 4.2. For referencewe also show the performance of an agent which receives perfect information about the usertarget without any errors, and selects actions based on the entropy of the slots (Max). �is canbe considered as an upper bound on the performance of any agent [154].

In each case the So�-KB versions achieve the highest average reward, which is the metric allagents optimize. In general, the trade-o� between minimizing average turns and maximizingsuccess rate can be controlled by changing the reward signal. Note that, except the E2E version,all versions share the same belief trackers, but by re-asking values of some slots they can havedi�erent posteriors ptT to inform the results. �is shows that having full information about thecurrent state of beliefs over the KB helps the So�-KB agent discover be�er policies. Further,reinforcement learning helps discover be�er policies than the hand-cra�ed rule-based agents,and we see a higher reward for RL agents compared to Rule ones. �is is due to the noisy naturallanguage inputs; with perfect information the rule-based strategy is optimal. Interestingly, theRL-Hard agent has the minimum number of turns in 2 out of the 4 se�ings, at the cost of a

56

March 26, 2019DRAFT

RLHard

RuleSoft

RLSoft

E2ESoft0.4

0.5

0.6

0.7

0.8

0.9

1.0

Suc

cess

Rat

e

p=0.01

ns

p=0.03

109

105 121

103

RLHard

RuleSoft

RLSoft

E2ESoft1

2

3

4

5

6

7

8

9

10

# T

urns

Figure 4.3: Performance of KB-InfoBot versions when tested against real users. Le�: Successrate, with the number of test dialogues indicated on each bar, and the p-values from a two-sidedpermutation test. Right: Distribution of the number of turns in each dialogue (di�erences inmean are signi�cant with p < 0.01).

lower success rate and average reward. �is agent does not receive any information aboutthe uncertainty in semantic parsing, and it tends to inform as soon as the number of retrievedresults becomes small, even if they are incorrect.

Among the So�-KB agents, we see that E2E>RL>Rule, except for the X-Large KB. For E2E,the action space grows exponentially with the size of the KB, and hence credit assignmentgets more di�cult. Future work should focus on improving the E2E agent in this se�ing. �edi�culty of a KB-split depends on number of entities it has, as well as the number of uniquevalues for each slot (more unique values make the problem easier). Hence we see that both the“Small” and “X-Large” se�ings lead to lower reward for the agents, since maxj |V j |

Nis small for

them.

4.6.3 Human Evaluation

We further evaluate the KB-InfoBot versions trained using the simulator against real subjects,recruited from the author’s a�liations. In each session, in a typed interaction, the subjectwas �rst presented with a target movie from the “Medium” KB-split along with a subset of itsassociated slot-values from the KB. To simulate the scenario where end-users may not knowslot values correctly, the subjects in our evaluation were presented multiple values for the slotsfrom which they could choose any one while interacting with the agent. Subjects were asked

57

March 26, 2019DRAFT

to initiate the conversation by specifying some of these values, and respond to the agent’ssubsequent requests, all in natural language. We test RL-Hard and the three So�-KB agents inthis study, and in each session one of the agents was picked at random for testing. In total, wecollected 433 dialogues, around 20 per subject. Figure 4.3 shows a comparison of these agentsin terms of success rate and number of turns.

In comparing Hard-KB versus So�-KB lookup methods we see that both Rule-So� and RL-So� agents achieve a higher success rate than RL-Hard, while E2E-So� does comparably. �eydo so in an increased number of average turns, but achieve a higher average reward as well.Between RL-So� and Rule-So� agents, the success rate is similar, however the RL agent achievesthat rate in a lower number of turns on average. RL-So� achieves a success rate of 74% on thehuman evaluation and 80% against the simulated user, indicating minimal over��ing. However,all agents take a higher number of turns against real users as compared to the simulator, due tothe noisier inputs.

�e E2E gets the highest success rate against the simulator, however, when tested againstreal users it performs poorly with a lower success rate and a higher number of turns. Since ithas more trainable components, this agent is also most prone to over��ing. In particular, thevocabulary of the simulator it is trained against is quite limited (V n = 3078), and hence whenreal users provided inputs outside this vocabulary, it performed poorly.

4.7 Conclusions

�is chapter discussed end-to-end trainable dialogue agents for information access. We intro-duced a di�erentiable probabilistic framework for querying a database given the agent’s beliefsover its �elds (or slots). We showed that such a framework allows the downstream reinforce-ment learner to discover be�er dialogue policies by providing it more information. We alsopresented an E2E agent for the task, which demonstrates a strong learning capacity in simula-tions but su�ers from over��ing when tested on real users.

Given these results, we propose the following deployment strategy that allows a dialoguesystem to be tailored to speci�c users via learning from agent-user interactions. �e systemcould start o� with an RL-So� agent (which gives good performance out-of-the-box). As theuser interacts with this agent, the collected data can be used to train the E2E agent, which hasa strong learning capability. Gradually, as more experience is collected, the system can switchfrom RL-So� to the personalized E2E agent.

58

March 26, 2019DRAFT

Chapter 5

Semi-Supervised QA (Completed Work)

�e methods we have presented so far, and deep learning models in general, are hinged on theavailability of large annotated datasets. However, large domain speci�c annotated datasets arelimited and expensive to construct. In this chapter, we envision a system where the end userspeci�es a set of base documents and only a few labelled examples. We introduce a techniquewhich exploits the structure of these base documents to do semi-supervised training of machinereading models.

5.1 Introduction

Practitioners looking to build QA systems for speci�c applications may not have the resources tocollect tens of thousands of questions on corpora of their choice. At the same time, state-of-the-art machine reading systems do not lend well to low-resource QA se�ings where the numberof labeled question-answer pairs are limited (c.f. Table 5.2). Semi-supervised QA methods like[159] aim to improve this performance by leveraging unlabeled data which is easier to collect.

Here we present a semi-supervised QA system which requires the end user to specify a setof base documents and only a small set of question-answer pairs over a subset of these docu-ments. Our proposed system consists of three stages. First, we construct cloze-style questions(predicting missing spans of text) from the unlabeled corpus; next, we use the generated clozesto pre-train a powerful neural network model for extractive QA [19]; and �nally, we �ne-tunethe model on the small set of provided QA pairs.

Our cloze construction process builds on a typical writing phenomenon and documentstructure: an introduction precedes and summarizes the main body of the article. Many largecorpora follow such a structure, including Wikipedia, academic papers, and news articles. We

59

March 26, 2019DRAFT

hypothesize that we can bene�t from the un-annotated corpora to be�er answer various ques-tions – at least ones that are lexically similar to the content in base documents and directlyrequire factual information.

5.2 Related Work

Semi-supervised learning augments the labeled dataset Lwith a potentially larger unlabeleddataset U . Yang et al. [159] presented a model, GDAN, which trained an auxiliary neural net-work to generate questions from passages by reinforcement learning, and augment the labeleddataset with the generated questions to train the QA model. Here we use a much simpler heuris-tic to generate the auxiliary questions, which also turns out to be more e�ective as we showsuperior performance compared to GDAN. Several approaches have been suggested for gener-ating natural questions [116, 119, 126], however none of them show a signi�cant improvementof using the generated questions in a semi-supervised se�ing. Recent papers also use unla-beled data for QA by training large language models and extracting contextual word vectorsfrom them to input to the QA model [79, 93, 102]. �e applicability of this method in the low-resource se�ing is unclear as the extra inputs increase the number of parameters in the QAmodel, however, our pretraining can be easily applied to these models as well.

Domain adaptation (and Transfer learning) leverage existing large scale datasets from asource domain (or task) to improve performance on a target domain (or task). For deep learningand QA, a common approach is to pretrain on the source dataset and then �ne-tune on the targetdataset [18, 42]. Wiese et al. [149] used S�AD as a source for the target BioASQ dataset, andKadlec et al. [61] used Book Test [5] as source for the target S�AD dataset. Mihaylov et al. [80]transfer learned model layers from the tasks of sequence labeling, text classi�cation and relationclassi�cation to show small improvements on S�AD. All these works use manually curatedsource datatset, which in themselves are expensive to collect. Instead, we show that it is possibleto automatically construct the source dataset from the same domain as the target, which turnsout to be more bene�cial in terms of performance as well (c.f. Section 5.4). Several cloze datasetshave been proposed in the literature which use heuristics for construction [51, 52, 89]. Wefurther see the usability of such a dataset in a semi-supervised se�ing.

60

March 26, 2019DRAFT

5.3 Methodology

Cloze generation

Most of the documents typically follow a template, they begin with an introduction that pro-vides an overview and a brief summary for what is to follow. We assume such a structure whileconstructing our cloze style questions. When there is no clear demarcation, we treat the �rstK% (hyperparameter, in our case 20%) of the document as the introduction. While noisy, thisheuristic generates a large number of clozes given any corpus, which we found to be bene�cialfor semi-supervised learning despite the noise.

We use a standard NLP pipeline based on Stanford CoreNLP1 (for S�AD, TrivaQA andPubMed) and the BANNER Named Entity Recognizer2 (only for PubMed articles) to identify en-tities and phrases. Assume that a document comprises of introduction sentences {q1, q2, ...qn},and the remaining passages {p1, p2, ..pm}. Additionally, let’s say that each sentence qi in in-troduction is composed of words {w1, w2, ...wlqi}, where lqi is the length of qi. We consider amatch(qi, pj), if there is an exact string match of a sequence of words {wk, wk+1, ..wlqi} betweenthe sentence qi and passage pj . If this sequence is either a noun phrase, verb phrase, adjectivephrase or a named entity in pj , as recognized by CoreNLP or BANNER, we select it as an answerspan A. Additionally, we use pj as the passage P and form a cloze question Q from the answerbearing sentence qi by replacingAwith a placeholder. As a result, we obtain passage-question-answer (P,Q,A) triples (Table 5.1 shows an example). As a post-processing step, we prune out(P,Q,A) triples where the word overlap between the question (Q) and passage (P) is less than2 words (a�er excluding the stop words).

�e process relies on the fact that answer candidates from the introduction are likely tobe discussed in detail in the remainder of the article. In e�ect, the cloze question from theintroduction and the matching paragraph in the body forms a question and context passagepair. We create two cloze datasets, one each from Wikipedia corpus (for S�AD and TriviaQA)and PUBMed academic papers (for the BioASQ challenge), consisting of 2.2M and 1M clozesrespectively. From analyzing the cloze data manually, we were able to answer 76% times forthe Wikipedia set and 80% times for the PUBMed set using the information in the passage. Inmost cases the cloze paraphrased the information in the passage, which we hypothesized to bea useful signal for the downstream QA task.

We also investigate the utility of forming subsets of the large cloze corpus, where we select1h�ps://stanfordnlp.github.io/CoreNLP/2h�p://banner.sourceforge.net

61

March 26, 2019DRAFT

Passage (P) : Autism is a neurodevelopmental disorder characterized by im-paired social interaction, verbal and non-verbal communication, and …re-stricted and repetitive behavior. Parents usually notice signs in the �rst twoyears of their child’s life. �ese signs o�en develop gradually, though somechildren with autism reach their developmental milestones at a normal paceand then regress.

�estion (Q) : People with autism tend to be a li�le aloof with li�le to no.

Answer (A) : social interaction

Table 5.1: An example constructed cloze.

the top passage-question-answer triples, based on the di�erent criteria, like i) jaccard similar-ity of answer bearing sentence in introduction and the passage ii) the tf-idf scores of answercandidates and iii) the length of answer candidates. However, we empirically �nd that we werebe�er o� using the entire set rather than these subsets.

Pre-training

We make use of the generated cloze dataset to pre-train an expressive neural network designedfor the task of reading comprehension. We work with two neural network models – the GAReader introduced in Chapter 2 and BiDAF + Self-A�ention (SA) model from Clark and Gardner[19] (which is among the best performing models on S�AD and TriviaQA). A�er pretraining,the performance of BiDAF+SA on a dev set of the (Wikipedia) cloze questions is 0.58 F1 scoreand 0.55 Exact Match (EM) score. �is implies that the cloze corpus is neither too easy, nor toodi�cult to answer.

Fine Tuning

We �ne tune the pre-trained model, from the previous step, over a small set of labelled question-answer pairs. As we shall later see, this step is crucial, and it only requires a handful of labelledquestions to achieve a signi�cant proportion of the performance typically a�ained by trainingon tens of thousands of questions.

62

March 26, 2019DRAFT


5.4.1 Datasets

We apply our system to three datasets from di�erent domains. S�AD [97] consists of ques-tions whose answers are free form spans of text from passages in Wikipedia articles. We followthe same se�ing as in [159], and split 10% of training questions as the test set, and report per-formance when training on subsets of the remaining data ranging from 1% to 90% of the fullset. We also report the performance on the dev set when trained on the full training set (1∗ inTable 5.2). We use the same hyperparameter se�ings as in prior work. We compare and studyfour di�erent se�ings: 1) the Supervised Learning (SL) se�ing, which is only trained on thesupervised data, 2) the best performing GDAN model from Yang et al. [159], 3) pretraining on aLanguage Modeling (LM) objective and �ne-tuning on the supervised data, and 4) pretrainingon the Cloze dataset and �ne-tuning on the supervised data. �e LM and Cloze methods useexactly the same data for pretraining, but di�er in the loss functions used. We report F1 andEM scores on our test set using the o�cial evaluation scripts provided by the authors of thedataset.

TriviaQA [60] comprises of over 95K web question-answer-evidence triples. Like S�AD,the answers are spans of text. Similar to the se�ing in S�AD, we create multiple smallersubsets of the entire set. For our semi-supervised QA system, we use the BiDAF+SA model [19]– the highest performing publicly available system for TrivaQA. Here again, we compare thesupervised learning (SL) se�ings against the pretraining on Cloze set and �ne tuning on thesupervised set. We report F1 and EM scores on the dev set3.

We also test on the BioASQ 5b dataset, which consists of question-answer pairs fromPubMed abstracts. We use the publicly available system4 from Wiese et al. [149], and follow theexact same setup as theirs, focusing only on factoid and list questions. For this se�ing, thereare only 899 questions for training. Since this is already a low-resource problem we only reportresults using 5-fold cross-validation on all the available data. We report Mean Reciprocal Rank(MRR) on the factoid questions, and F1 score for the list questions.

3We use a sample of dev questions, which is the default se�ing for the code by Clark and Gardner [19]. Sinceour goal is only to compare the models, this is not problematic.

4https://github.com/georgwiese/biomedical-qa

63

https://github.com/georgwiese/biomedical-qa

March 26, 2019DRAFT

Model Method0 0.01 0.05 0.1 0.2 0.5 0.9 1

F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM F1 EM

S�AD

GA SL – – 0.0882 0.0359 0.3517 0.2275 0.4116 0.2752 0.4797 0.3393 0.5705 0.4224 0.6125 0.4684 – –GA GDAN – – – – – – 0.4840 0.3270 0.5394 0.3781 0.5831 0.4267 0.6102 0.4531 – –GA LM – – 0.0957 0.0394 0.3141 0.1856 0.3725 0.2365 0.4406 0.2983 0.5111 0.3589 0.5520 0.3964 – –GA Cloze – – 0.3090 0.1964 0.4688 0.3385 0.4937 0.3588 0.5575 0.4126 0.6086 0.4679 0.6302 0.4894 – –

BiDAF+SA SL – – 0.1926 0.1018 0.4764 0.3388 0.5639 0.4258 0.6484 0.5031 0.7044 0.5615 0.7287 0.5874 0.8069 0.7154BiDAF+SA Cloze 0.0682 0.032 0.5042 0.3751 0.6324 0.4862 0.6431 0.4995 0.6839 0.5413 0.7151 0.5767 0.7369 0.6005 0.8080 0.7186

TRIVIA-QA

BiDAF+SA SL – – 0.2533 0.1898 0.4215 0.3566 0.4971 0.4318 0.5624 0.5077 0.6867 0.6239 0.7131 0.6617 0.7291 0.6786BiDAF+SA Cloze 0.1182 0.0729 0.5521 0.4807 0.6245 0.5614 0.6506 0.5893 0.6849 0.6281 0.7196 0.6607 0.7381 0.6823 0.7461 0.6903

Table 5.2: A holistic view of the performance of our system compared against baseline systemson S�AD and TriviaQA. Column groups represent di�erent fractions of the training set usedfor training.

5.4.2 Main Results

Table 5.2 shows a comparison of the discussed se�ings on both S�AD and TriviaQA. Withoutany �ne-tuning (column 0) the performance is low, probably because the model never saw areal question, but we see signi�cant gains with Cloze pretraining even with very li�le labeleddata. �e BiDAF+SA model, exceeds an F1 score of 50% with only 1% of the training data(454 questions for S�AD, and 746 questions for TriviaQA), and approaches 90% of the bestperformance with only 10% labeled data. �e gains over the SL se�ing, however, diminish asthe size of the labeled set increases and are small when the full dataset is available.

Method Factoid MRR List F1

SL∗ 0.242 0.211S�AD pretraining 0.262 0.211Cloze pretraining 0.328 0.230

Table 5.3: 5-fold cross-validation results on BioASQ Task 5b. ∗Our SL experiments showedbe�er performance than what was reported in Wiese et al. [149].

Cloze pretraining outperforms the GDAN baseline from Yang et al. [159] using the sameS�AD dataset splits. Additionally, we show improvements in the 90% data case unlike GDAN.Our approach is also applicable in the extremely low-resource se�ing of 1% data, which we sus-

64

March 26, 2019DRAFT

ALALP

ALSPARC

ARS ASL DL FALOQP

LOQSLSQP

LSQSPRC PRS QL

QRCQRS

−0.05

0.00

0.05

0.10

Co

effici

ents

Regression Analysis

ycloze

ysl

ycloze − yslAL Answer LengthALP Answer Location in PassageALSP Answer Location in SentenceARC Answer Rareness w.r.t Cloze corpusARS Answer Rareness w.r.t Squad corpusASL Answer Sentence LengthDL Document LengthFA Frequency of Answer in PassageLOQP Lexical Overlap Question and PassageLOQS Lexical Overlap Question and Answer SentenceLSQP Lexical Similarity Question and PassageLSQS Lexical Similarity Question and Answer SentencePRC Passage Rareness w.r.t Cloze corpusPRS Passage Rareness w.r.t Squad corpusQL Question LengthQRC Question Rareness w.r.t Cloze corpusQRS Question Rareness w.r.t Squad corpus

Figure 5.1: Le�: Regression coe�cients, along with std-errors, when predicting F1 score ofcloze model, or sl model, or the di�erence of the two, from features computed from S�AD devset questions. Right: Descriptions of the features.

pect GDAN might have trouble with since it uses the labeled data to do reinforcement learning.Furthermore, we are able to use the same cloze dataset to improve performance on both S�ADand TriviaQA datasets. When we use the same unlabeled data to pre-train with a languagemodeling objective, the performance is worse5, showing the bias we introduce by constructingclozes is important.

On the BioASQ dataset (Table 5.3) we again see a signi�cant improvement when pretrain-ing with the cloze questions over the supervised baseline. �e improvement is smaller thanwhat we observe with S�AD and TriviaQA datasets – we believe this is because questionsare generally more di�cult in BioASQ. Wiese et al. [149] showed that pretraining on S�ADdataset improves the downstream performance on BioASQ. Here, we show a much larger im-provement by pretraining on cloze questions constructed in an unsupervised manner from thesame domain.

5.4.3 Analysis

Regression Analysis

To understand which types of questions bene�t from pre-training, we pre-speci�ed certainfeatures (see Figure 5.1 right) for each of the dev set questions in S�AD, and then performedlinear regression to predict the F1 score for that question from these features. We predict the

5Since the GA Reader uses bidirectional RNN layers, when pretraining the LM we had to mask the inputs tothe intermediate layers partially to avoid the model being exposed to the labels it is predicting. �is results in aonly a subset of the parameters being pretrained, which is why we believe this baseline performs poorly.

65

March 26, 2019DRAFT

ABBRHUM

LOCENTY

NUMDESC

0.0

0.1

0.2

0.3

0.4

0.5

0.6

ycloze−ysl

Conditional Performance - Question Classes

None INWHO

WHAT

WHEREHOW

WHENWHICH

WHY0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

ycloze−ysl

Conditional Performance - ”WH” Question Types

Figure 5.2: Performance gain with pretraining for di�erent subsets of question types.

F1 scores from the cloze pretrained model (ycloze), the supervised model (ysl), and the di�erenceof the two (ycloze− ysl), when using 10% of labeled data. �e coe�cients of the ��ed model areshown in Figure 5.1 (le�) along with their std errors. Positive coe�cients indicate that a highvalue of that feature is predictive of a high F1 score, and a negative coe�cient indicates that asmall value of that feature is predictive of a high F1 score (or a high di�erence of F1 scores fromthe two models in the case of ycloze − ysl).

�e two strongest e�ects we observe are that a high lexical overlap between the questionand the sentence containing the answer is indicative of high boost with pretraining, and that ahigh lexical overlap between the question and the whole passage is indicative of the opposite.�is is hardly surprising, since our cloze construction process is biased towards questions whichhave a similar phrasing to the answer sentences in context. Hence, test questions with a similarproperty are answered correctly a�er pretraining, whereas those with a high overlap with thewhole passage tend to have lower performance. �e pretraining also favors questions with shortanswers because the cloze construction process produces short answer spans. Also passagesand questions which consist of tokens infrequent in the S�AD training corpus receive a largeboost a�er pretraining, since the unlabeled data covers a larger domain.

Performance on question types

Figure 5.2 shows the average gain in F1 score for di�erent types of questions, when we pretrainon the clozes compared to the supervised case. �is analysis is done on the 10% split of theS�AD training set. We consider two classi�cations of each question – one determined onthe �rst word (usually a wh-word) of the question (Figure 5.2 (bo�om)) and one based on theoutput of a separate question type classi�er6 adapted from [73]. We use the coarse grain labelsnamely Abbreviation (ABBR), Entity (ENTY), Description (DESC), Human (HUM), Location

6h�ps://github.com/brmson/question-classi�cation

66

March 26, 2019DRAFT

(LOC), Numeric (NUM) trained on a Logistic Regression classi�cation system . While thereis an improvement across the board, we �nd that abbreviation questions in particular receivea large boost. Also, ”why” questions show the least improvement, which is in line with ourexpectation, since these usually require reasoning or world knowledge which cloze questionsrarely require.

5.5 Conclusion

We show that pre-training QA models with automatically constructed cloze questions improvesthe performance of the models signi�cantly, especially when there are few labeled examples.�e performance of the model trained only on the cloze questions is poor, validating the need for�ne-tuning. �rough regression analysis, we �nd that pretraining helps with questions whichask for factual information located in a speci�c part of the context. One interesting direction forfuture work is to explore the active learning setup for this task – speci�cally, which passagesand / or types of questions can we select to annotate, such that there is a maximum performancegain from �ne-tuning.

67

March 26, 2019DRAFT

68

March 26, 2019DRAFT

Chapter 6

Scaling up (Proposed Work)

�e knowledge representations we have discussed so far have been limited to small contexts,such as a single text document or a small collection of facts from a knowledge graph. �is islargely due to the memory restrictions imposed by GPUs, the preferred hardware for train-ing deep neural networks. Real-world applications must instead deal with massive web-scaleknowledge sources to extract information. In this chapter, we propose a framework for build-ing and using large-scale distributed KBs from text corpora. In contrast to traditional KBs, weargue that our proposed knowledge representations will provide more coverage, without lossof scalability.

6.1 Towards a Distributed Text KB

We de�ne a distributed text KB as a collection of text units extracted from a corpus, each ofwhich is associated with a dense vector which captures the semantics of both the content ofthe unit and its context. �e text units themselves can range from a single token, to phrases, tosentences and whole documents. To answer queries against such a KB, we propose to embedthe query in the same space as the dense text unit vectors, optionally a�er decomposing itinto a sequence of simpler queries, followed by a maximum inner product search (MIPS) toretrieve the answers. Advances in approximate inner product search [59, 114] mean that, evenfor billion-scale collections, a MIPS operation can be performed in milliseconds.

We contrast our setup with information retrieval in continuous space [39], where the as-sumption is that the query fully speci�es the text unit to be retrieved. In this case the textrepresentations need only encode the content and not the context. In contrast, our goal is tosupport underspeci�ed information-seeking queries against the KB, which requires encoding

69

March 26, 2019DRAFT

both the content and the context of the text units.�e distributed text KB di�ers from a traditional knowledge graph representation in the

following ways. (1) �e basic units are text mentions instead of entities. �is means that thesame entity (e.g. “Barrack Obama” ) may be represented multiple times in the KB. While this willprevent con�ating di�erent aspects of the same unit, it also adds the challenge of dealing withredundancy of information (§6.4). (2) Information is expressed using contextual representa-tions rather than typed edges representing relations. We believe the former is more expressivein terms of the range of information it can support. (3) MIPS operations are used instead ofdedicated query languages, such as SPARQL [108].

An alternative line of work has focused on combining shallow information retrieval ap-proaches as a pre-processing step to scale up deep learning approaches for answering queries[13, 28, 142, 143]. However, a limiting factor of these approaches is the loss of recall when re-trieving the context from a corpus given a query [131]. We showed a similar result in Chapter 3,where errors in constructing the question sub-graphs propagated through the pipeline.

E�ective implementation of a distributed text KB poses several research challenges. In theremainder of this thesis, we will focus on the following, which are further detailed in this chap-ter:

1. How can we learn e�ective contextual representations of the text units? What types ofqueries can these contextual representations support?

2. How can we exploit redundancies inherent in a text corpus to increase the quality of thesecontextual representations and reduce the size of the KB?

3. How can we support reasoning mechanisms, such as those required for answering mulit-hop questions, against this KB?

6.2 Related Work

Our proposal builds upon the recent work of Seo et al. [111] on phrase-indexed question answer-ing (PIQA). State-of-the-art reading comprehension models rely on sophisticated interactionsbetween the query and and the context when extracting answers. For example, in Chapter 2 weshowed how the gated a�ention mechanism for iteratively re�ning the context representationsbased on the query led to signi�cant improvements over a model which did not consider thequery. �ese sophisticated interactions, however, mean that during inference, the entire modelneeds to be re-run for every retrieved context document.

70

March 26, 2019DRAFT

PIQA works around this by enforcing the interaction between the query and documents tobe restricted to a Maximum Inner Product Search (MIPS) between their representations. �isproceeds in two steps – (1) Phrase representations for all possible answer spans are computedo�ine once and stored. (2) �e query is embedded into the same space and inference is doneusing approximate MIPS [59, 114].

Our distributed text KB shares the same pipeline, but is broader in scope since it is intendedto replace a traditional KB. First, we would like to identify a “schema” of the types of informationthe contextual representations encode. We will do so by de�ning probe tasks over a set ofrelations. Second, instead of embedding all spans of text, we are interested in identifying thesubset which will support all information seeking queries. So, for example, text units whichexpress the same information should be represented only once in the KB. �ird, a key aspectof traditional KBs is that they support rudimentary forms of reasoning over paths of relations.Hence, we are also interested in building similar capabilities for the text KB.

PIQA details

Suppose we are given the query q = (q1, . . . , qQ) and a corpus of documents C = (d1, . . . , d|C|).Each document is a sequence of tokens di = (d1

i , . . . , dTii ). Answers can be arbitrary spans in

the text, so let S(di) denote the collection of spans we want to consider as possible answersfrom di. In the general case, these would be all spans upto a speci�ed lengthM in di (leading toapproximatelyMTi spans in total). Further, let S∗ denote all possible answer spans from all thedocuments in the corpus. To answer questions, we train two separate representation learners,f and g, for encoding the question and each span, respectively, to a �xed size vector in Rp. �enthe answer is given by:

a = arg maxdi∈C,s∈S(di)

f(q) · g(s, di). (6.1)

Note that the span encoder g depends on both the span s and the document di in which thespan occurs. �is is because for QA, the span representation needs to encode information fromthe context in which the span occurs.

During training the probability of a span being the answer is:

Pr(s|q, di, C) ∝ exp (f(q) · g(s, di)) , (6.2)

where, ideally, the normalization should be over all spans in the corpus. �is is not possible fora large corpus, however, and needs to be approximated. Seo et al. [111], for example, assumedthere is only one document in the corpus for a given question, and train f and g to minimize the

71

March 26, 2019DRAFT

cross-entropy loss for predicting the correct answer. In our work, we will relax this assumption.We will also normalize over each document separately, but adopt the approach of Clark andGardner [19], and use a No-Answer option, to train on multiple documents from the corpus.Speci�cally, we will add a special span φ to S(di) for each document. For documents whichdo not contain the correct answer, we can minimize the cross-entropy loss to predict the No-

Answer span φ. �e No-Answer spans will be only used during training and ignored at testtime when searching for the answer using Eq. 6.1.

6.3 Contextual Representations

�e quality of the text KB depends on the quality of the contextual representations of the textunits. We start by looking at models which produce contextual representations of short spansof text, similar to PIQA. We introduce an evaluation framework for characterizing the relationalinformation encoded in a given set of representations. We de�ne relational information as thatpertaining to the relationships between entities, such as (CMU, located in, Pi�sburgh).

6.3.1 Probing Tasks Setup

We assume that each instance in the probing tasks involves a tuple (d, q, a) of document, ques-tion and answer respectively. �e document and question are sequences of tokens d1, . . . , dT

and q1, . . . , qQ, respectively. �e answer a is a span (i, j) denoting the start and end positionsin the document d.

We assume access to data where facts are aligned with sentences expressing them. Eachfact is a triple of the form (e1, R, e2) stating that the relation R holds between the entities e1

and e2. �e queries q, simply concatenate the the subject entity e1 and the relation R with aperiod into a string. So <CMU, located in, Pi�sburgh> becomes the query “CMU . located in”

whose answer is “Pi�sburgh”.We consider 3 di�erent probing task setups, to check for di�erent kinds of information

present or absent from g(s, d). Each se�ing consists of positive examples, where queries abouta fact must be answered from the sentence mentioning that fact. �ey di�er only in terms ofnegative examples, where the queries are paired with sentences not mentioning the fact, andthe correct answer is No-Answer.

1. Type Identi�cation (TI):Negatives for this probing task pair facts with sentences whichmention the same subject entity but relations other than the one in the query. Here,

72

March 26, 2019DRAFT

we want to check if the representations include �ne-grained type information about theobject entities for the relations.

2. Entity Association (EA): Negatives pair facts with sentences which mention the samerelation but with di�erent entities. Here, we want to check if the representations includeinformation about which entities a mention is collocated with.

3. Relation Extraction (RE): Negatives include both the above types – facts paired withsentences mentioning the same relation but with di�erent entity, and facts paired withsentences mentioning same entities but di�erent relations.

Our hypothesis is that type identi�cation and entity association are sub-tasks which a modelneeds to do for the overall task of relation extraction. Hence, we would like to see how thesevary for di�erent relations.

Data

We leverage the relation extraction data collected by Levy et al. [70]. It consists of WikiData[139] facts aligned using distant supervision to sentences which express those facts in naturallanguage. �e sentences are extracted from the Wikipedia article of e1, and constrained tomention both e1 and e2. �e dataset includes approximately 30 million such instances across120 relation types.

We �lter relations for which e2 takes less than 15 distinct values, such as gender. Wealso �lter relations for which the distribution of values taken by e2 is highly skewed and theentropy is less than 0.5 times its maximum possible value. We also ensure that for each relationwe have at least 500 instances, and at most 1000 instances. In total, this gives a dataset of 54000

instances across 92 relations. We split this along the subject entities into training and dev sets.

6.3.2 Pretrained Language Models

Pre-trained language models, such as ELMo [92] and BERT [30], have recently emerged ashighly e�ective tools for learning contextualized word representations. Analysis by Peters et al.[94] and Tenney et al. [127] suggests that these representations encode rich syntactic informa-tion, such as the part of speech, about the role of a word in its context. Here, we use the probingsetup above to characterize the relational information encoded in these representations, aboutthe facts that the words participate in with their context.

So far we have conducted experiments on BERT, which shows state-of-the-art performance

73

March 26, 2019DRAFT

on multiple NLP benchmarks. �e document and query are pre-processed using WordPiecetokenization [155] and with special tokens ‘[CLS]’ and ‘[SEP]’ at their beginning and ends,respectively. We allow the answer to be missing from the document, in which case we let thespan be (0, 0), which corresponds to the ‘[CLS]’ token.

BERT encodes documents and queries through l = 1 . . . 12 transformer layers, each ofwhich produces a representation for each input token dt or qt, which we denote as htl and otl ,respectively. We combine these using learned weights which sum to 1. For example,

ht =12∑l=1

αlhtl (6.3)

αls will be tuned along with the other parameters of the probing task. Similarly, separateweights are learned to aggregate the query representations ot.

We further pass the query representations through two separate 4-layer transformer net-works [137] and keep their outputs from the �rst position in the sequence (which correspondsto the ‘[CLS]’ token). �ese two outputs are denoted as the query-start representation ost andthe query-end representation oen, respectively, since they will be used to extract the start andend positions of the answer from the document. �e parameters of the 4-layer transformernetworks are trained on the probing task, whereas the BERT layers are kept �xed.

�e probabilities for the start and end positions of the answer are modeled separately, sim-ilar to the BERT model for Squad [30]:

P (i|d, q) ∝ exp(hi · ost

), P (j|d, q) ∝ exp

(hj · oen

)(6.4)

During training, both of these likelihoods are maximized separately using the cross-entropyloss.

Preliminary Results

Table 6.1 shows the micro-averaged F1 scores on the dev set for each of the probing tasks whenboth BERT is kept �xed to its pre-trained version, and when it is �ne-tuned on the training setof the probing task. When �ne-tuning BERT, we simply take the output of the last layer as thecontextual representation of tokens. For each case, we train the parameters of the query-startand query-end representation layers on the training set of each probing task, and report theseresults in separate columns.

Clearly, the entity association sub-task is harder than the type identi�cation sub-task. �issupports earlier observations from Peters et al. [94] and Tenney et al. [127] that pre-trained LM

74

March 26, 2019DRAFT

ProbingTask

Pre-trained BERT Fine-tuned BERT

EA TI RE EA TI RE

EA 71 61 70 88 63 88TI 73 85 77 70 88 86RE 66 59 69 69 62 85

Table 6.1: F1 score on each probing task from §6.3.1 for several models trained di�erently(columns). Columns 2-4 show results when the contextual representations are �xed to the pre-trained BERT. Columns 5-7 show results when BERT is �ne-tuned on the probing task data.[Updated on March 26, 2019]

representations encode rich local syntactic information, but not much long-distance semanticinformation. �e F1-score on the full task of relation extraction, which is what we ultimatelycare about, is only 69% for the pre-trained BERT model, but increases to 85% when it is �ne-tuned. �is suggests that while the BERT architecture is expressive enought to produce richrelational representations, the pre-training objective does not do this by default.

�e choice of negatives during training (columns 5-7) plays an important role in the �nalperformance on the probing tasks. �e natural categorization by entities and relations of thequeries we used in these experiments gave us full control over the negatives we used in trainingand testing. However, in general this will not be true for more natural datasets of question andanswer pairs. For such cases, we would still like to train contextual representation models, andagain the choice of negatives will be an important one. We propose a method for doing so inthe next section.

Table 6.2 shows a relation-wise breakdown of the performance of the pre-trained BERT onthe RE task. Interestingly, relations which only permit a �ne-grained entity type as their objects,such as vessel class, astronomical bodies, sports and languages, have highest performance. Onthe other end, relations which require person or company names as their objects show theworst performance, which suggests that entity types which appear in diverse contexts havepoor contextual representations for relation extraction. We plan to investigate this further inthe future.

75

March 26, 2019DRAFT

Relation (# instances) P (%) R (%) F (%) Relation (# instances) P (%) R (%) F (%)

chromosome (44) 44 24 31 replaced by (45) 71 67 69based on (94) 47 36 41 architect (136) 69 71 70cast member (251) 42 40 41 creator (218) 79 64 71parent company (153) 52 37 43 inception (287) 70 72 71child (238) 49 43 46 place of death (278) 68 74 71brother (256) 51 43 47 designer (98) 74 69 71production company (225) 56 43 48 discoverer or inventor (145) 64 81 72screenwriter (282) 63 44 52 dra�ed by (97) 58 94 72programming language (134) 52 54 53 illustrator (72) 68 76 72�lm editor (71) 62 52 56 location of formation (84) 81 65 72developer (262) 64 52 58 service entry (55) 69 77 73located in the territory (243) 57 58 58 league (147) 71 74 73headquarters location (277) 55 61 58 material used (166) 71 75 73licensed to broadcast to (183) 50 68 58 instrument (224) 67 80 73record label (269) 62 54 58 position held (307) 71 75 73occupant (146) 62 54 58 crosses (110) 74 73 73occupation (301) 51 71 59 cause of death (259) 67 82 74head of government (45) 56 64 60 present in work (203) 73 76 75residence (171) 58 63 60 date of birth (283) 72 77 75original network (186) 61 59 60 award received (305) 71 78 75mouth of the watercourse (190) 56 65 60 country of origin (251) 71 80 75educated at (259) 63 58 60 date of death (308) 75 76 75distributor (201) 64 57 61 medical condition (39) 64 93 76spouse (297) 65 57 61 manufacturer (198) 76 78 77narrative location (207) 61 61 61 time of discovery (33) 67 93 78home venue (38) 61 61 61 site of astronomical discovery (46) 72 84 78employer (286) 64 59 61 product (33) 67 94 78performer (248) 62 61 61 noble family (86) 76 80 78founder (278) 61 62 62 time of spacecra� launch (88) 72 86 78member of sports team (262) 58 66 62 from �ctional universe (125) 78 78 78lyrics by (218) 64 62 63 member of political party (262) 73 85 78located next to body of water (85) 61 67 64 language of work or name (198) 77 81 79place of burial (149) 63 66 64 participant of (307) 74 85 79author (279) 68 61 65 instrumentation (37) 70 91 79named a�er (271) 66 64 65 date of o�cial opening (10) 80 80 80father (262) 67 64 66 con�ict (248) 83 82 82mother (289) 65 67 66 point in time (180) 78 89 83dissolved or abolished (226) 57 78 66 country of citizenship (291) 81 86 84place of birth (295) 62 70 66 end time (187) 78 91 84series (237) 61 74 67 constellation (192) 81 91 86connecting line (201) 73 62 67 country (237) 84 89 86publication date (269) 64 71 67 sport (250) 82 92 87airline hub (112) 62 74 68 languages spoken or wri�en (301) 86 90 88publisher (290) 68 68 68 start time (123) 84 93 88parent taxon (258) 67 68 68 located on astronomical body (106) 86 94 90director (303) 70 67 68 vessel class (157) 90 98 94

Table 6.2: Relation-wise breakdown of precision (P), recall (R), and F-score (F) of the pre-trainedBERT model on the RE probing task. Rows are sorted based on F-score. [Updated on March 26,2019]

76

March 26, 2019DRAFT

6.3.3 Hard Negative Mining

Next we look at the task of learning the contextual representations of our text KB from a set oflabeled question and answer pairs. While the positive instances can be obtained using distantsupervision, �nding hard negatives to train the model is challenging when no natural groupingamong questions exists. Random sampling, or using information retrieval, is not guaranteedto produce such hard negatives. Further, any set of negatives we include may have implicitbiases which separate them from the positive instances, which the model might learn to use todistinguish between them, but might not be applicable generally throughout the corpus.

Hence, we propose to use an iterative procedure to sample hard negatives and �ne-tune thecontextual representation learner in stages.

Algorithm

Algorithm 1 shows our proposed procedure for mining hard negatives. �e FindPositivefunction uses distant supervision to link each (q, a) pair to a document containing the answer.�e FindNegative function fetches the initial negative instance, i.e. a document not con-taining the answer, for each question. �is can be a random sample from the corpus, optionallywith heuristics such as high overlap with the question text. A�er training an initial model withthese positive and negative instances, in each iteration new negatives are fetched by looking forquestions which were answered incorrectly. �is is done by the Answer function, which im-plements Eq. 6.1. Importantly, answer selection is done using MIPS, which allows this functionto scale to large corpora e�ciently, which consequently allows searching for hard negativeseasily.

Preliminary Results

We reuse the data from the probing tasks in the previous section, but we consider a di�erent testse�ing. For each query at test time, the answer must now be extracted from all sentences in thecorresponding Wikipedia article of the subject entity, plus all sentences from b randomly chosendistractor Wikipedia articles. We report results for each of the �ne-tuned BERT versions fromTable 6.1, plus a model which is further �ne-tuned by mining negatives (for k = 1 iteration)starting from the TI model. We chose the TI model for this experiment, since that only requiresrandom negatives from the same article as the subject entity, which mimics the kind of negativeswe would expect from a retrieval pipeline. We denote this model as TI-1−

77

March 26, 2019DRAFT

Algorithm 1: Iterative procedure for mining hard negatives.Input : QA pairs T = {(qn, an)}Nn=1, corpus C, hyperparameter KOutput: Trained modelMK

// fetch positive instances

Tpos ← ∅;for n← 1 to N do

d← FindPositive(qn, an);Tpos ← Tpos ∪ (qn, an, d);

end// fetch first round of negative instances

T 0neg ← ∅;

for n← 1 to N dod← FindNegative(qn, an);T 0neg ← T 0

neg ∪ (qn,No-Answer, d) ;end// train initial model

M0 ← Train(Tpos ∪ T 0neg);

// iteratively fetch harder negatives

for k ← 1 to K doT kneg ← ∅;for n← 1 to N do

// fetch document and answer from current model

a, d← Answer(qn,Mk−1, C);if a 6= an thenT kneg ← T kneg ∪ (qn,No-Answer, d);

endend// train new model

Mk ← Train(Tpos ∪ T kneg);end

78

March 26, 2019DRAFT

b EA RE TI TI-1−

0 0.796 0.834 0.822 0.8311 0.792 0.831 0.730 0.8262 0.791 0.830 0.658 0.8243 0.790 0.826 0.605 0.8224 0.790 0.825 0.561 0.8195 0.789 0.824 0.523 0.816

Table 6.3: Accuracy of extracting the tail entity for a query of the form (e1, R, ?) from all sen-tences of the Wikipedia article of e1, plus sentences from b distractor articles. EA, RE, TI are the�ne-tuned BERT models from Table 6.1. TI-1− is the TI model further �ne-tuned with k = 1

rounds of hard negatives from Algorithm 1.

�e EA and RE models show li�le degradation upto b = 5. �ese are trained on carefullyconstructed negatives which include sentences mentioning the same relation but di�erent en-tities. Among these, RE has be�er performance than EA since it is also trained on negativeswhich include sentences mentioning the same entities but di�erent relations. �e TI model, onthe other hand, is only trained on negatives of the second kind, and its performance degradessharply as more distractor articles are added. However, mining negatives using this trainedmodel (TI-1−) signi�cantly improves the performance.

Given these promising initial results, we want to use the same procedure for building anend-to-end system for slot-�lling. We also want to next try our scheme for mining negativesfor answering natural language open-domain questions against a corpus.

6.4 Exploiting Redundancies

Consider the following two sentences on Wikipedia:

�e inauguration of Barack Obama took place on January 20, 2009.Obama was inaugurated on January 20, 2009.

�e span “January 20, 2009” has the same information in its context in both cases – that itwas the date when Obama was inaugrated. Similar redundancies exist all across Wikipedia; thesame information is typically mentioned multiple times in an article and across articles. Meth-ods which extract facts from text to populate knowledge bases [21], exploit such redundancies

79

March 26, 2019DRAFT

to increase the precision of extracted information. A naıve implementation of the distributedtext KB would store these spans separately, but hopefully with very similar representations. Toincrease the con�dence of facts which are mentioned multiple times in the corpus, we wouldlike to explore aggregation schemes over their representations. �is will also result in reducingthe size of the KB, which is important for reducing the time required to answer queries usingMIPS.

Speci�cally, suppose we are given a text unit s which is mentioned M times in the corpus.Each of these mentions will have a di�erent contextual representation h1, . . . , hM . In caseswhere M is very large, we conjecture that we can group the mentions into k clusters (k < M ),and their representations v1, . . . , vk, which can support most of the queries whose answer is s.So, for example, when s is a date, the number of clusters might correspond to the number ofnotable events that took place on that date.

A fundamental challenge with unsupervised clustering, however, is deciding the right num-ber of clusters k. In this situation, k will further depend on the text unit s, since there may bemore or less information about each s in the corpus. We plan to explore methods which usethe distribution of the original representations to �gure out the number of clusters, such ashierarchical clustering or techniques to �gure out k in k-means clustering [78, 134]. Anotherpossibility here will be to use a labeled dataset of question and answer pairs to �gure out thenumber of clusters needed for di�erent types of answers, such as locations and dates.

6.5 Multi-Hop�eries

A key strength of traditional KBs is that they provide a natural interface for answering multi-

hop queries, which involve operating over paths of relations and sets of entities. A rich line ofwork has focused on semantic parsing [99] for such KBs, which decomposes complex questionsinto a program of simple operations, executing which on the KB results in the answer.

Any proposal for a new type of KB would be incomplete without considering how it cansupport complex questions. Hence, in the last part of this thesis, we will develop models foranswering such questions against the distributed text KB. In particular, we will focus on twotypes of complex questions. Conjunctive questions look for the intersection of two sets of an-swers, for example, “Which movie stars Lady Gaga and was directed by Bradley Cooper?”. Here,we need to �nd the answer common to two sub-questions. Compositional questions look forthe answer which is at the end of a path of relations, for example, “Which ocean does the river

that �ow through Pi�sburgh �ow into?”. Here the answer to the �rst sub-question is an input to

80

March 26, 2019DRAFT

the second sub-question.For answering these questions we propose to explicitly generate two question representa-

tions q1 and q2. We will �rst retrieve possible answers for q1 using MIPS over the distributedKB. For conjunctive queries, we will then run another inner product search over only thesepossible answers using q2. For compositional queries, in a manner similar to beam search, eachof the possible answers for q1 will combine with q2 to form a new query whose answers will beretrieved using MIPS. For training we will use question and answer pairs, and, if available, an-swers to the intermediate queries q1 and q2. We will focus on datasets which explicitly requiremulti-hop reasoning, such as ComplexWeb�estions [125] and HotPotQA [160].

Our proposed model resembles the se�ing for semantic parsing from denotations [85]. How-ever, instead of learning a symbolic representation of the intermediate queries, we will learn toproduce distributed vector representations which match the contextual representations of theanswers in the KB. Our hope is that we will be able to support a much broader range of queriesin this manner due to the larger range of information that the distributed KB can hold.

81

March 26, 2019DRAFT

82

March 26, 2019DRAFT

Bibliography

[1] Layla El Asri, Jing He, and Kaheer Suleman. A sequence-to-sequence model for usersimulation in spoken dialogue systems. arXiv preprint arXiv:1607.00070, 2016. 4.6.1

[2] James Atwood and Don Towsley. Di�usion-convolutional neural networks. In Advances

in Neural Information Processing Systems, pages 1993–2001, 2016. 3.2

[3] Soren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, andZachary Ives. Dbpedia: A nucleus for a web of open data. In Karl Aberer, Key-Sun Choi,Natasha Noy, Dean Allemang, Kyung-Il Lee, Lyndon Nixon, Jennifer Golbeck, Peter Mika,Diana Maynard, Riichiro Mizoguchi, Guus Schreiber, and Philippe Cudre-Mauroux, edi-tors, �e Semantic Web, pages 722–735, Berlin, Heidelberg, 2007. Springer Berlin Heidel-berg. ISBN 978-3-540-76298-0. 3, 3.1

[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation byjointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. 2.1, 2.3

[5] Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindienst. Embracing data abundance: Booktestdataset for reading comprehension. arXiv preprint arXiv:1610.00956, 2016. 2.2, 2.5.1, 5.2

[6] Petr Baudis. Yodaqa: a modular question answering system pipeline. In POSTER 2015-

19th International Student Conference on Electrical Engineering, pages 1156–1165, 2015.3.1

[7] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A reviewand new perspectives. IEEE transactions on pa�ern analysis and machine intelligence, 35(8):1798–1828, 2013. 1.1

[8] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing onfreebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical

Methods in Natural Language Processing, pages 1533–1544, 2013. 1, 3

[9] Tim Berners-Lee, James Hendler, Ora Lassila, et al. �e semantic web. Scienti�c american,

83

March 26, 2019DRAFT

284(5):28–37, 2001. 1

[10] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: acollaboratively created graph database for structuring human knowledge. In Proceedings

of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM, 2008. 1, 3, 3.1

[11] Antoine Bordes and Jason Weston. Learning end-to-end goal-oriented dialog. arXiv

preprint arXiv:1605.07683, 2016. 4.2

[12] Danqi Chen, Jason Bolton, and Christopher D Manning. A thorough examination of thecnn/daily mail reading comprehension task. ACL, 2016. 2.1, 2.2, 2.3

[13] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia toanswer open-domain questions. In Association for Computational Linguistics (ACL), 2017.3.1, 3.3, 3.5.4, 3.3, 6.1

[14] Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. Improved neural ma-chine translation with a syntax-aware encoder and decoder. ACL, 2017. 2.2

[15] Yun-Nung Chen, Dilek Hakkani-Tur, Gokhan Tur, Jianfeng Gao, and Li Deng. End-to-end memory networks with knowledge carryover for multi-turn spoken language un-derstanding. In Proceedings of �e 17th Annual Meeting of the International Speech Com-

munication Association, 2016. 4.4

[16] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, FethiBougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations usingrnn encoder-decoder for statistical machine translation. ACL, 2015. 2.1, 2.2, 2.3, 4.4

[17] Zewei Chu, Hai Wang, Kevin Gimpel, and David McAllester. Broad context languagemodeling as reading comprehension. EACL, 2017. 2.5.4, 2.5.4

[18] Yu-An Chung, Hung-Yi Lee, and James Glass. Supervised and unsupervised transferlearning for question answering. arXiv preprint arXiv:1711.05345, 2017. 5.2

[19] Christopher Clark and Ma� Gardner. Simple and e�ective multi-paragraph reading com-prehension. In ACL, 2018. 5.1, 5.3, 5.4.1, 3, 6.2

[20] Kevin Clark and Christopher D Manning. Entity-centric coreference resolution withmodel stacking. In Proceedings of the 53rd Annual Meeting of the Association for Compu-

tational Linguistics and the 7th International Joint Conference on Natural Language Pro-

cessing (Volume 1: Long Papers), volume 1, pages 1405–1415, 2015. 2.5.4

84

March 26, 2019DRAFT

[21] Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, KamalNigam, and Sean Sla�ery. Learning to construct knowledge bases from the world wideweb. Arti�cial intelligence, 118(1-2):69–113, 2000. 6.4

[22] Heriberto Cuayahuitl, Steve Renals, Oliver Lemon, and Hiroshi Shimodaira. Human-computer dialogue simulation using hidden markov models. In Automatic Speech Recog-

nition and Understanding, 2005 IEEE Workshop on, pages 290–295. IEEE, 2005. 4.6.1

[23] Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. A�ention-over-a�ention neural networks for reading comprehension. ACL, 2017. 2.2

[24] Michał Daniluk, Tim Rocktaschel, Johannes Welbl, and Sebastian Riedel. Frustratinglyshort a�ention spans in neural language modeling. ICLR, 2017. 2.1

[25] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, AkshayKrishnamurthy, Alex Smola, and Andrew McCallum. Go for a walk and arrive at theanswer: Reasoning over paths in knowledge bases using reinforcement learning. arXivpreprint arXiv:1711.05851, 2017. 3.3

[26] Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains ofreasoning over entities, relations, and text using recurrent neural networks. EACL, 2017.3.2

[27] Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew McCallum. �estion answeringon knowledge bases and text using universal schema and memory networks. ACL, 2017.3.1, 3.2, 3.5.2

[28] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, and Andrew McCallum. Multi-stepretriever-reader interaction for scalable open-domain question answering. ICLR, 2019.6.1

[29] Randall Davis, Howard Shrobe, and Peter Szolovits. What is a knowledge representation?AI magazine, 14(1):17, 1993. 1

[30] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint

arXiv:1810.04805, 2018. 1.3, 6.3.2, 6.3.2

[31] Greg Durre� and Dan Klein. Easy victories and uphill ba�les in coreference resolution.In EMNLP, pages 1971–1982, 2013. 2.1, 2.5.3

[32] Chris Dyer. Should neural network architecture re�ect linguistic structure? CoNLL

85

March 26, 2019DRAFT

Keynote, 2017. URL http://www.conll.org/keynotes-2017. 2.1

[33] Manaal Faruqui and Dipanjan Das. Identifying well-formed natural language questions.In Proc. of EMNLP, 2018. 1

[34] Paolo Ferragina and Ugo Scaiella. Fast and accurate annotation of short texts withwikipedia pages. IEEE so�ware, 29(1):70–75, 2012. 1

[35] David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya AKalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. Buildingwatson: An overview of the deepqa project. AI magazine, 31(3):59–79, 2010. 3.1, 3.1

[36] Charles J Fillmore. Frame semantics and the nature of language. Annals of the New York

Academy of Sciences, 280(1):20–32, 1976. 1

[37] Ma� Gardner and Jayant Krishnamurthy. Open-vocabulary semantic parsing with bothdistributional statistics and formal knowledge. In AAAI, pages 3195–3201, 2017. 3.2

[38] M Gasic, Catherine Breslin, Ma�hew Henderson, Dongho Kim, Martin Szummer, Blaise�omson, Pirros Tsiakoulis, and Steve Young. On-line policy optimisation of bayesianspoken dialogue systems via human interaction. In 2013 IEEE International Conference on

Acoustics, Speech and Signal Processing, pages 8367–8371. IEEE, 2013. 4.1

[39] Daniel Gillick, Alessandro Presta, and Gaurav Singh Tomar. End-to-end retrieval in con-tinuous space. arXiv preprint arXiv:1811.08008, 2018. 6.1

[40] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl.Neural message passing for quantum chemistry. ICML, 2017. 3.2

[41] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communi-

cations of the ACM, 33(10):75–84, 1990. 4.5

[42] David Golub, Po-Sen Huang, Xiaodong He, and Li Deng. Two-stage synthesis networksfor transfer learning in machine comprehension. arXiv preprint arXiv:1706.09789, 2017.5.2

[43] Evan Greensmith, Peter L Bartle�, and Jonathan Baxter. Variance reduction techniquesfor gradient estimates in reinforcement learning. Journal of Machine Learning Research,5(Nov):1471–1530, 2004. 4.5

[44] Dilek Hakkani-Tur, Gokhan Tur, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao,Li Deng, and Ye-Yi Wang. Multi-domain joint semantic frame parsing using bi-directionalRNN-LSTM. In Proceedings of �e 17th Annual Meeting of the International Speech Com-

86

http://www.conll.org/keynotes-2017

March 26, 2019DRAFT

munication Association, 2016. 4.4

[45] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning onlarge graphs. CoRR, abs/1706.02216, 2017. URL http://arxiv.org/abs/1706.

02216. 3.2

[46] Xu Han, Zhiyuan Liu, and Maosong Sun. Joint representation learning of text and knowl-edge for knowledge graph completion. arXiv preprint arXiv:1611.04125, 2016. 3.2

[47] Taher H Haveliwala. Topic-sensitive pagerank. In Proceedings of the 11th international

conference on World Wide Web, pages 517–526. ACM, 2002. 3.1, 3.3, 3.4

[48] Mikael Hena�, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Trackingthe world state with recurrent entity networks. arXiv preprint arXiv:1612.03969, 2016. 2.2,2.4, 2.5.2, 2.5.2

[49] Ma�hew Henderson. Machine learning for dialog state tracking: A review. Machine

Learning in Spoken Language Processing Workshop, 2015. 4.4

[50] Ma�hew Henderson, Blaise �omson, and Steve Young. Word-based dialog state trackingwith recurrent neural networks. In Proceedings of the 15th Annual Meeting of the Special

Interest Group on Discourse and Dialogue (SIGDIAL), pages 292–299, 2014. 4.4, 4.4

[51] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenste�e, Lasse Espeholt, Will Kay,Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. InAdvances in Neural Information Processing Systems, pages 1684–1692, 2015. 2.1, 2.2, 2.3,2.5.1, 5.2

[52] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. �e goldilocks principle:Reading children’s books with explicit memory representations. ICLR, 2016. 2.1, 2.2,2.5.1, 5.2

[53] Geo�rey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, NavdeepJaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deepneural networks for acoustic modeling in speech recognition. IEEE Signal processing

magazine, 29, 2012. 1

[54] Geo�rey Hinton, N Srivastava, and Kevin Swersky. Lecture 6a overview of mini–batch gradient descent. Coursera Lecture slides h�ps://class. coursera. org/neuralnets-2012-001/lecture,[Online, 2012. 4.5

[55] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation,

87

http://arxiv.org/abs/1706.02216


March 26, 2019DRAFT

9(8):1735–1780, 1997. 2.3

[56] Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. Search-based neural structured learningfor sequential question answering. In Proceedings of the 55th Annual Meeting of the Asso-

ciation for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1821–1831,2017. 3.1

[57] Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi, and Noah A Smith. Dynamicentity representations in neural language models. EMNLP, 2017. 2.2

[58] Jiatao Jiang, Zhen Cui, Chunyan Xu, Chengzheng Li, and Jian Yang. Walk-steered con-volution for graph classi�cation. arXiv preprint arXiv:1804.05837, 2018. 3.2

[59] Je� Johnson, Ma�hijs Douze, and Herve Jegou. Billion-scale similarity search with gpus.arXiv preprint arXiv:1702.08734, 2017. 1.3, 6.1, 6.2

[60] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Ze�lemoyer. Triviaqa: A large scaledistantly supervised challenge dataset for reading comprehension. In Proceedings of the

55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada,July 2017. Association for Computational Linguistics. 5.4.1

[61] Rudolf Kadlec, Ondrej Bajgar, Peter Hrincar, and Jan Kleindienst. Finding a jack-of-all-trades: An examination of semi-supervised learning in reading comprehension. 2016.5.2

[62] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text understandingwith the a�ention sum reader network. ACL, 2016. 2.1, 2.2, 2.3, 2.3

[63] �omas N Kipf and Max Welling. Semi-supervised classi�cation with graph convolu-tional networks. arXiv preprint arXiv:1609.02907, 2016. 3.1, 3.2, 3.4

[64] Ryan Kiros, Richard Zemel, and Ruslan R Salakhutdinov. A multiplicative model forlearning distributed text-based a�ribute representations. In Advances in Neural Informa-

tion Processing Systems, pages 2348–2356, 2014. 2.3

[65] Sosuke Kobayashi, Ran Tian, Naoaki Okazaki, and Kentaro Inui. Dynamic entity rep-resentations with max-pooling improves machine reading. In NAACL-HLT, 2016. 2.2,2.2

[66] Alex Krizhevsky, Ilya Sutskever, and Geo�rey E Hinton. Imagenet classi�cation withdeep convolutional neural networks. In Advances in neural information processing sys-

tems, pages 1097–1105, 2012. 1

88

March 26, 2019DRAFT

[67] Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Pe-ter Ondruska, Ishaan Gulrajani, and Richard Socher. Ask me anything: Dynamic memorynetworks for natural language processing. ICML, 2016. 2.3

[68] Ni Lao, Amarnag Subramanya, Fernando Pereira, and William W Cohen. Reading theweb with learned syntactic-semantic inference rules. In Proceedings of the 2012 Joint Con-

ference on Empirical Methods in Natural Language Processing and Computational Natural

Language Learning, pages 1017–1026. Association for Computational Linguistics, 2012.3.2

[69] Kenton Lee, Luheng He, Mike Lewis, and Luke Ze�lemoyer. End-to-end neural corefer-ence resolution. EMNLP, 2017. 2.1

[70] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Ze�lemoyer. Zero-shot relation extrac-tion via reading comprehension. In Proceedings of the 21st Conference on Computational

Natural Language Learning (CoNLL 2017), pages 333–342, 2017. 6.3.1

[71] Jiwei Li, Will Monroe, Alan Ri�er, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deepreinforcement learning for dialogue generation. EMNLP, 2016. 4.1

[72] Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. Datasetand neural recurrent sequence labeling model for open-domain factoid question answer-ing. arXiv preprint arXiv:1607.06275, 2016. 2.3

[73] Xin Li and Dan Roth. Learning question classi�ers. In Proceedings of the 19th international

conference on Computational linguistics-Volume 1, pages 1–7. Association for Computa-tional Linguistics, 2002. 5.4.3

[74] Xiujun Li, Zachary C Lipton, BhuwanDhingra, Lihong Li, Jianfeng Gao, and Yun-NungChen. A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688,2016. 4.1, 4.6.1

[75] Xuijun Li, Yun-Nung Chen, Lihong Li, and Jianfeng Gao. End-to-end task-completionneural dialogue systems. arXiv preprint arXiv:1703.01008, 2017. 4.1, 4.2

[76] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequenceneural networks. ICLR, 2016. 3.2

[77] Chen Liang, Jonathan Berant, �oc Le, Kenneth D Forbus, and Ni Lao. Neural symbolicmachines: Learning semantic parsers on freebase with weak supervision. ACL, 2017. 1,3.1, 3.5.1, 3.3, 3.5.4

89

March 26, 2019DRAFT

[78] Yufeng Liu, David Neil Hayes, Andrew Nobel, and James Stephen Marron. Statistical sig-ni�cance of clustering for high-dimension, low–sample size data. Journal of the American

Statistical Association, 103(483):1281–1293, 2008. 6.4

[79] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in trans-lation: Contextualized word vectors. arXiv preprint arXiv:1708.00107, 2017. 5.2

[80] Todor Mihaylov, Zornitsa Kozareva, and Ane�e Frank. Neural skill transfer from super-vised language tasks to reading comprehension. arXiv preprint arXiv:1711.03754, 2017.5.2

[81] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je� Dean. Distributedrepresentations of words and phrases and their compositionality. In Advances in neural

information processing systems, pages 3111–3119, 2013. 1

[82] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, andJason Weston. Key-value memory networks for directly reading documents. arXiv

preprint arXiv:1606.03126, 2016. 2.4, 3.1, 3.2, 3.5.1, 3.5.2, 3.3

[83] Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. Distant supervi-sion for relation extraction with an incomplete knowledge base. In Proceedings of the 2013

Conference of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies, pages 777–782, 2013. 1, 3.1

[84] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relationextraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual

Meeting of the ACL and the 4th International Joint Conference on Natural Language Process-

ing of the AFNLP: Volume 2-Volume 2, pages 1003–1011. Association for ComputationalLinguistics, 2009. 1.1

[85] Dipendra Misra, Ming-Wei Chang, Xiaodong He, and Wen-tau Yih. Policy shaping andgeneralized update equations for semantic parsing from denotations. 2018. 6.5

[86] Je� Mitchell and Mirella Lapata. Vector-based models of semantic composition. In ACL,pages 236–244, 2008. 2.3

[87] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual a�ention.In Advances in Neural Information Processing Systems, pages 2204–2212, 2014. 2.3

[88] Tsendsuren Munkhdalai and Hong Yu. Neural semantic encoders. EACL, 2017. 2.2

[89] Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. Who did

90

March 26, 2019DRAFT

what: A large-scale person-centered cloze dataset. EMNLP, 2016. 2.1, 2.5.1, 5.2

[90] Denis Paperno, German Kruszewski, Angeliki Lazaridou, �an Ngoc Pham, Ra�aellaBernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. �elambada dataset: Word prediction requiring a broad discourse context. ACL, 2016. 2.5.4

[91] Nanyun Peng, Hoifung Poon, Chris �irk, Kristina Toutanova, and Wen-tau Yih. Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for

Computational Linguistics, 5:101–115, 2017. 2.2

[92] Ma�hew E. Peters, Mark Neumann, Mohit Iyyer, Ma� Gardner, Christopher Clark, Ken-ton Lee, and Luke Ze�lemoyer. Deep contextualized word representations. In Proc. of

NAACL, 2018. 6.3.2

[93] Ma�hew E Peters, Mark Neumann, Mohit Iyyer, Ma� Gardner, Christopher Clark, Ken-ton Lee, and Luke Ze�lemoyer. Deep contextualized word representations. arXiv preprintarXiv:1802.05365, 2018. 5.2

[94] Ma�hew E. Peters, Mark Neumann, Luke S. Ze�lemoyer, and Wen tau Yih. Dissectingcontextual word embeddings: Architecture and representation. In EMNLP, 2018. 6.3.2,6.3.2

[95] Feng Qian, Lei Sha, Baobao Chang, Lu-chen Liu, and Ming Zhang. Syntax aware lstmmodel for chinese semantic role labeling. arXiv preprint arXiv:1704.00405, 2017. 2.2

[96] Martin Raison, Pierre-Emmanuel Mazare, Rajarshi Das, and Antoine Bordes. Weaver:Deep co-encoding of questions and documents for machine reading. arXiv preprint

arXiv:1804.10490, 2018. 3.1

[97] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+questions for machine comprehension of text. EMNLP, 2016. 5.4.1

[98] Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. Local and global algo-rithms for disambiguation to wikipedia. InACL, 2011. URLhttp://cogcomp.org/

papers/RRDA11.pdf. 1

[99] Siva Reddy, Oscar Tackstrom, Michael Collins, Tom Kwiatkowski, Dipanjan Das, MarkSteedman, and Mirella Lapata. Transforming dependency structures to logical forms forsemantic parsing. Transactions of the Association for Computational Linguistics, 4:127–140, 2016. 6.5

[100] Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. Relation ex-

91

http://cogcomp.org/papers/RRDA11.pdf

http://cogcomp.org/papers/RRDA11.pdf

March 26, 2019DRAFT

traction with matrix factorization and universal schemas. In Proceedings of the 2013 Con-

ference of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies, pages 74–84, 2013. 3.2

[101] Pum-Mo Ryu, Myung-Gil Jang, and Hyun-Ki Kim. Open domain question answer-ing using wikipedia-based knowledge model. Information Processing and Manage-

ment, 50(5):683 – 692, 2014. ISSN 0306-4573. doi: h�ps://doi.org/10.1016/j.ipm.2014.04.007. URL http://www.sciencedirect.com/science/article/pii/

S0306457314000351. 3.2

[102] Shimi Salant and Jonathan Berant. Contextualized word representations for reading com-prehension. arXiv preprint arXiv:1712.03609, 2017. 5.2

[103] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Mon-fardini. �e graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. 3.2

[104] Jost Schatzmann, Blaise �omson, Karl Weilhammer, Hui Ye, and Steve Young. Agenda-based user simulation for bootstrapping a pomdp dialogue system. In Human Language

Technologies 2007: �e Conference of the North American Chapter of the Association for

Computational Linguistics; Companion Volume, Short Papers, pages 149–152. Associationfor Computational Linguistics, 2007. 4.6.1

[105] Jost Schatzmann, Blaise �omson, and Steve Young. Statistical user simulation with ahidden agenda. Proc SIGDial, Antwerp, 273282(9), 2007. 4.1, 4.6.1

[106] Konrad Sche�er and Steve Young. Automatic learning of dialogue strategy using dia-logue simulation and reinforcement learning. In Proceedings of the second international

conference on Human Language Technology Research, pages 12–19. Morgan KaufmannPublishers Inc., 2002. 4.1

[107] Michael Schlichtkrull, �omas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov,and Max Welling. Modeling relational data with graph convolutional networks. arXiv

preprint arXiv:1703.06103, 2017. 3.1, 3.2, 3.4, 3.4, 3.4, 3.3

[108] Toby Segaran, Colin Evans, and Jamie Taylor. Programming the Semantic Web: Build

Flexible Applications with Graph Data. ” O’Reilly Media, Inc.”, 2009. 6.1

[109] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectionala�ention �ow for machine comprehension. ICLR, 2017. 2.2, 3.1

92

http://www.sciencedirect.com/science/article/pii/S0306457314000351

http://www.sciencedirect.com/science/article/pii/S0306457314000351

March 26, 2019DRAFT

[110] Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. �ery-reduction net-works for question answering. ICLR, 2017. 2.5.2, 2.5.2

[111] Minjoon Seo, Tom Kwiatkowski, Ankur P Parikh, Ali Farhadi, and Hannaneh Hajishirzi.Phrase-indexed question answering: A new challenge for scalable document comprehen-sion. In EMNLP, 2018. 1.3, 6.2, 6.2

[112] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stopreading in machine comprehension. arXiv preprint arXiv:1609.05284, 2016. 2.1, 2.2, 2.3

[113] Edward H Shortli�e and Bruce G Buchanan. A model of inexact reasoning in medicine.Mathematical biosciences, 23(3-4):351–379, 1975. 1

[114] Anshumali Shrivastava and Ping Li. Asymmetric lsh (alsh) for sublinear time maximuminner product search (mips). In Advances in Neural Information Processing Systems, pages2321–2329, 2014. 1.3, 6.1, 6.2

[115] Amit Singhal. Introducing the knowledge graph: things, not strings, May2012. URL https://googleblog.blogspot.com/2012/05/

introducing-knowledge-graph-things-not.html. 1

[116] Linfeng Song, Zhiguo Wang, and Wael Hamza. A uni�ed query-based generative modelfor question generation and question answering. arXiv preprint arXiv:1709.01058, 2017.5.2

[117] Alessandro Sordoni, Phillip Bachman, and Yoshua Bengio. Iterative alternating neurala�ention for machine reading. arXiv preprint arXiv:1606.02245, 2016. 2.1, 2.2, 2.3

[118] Amanda Spink, Dietmar Wolfram, Major BJ Jansen, and Te�o Saracevic. Searching theweb: �e public and their queries. Journal of the Association for Information Science and

Technology, 52(3):226–234, 2001. 4.2

[119] Sandeep Subramanian, Tong Wang, Xingdi Yuan, and Adam Trischler. Neural models forkey phrase detection and question generation. arXiv preprint arXiv:1706.04560, 2017. 5.2

[120] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semanticknowledge. In Proceedings of the 16th international conference on World Wide Web, pages697–706. ACM, 2007. 3

[121] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks.In Advances in Neural Information Processing Systems, pages 2431–2439, 2015. 2.1, 2.3, 2.4

[122] Haitian Sun*, Bhuwan Dhingra*, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhut-

93

https://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html

https://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html

March 26, 2019DRAFT

dinov, and William W. Cohen. Open domain question answering using early fu-sion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical

Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4,

2018, pages 4231–4242, 2018. URL https://aclanthology.info/papers/

D18-1455/d18-1455. 1.1, 1.2

[123] Swabha Swayamdipta. Learning Algorithms for Broad-Coverage Semantic Parsing. PhDthesis, Carnegie Mellon University Pi�sburgh, PA, 2017. 2.2

[124] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic repre-sentations from tree-structured long short-term memory networks. ACL, 2015. 2.2

[125] A. Talmor and J. Berant. �e web as a knowledge-base for answering complex questions.In North American Association for Computational Linguistics (NAACL), 2018. 3.1, 3.1, 6.5

[126] Duyu Tang, Nan Duan, Tao Qin, and Ming Zhou. �estion answering and questiongeneration as dual tasks. arXiv preprint arXiv:1706.02027, 2017. 5.2

[127] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R �omas McCoy, Na-joung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. What do youlearn from context? probing for sentence structure in contextualized word representa-tions. ICLR, 2019. 6.3.2, 6.3.2

[128] Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W.Cohen. Tweet2vec: Character-based distributed representations for social media. InProceedings of the 54th Annual Meeting of the Association for Computational Linguis-

tics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers, 2016. URLhttp://aclweb.org/anthology/P/P16/P16-2044.pdf. 2.3

[129] Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed,and Li Deng. Towards end-to-end reinforcement learning of dialogue agents for informa-tion access. In Proceedings of the 55th Annual Meeting of the Association for Computational

Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages484–495, 2017. doi: 10.18653/v1/P17-1045. URL https://doi.org/10.18653/

v1/P17-1045. 1.2

[130] Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W. Cohen, and Ruslan Salakhut-dinov. Gated-a�ention readers for text comprehension. In Proceedings of the 55th Annual

Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada,

July 30 - August 4, Volume 1: Long Papers, pages 1832–1846, 2017. doi: 10.18653/v1/

94

https://aclanthology.info/papers/D18-1455/d18-1455

https://aclanthology.info/papers/D18-1455/d18-1455

http://aclweb.org/anthology/P/P16/P16-2044.pdf

https://doi.org/10.18653/v1/P17-1045

https://doi.org/10.18653/v1/P17-1045

March 26, 2019DRAFT

P17-1168. URL https://doi.org/10.18653/v1/P17-1168. 1.2

[131] Bhuwan Dhingra, Kathryn Mazaitis, and William W. Cohen. �asar: Datasets forquestion answering by search and reading. CoRR, abs/1707.03904, 2017. URL http:

//arxiv.org/abs/1707.03904. 3.3, 6.1

[132] Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William W. Cohen, and Ruslan Salakhutdi-nov. Neural models for reasoning over multiple mentions using coreference. In Pro-

ceedings of the 2018 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans,

Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pages 42–48, 2018. URL https:

//aclanthology.info/papers/N18-2007/n18-2007. 1.2

[133] Bhuwan Dhingra, Danish Pruthi, and Dheeraj Rajagopal. Simple and e�ective semi-supervised question answering. In Proceedings of the 2018 Conference of the North Amer-

ican Chapter of the Association for Computational Linguistics: Human Language Tech-

nologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short

Papers), pages 582–587, 2018. URL https://aclanthology.info/papers/

N18-2092/n18-2092. 1.2, 3.1

[134] Ryan Tibshirani and Larry Wasserman. Lecture notes on clustering, February 2017.URL http://www.stat.cmu.edu/∼ryantibs/statml/lectures/

clustering.pdf. 6.4

[135] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, andMichael Gamon. Representing text for joint embedding of text and knowledge bases. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,pages 1499–1509, 2015. 3.2

[136] Adam Trischler, Zheng Ye, Xingdi Yuan, and Kaheer Suleman. Natural language com-prehension with the epireader. EMNLP, 2016. 2.2

[137] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Łukasz Kaiser, and Illia Polosukhin. A�ention is all you need. In Advances in

Neural Information Processing Systems, pages 5998–6008, 2017. 2.2, 6.3.2

[138] Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth, and Andrew McCallum.Multilingual relation extraction using compositional universal schema. NAACL, 2016. 3.2

[139] Denny Vrandecic and Markus Krotzsch. Wikidata: a free collaborative knowledge base.2014. 6.3.1

95

https://doi.org/10.18653/v1/P17-1168



https://aclanthology.info/papers/N18-2007/n18-2007




http://www.stat.cmu.edu/~ryantibs/statml/lectures/clustering.pdf

http://www.stat.cmu.edu/~ryantibs/statml/lectures/clustering.pdf

March 26, 2019DRAFT

[140] Li Wan, Ma�hew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization ofneural networks using dropconnect. In International Conference on Machine Learning,pages 1058–1066, 2013. 3.4

[141] Hai Wang, Takeshi Onishi, Kevin Gimpel, and David McAllester. Emergent logical struc-ture in vector representations of neural readers. 2ndWorkshop on Representation Learning

for NLP, ACL, 2017. 2.2

[142] Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, ShiyuChang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. R 3: Reinforced reader-ranker foropen-domain question answering. 2018. 6.1

[143] Yusuke Watanabe, Bhuwan Dhingra, and Ruslan Salakhutdinov. �estion answeringfrom unstructured text by retrieval and comprehension. arXiv preprint arXiv:1703.08885,2017. 1.1, 3.3, 6.1

[144] Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing datasets for multi-hop reading comprehension across documents. arXiv preprint arXiv:1710.06481, 2017.2.5.3, 3.1

[145] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M. Rojas-Barahona, Pei-Hao Su,Stefan Ultes, David Vandyke, and Steve Young. A network-based end-to-end trainabletask-oriented dialogue system. arXiv preprint arXiv:1604.04562, 2016. 4.1, 4.2, 4.4, 4.4,4.6.1

[146] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merrienboer,Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set ofprerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015. 2.5.2

[147] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. ICLR, 2015. 2.1,2.2, 2.3

[148] Georg Wiese, Dirk Weissenborn, and Mariana Neves. Neural domain adaptation forbiomedical question answering. In Proceedings of the 21st Conference on Computa-

tional Natural Language Learning (CoNLL 2017), pages 281–289, Vancouver, Canada, Au-gust 2017. Association for Computational Linguistics. URL http://aclweb.org/

anthology/K17-1029. 3.1

[149] Georg Wiese, Dirk Weissenborn, and Mariana L. Neves. Neural question answering atbioasq 5b. In BioNLP 2017, Vancouver, Canada, August 4, 2017, pages 76–79, 2017. doi: 10.18653/v1/W17-2309. URL https://doi.org/10.18653/v1/W17-2309. 5.2,

96

http://aclweb.org/anthology/K17-1029

http://aclweb.org/anthology/K17-1029

https://doi.org/10.18653/v1/W17-2309

March 26, 2019DRAFT

5.4.1, 5.3, 5.4.2

[150] Jason D Williams and Steve Young. Scaling up POMDPs for dialog management: �e“Summary POMDP” method. In IEEE Workshop on Automatic Speech Recognition and

Understanding, 2005., pages 177–182. IEEE, 2005. 4.4

[151] Jason D Williams and Geo�rey Zweig. End-to-end lstm-based dialog control optimizedwith supervised and reinforcement learning. arXiv preprint arXiv:1606.01269, 2016. 4.1,4.2, 4.4

[152] Ronald J Williams. Simple statistical gradient-following algorithms for connectionistreinforcement learning. Machine learning, 8(3-4):229–256, 1992. 4.5, 4.5

[153] Sam Wiseman, Alexander M Rush, and Stuart M Shieber. Learning global features forcoreference resolution. NAACL, 2016. 2.1

[154] Ji Wu, Miao Li, and Chin-Hui Lee. A probabilistic framework for representing dialogsystems and entropy-based dialog management through dynamic stochastic state evo-lution. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11):2026–2035, 2015. 4.2, 4.4, 4.6.2

[155] Yonghui Wu, Mike Schuster, Zhifeng Chen, �oc V Le, Mohammad Norouzi, WolfgangMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neuralmachine translation system: Bridging the gap between human and machine translation.arXiv preprint arXiv:1609.08144, 2016. 6.3.2

[156] Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhutdinov. Onmultiplicative integration with recurrent neural networks. Advances in Neural Informa-

tion Processing Systems, 2016. 2.3

[157] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Learning multi-relational semantics using neural-embedding models. NIPS Workshop on Learning Se-

mantics, 2014. 2.3

[158] Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. Multi-task cross-lingual se-quence tagging from scratch. arXiv preprint arXiv:1603.06270, 2016. 1, 2.3

[159] Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, and William W Cohen. Semi-supervisedqa with generative domain-adaptive nets. ACL, 2017. 5.1, 5.2, 5.4.1, 5.4.2

[160] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, RuslanSalakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explain-

97

March 26, 2019DRAFT

able multi-hop question answering. In Proceedings of the Conference on Empirical Methods

in Natural Language Processing (EMNLP), 2018. 6.5

[161] Zichao Yang, Phil Blunsom, Chris Dyer, and Wang Ling. Reference-aware language mod-els. EMNLP, 2017. 2.2

[162] Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geo�rey Zweig, and Yangyang Shi.Spoken language understanding using long short-term memory neural networks. InSpoken Language Technology Workshop (SLT), 2014 IEEE, pages 189–194. IEEE, 2014. 4.4

[163] Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. Semantic parsing viastaged query graph generation: �estion answering with knowledge base. In Proceedings

of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th

International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1321–1331, Beijing, China, July 2015. Association for Computational Linguistics.URL http://www.aclweb.org/anthology/P15-1128. 1

[164] Wen-tau Yih, Ma�hew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. �evalue of semantic parse labeling for knowledge base question answering. In Proceedings

of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2:

Short Papers), volume 2, pages 201–206, 2016. 3.1, 3.5.1

[165] Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. Neuralgenerative question answering. International Joint Conference on Arti�cial Intelligence,2016. 4.2

[166] Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao. Neural enquirer: Learning toquery tables. International Joint Conference on Arti�cial Intelligence, 2016. 4.2

[167] Steve Young, Milica Gasic, Blaise �omson, and Jason D Williams. Pomdp-based statisti-cal spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179, 2013. 2,4.1, 4.4

[168] Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dos Santos, Bing Xiang, and BowenZhou. Improved neural relation detection for knowledge base question answering. ACL,2017. 3.5.4

[169] Tiancheng Zhao and Maxine Eskenazi. Towards end-to-end learning for dialogstate tracking and management using deep reinforcement learning. arXiv preprint

arXiv:1606.02560, 2016. 4.2, 4.4

98

http://www.aclweb.org/anthology/P15-1128

March 26, 2019DRAFT

[170] Stefan Zwicklbauer, Christin Seifert, and Michael Granitzer. Do we need entity-centricknowledge bases for entity disambiguation? In Proceedings of the 13th International Con-

ference on Knowledge Management and Knowledge Technologies, page 4. ACM, 2013. 4.3

99