=Paper= {{Paper |id=Vol-1773/CoCoNIPS_2016_paper2 |storemode=property |title=ReasoNet: Learning to Stop Reading in Machine Comprehension |pdfUrl=https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper2.pdf |volume=Vol-1773 |authors=Yelong Shen,Po-Sen Huang,Jianfeng Gao,Weizhu Chen |dblpUrl=https://dblp.org/rec/conf/nips/ShenHGC16 }} ==ReasoNet: Learning to Stop Reading in Machine Comprehension== https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper2.pdf

ReasoNet: Learning to Stop Reading in Machine
Comprehension

Yelong Shen, Po-Sen Huang, Jianfeng Gao, Weizhu Chen
Microsoft Research, Redmond, WA, USA
{yeshen,pshuang,jfgao,wzchen}@microsoft.com

Abstract
Teaching a computer to read a document and answer general questions pertaining
to the document is a challenging yet unsolved problem. In this paper, we describe
a novel neural network architecture called the Reasoning Network (ReasoNet) for
machine comprehension tasks. ReasoNets make use of multiple turns to effectively
exploit and then reason over the relation among queries, documents, and answers.
Different from previous approaches using a fixed number of turns during inference,
ReasoNets introduce a termination state to relax this constraint on the reasoning
depth. With the use of reinforcement learning, ReasoNets can dynamically deter-
mine whether to continue the comprehension process after digesting intermediate
results, or to terminate reading when it concludes that existing information is ade-
quate to produce an answer. ReasoNets have achieved state-of-the-art performance
in machine comprehension datasets, including unstructured CNN and Daily Mail
datasets, and a structured Graph Reachability dataset.

1 Introduction
Teaching machines to read, process, and comprehend natural language documents is a coveted goal
for artificial intelligence [2, 17, 6]. Genuine reading comprehension is extremely challenging, since
effective comprehension involves thorough understanding of documents and performing sophisticated
inference. Toward solving this machine reading comprehension problem, in recent years, several
work has collected various datasets, in the form of question, passage, and answer, to test machine
on answering a question based on the provided passage [17, 6, 7, 16]. Some large-scale cloze-style
datasets [6, 7] have gained significant attention along with powerful deep learning models.
Recent approaches on cloze-style datasets can be separated into two categories: single-turn and multi-
turn reasoning. Single turn reasoning models utilize attention mechanisms [1] with deep learning
models to emphasize specific parts of the document which are relevant to the query. These attention
models subsequently calculate the relevance between a query and the corresponding weighted
representations of document subunits (e.g. sentences or words) to score target candidates [7, 6, 8].
However, considering the sophistication of the problem, after a single-turn comprehension, readers
often revisit some specific passage or the question to grasp a better understanding of the problem.
With this motivation, recent advances in reading comprehension have made use of multiple turns to
infer the relation between query, document and answer [7, 5, 22, 18]. By repeatedly processing the
document and question after digesting intermediate information, multi-turn reasoning can generally
produce a better answer and all existing work has demonstrated its superior performance consistently.
Existing multi-turn models have a fixed number of hops or iterations in their inference, i.e., with pre-
determined reasoning depth, without regard to the complexity of each individual query or document.
However, when a human reads a document with a question in mind, we often decide whether we want
to stop reading if we believe the observed information is adequate already to answer the question,
or continue reading after digesting intermediate information until we can answer the question with
confidence. This behavior generally varies from document to document, or question to question

Copyright © 2016 for this paper by its authors. Copying permitted for private and academic purposes.
because it is related to the sophistication of the document or the difficulty of the question. Meanwhile,
the analysis in [3] also illustrates the huge variations in the difficulty level with respect to questions
in the CNN/Daily Mail datasets [6]. For a significant part of the datasets, this analysis shows that the
problem cannot be solved without appropriate reasoning on both its query and document.
With this motivation, we propose a novel neural network architecture called Reasoning Network
(ReasoNet). ReasoNets try to mimic the inference process of human readers. With a question in mind,
ReasoNets read a document repeatedly, each time focusing on different parts of the document until a
satisfying answer is found or formed. This reminds us of a Chinese proverb: “The meaning of a book
will become clear if you read it hundreds of times.”. Moreover, unlike previous approaches using
fixed number of hops or iterations, ReasoNets introduce a termination state in the inference. This state
can decide whether to continue the inference to next turn after digesting intermediate information, or
to terminate the whole inference when it concludes that existing information is sufficient to yield an
answer. This number of turns in the inference is dynamically modeled by both the document and the
query, and can be learned automatically according to the difficulty of the problem.
One of the significant challenges ReasoNets face is how to design an efficient training method,
since the termination state is discrete and not connected to the final output. This prohibits canonical
back-propagation method being directly applied to train ReasoNets. Inspired by [24, 13], we tackle
this challenge by proposing a novel deep reinforcement learning method called Contrastive Reward
(CR) to successfully train ReasoNets. Unlike traditional reinforcement learning optimization methods
using a global variable to capture rewards, CR utilizes an instance-based reward baseline assignment.
Experiments show the superiority of CR in both training speed and accuracy. Finally, by accounting
for a dynamic termination state during inference and applying proposed deep reinforcement learning
optimization method, ReasoNets achieve the state-of-the-art results in machine comprehension
datasets when the paper is first publicly available in arXiv1 , including unstructured CNN and Daily
Mail datasets, and a proposed structured Graph Reachability dataset.
This paper is organized as follows. In Section 2, we review and compare recent work on machine
reading comprehension tasks. In Section 3, we introduce our proposed ReasoNet model architecture
and training objectives. Section 4 presents the experimental setting and results on unstructured and
structured machine reading comprehension tasks .

2 Related Work

Recently, with large-scale datasets available and the impressive advance of various statistical models,
machine reading comprehension tasks have attracted much attention. Here we mainly focus on the
related work in cloze-style datasets [6, 7]. Based on how they perform the inference, we can classify
their models into two categories: single-turn and multi-turn reasoning.
Single-turn reasoning Single turn reasoning models utilize an attention mechanism to emphasis
some sections of a document which are relevant to a query. This can be thought of as treating some
parts unimportant while focusing on other important ones to find the most probable answer. [6]
propose the attentive reader and the impatient reader models using neural networks with an attention
over passages to predict candidates. [7] use attention over window-based memory, which encodes
a window of words around entity candidates, by leveraging an end-to-end memory network [19].
Meanwhile, given the same entity candidate can appear multiple times in a passage, [8] propose the
attention-sum reader to sum up all the attention scores for the same entity. This score captures the
relevance between a query and a candidate. [3] propose using a bilinear term similarity function
to calculate attention scores with pretrained word embedding. [22] propose the EpiReader which
uses two neural network structures: one extracts candidates using the attention-sum reader; the
other reranks candidates based on a bilinear term similarity score calculated from query and passage
representations.
Multi-turn reasoning For complex passages and complex queries, human readers often revisit the
given document in order to perform deeper inference after reading a document. Several recent
studies try to simulate this revisit by combining the information in the query with the new information
digested from previous iterations [7, 5, 18, 23, 12]. [7] use multiple hops memory network to augment
the query with new information from the previous hop. Gated Attention reader [5] is an extension of
1
https://arxiv.org/abs/1609.05284

2
Algorithm 1: Stochastic Inference in a ReasoNet
Input :Memory M ; Initial state s1 ; Step t = 1; Maximum Step Tmax
Output :Termination Step T , Answer aT
1 Sample tt from the distribution p(·|ftg (st ; θtg ));
2 if tt is false, go to Step 3; otherwise Step 6;
3 Generate attention vector xt = fatt (st , M ; θx );
4 Update internal state st+1 = RNN(st , xt ; θs );
5 Set t = t + 1; if t < Tmax go to Step 1; otherwise Step 6;
6 Generate answer at ∼ p(·|fa (st ; θa ));
7 Return T = t and aT = at ;

the attention-sum reader with multiple iterations by pushing the query encoding into an attention-
based gate in each iteration. Iterative Alternative (IA) reader [18] produces a new query glimpse and
document glimpse in each iteration and utilizes them alternatively in the next iteration. [4] further
propose to extend the query-specific attention to both query-to-document attention and document-to-
query attention, which is built from the intermediate results in the query-specific attention. By reading
documents and enriching the query in an iterative fashion, multi-turn reasoning has demonstrated
their superior performance consistently.
Our proposed approach explores the idea of using both attention-sum to aggregate candidate attention
scores and multiple turns to attain a better reasoning capability. Unlike previous approaches using
fixed number of hops or iterations, motivated by [14, 13], we propose a termination module in the
inference. The termination module can decide whether to continue to infer the next turn after digesting
intermediate information, or to terminate the whole inference process when it concludes existing
information is sufficient to yield an answer. The number of turns in the inference is dynamically
modeled by both a document and a query, and is generally related to the complexity of the document
and the query.

3 Reasoning Networks
ReasoNets are devised to mimic the inference process of human readers. ReasoNets read a document
repeatedly, with attention on different parts each time until a satisfying answer is found. As shown in
Figure 1, a ReasoNet is composed of the following components:
Internal State: The internal state is denoted as S which is a vector representation of the question
state. Typically, the initial state s1 is the last-word vector representation of query by an RNN. The
t-th time step of the internal state is represented by st . The sequence of internal state is modeled by
an RNN: st+1 = RNN(st , xt ; θs );
Memory: The external memory is denoted as M . It is a list of word vectors, M = {mi }i=1..D , where
mi is a fixed dimensional vector. In machine comprehensive tasks, mi is the vector representation of
each word in the doc by a bidirectional-RNN.
Attention: Attention vector xt is generated based on the current internal state st and the external
memory M : xt = fatt (st , M ; θx );
Termination Gate: Termination gate generates a stochastic random variable according to the current
internal state; tt ∼ p(·|ftg (st ; θtg ))). tt is a binary random variable. If tt is true, the ReasoNet stops,
and the answer module executes at time step t; otherwise the ReasoNet generates an attention vector
xt+1 , and feed into the state network to update the next internal state st+1 .
Answer: The action of answer module is triggered when the termination gate variable is true:
at ∼ p(·|fa (st ; θa )).
In Algorithm 1, we describe the stochastic inference process of a ReasoNet. The process can be
considered as a Partially Observable Markov Decision Process (POMDP) [9] in the reinforcement
learning (RL) literature. The state sequence s1:T is hidden and dynamic, controlled by an RNN
sequence model. The ReasoNet performs an answer action aT at the T -th step, which implies that
the termination gate variables t1:T = (t1 = 0, t2 = 0, ..., tT −1 = 0, tT = 1). The ReasoNet learns a
stochastic policy π((tt , at )|st ; θ) with parameters θ to get a distribution over termination actions, to

3
Memory M

Attention Attention
Query

fatt(θx) Xt fatt(θx) Xt+1

RNN states

S1 St St+1 St+2

False False
ftg(θtg) ftg(θtg)

Tt Tt+1
True True

Termination Termination
fa(θa) fa(θa)

Answer at Answer at+1

Figure 1: A ReasoNet Architecture.

continue reading or to stop, and over answer actions if the model decides to stop at the current step.
The termination step T varies from instance to instance.
The parameters θ of the ReasoNet are given by the parameters of the embedding matrices W ,
attention network θx , the state RNN network θs , the answer action network θa , and the termination
gate network θtg . The parameters θ = {W, θx , θs , θa , θtg } are trained by maximizing the total expect
reward. The expected reward for an instance is defined as:
" T #
X
J(θ) = Eπ(t1:T ,aT ;θ) rt
t=1

The reward can only be received at the final termination step when an answer action aT is performed.
We define rT = 1 if tT = 1 and the answer is correct, and rT = 0 otherwise. The rewards on
intermediate steps are zeros, {rt = 0}t=1...T −1 . J can be maximized by directly applying gradient
based optimization methods. The gradient of J is given by:
∇θ J(θ) = Eπ(t1:T ,aT ;θ) [∇θ logπ(t1:T , aT ; θ)rT ]

We apply the REINFORCE algorithm [24] to compute ∇θ J(θ):

X
Eπ(t1:T ,aT ;θ) [∇θ logπ(t1:T , aT ; θ)rT ] = π(t1:T , aT ; θ) [∇θ logπ(t1:T , aT ; θ)(rT − bT )]
(t1:T ,aT )∈A†

where A† is all the possible episodes, T, t1:T , aT and rT are the termination step, termination action,
answer action, and reward, respectively, for the (t1:T , aT ) episode. bT is called the reward baseline
in the RL literature to lower variance [21]. It is common to select bT = Eπ [rT ] [20], and can be
updated via an online moving average approach : bT = λbT + (1 − λ)rT .
However, we empirically find that above approach leads to slow convergence in training ReasoNets.
Intuitively, the average baselines {bT ; T = 1..Tmax } are global variables independent of instances.
It is hard for these baselines to capture the dynamic termination behavior of ReasoNets. In other
words, ReasoNets may stop at different time steps for different instances. The adoption of a global
variable without considering the dynamic variance in each instance is inappropriate. To resolve this
weakness in traditional methods and account for the dynamic characteristic of ReasoNets, we propose
an instance-based baseline method called “Contrastive Reward” (CR) to calculate ∇θ J(θ). The basic
idea of CR is to utilize an instance-based baseline assignment. We will elaborate its implementation
details in Section 3.1. Empirical results show that the proposed reward schema has produced better
results compared to the baseline approach.

4
3.1 Training Details

In the machine reading comprehension tasks, a training dataset can be simplified as a collection of
triplets of query q, passage p, and answer a. Say hqn , pn , an i is the n-th training instance.
The first step is to extract memory M from pn by mapping each symbolic in the passage to a
contextual representation given by the concatenation of forward and backward RNN hidden states,
i.e., mk = [−p→ k ←−|pn |−k+1 ], and extract initial state s from q by assigning s = [→
n , pn 1 n 1
−
qn |qn | , ←
q− 1
n ].
†
Given M and s1 for the n-th training instance, a ReasoNet executes |A | episodes, where all possible
episodes A† can be enumerated by setting a maximum step. Each episode generates actions and a
reward from the last step: h(t1:T , aT ), rT i(t1:T ,aT )∈A† .
Therefore, the gradient of J can be rewritten as:
X
∇θ J(θ) = π(t1:T , aT ; θ) [∇θ logπ(t1:T , aT ; θ)(rT − b)]
(t1:T ,aT )∈A†

where the baseline b = (t1:T ,aT )∈A† π(t1:T , aT ; θ)rT is the average reward on the |A† | episodes
P

for the n-th training instance. It allows different baselines for different training instances. This can
be beneficial since the complexity of trainingP instances varies significantly. Since the sum of the
proposed rewards over |A† | episodes is zero, (t1:T ,aT )∈A† π(t1:T , aT ; θ)(rT − b) = 0, we call it
Contrastive Reward in this work. In experiments, we empirically find using ( rbT − 1) in replace of
(rT − b) can lead to a faster convergence. Therefore, we adopt this approach to train ReasoNets in
the experiments.

4 Experiments
4.1 CNN and Daily Mail Datasets

We evaluate the performance of ReasoNets on CNN and Daily Mail datasets.2 The detailed settings
of the ReasoNet model are as follows.
Vocab Size: For training our ReasoNet, we keep the most frequent |V | = 101k words (not including
584 entities and 1 placeholder marker) in the CNN dataset, and |V | = 151k words (not including
530 entities and 1 placeholder marker) in the Daily Mail dataset.
Embedding Layer: We choose word embedding size d = 300, and use the 300 dimensional
pretrained Glove word embeddings [15] for initialization. We also apply dropout with probability 0.2
to the embedding layer.
Bi-GRU Encoder: We apply bi-directional GRU for encoding query and passage into vector repre-
sentations. We set the number of hidden units to be 256 and 384 for the CNN and Daily Mail datasets,
respectively. The recurrent weights of GRUs are initialized with random orthogonal matrices. The
other weights in GRU cell are initialized from a uniform distribution between −0.01 and 0.01. We
use a shared GRU model for both query and passage.
Memory and Attention: The memory of the ReasoNet on CNN and Daily Mail dataset is
composed of query memory and passage memory. M = (M query , M doc ), where M query and
M doc are extracted from query bidirectional-GRU encoder and passage bidirectional-GRU en-
coder respectively. We choose projected cosine similarity function as the attention module.
The attention score adoc t,i on memory mi
doc
given the state st is computed as follows: adoc t,i =
softmaxi=1,...,|M doc | γ cos(W1doc mdoc
i , W doc
2 st ), where γ is set to 10. W doc
1 and W doc
2 are weight
vectors associated with mdoc i and st , respectively, and are joint trained in the ReasoNet. Thus,
P|M |
attention vector on passage is given by xdoc t = i at,i mdoc i . The final attention vector is the
concatenation of the query attention vector and the passage attention vector xt = (xquery
t , xdoc
t ). The
query query doc doc
attention module is parameterized by θx = (W1 , W2 , W1 , W2 );
Internal State Controller: We choose GRU model as the internal state controller. The number of
hidden units in the GRU state controller is 256 for CNN and 384 for Daily Mail. The initial state
2
The CNN and Daily Mail datasets are available at https://github.com/deepmind/rc-data

5
Query: passenger @placeholder 1, 36 , died at the scene
11
Passage: ( @entity0 ) what was supposed to be a fantasy sports car ride at
@entity3 turned deadly when a @entity4 3crashed into a guardrail 1. the crash
took place sunday at the @entity8 , which bills itself as a chance to drive your
dream car on a racetrack . the @entity4 's passenger , 36 - year - old @entity14 1
23 3 1
of @entity15 , @entity16 , died at the scene , @entity13 said . the driver of the Termination Attention
@entity4 , 24 - year - old @entity182of @entity19 , @entity16 , lost control of Step
2 Probability Sum
the vehicle , the @entity13 said . he was hospitalized with minor injuries .
@entity24 , which operates the @entity8 at @entity3 , released a statement 1 0.0011 0.4916
sunday night about the crash . " on behalf of everyone in the organization , it is 2 0.5747 0.5486
with a very heavy heart that we extend our deepest sympathies to those
involved in today 's tragic accident in @entity36 , " the company said . @entity24 3 0.9178 0.5577
also operates the @entity3 -- a chance to drive or ride in @entity39 race cars
named for the winningest driver in the sport 's history . @entity0 's @entity43
and @entity44 contributed to this report .

Answer: @entity14 Step 1 Step 2 Step 3

Figure 2: Results of a test example 69e1f777e41bf67d5a22b7c69ae76f0ae873cf43.story from the CNN dataset.
The numbers next to the underline bars indicate the rank of the attention scores. The corresponding termination
probability and the sum of attention scores for the answer entity are shown in the table on the right.

of the GRU controller is set to be the last-word of the query representation by a bidirectional-GRU
encoder.
Termination Module: We adopt a logistical regression to model the termination variable at each
time step : ftg (st ; θtg ) = sigmoid(Wtg st + btg ); θtg = (Wtg , btg )
Answer Module: We apply a linear projection from GRU outputs and make predictions on the entity
candidates. Following the settings in AS Reader [8], we sum up scores from the same candidate and
make a prediction. Thus, AS Reader can be viewed as a special case of ReasoNets with Tmax = 1.
Other Details: The maximum reasoning step, Tmax is set to 5 in experiments on both CNN and Daily
Mail datasets. We use ADAM optimizer [10] for parameter optimization with an initial learning rate
of 0.0005, β1 = 0.9 and β2 = 0.999; The absolute value of gradient on each parameter is clipped
within 0.001. The batch size is 64 for both CNN and Daily Mail datasets. For each batch of the CNN
and Daily Mail datasets we randomly reshuffle the assignment of named entities [6]. This forces
the model to treat the named entities as semantically meaningless labels. In the prediction of test
cases, we randomly reshuffle named entities up to 4 times, and report the averaged answer. Models
are trained on GTX TitanX 12GB. It takes 7 hours per epoch to train on the Daily Mail dataset and 3
hours per epoch to train on the CNN dataset. The models are usually converged within 6 epochs on
both CNN and Daily Mail datasets.
Table 1 shows the performance of all the existing single model baselines and our proposed ReasoNet.
By capturing multi-turn reasoning and learning to stop reading a paragraph, we have achieved the
state-of-the-art results in both CNN and Daily Mail datasets. To further understand the inference
process of the ReasoNet, Figure 2 shows a test example of the CNN dataset. The model initially
focuses on wrong entities with low termination probability. In the second and third steps, the model
focuses on the right clue with higher termination probability. Interestingly, we also find that query
attention focuses on the placeholder token throughout all the steps.

4.2 Graph Reachability Task

Recent analysis and results [3] on the cloze-style machine comprehension tasks have suggested some
simple models without multi-turn reasoning can achieve reasonable performance. Based on these
results, we construct a synthetic structured Graph Reachability dataset3 to evaluate longer range
machine inference and reasoning capability, since we expect ReasoNets have the capability to handle
long range relationships.
We generate two synthetic datasets: a small graph dataset and a large graph dataset. In the small
graph dataset, it contains 500K small graphs, where each graph contains 9 nodes, and 16 direct edges
3
The dataset is available at https://github.com/MSRDL/graph_reachability_dataset

6
Table 1: The performance of Reasoning Network on CNN and Daily Mail dataset.
CNN Daily Mail
valid test valid test
Deep LSTM Reader [6] 55.0 57.0 63.3 62.2
Attentive Reader [6] 61.6 63.0 70.5 69.0
MemNets [7] 63.4 66.8 - -
AS Reader [8] 68.6 69.5 75.0 73.9
Stanford AR [3] 72.2 72.4 76.9 75.8
DER Network [11] 71.3 72.9 - -
Iterative Attention Reader [18] 72.6 73.3 - -
EpiReader [22] 73.4 74.0 - -
GA Reader [5] 73.0 73.8 76.7 75.7
AoA Reader [4] 73.1 74.4 - -
ReasoNet 72.9 74.7 77.6 76.6

Table 2: Reachability statistics of the Graph Reachability dataset.
Small Graph Large Graph
Reachable Step No Reach 1–3 4–6 7–9 No Reach 1–3 4–6 7–13
Train (%) 44.16 42.06 13.51 0.27 49.02 25.57 21.92 3.49
Test (%) 45.00 41.35 13.44 0.21 49.27 25.46 21.74 3.53

to randomly connect pairs of nodes. The large graph dataset contains 500K graphs, where each graph
contains 18 nodes, and 32 random direct edges. Duplicated edges are removed. Table 2 shows the
graph reachability statistics on the two datasets.
In Table 3, we show examples of a small graph and a large graph in the synthetic dataset. Both graph
and query are represented by a sequence of symbols. In the experiment, we use a 100-dimensional
embedding vector for each symbol, and bidirectional-LSTM with 128 and 256 cells for query and
graph embedding in the small and the large graph datasets, respectively. The last states of bidirectional-
LSTM on query are concatenated to be the initial internal state s1 = [→ −q |q| , ←
−
q 1 ] in the ReasoNet.
Another bidirectional-LSTM on graph description maps each symbol g i to a contextual representation
given by the concatenation of forward and backward LSTM hidden states mi = [→ −g i, ←−
g |g|−i+1 ]. The
final answer is either “Yes” or “No” and hence logistical regression is used as the answer module:
at = σ(Wa st + ba ); θa = (Wa , ba ). We apply another logistical regression as the termination gate
module: tt = σ(Wtg st + btg ). The maximum reasoning step Tmax is set to 15 and 25 for the small
graph and large graph dataset, respectively.
We study the effectiveness of the termination gate in ReasoNets. We denote “ReasoNet” as a standard
ReasoNet with termination gate, as described in Section 3.1. If we remove the termination gate, and
just simply use the last state answer action as the final answer, say â = aTmax (Tmax is the maximum
reasoning step), denoted as “ReasoNet-Last”. To study the effectiveness of multi-turn reasoning,
we choose “ReasoNet-Tmax = 2”, which only has single-turn reasoning, as a baseline.
In Table 4, we report the performance of ReasoNet, ReasoNet-Last and ReasoNet-Tmax = 2
models on the Graph Reachability dataset. The ReasoNet-Last model performs well on the small
graph dataset, and it obtains 100% accuracy. However, the ReasoNet-Last model fails to learn on
the large graph dataset, as the task becomes much more challenging. Meanwhile, the ReasoNet
model converges faster than the ReasoNet-Last model. The ReasoNet model converges in 20
epochs in the small graph dataset, and 40 epochs in the large graph dataset, while the ReasoNet-Last
model converges around 40 epochs in the small graph dataset, and 70 epochs in the large graph dataset.
The results suggest that the termination gate variable in the ReasoNet is helpful when training with
sophisticated examples, and makes models converge faster. Both the ReasoNet and ReasoNet-Last
models perform better than the ReasoNet-Tmax = 2 model, which demonstrates the importance of
multi-turn reasoning. To further understand the inference process in ReasoNets, we present two
examples of the graph reachability results in appendix A.

7
Table 3: Small and large random graph in the Graph Reachability dataset. Note that “A –> B”
represents an edge connected from A to B and the # symbol is used as a delimiter between different
edges.
Small Graph Large Graph
Graph Description 0 –> 0 # 0 –> 2 # 1 –> 2 # 2 –> 1 # 0 –> 17 # 1 –> 3 # 1 –> 14 # 1 –> 6 #
3 –> 2 # 3 –> 3 # 3 –> 6 # 3 –> 7 # 2 –> 11 # 2 –> 13 # 2 –> 15 # 3 –> 7#
4 –> 0 # 4 –> 1 # 4 –> 4 # 5 –> 7 # 5 –> 0 # 5 –> 7 # 6 –> 10 # 6 –> 5#
6 –> 0 # 6 –> 1 # 7 –> 0 # 7 –> 15 # 7 –> 7 # 8 –> 11 # 8 –> 7 #
10 –> 9 # 10 –> 6 # 10 –> 7 # 12 –> 1 #
12 –> 12 # 12 –> 6 # 13 –> 11 # 14 –> 17 #
14 –> 14 # 15 –> 10 # 16 –> 2 # 17 –> 4 #
17 -> 7 #
Query 7 –> 4 10 –> 17
Answer No Yes

Table 4: The performance of Reasoning Network on the Graph Reachability dataset.
Small Graph Large Graph
ROC-AUC PR-AUC Accuracy ROC-AUC PR-AUC Accuracy
ReasoNet-Tmax = 2 0.9638 0.9677 0.8961 0.8477 0.8388 0.7607
ReasoNet-Last 1 1 1 0.8836 0.8742 0.7895
ReasoNet 1 1 1 0.9988 0.9989 0.9821

5 Conclusion
In this paper, we propose ReasoNets that dynamically decide whether to continue or to terminate the
inference process in machine comprehension tasks. Using reinforcement learning with the proposed
contractive reward, our proposed model achieves the start-of-the-art results in machine comprehension
datasets, including unstructured CNN and Daily Mail datasets, and a proposed structured Graph
Reachability dataset. For future work, ReasoNets can be generalized to other tasks that requires
reasoning capability, such as question answering and knowledge graph inference.
Acknowledgments
We thank Ming-Wei Chang, Li Deng, Lihong Li, and Xiaodong Liu for their thoughtful feedback and
discussions.

References
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning
to align and translate. In Proceedings of the International Conference on Learning Representations, 2015.

[2] Léon Bottou. From machine learning to machine reasoning. Machine Learning, 94(2):133–149, 2014.

[3] Danqi Chen, Jason Bolton, and Christopher D Manning. A thorough examination of the CNN / Daily Mail
reading comprehension task. In ACL, 2016.

[4] Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attention-over-attention
neural networks for reading comprehension. CoRR, abs/1607.04423, 2016.

[5] Bhuwan Dhingra, Hanxiao Liu, William W. Cohen, and Ruslan Salakhutdinov. Gated-attention readers for
text comprehension. CoRR, abs/1606.01549, 2016.

[6] Karm Moritz Hermann, Tomáš Kočiský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information
Processing Systems, pp. 1693–1701, 2015.

[7] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The Goldilocks principle: Reading children’s
books with explicit memory representations. In Proceedings of the International Conference on Learning
Representations, 2016.

8
[8] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst. Text understanding with the attention
sum reader network. arXiv:1603.01547v1 [cs.CL], 2016.

[9] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially
observable stochastic domains. Artificial Intelligence, 101:99–134, 1998.
[10] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the
International Conference on Learning Representations, 2015.

[11] Sosuke Kobayashi, Ran Tian, Naoaki Okazaki, and Kentaro Inui. Dynamic entity representation with
max-pooling improves machine reading. In Proceedings of the North American Chapter of the Association
for Computational Linguistics and Human Language Technologies (NAACL-HLT), 2016.

[12] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong,
Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language
processing. In Proceedings of the International Conference on Machine Learning, 2016.

[13] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in
Neural Information Processing Systems, pp. 2204–2212, 2014.
[14] Rodrigo Nogueira and Kyunghyun Cho. Webnav: A new large-scale task for natural language based
sequential decision making. In Advances in Neural Information Processing Systems, 2016.

[15] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word
representation. In EMNLP, 2014.

[16] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100, 000+ questions for
machine comprehension of text. In EMNLP, 2016.
[17] Matthew Richardson, Christopher JC Burges, and Erin Renshaw. MCTest: A challenge dataset for the
open-domain machine comprehension of text. In EMNLP, 2013.

[18] Alessandro Sordoni, Phillip Bachman, and Yoshua Bengio. Iterative alternating neural attention for
machine reading. CoRR, abs/1606.02245, 2016.

[19] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in
neural information processing systems, pp. 2440–2448, 2015.

[20] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods
for reinforcement learning with function approximation. In Advances in Neural Information Processing
Systems, 1999.

[21] Richard Stuart Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, 1984.
[22] Adam Trischler, Zheng Ye, Xingdi Yuan, and Kaheer Suleman. Natural language comprehension with the
EpiReader. In EMNLP, 2016.

[23] Dirk Weissenborn. Separating answers from queries for neural reading comprehension. CoRR,
abs/1607.03316, 2016.

[24] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Machine Learning, 8(3-4):229–256, 1992.

A Examples of Graph Reachability Results in ReasoNets
Figures 3 and 4 show test examples of the large graph dataset. In Figure 3, we observe that the
model does not make a firm prediction till step 9. The highest attention word at each step shows
the reasoning process of the model. Interestingly, the model starts from the end node (17), traverses
backward till finding the starting node (10) in step 9, and makes a firm termination prediction. On the
other hand, in Figure 4, the model learns to stop in step 2. In step 1, the model looks for neighbor
nodes (12, 6, 16) to 4 and 9. Then, the model gives up in step 2 and predict “No". All of these
demonstrate the dynamic termination characteristic and potential reasoning capability of ReasoNets.

9
Step
Step 2 3

Step
Step 1 2 Step
Step 3 4 Step Termination Probability Prediction
1 1.00E-06 0.172
Step
Step 10 2 1.00E-06 0.625
3 1.00E-06 0.752
4 1.00E-06 0.202
5 1.00E-06 0.065
6 1.00E-06 0.041
7 2.30E-06 0.137
8 0.0017 0.136
Step 9 9 0.49 0.761
Step 10
10 0.99 0.927

Step 6, 7, 9
Steps
8 Step 4,
Steps 5, 6, 8
5, 7

Figure 3: An example of graph reachability result, given a query “10 –> 17” (Answer: Yes). The red
circles highlight the nodes/edges which have the highest attention in each step. The corresponding termination
probability and prediction results are shown in the table. The model terminates at step 10.

1 -> 16 #3 1 -> 12 # 1 -> 14 # 1 -> 7 # 2 -
> 17 # 3 -> 1 #2 4 -> 0 # 4 -> 1 #1 4 -> 12 1
Step 2 # 4 -> 6 #2 6 -> 0 # 6 -> 3 # 6 -> 7 # 8 ->
2 # 8 -> 4 # 8 -> 13 # 8 -> 14 # 9 -> 16
Step 1 # 10 -> 0 # 10 -> 6 # 11 -> 10 # 11 -> 2
Step 1
# 12 -> 2 # 13 -> 2 # 13 -> 6 #3 14 -> 2 #
14 -> 7 # 16 -> 13 # 16 -> 14 # 17 -> 0
# 17 -> 13 #
Step 1 Step 2

Step Termination Probability Prediction
Step 1 1 1.40E-05 4.49E-04
Step 2
2 0.999 1.40E-05

Figure 4: An example of graph reachability result, given a query “4 –> 9” (Answer: No). The numbers next
to the underline bars indicate the rank of the attention scores. The corresponding termination probability and
prediction results are shown in the table.