<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Long-Term Memory Networks for Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fenglong Ma</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Radha Chitta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saurabh Kataria</string-name>
          <email>saurabh.cse05@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Palghat Ramesh</string-name>
          <email>palghat.ramesh@parc.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tong Sun</string-name>
          <email>sunt@utrc.utc.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Gao</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Conduent Labs US</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>United Technologies Research Center</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>19</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>Question answering is an important and difcult task in the natural language processing domain, because many basic natural language processing tasks can be cast into a question answering task. Several deep neural network architectures have been developed recently, which employ memory and inference components to memorize and reason over text information, and generate answers to questions. However, a major drawback of many such models is that they are capable of only generating single-word answers. In addition, they require large amount of training data to generate accurate answers. In this paper, we introduce the Long-Term Memory Network (LTMN), which incorporates both an external memory module and a Long Short-Term Memory (LSTM) module to comprehend the input data and generate multi-word answers. The LTMN model can be trained end-to-end using back-propagation and requires minimal supervision. We test our model on two synthetic data sets (based on Facebook's bAbI data set) and the real-world Stanford question answering data set, and show that it can achieve state-of-the-art performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Work carried out while at PARC, a Xerox Company.
Copyright c by the paper's authors. Copying permitted for
private and academic purposes.
the given unstructured text, is one of the core tasks in
natural language understanding and processing. Many
problems in natural language processing, such as
reading comprehension, machine translation, entity
recognition, sentiment analysis, and dialogue generation,
can be cast as question answering problems.</p>
      <p>Traditional question answering approaches can
be categorized as: (i) IR-based question answering
[Pas03] where the question is formulated as a search
query, and a short text segment is found on the Web
or similar corpus for the answer; (ii) Knowledge-based
question answering [GJWCL61, BCFL13], which aims
to answer a natural language question by mapping it
to a semantic query over a database.</p>
      <p>The traditional approaches are simple query-based
techniques. It is di cult to establish the relationships
between the sentences in the input text, and derive a
meaningful representation of the information within
the text using these traditional question-answering
systems.</p>
      <p>Figure 1 shows an example of question answering
task. The sentences in black are facts that may be
relevant to the questions, questions are in blue, and
the correct answers are in red. In order to correctly
answer the question \What did Steve Jobs o er Xerox
to visit and see their latest technology? ", the model
should have the ability to recognize that the sentence
\After hearing of the pioneering GUI technology being
developed at Xerox PARC, Jobs had negotiated a visit
to see the Xerox Alto computer and its Smalltalk
development tools in exchange for Apple stock options."
is a supporting fact and extract the relevant portion of
the supporting fact to form the answer. In addition,
the model should have the ability to memorize all the
facts that have been presented to it until the current
time, and deduce the answer.</p>
      <p>The authors of [WCB15] proposed a new class of
learning models named Memory Networks (MemNN),
which use a long-term memory component to store
1: Burrel’s innovative design, which
combined the low production cost of an
Apple II with the computing power of
Lisa’s CPU, the Motorola 68K, received
the attention of Steve Jobs, co-founder
of Apple.
2: Realizing that the Macintosh was
more marketable than the Lisa, he began
to focus his attention on the project.
3: Raskin left the team in 1981 over a
personality conflict with Jobs.
4: Why did Raskin leave the Apple team
in 1981? over a personality conflict
with Jobs
5: Team member Andy Hertzfeld said that
the final Macintosh design is closer to
Jobs’ ideas than Raskin’s.
6: According to Andy Hertzfeld, whose
idea is the final Mac design closer to?
Jobs
7: After hearing of the pioneering GUI
technology being developed at Xerox
PARC, Jobs had negotiated a visit to
see the Xerox Alto computer and its
Smalltalk development tools in exchange
for Apple stock options.
8: What did Steve Jobs offer Xerox to
visit and see their latest technology?
Apple stock options
information and an inference component for
reasoning. [KIO+16] proposed the Dynamic Memory
Network (DMN) for general question answering tasks,
which processes input sentences and questions, forms
episodic memories, and generates answers. These two
approaches are strongly supervised, i.e., only the
supporting facts (factoids) are fed to the model as
inputs for training the model for each type of question.
For example, when training the model with the
question in the fourth line of Figure 1, strongly supervised
methods only use the sentence in line 3 as input. Thus,
these methods require a large amount of training data.</p>
      <p>To tackle this issue, [SWF+15] introduced a weakly
supervised approach called End-to-End Memory
Network (MemN2N), which uses all the sentences that
have appeared before this question. For the above
example, the inputs are the sentences from line 1 to line
3 when training for the question in the fourth line.
MemN2N is trained end-to-end and uses an attention
mechanism to calculate the matching probabilities
between the input sentences and questions. The
sentences which match the question with high probability
are used as the factoids for answering the question.</p>
      <p>However, this model is capable of generating only
single-word answers. For example, the answer of the
question \According to Andy Hertzfeld, whose idea is
the nal Mac design closer to? " in Figure 1 is only
one word \Jobs". Since the answers of many questions
contain multiple words (for instance, the question
labeled 4 in Figure 1), this model cannot be directly
applied to the general question answering tasks.</p>
      <p>Recurrent neural networks comprising Long Short
Term Memory Units have been employed to
generate multi-word text in the literature [Gra13, SVL14].
However, simple LSTM based recurrent neural
networks do not perform well on the question-answering
task due to the lack of an external memory
component which can memorize and contextualize the facts.
We present a more sophisticated recurrent neural
network architecture, named Long-Term Memory
Network (LTMN), which combines the best aspects of
end-to-end memory networks and LSTM based
recurrent neural networks to address the challenges faced
by the currently available neural network architectures
for question-answering. Speci cally, it rst embeds the
input sentences (initially encoded using a distributed
representation learning mechanism such as paragraph
vectors [LM14]) in a continuous space, and stores them
in memory. It then matches the sentences with the
questions, also embedded into the same space, by
performing multiple passes through the memory, to obtain
the factoids which are relevant to each question. These
factoids are then employed to generate the rst word
of the answer, which is then input to an LSTM unit.
The LSTM unit is used to generate the subsequent
words in the answer. The proposed LTMN model can
be trained end-to-end, requires minimal supervision
during training (i.e., weakly supervised), and
generates multiple words answers. Experimental results on
two synthetic datasets and one real world dataset show
that the proposed model outperforms the
state-of-theart approaches.</p>
      <p>In summary, the contributions of this paper are as
follows:</p>
      <p>We propose an e ective neural network
architecture for general question answering, i.e. for
generating multi-word answers for questions. Our
architecture combines the best aspects of MemN2N
and LSTM and can be trained end-to-end.</p>
      <p>The proposed architecture employs distributed
representation learning techniques (e.g.
paragraph2vec) to learn vector representations for
sentences or factoids, questions and words, as well as
their relationships. The learned embeddings
contribute to the accuracy of the answers generated
by the proposed architecture.</p>
      <p>We generate a new synthetic dataset with multiple
word answers based on Facebook's bAbI dataset
[WBC+16]. We call this the multi-word answer
bAbI dataset.</p>
      <p>We test the proposed architecture on two
synthetic datasets (the single-word answer bAbI
dataset and the multi-word answer bAbI dataset),
and the real-world Stanford question answering
dataset [RZLL16]. The results clearly
demonstrate the advantages of the proposed architecture
for question answering.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>In this section, we review literature closely related to
question answering, particularly focusing on models
using memory networks to generate answers.
2.1</p>
      <sec id="sec-2-1">
        <title>Question Answering</title>
        <p>Traditional question answering approaches mainly
include two categories: IR-based [Pas03] and
Knowledge-based question answering [GJWCL61,
BCFL13]. IR-based question answering systems use
information retrieval techniques to extract information
(i.e., answers) from documents. These methods rst
process questions, i.e., detect named entities in
questions, and then predict answer types, such as cities'
names or person's names. After recognizing answer
types, these approaches generate queries, and extract
answers from the web using the generated queries.
These approaches are easy, but they ignore the
semantics between questions and answers.</p>
        <p>Knowledge-based question answering systems
[ZC05, BL14, ZHLZ16] consider the semantics and use
existing knowledge bases, such as Freebase [BEP+08]
and DBpedia [BLK+09]. They cast the question
answering task as that of nding one of the missing
arguments in a triple. Most of knowledge-based
question answering approaches use neural networks,
dependency trees and knowledge bases [BGWB12] or
sentences [IBGC+14].</p>
        <p>Using traditional question answering approaches, it
is di cult to establish the relationship between
sentences in the input text, and thereby identify the
relevance of the di erent sentences to the question. Of
late, several neural network architectures with
memories have been proposed to solve this challenging
problem.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Memory Networks</title>
        <p>Several deep neural network models use memory
architectures [SWF+15, KIO+16, WCB15, GWD14, JM15,
MD93] and attention mechanisms for image captioning
[YJW+16], machine comprehension [WGL+16] and
healthcare data mining [MCZ+17, SMC+17]. We
focus on the models using memory networks for natural
language question answering.</p>
        <p>Memory networks (MemNN), proposed in
[WCB15], rst introduced the concept of an
external memory component for natural language
question answering. They are strongly supervised,
i.e., they are trained with only the supporting facts
for each question. The supporting input sentences are
embedded in memory, and the response is generated
from these facts by scoring all the words in the
vocabulary in correlation with the facts. This scoring
function is learnt during the training process and
employed during the testing phase. MemNN are
capable of producing only single-word answers, due
to this response generation mechanism. In addition,
MemNN cannot be trained end-to-end.</p>
        <p>The authors of [KIO+16] improve over MemNN
by introducing an end-to-end trainable network called
Dynamic Memory Networks (DMN). DMN have four
modules: input module, question module, episodic
memory module and answer module. The input
module encodes raw text inputs into distributed vector
representations using a gated recurrent network (GRU)
[CVMBB14]. The question module similarly encodes
the question using a recurrent neural network. The
sentences and question representations are fed to the
episodic memory module, which chooses the sentences
to focus on using the attention mechanism. It
iteratively produces a memory vector, representing all the
relevant information, which is then used by the answer
module to generate the answer using a GRU.
However, DMN are also strongly supervised like MemNN,
thereby requiring a large amount of training data.</p>
        <p>End-to-End Memory Networks (MemN2N)
[SWF+15] rst encode sentences into continuous
vector representations, then use a soft attention
mechanism to calculate matching probabilities between
sentences and questions and nd the most relevant
facts, and nally generate responses using the
vocabulary from these facts. Unlike the MemNN and DMN
architectures, MemN2N can be trained end-to-end
and are weakly supervised. However, the drawback of
MemN2N is that it only generates answers with one
word. The proposed LTMN architecture improves
over the existing network architectures because (i) it
can be trained end-to-end, (ii) it is weakly supervised,
and (iii) can generate answers with multiple words.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Long-Term</title>
    </sec>
    <sec id="sec-4">
      <title>Memory Networks</title>
      <p>In this section, we describe the proposed Long-Term
Memory Network, shown in Figure 2. It includes four
modules: input module, question module, memory
module and answer module. The input module
enOutput of
MemN2N</p>
      <p>Question
representation
Raskin left the team</p>
      <p>in 1981 over a
personality conflict
with Jobs.</p>
      <p>over
LSTM</p>
      <p>Answer Module
a L Jobs
LSTM</p>
      <p>L LSTM
Word Embeddings
&lt;EOS&gt;</p>
      <p>LSTM
Why did Raskin leave the Apple team in</p>
      <p>1981?
Question Module
3.1</p>
      <sec id="sec-4-1">
        <title>Input Module and Question Module</title>
        <p>The input sentences fxigin=1 are embedded using the
matrix A as mi = Axi; i = 1; 2; : : : ; n; mi 2 Rd and
stored in memory. Note that we use all the sentences
before the question as input, which implies that the
proposed model is weakly supervised. The question
q is also embedded using the matrix B as u = Bq; u 2
Rd. The memory module then calculates the matching
probabilities between the sentences and the question,
by computing the inner product followed by a softmax
function as follows:
Matching Probability Vector
Let fxigin=1 represent the set of input sentences. Each
sentence xi 2 RjV j contains words belonging to a
dictionary V , and ends with an end-of-sentence token
&lt;EOS&gt;. The goal of the input module is to encode
sentences into vector representations. The question
module, like the input module, aims to encode each
question q 2 RjV j into a vector representation.
Specifically, we use a matrix A 2 Rd jV j to embed sentences
and B 2 Rd jV j for questions.</p>
        <p>Several methods have been proposed to encode the
input sentences or questions. In [SWF+15], an
embedding matrix is employed to embed the sentences in
a continuous space and obtain the vector
representations. [KIO+16, Elm91] use a recurrent neural network
to encode the input sentences into vector
representations. Our objective is to learn the co-occurrence and
sequence relationships between words in the text in
order to generate a coherent sequence of words as
answers. Thus, we employ a distributed representation
learning technique, such as paragraph vectors
(paragraph2vec) model [LM14] to pre-train A and B (with
A = B) for the real-word SQuAD dataset, which takes
into account the order and semantics among words to
encode the input sentences and questions1. For
synthetic datasets, which are based on a small vocabulary,
1We use paragraph2vec in our implementation. Other
representation learning mechanisms may be employed in the proposed
LTMN model.</p>
        <p>pi = softmax(uT mi);
where softmax(zi) = ezi = P ezj . The probability pi
j
is expected to be high for all the sentences xi that are
related to the question q.</p>
        <p>The output of the memory module is a vector
o 2 Rd, which can be represented by the sum over
input sentence representations, weighted by the
matching probability vector as follows:
o =</p>
        <p>X pimi:
i
(1)
(2)</p>
        <p>This approach, known as the soft attention
mechanism, has been used by [SWF+15, BCB15]. The
bene t of this approach is that it is easy to compute
gradients and back-propagate through this function.
3.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Answer Module</title>
        <p>Based on the output vector o from the memory
module and the word representations from input module,
the answer module generates answers for questions.
As our objective is to generate answers with
multiple words, we employ Long Short Term Memory
Networks (LSTM) [HS97] to generate answers.</p>
        <p>The core of the LSTM neural network is a
memory unit whose behavior is controlled by a set of three
gates: input, output and forget gates. The memory
unit accumulates the knowledge from the input data
at each time step, based on the values of the gates,
and stores this knowledge in its internal state. The
initial input to the LSTM is the embedding of the
begin-of-answer (&lt;BOA&gt;) token and its state. We
use the output of the memory module o, the question
representation u, a weight matrix W (o) and bias bo to
generate the embedding of &lt;BOA&gt; a0 as follows:
a0 = softmax(W (o)(o + u) + bo):
(3)
Using a0 and the initial state s0, LSTM can generate
the rst word w1 and its corresponding predicted
output y1 and state s1. At each time step t, LSTM takes
the embedding of word wt 1 and last hidden state st 1
as input to generate the new word wt.
(4)
(5)
(6)
(7)
(8)
(9)
(10)
vt = [wt 1]
it = (Wivvt + Wimyt 1 + bi)
ft = (Wfvvt + Wfmyt 1 + bf )
ot = (Wovvt + Womyt 1 + bo)
st = ft
st 1 + it</p>
        <p>tanh(Wsvvt + Wsmyt 1)
yt = ot</p>
        <p>st
wt = argmax hsoftmax(W (t)yt + bt)i
where [wt] is the embedding of word wt learnt from
the input module, and denote the sigmoid
function and Hadamard product respectively, and W (t) is
a weight matrix and bt is a bias vector.</p>
        <p>The model is trained end-to-end with the loss
dened by the cross-entropy between the true answer
and the predicted output wt, represented using
onehot encoding. The predicted answer is generated by
concatenating all the words generated by the model.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>In this section, we compare the performance of the
proposed LTMN model with the current
state-of-theart models for question answering.
4.1</p>
      <sec id="sec-5-1">
        <title>Datasets</title>
        <p>We use three datasets: the real-world Stanford
question answering dataset (SQuAD) [RZLL16], the
synthetic single-word answer bAbI dataset [WBC+16],
and the synthetic multi-word answer bAbI dataset,
generated by performing vocabulary replacements in
the single-word answer bAbI dataset.</p>
        <p>Stanford Question Answering Dataset
(SQuAD) [RZLL16] contains 100,000+ questions
labeled by crowd workers on a set of Wikipedia
articles. The answer for each question is a segment
of text from the corresponding paragraph. In order
to convert the format of the data to the input format
of our model (shown in Figure 1) , we use NLTK
to detect the boundary of sentences and assign an
index to each sentence and question, in accordance
with the starting index of the answer provided by the
crowd workers. The dataset is thus transformed to
a question answer dataset containing 18; 893 stories
and 69; 523 questions2. For our experiments, we
randomly selected 1; 248 questions for training and
1; 248 questions for testing. Each answer contains less
than or equal to ve words.</p>
        <p>The single-word answer bAbI
dataset [WBC+16] is a synthetic dataset
created to benchmark question answering models. It
contains 20 types of question answer tasks, and each
task is comprising a set of statements followed by a
single-word answer. For each question, only some of
the statements contain the relevant information. The
training and test data contains 1; 000 examples for
each task.</p>
        <p>The multi-word answer bAbI dataset. As
the goal of the proposed model is to generate
multiword answers, we manually generated a new dataset
from the Facebook bAbI dataset, by replacing few
words, such as \bedroom" and \bathroom" with
\guest room", and \shower room", respectively. The
replacements are listed in Table 1.
We use 10% of the training data for model
validation to choose the best parameters. The best
performance was obtained when the learning rate was set
to 0:002, the batch size set to 32, and the weights
initialized randomly from a Gaussian distribution with
2The dataset can be downloaded from http://www.acsu.
buffalo.edu/˜fenglong/
zero mean and 0:1 variance. The model was trained
for 200 epochs. The paragraph2vec model was set to
generate 100-dimensional representations for the input
sentences and the questions.</p>
        <p>We rst compare the performance of the proposed
LTMN model with a simple Long Short Term Memory
network (LSTM) model, as implemented in [SVL14] to
predict sequences. The LSTM model works by reading
the story until it comes across a question and outputs
an answer, using the information obtained from the
sentences read so far. Unlike the LTMN model, it
does not have an external memory component. We
also compare its performance</p>
        <p>On the single-word answer bAbI dataset, we also
compare our results with those of the attention based
LSTM model (LSTM + Attention) [HKG+15], which
propagates dependencies between input sentences
using an attention mechanism, MemNN [WCB15],
DMN [KIO+16], and MemN2N [SWF+15]. These
models cannot be applied as-is to the SQuAD and
multi-word answer bAbI datasets because they are
only capable of generating single-word answers.
4.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Evaluation Measures</title>
        <p>In order to evaluate the performance of all the
methods, the following measurements are used:</p>
        <p>Exact Match Accuracy (EMA) represents the
ratio of predicted answers which exactly match the
true answers.</p>
        <p>Partial Match Accuracy (PMA) is the ratio of
generated answers that partially match the correct
answers.</p>
        <p>BLEU score [CC14], widely used to evaluate
machine translation models, measures the quality of
the generated answers.
The performance of the LTMN model is shown in
Tables 2, 3, and 4 on the SQuAD, single-word answer
bAbI and multi-word answer bAbI datasets,
respectively.</p>
        <p>We observe that LTMN performs better than LSTM
in terms of all three evaluation measures, on all the
datasets. On the SQuAD dataset, as the
vocabulary is large (8; 969), the LSTM model cannot learn
the embedding matrices accurately, leading to its poor
performance. However, as the LTMN model employs
paragraph2vec, it learns richer vector representations
of the sentences and questions. In addition, it can
memorize and reason over the facts better than the
simple LSTM model. On the multi-word answer bAbI
dataset, the LTMN model is signi cantly better than
the LSTM model, especially on tasks 1, 4, 12, 15, 19,
and 20. The average EMA, BLEU, and PMA scores of
LTMN are about 30% higher than those of the LSTM
model. The single-word answer bAbI dataset's
vocabulary is small (about 20), so we learn the
embedding matrices A and B using back-propagation,
instead of using paragraph2vec to obtain the vector
representations. In Table 3, we observe that the LTMN
model achieves accuracy close to the strongly
supervised MemNN and DMN models on 4 out of the
20 bAbI tasks, despite being weakly supervised, and
achieves better accuracy than the weakly-supervised
LSTM+Attention and MemN2N on 7 tasks. The
proposed LTMN model also o ers the additional
capability of generating multi-word answers, unlike these
baseline models.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>Question answering is an important and challenging
task in natural language processing. Traditional
question answering approaches are simple query-based
approaches, which cannot memorize and reason over the
input text. Deep neural networks with memory have
been employed to alleviate this challenge in the
literature.</p>
      <p>In this paper, we proposed the Long-Term Memory
Network, a novel recurrent neural network, which can
encode raw text information (the input sentences and
questions) into vector representations, form memories,
nd relevant information in the input sentences to
answer the questions, and nally generate multi-word
answers using a long short term memory network. The
proposed architecture is a weakly supervised model
and can be trained end-to-end. Experiments on both
synthetic and real-world datasets demonstrate the
remarkable performance of the proposed architecture.</p>
      <p>In our experiments on the bAbI question &amp;
answering tasks, we found that the proposed model fails to
perform as well as the completely supervised memory
networks on certain tasks. In addition, the model
performs poorly when the input sentences are very long
and the vocabulary is large, as it cannot calculate the
supporting facts e ciently. In the future, we plan to
expand the model to handle long input sentences, and
improve the performance of the proposed network.
Task
[BCB15]</p>
      <p>Dzmitry Bahdanau, Kyunghyun Cho, and
Yoshua Bengio. Neural machine translation
by jointly learning to align and translate. In
ICLR, 2015.
[BCFL13]</p>
      <p>Jonathan Berant, Andrew Chou, Roy
Frostig, and Percy Liang. Semantic parsing
on freebase from question-answer pairs. In</p>
      <p>EMNLP, 2013.
[BEP+08]</p>
      <p>Kurt Bollacker, Colin Evans, Praveen
Par[BLK+09]
itosh, Tim Sturge, and Jamie Taylor.</p>
      <p>Freebase: a collaboratively created graph
database for structuring human knowledge.</p>
      <p>In SIGMOD, 2008.</p>
      <p>Antoine Bordes, Xavier Glorot, Jason
Weston, and Yoshua Bengio. Joint learning of
words and meaning representations for
opentext semantic parsing. In AISTATS, 2012.</p>
      <p>Jonathan Berant and Percy Liang. Semantic
parsing via paraphrasing. In ACL, 2014.</p>
      <p>Christian Bizer, Jens Lehmann, Georgi
Kobilarov, Soren Auer, Christian Becker, Richard
Cyganiak, and Sebastian Hellmann. Dbpedia
- a crystallization point for the web of data.</p>
      <p>Web Semant, 2009.</p>
      <p>Boxing Chen and Colin Cherry. A
systematic comparison of smoothing techniques for
sentence-level bleu. In SMT, 2014.
[CVMBB14] Kyunghyun Cho, Bart Van Merrienboer,
Dzmitry Bahdanau, and Yoshua Bengio. On
the properties of neural machine translation:
Encoder-decoder approaches. arXiv preprint
arXiv:1409.1259, 2014.</p>
      <p>Je rey L Elman. Distributed representations,
simple recurrent networks, and grammatical
structure. Machine learning, 1991.
[Gra13]
[GWD14]
[HKG+15]
[HS97]
[IBGC+14]
[JM15]
[KIO+16]</p>
      <p>Alex Graves. Generating sequences with
recurrent neural networks. arXiv preprint
arXiv:1308.0850, 2013.</p>
      <p>Alex Graves, Greg Wayne, and Ivo
Danihelka. Neural turing machines. arXiv
preprint arXiv:1410.5401, 2014.</p>
      <p>Karl Moritz Hermann, Tomas Kocisky,
Edward Grefenstette, Lasse Espeholt, Will
Kay, Mustafa Suleyman, and Phil Blunsom.</p>
      <p>Teaching machines to read and comprehend.</p>
      <p>In NIPS, 2015.</p>
      <p>Sepp Hochreiter and Jurgen Schmidhuber.</p>
      <p>Long short-term memory. Neural
computation, 1997.</p>
      <p>Mohit Iyyer, Jordan L Boyd-Graber,
Leonardo Max Batista Claudino, Richard
Socher, and Hal Daume III. A neural
network for factoid question answering over
paragraphs. In EMNLP, 2014.</p>
      <p>Armand Joulin and Tomas Mikolov. Inferring
algorithmic patterns with stack-augmented
recurrent nets. In NIPS, 2015.</p>
      <p>Ankit Kumar, Ozan Irsoy, Peter Ondruska,
Mohit Iyyer, James Bradbury, Ishaan
Gulrajani, Victor Zhong, Romain Paulus, and
[LM14]
[MCZ+17]
[MD93]
[Pas03]
[RZLL16]
[SMC+17]
[SVL14]
[SWF+15]
[WBC+16]
[WCB15]
[WGL+16]
[YJW+16]
[ZC05]
[ZHLZ16]</p>
      <p>Richard Socher. Ask me anything: Dynamic
memory networks for natural language
processing. In ICML, 2016.</p>
      <p>Quoc V Le and Tomas Mikolov. Distributed
representations of sentences and documents.</p>
      <p>In ICML, 2014.</p>
      <p>Fenglong Ma, Radha Chitta, Jing Zhou,
Quanzeng You, Tong Sun, and Jing Gao.</p>
      <p>Dipole: Diagnosis prediction in healthcare
via attention-based bidirectional recurrent
neural networks. In KDD, 2017.</p>
      <p>Michael C Mozer and Sreerupa Das. A
connectionist symbol manipulator that discovers
the structure of context-free languages. In
NIPS, 1993.</p>
      <p>Marius Pasca. Open-domain question
answering from large text collections.
Computational Linguistics, 2003.</p>
      <p>Pranav Rajpurkar, Jian Zhang, Konstantin
Lopyrev, and Percy Liang. Squad: 100,000+
questions for machine comprehension of text.
arXiv preprint arXiv:1606.05250, 2016.</p>
      <p>Qiuling Suo, Fenglong Ma, Giovanni Canino,
Jing Gao, Aidong Zhang, Pierangelo Veltri,
and Agostino Gnasso. A multi-task
framework for monitoring health conditions via
attention-based recurrent neural networks. In
AMIA, 2017.</p>
      <p>Ilya Sutskever, Oriol Vinyals, and Quoc V
Le. Sequence to sequence learning with
neural networks. In NIPS, 2014.</p>
      <p>Sainbayar Sukhbaatar, Jason Weston, Rob
Fergus, et al. End-to-end memory networks.</p>
      <p>In NIPS, 2015.</p>
      <p>Jason Weston, Antoine Bordes, Sumit
Chopra, Alexander M Rush, Bart van
Merrienboer, Armand Joulin, and Tomas
Mikolov. Towards ai-complete question
answering: A set of prerequisite toy tasks. In
ICLR, 2016.</p>
      <p>Jason Weston, Sumit Chopra, and Antoine
Bordes. Memory networks. In ICLR, 2015.</p>
      <p>Bingning Wang, Shangmin Guo, Kang Liu,
Shizhu He, and Jun Zhao. Employing
external rich knowledge for machine
comprehension. In IJCAI, 2016.</p>
      <p>Quanzeng You, Hailin Jin, Zhaowen Wang,
Chen Fang, and Jiebo Luo. Image captioning
with semantic attention. In CVPR, 2016.</p>
      <p>Luke S. Zettlemoyer and Michael Collins.</p>
      <p>Learning to map sentences to logical form:
Structured classi cation with probabilistic
categorial grammars. In UAI, 2005.</p>
      <p>Yuanzhe Zhang, Shizhu He, Kang Liu, and
Jun Zhao. A joint model for question
answering over multiple knowledge bases. In AAAI,
2016.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>