-

Long-Term Memory Networks for Question Answering

Fenglong Ma

Radha Chitta

Saurabh Kataria

saurabh.cse05@gmail.com

Jing Zhou

Palghat Ramesh

palghat.ramesh@parc.com

Tong Sun

sunt@utrc.utc.com 1

Jing Gao

0 Conduent Labs US 1 United Technologies Research Center

2017

19 25

Question answering is an important and difcult task in the natural language processing domain, because many basic natural language processing tasks can be cast into a question answering task. Several deep neural network architectures have been developed recently, which employ memory and inference components to memorize and reason over text information, and generate answers to questions. However, a major drawback of many such models is that they are capable of only generating single-word answers. In addition, they require large amount of training data to generate accurate answers. In this paper, we introduce the Long-Term Memory Network (LTMN), which incorporates both an external memory module and a Long Short-Term Memory (LSTM) module to comprehend the input data and generate multi-word answers. The LTMN model can be trained end-to-end using back-propagation and requires minimal supervision. We test our model on two synthetic data sets (based on Facebook's bAbI data set) and the real-world Stanford question answering data set, and show that it can achieve state-of-the-art performance.

Work carried out while at PARC, a Xerox Company. Copyright c by the paper's authors. Copying permitted for private and academic purposes. the given unstructured text, is one of the core tasks in natural language understanding and processing. Many problems in natural language processing, such as reading comprehension, machine translation, entity recognition, sentiment analysis, and dialogue generation, can be cast as question answering problems.

Traditional question answering approaches can be categorized as: (i) IR-based question answering [Pas03] where the question is formulated as a search query, and a short text segment is found on the Web or similar corpus for the answer; (ii) Knowledge-based question answering [GJWCL61, BCFL13], which aims to answer a natural language question by mapping it to a semantic query over a database.

The traditional approaches are simple query-based techniques. It is di cult to establish the relationships between the sentences in the input text, and derive a meaningful representation of the information within the text using these traditional question-answering systems.

Figure 1 shows an example of question answering task. The sentences in black are facts that may be relevant to the questions, questions are in blue, and the correct answers are in red. In order to correctly answer the question \What did Steve Jobs o er Xerox to visit and see their latest technology? ", the model should have the ability to recognize that the sentence \After hearing of the pioneering GUI technology being developed at Xerox PARC, Jobs had negotiated a visit to see the Xerox Alto computer and its Smalltalk development tools in exchange for Apple stock options." is a supporting fact and extract the relevant portion of the supporting fact to form the answer. In addition, the model should have the ability to memorize all the facts that have been presented to it until the current time, and deduce the answer.

The authors of [WCB15] proposed a new class of learning models named Memory Networks (MemNN), which use a long-term memory component to store 1: Burrel’s innovative design, which combined the low production cost of an Apple II with the computing power of Lisa’s CPU, the Motorola 68K, received the attention of Steve Jobs, co-founder of Apple. 2: Realizing that the Macintosh was more marketable than the Lisa, he began to focus his attention on the project. 3: Raskin left the team in 1981 over a personality conflict with Jobs. 4: Why did Raskin leave the Apple team in 1981? over a personality conflict with Jobs 5: Team member Andy Hertzfeld said that the final Macintosh design is closer to Jobs’ ideas than Raskin’s. 6: According to Andy Hertzfeld, whose idea is the final Mac design closer to? Jobs 7: After hearing of the pioneering GUI technology being developed at Xerox PARC, Jobs had negotiated a visit to see the Xerox Alto computer and its Smalltalk development tools in exchange for Apple stock options. 8: What did Steve Jobs offer Xerox to visit and see their latest technology? Apple stock options information and an inference component for reasoning. [KIO+16] proposed the Dynamic Memory Network (DMN) for general question answering tasks, which processes input sentences and questions, forms episodic memories, and generates answers. These two approaches are strongly supervised, i.e., only the supporting facts (factoids) are fed to the model as inputs for training the model for each type of question. For example, when training the model with the question in the fourth line of Figure 1, strongly supervised methods only use the sentence in line 3 as input. Thus, these methods require a large amount of training data.

To tackle this issue, [SWF+15] introduced a weakly supervised approach called End-to-End Memory Network (MemN2N), which uses all the sentences that have appeared before this question. For the above example, the inputs are the sentences from line 1 to line 3 when training for the question in the fourth line. MemN2N is trained end-to-end and uses an attention mechanism to calculate the matching probabilities between the input sentences and questions. The sentences which match the question with high probability are used as the factoids for answering the question.

However, this model is capable of generating only single-word answers. For example, the answer of the question \According to Andy Hertzfeld, whose idea is the nal Mac design closer to? " in Figure 1 is only one word \Jobs". Since the answers of many questions contain multiple words (for instance, the question labeled 4 in Figure 1), this model cannot be directly applied to the general question answering tasks.

Recurrent neural networks comprising Long Short Term Memory Units have been employed to generate multi-word text in the literature [Gra13, SVL14]. However, simple LSTM based recurrent neural networks do not perform well on the question-answering task due to the lack of an external memory component which can memorize and contextualize the facts. We present a more sophisticated recurrent neural network architecture, named Long-Term Memory Network (LTMN), which combines the best aspects of end-to-end memory networks and LSTM based recurrent neural networks to address the challenges faced by the currently available neural network architectures for question-answering. Speci cally, it rst embeds the input sentences (initially encoded using a distributed representation learning mechanism such as paragraph vectors [LM14]) in a continuous space, and stores them in memory. It then matches the sentences with the questions, also embedded into the same space, by performing multiple passes through the memory, to obtain the factoids which are relevant to each question. These factoids are then employed to generate the rst word of the answer, which is then input to an LSTM unit. The LSTM unit is used to generate the subsequent words in the answer. The proposed LTMN model can be trained end-to-end, requires minimal supervision during training (i.e., weakly supervised), and generates multiple words answers. Experimental results on two synthetic datasets and one real world dataset show that the proposed model outperforms the state-of-theart approaches.

In summary, the contributions of this paper are as follows:

We propose an e ective neural network architecture for general question answering, i.e. for generating multi-word answers for questions. Our architecture combines the best aspects of MemN2N and LSTM and can be trained end-to-end.

The proposed architecture employs distributed representation learning techniques (e.g. paragraph2vec) to learn vector representations for sentences or factoids, questions and words, as well as their relationships. The learned embeddings contribute to the accuracy of the answers generated by the proposed architecture.

We generate a new synthetic dataset with multiple word answers based on Facebook's bAbI dataset [WBC+16]. We call this the multi-word answer bAbI dataset.

We test the proposed architecture on two synthetic datasets (the single-word answer bAbI dataset and the multi-word answer bAbI dataset), and the real-world Stanford question answering dataset [RZLL16]. The results clearly demonstrate the advantages of the proposed architecture for question answering. 2

Related Work

In this section, we review literature closely related to question answering, particularly focusing on models using memory networks to generate answers. 2.1

Question Answering

Traditional question answering approaches mainly include two categories: IR-based [Pas03] and Knowledge-based question answering [GJWCL61, BCFL13]. IR-based question answering systems use information retrieval techniques to extract information (i.e., answers) from documents. These methods rst process questions, i.e., detect named entities in questions, and then predict answer types, such as cities' names or person's names. After recognizing answer types, these approaches generate queries, and extract answers from the web using the generated queries. These approaches are easy, but they ignore the semantics between questions and answers.

Knowledge-based question answering systems [ZC05, BL14, ZHLZ16] consider the semantics and use existing knowledge bases, such as Freebase [BEP+08] and DBpedia [BLK+09]. They cast the question answering task as that of nding one of the missing arguments in a triple. Most of knowledge-based question answering approaches use neural networks, dependency trees and knowledge bases [BGWB12] or sentences [IBGC+14].

Using traditional question answering approaches, it is di cult to establish the relationship between sentences in the input text, and thereby identify the relevance of the di erent sentences to the question. Of late, several neural network architectures with memories have been proposed to solve this challenging problem. 2.2

Memory Networks

Several deep neural network models use memory architectures [SWF+15, KIO+16, WCB15, GWD14, JM15, MD93] and attention mechanisms for image captioning [YJW+16], machine comprehension [WGL+16] and healthcare data mining [MCZ+17, SMC+17]. We focus on the models using memory networks for natural language question answering.

Memory networks (MemNN), proposed in [WCB15], rst introduced the concept of an external memory component for natural language question answering. They are strongly supervised, i.e., they are trained with only the supporting facts for each question. The supporting input sentences are embedded in memory, and the response is generated from these facts by scoring all the words in the vocabulary in correlation with the facts. This scoring function is learnt during the training process and employed during the testing phase. MemNN are capable of producing only single-word answers, due to this response generation mechanism. In addition, MemNN cannot be trained end-to-end.

The authors of [KIO+16] improve over MemNN by introducing an end-to-end trainable network called Dynamic Memory Networks (DMN). DMN have four modules: input module, question module, episodic memory module and answer module. The input module encodes raw text inputs into distributed vector representations using a gated recurrent network (GRU) [CVMBB14]. The question module similarly encodes the question using a recurrent neural network. The sentences and question representations are fed to the episodic memory module, which chooses the sentences to focus on using the attention mechanism. It iteratively produces a memory vector, representing all the relevant information, which is then used by the answer module to generate the answer using a GRU. However, DMN are also strongly supervised like MemNN, thereby requiring a large amount of training data.

End-to-End Memory Networks (MemN2N) [SWF+15] rst encode sentences into continuous vector representations, then use a soft attention mechanism to calculate matching probabilities between sentences and questions and nd the most relevant facts, and nally generate responses using the vocabulary from these facts. Unlike the MemNN and DMN architectures, MemN2N can be trained end-to-end and are weakly supervised. However, the drawback of MemN2N is that it only generates answers with one word. The proposed LTMN architecture improves over the existing network architectures because (i) it can be trained end-to-end, (ii) it is weakly supervised, and (iii) can generate answers with multiple words. 3

Long-Term Memory Networks

In this section, we describe the proposed Long-Term Memory Network, shown in Figure 2. It includes four modules: input module, question module, memory module and answer module. The input module enOutput of MemN2N

Question representation Raskin left the team

in 1981 over a personality conflict with Jobs.

over LSTM

Answer Module a L Jobs LSTM

L LSTM Word Embeddings <EOS>

LSTM Why did Raskin leave the Apple team in

1981? Question Module 3.1

Input Module and Question Module

The input sentences fxigin=1 are embedded using the matrix A as mi = Axi; i = 1; 2; : : : ; n; mi 2 Rd and stored in memory. Note that we use all the sentences before the question as input, which implies that the proposed model is weakly supervised. The question q is also embedded using the matrix B as u = Bq; u 2 Rd. The memory module then calculates the matching probabilities between the sentences and the question, by computing the inner product followed by a softmax function as follows: Matching Probability Vector Let fxigin=1 represent the set of input sentences. Each sentence xi 2 RjV j contains words belonging to a dictionary V , and ends with an end-of-sentence token <EOS>. The goal of the input module is to encode sentences into vector representations. The question module, like the input module, aims to encode each question q 2 RjV j into a vector representation. Specifically, we use a matrix A 2 Rd jV j to embed sentences and B 2 Rd jV j for questions.

Several methods have been proposed to encode the input sentences or questions. In [SWF+15], an embedding matrix is employed to embed the sentences in a continuous space and obtain the vector representations. [KIO+16, Elm91] use a recurrent neural network to encode the input sentences into vector representations. Our objective is to learn the co-occurrence and sequence relationships between words in the text in order to generate a coherent sequence of words as answers. Thus, we employ a distributed representation learning technique, such as paragraph vectors (paragraph2vec) model [LM14] to pre-train A and B (with A = B) for the real-word SQuAD dataset, which takes into account the order and semantics among words to encode the input sentences and questions1. For synthetic datasets, which are based on a small vocabulary, 1We use paragraph2vec in our implementation. Other representation learning mechanisms may be employed in the proposed LTMN model.

pi = softmax(uT mi); where softmax(zi) = ezi = P ezj . The probability pi j is expected to be high for all the sentences xi that are related to the question q.

The output of the memory module is a vector o 2 Rd, which can be represented by the sum over input sentence representations, weighted by the matching probability vector as follows: o =

X pimi: i (1) (2)

This approach, known as the soft attention mechanism, has been used by [SWF+15, BCB15]. The bene t of this approach is that it is easy to compute gradients and back-propagate through this function. 3.3

Answer Module

Based on the output vector o from the memory module and the word representations from input module, the answer module generates answers for questions. As our objective is to generate answers with multiple words, we employ Long Short Term Memory Networks (LSTM) [HS97] to generate answers.

The core of the LSTM neural network is a memory unit whose behavior is controlled by a set of three gates: input, output and forget gates. The memory unit accumulates the knowledge from the input data at each time step, based on the values of the gates, and stores this knowledge in its internal state. The initial input to the LSTM is the embedding of the begin-of-answer (<BOA>) token and its state. We use the output of the memory module o, the question representation u, a weight matrix W (o) and bias bo to generate the embedding of <BOA> a0 as follows: a0 = softmax(W (o)(o + u) + bo): (3) Using a0 and the initial state s0, LSTM can generate the rst word w1 and its corresponding predicted output y1 and state s1. At each time step t, LSTM takes the embedding of word wt 1 and last hidden state st 1 as input to generate the new word wt. (4) (5) (6) (7) (8) (9) (10) vt = [wt 1] it = (Wivvt + Wimyt 1 + bi) ft = (Wfvvt + Wfmyt 1 + bf ) ot = (Wovvt + Womyt 1 + bo) st = ft st 1 + it

tanh(Wsvvt + Wsmyt 1) yt = ot

st wt = argmax hsoftmax(W (t)yt + bt)i where [wt] is the embedding of word wt learnt from the input module, and denote the sigmoid function and Hadamard product respectively, and W (t) is a weight matrix and bt is a bias vector.

The model is trained end-to-end with the loss dened by the cross-entropy between the true answer and the predicted output wt, represented using onehot encoding. The predicted answer is generated by concatenating all the words generated by the model. 4

Experiments

In this section, we compare the performance of the proposed LTMN model with the current state-of-theart models for question answering. 4.1

Datasets

We use three datasets: the real-world Stanford question answering dataset (SQuAD) [RZLL16], the synthetic single-word answer bAbI dataset [WBC+16], and the synthetic multi-word answer bAbI dataset, generated by performing vocabulary replacements in the single-word answer bAbI dataset.

Stanford Question Answering Dataset (SQuAD) [RZLL16] contains 100,000+ questions labeled by crowd workers on a set of Wikipedia articles. The answer for each question is a segment of text from the corresponding paragraph. In order to convert the format of the data to the input format of our model (shown in Figure 1) , we use NLTK to detect the boundary of sentences and assign an index to each sentence and question, in accordance with the starting index of the answer provided by the crowd workers. The dataset is thus transformed to a question answer dataset containing 18; 893 stories and 69; 523 questions2. For our experiments, we randomly selected 1; 248 questions for training and 1; 248 questions for testing. Each answer contains less than or equal to ve words.

The single-word answer bAbI dataset [WBC+16] is a synthetic dataset created to benchmark question answering models. It contains 20 types of question answer tasks, and each task is comprising a set of statements followed by a single-word answer. For each question, only some of the statements contain the relevant information. The training and test data contains 1; 000 examples for each task.

The multi-word answer bAbI dataset. As the goal of the proposed model is to generate multiword answers, we manually generated a new dataset from the Facebook bAbI dataset, by replacing few words, such as \bedroom" and \bathroom" with \guest room", and \shower room", respectively. The replacements are listed in Table 1. We use 10% of the training data for model validation to choose the best parameters. The best performance was obtained when the learning rate was set to 0:002, the batch size set to 32, and the weights initialized randomly from a Gaussian distribution with 2The dataset can be downloaded from http://www.acsu. buffalo.edu/˜fenglong/ zero mean and 0:1 variance. The model was trained for 200 epochs. The paragraph2vec model was set to generate 100-dimensional representations for the input sentences and the questions.

We rst compare the performance of the proposed LTMN model with a simple Long Short Term Memory network (LSTM) model, as implemented in [SVL14] to predict sequences. The LSTM model works by reading the story until it comes across a question and outputs an answer, using the information obtained from the sentences read so far. Unlike the LTMN model, it does not have an external memory component. We also compare its performance

On the single-word answer bAbI dataset, we also compare our results with those of the attention based LSTM model (LSTM + Attention) [HKG+15], which propagates dependencies between input sentences using an attention mechanism, MemNN [WCB15], DMN [KIO+16], and MemN2N [SWF+15]. These models cannot be applied as-is to the SQuAD and multi-word answer bAbI datasets because they are only capable of generating single-word answers. 4.3

Evaluation Measures

In order to evaluate the performance of all the methods, the following measurements are used:

Exact Match Accuracy (EMA) represents the ratio of predicted answers which exactly match the true answers.

Partial Match Accuracy (PMA) is the ratio of generated answers that partially match the correct answers.

BLEU score [CC14], widely used to evaluate machine translation models, measures the quality of the generated answers. The performance of the LTMN model is shown in Tables 2, 3, and 4 on the SQuAD, single-word answer bAbI and multi-word answer bAbI datasets, respectively.

We observe that LTMN performs better than LSTM in terms of all three evaluation measures, on all the datasets. On the SQuAD dataset, as the vocabulary is large (8; 969), the LSTM model cannot learn the embedding matrices accurately, leading to its poor performance. However, as the LTMN model employs paragraph2vec, it learns richer vector representations of the sentences and questions. In addition, it can memorize and reason over the facts better than the simple LSTM model. On the multi-word answer bAbI dataset, the LTMN model is signi cantly better than the LSTM model, especially on tasks 1, 4, 12, 15, 19, and 20. The average EMA, BLEU, and PMA scores of LTMN are about 30% higher than those of the LSTM model. The single-word answer bAbI dataset's vocabulary is small (about 20), so we learn the embedding matrices A and B using back-propagation, instead of using paragraph2vec to obtain the vector representations. In Table 3, we observe that the LTMN model achieves accuracy close to the strongly supervised MemNN and DMN models on 4 out of the 20 bAbI tasks, despite being weakly supervised, and achieves better accuracy than the weakly-supervised LSTM+Attention and MemN2N on 7 tasks. The proposed LTMN model also o ers the additional capability of generating multi-word answers, unlike these baseline models. 5

Conclusions

Question answering is an important and challenging task in natural language processing. Traditional question answering approaches are simple query-based approaches, which cannot memorize and reason over the input text. Deep neural networks with memory have been employed to alleviate this challenge in the literature.

In this paper, we proposed the Long-Term Memory Network, a novel recurrent neural network, which can encode raw text information (the input sentences and questions) into vector representations, form memories, nd relevant information in the input sentences to answer the questions, and nally generate multi-word answers using a long short term memory network. The proposed architecture is a weakly supervised model and can be trained end-to-end. Experiments on both synthetic and real-world datasets demonstrate the remarkable performance of the proposed architecture.

In our experiments on the bAbI question & answering tasks, we found that the proposed model fails to perform as well as the completely supervised memory networks on certain tasks. In addition, the model performs poorly when the input sentences are very long and the vocabulary is large, as it cannot calculate the supporting facts e ciently. In the future, we plan to expand the model to handle long input sentences, and improve the performance of the proposed network. Task [BCB15]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015. [BCFL13]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In

EMNLP, 2013. [BEP+08]

Kurt Bollacker, Colin Evans, Praveen Par[BLK+09] itosh, Tim Sturge, and Jamie Taylor.

Freebase: a collaboratively created graph database for structuring human knowledge.

In SIGMOD, 2008.

Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. Joint learning of words and meaning representations for opentext semantic parsing. In AISTATS, 2012.

Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. In ACL, 2014.

Christian Bizer, Jens Lehmann, Georgi Kobilarov, Soren Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. Dbpedia - a crystallization point for the web of data.

Web Semant, 2009.

Boxing Chen and Colin Cherry. A systematic comparison of smoothing techniques for sentence-level bleu. In SMT, 2014. [CVMBB14] Kyunghyun Cho, Bart Van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

Je rey L Elman. Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 1991. [Gra13] [GWD14] [HKG+15] [HS97] [IBGC+14] [JM15] [KIO+16]

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom.

Teaching machines to read and comprehend.

In NIPS, 2015.

Sepp Hochreiter and Jurgen Schmidhuber.

Long short-term memory. Neural computation, 1997.

Mohit Iyyer, Jordan L Boyd-Graber, Leonardo Max Batista Claudino, Richard Socher, and Hal Daume III. A neural network for factoid question answering over paragraphs. In EMNLP, 2014.

Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. In NIPS, 2015.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and [LM14] [MCZ+17] [MD93] [Pas03] [RZLL16] [SMC+17] [SVL14] [SWF+15] [WBC+16] [WCB15] [WGL+16] [YJW+16] [ZC05] [ZHLZ16]

Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In ICML, 2016.

Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents.

In ICML, 2014.

Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao.

Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In KDD, 2017.

Michael C Mozer and Sreerupa Das. A connectionist symbol manipulator that discovers the structure of context-free languages. In NIPS, 1993.

Marius Pasca. Open-domain question answering from large text collections. Computational Linguistics, 2003.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.

Qiuling Suo, Fenglong Ma, Giovanni Canino, Jing Gao, Aidong Zhang, Pierangelo Veltri, and Agostino Gnasso. A multi-task framework for monitoring health conditions via attention-based recurrent neural networks. In AMIA, 2017.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014.

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks.

In NIPS, 2015.

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merrienboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. In ICLR, 2016.

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In ICLR, 2015.

Bingning Wang, Shangmin Guo, Kang Liu, Shizhu He, and Jun Zhao. Employing external rich knowledge for machine comprehension. In IJCAI, 2016.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In CVPR, 2016.

Luke S. Zettlemoyer and Michael Collins.

Learning to map sentences to logical form: Structured classi cation with probabilistic categorial grammars. In UAI, 2005.

Yuanzhe Zhang, Shizhu He, Kang Liu, and Jun Zhao. A joint model for question answering over multiple knowledge bases. In AAAI, 2016.