-

Iterative Multi-document Neural Attention for Multiple Answer Prediction

Claudio Greco

Alessandro Suglia

alessandro.suglia@gmail.com 0

Pierpaolo Basile

Gaetano Rossiello

Giovanni Semeraro

0 0 Department of Computer Science, University of Bari Aldo Moro , Via E. Orabona 4, 70125 Bari , Italy

People have information needs of varying complexity, which can be solved by an intelligent agent able to answer questions formulated in a proper way, eventually considering user context and preferences. In a scenario in which the user profile can be considered as a question, intelligent agents able to answer questions can be used to find the most relevant answers for a given user. In this work we propose a novel model based on Artificial Neural Networks to answer questions with multiple answers by exploiting multiple facts retrieved from a knowledge base. The model is evaluated on the factoid Question Answering and top-n recommendation tasks of the bAbI Movie Dialog dataset. After assessing the performance of the model on both tasks, we try to define the long-term goal of a conversational recommender system able to interact using natural language and supporting users in their information seeking processes in a personalized way.

We are surrounded by a huge variety of technological artifacts which “live” with us today. These artifacts can help us in several ways because they have the power to accomplish complex and time-consuming tasks. Unfortunately, common software systems can do for us only specific types of tasks, in a strictly algorithmic way which is pre-defined by the software designer. Machine Learning (ML), a branch of Artificial Intelligence (AI), gives machines the ability to learn to complete tasks without being explicitly programmed.

People have information needs of varying complexity, ranging from simple questions about common facts which can be found in encyclopedias, to more sophisticated cases in which they need to know what movie to watch during a romantic evening. These tasks can be solved by an intelligent agent able to answer questions formulated in a proper way, eventually considering user context and preferences.

Question Answering (QA) emerged in the last decade as one of the most promising fields in AI, since it allows to design intelligent systems which are able to give correct answers to user questions expressed in natural language. Whereas, recommender systems produce individualized recommendations as output and have the effect of guiding the user in a personalized way to interesting or useful objects in a large space of possible options. In a scenario in which the user profile (the set of user preferences) can be represented by a question, intelligent agents able to answer questions can be used to find the most appealing items for a given user, which is the classical task that recommender systems can solve. Despite the efficacy of classical recommender systems, generally they are not able to handle a conversation with the user so they miss the possibility of understanding his contextual information, emotions and feedback to refine the user profile and provide enhanced suggestions. Conversational recommender systems assist online users in their information-seeking and decision making tasks by supporting an interactive process [ 10 ] which could be goal oriented with the task of starting general and, through a series of interaction cycles, narrowing down the user interests until the desired item is obtained [ 17 ].

In this work we propose a novel model based on Artificial Neural Networks to answer questions exploiting multiple facts retrieved from a knowledge base and evaluate it on a QA task. Moreover, the effectiveness of the model is evaluated on the top-n recommendation task, where the aim of the system is to produce a list of suggestions ranked according to the user preferences. After having assessed the performance of the model on both tasks, we try to define the long-term goal of a conversational recommender system able to interact with the user using natural language and supporting him in the information seeking process in a personalized way.

In order to fulfill our long-term goal of building a conversational recommender system we need to assess the performance of our model on specific tasks involved in this scenario. A recent work which goes in this direction is reported in [ 5 ], which presents the bAbI Movie Dialog dataset, composed by different tasks such as factoid QA, top-n recommendation and two more complex tasks, one which mixes QA and recommendation and one which contains turns of dialogs taken from Reddit. Having more specific tasks like QA and recommendation, and a more complex one which mixes both tasks gives us the possibility to evaluate our model on different levels of granularity. Moreover, the subdivision in turns of the more complex task provides a proper benchmark of the model capability to handle an effective dialog with the user.

For the task related to QA, a lot of datasets have been released in order to assess the machine reading and comprehension capabilities and a lot of neural network-based models have been proposed. Our model takes inspiration from [ 19 ], which is able to answer Cloze-style [ 22 ] questions repeating an attention mechanism over the query and the documents multiple times. Despite the effectiveness on the Cloze-style task, the original model does not consider multiple documents as a source of information to answer questions, which is fundamental in order to extract the answer from different relevant facts. The restricted assumption that the answer is contained in the given document does not allow the model to provide an answer which does not belong to the document. Moreover, this kind of task does not expect multiple answers for a given question, which is important for the complex information needs required for a conversational recommender system.

According to our vision, the main outcomes of our work can be considered as building blocks for a conversational recommender system and can be summarized as follows: 1. we extend the model reported in [ 19 ] to let the inference process exploit evidences observed in multiple documents coming from an external knowledge base represented as a collection of textual documents; 2. we design a model able to leverage the attention weights generated by the inference process to provide multiple answers which does not necessarily belong to the documents through a multi-layer neural network which may uncover possible relationships between the most relevant evidences; 3. we assess the efficacy of our model through an experimental evaluation on factoid QA and top-n recommendation tasks supporting our hypothesis that a QA model can be used to solve top-n recommendation, too.

The paper is organized as follows: Section 2 describes our model, while Section 3 summarizes the evaluation of the model on the two above-mentioned tasks and the comparison with respect to state-of-the-art approaches. Section 4 gives an overview of the literature of both QA and recommender systems, while final remarks and our long-term vision are reported in Section 5. 2

Methodology

Given a query q, an operator : Q ! D that produces the set of documents relevant for q, where Q is the set of all queries and D is the set of all documents. Our model defines a workflow in which a sequence of inference steps are performed in order to extract relevant information from (q) to generate the answers for q.

Following [ 19 ], our workflow consists of three steps: (1) the encoding phase, which generates meaningful representations for query and documents; (2) the inference phase, which extracts relevant semantic relationships between the query and the documents by using an iterative attention mechanism and finally (3) the prediction phase, which generates a score for each candidate answer. 2.1

Encoding phase

The input of the encoding phase is given by a query q and a set of documents (q) = fd1; d2; : : : ; djDqjg Dq. Both queries and documents are represented by a sequence of words X = (x1; x2; : : : ; xjXj), drawn from a vocabulary V . Each word is represented by a continuous d-dimensional word embedding x 2 Rd stored in a word embedding matrix X 2 RjV j d.

The sequences of dense representations for q and dj are encoded using a bidirectional recurrent neural network encoder with Gated Recurrent Units (GRU) as in [ 19 ] which represents each word xi 2 X as the concatenation of a forward encoding h!k 2 Rh and a backward encoding hk 2 Rh. From now on, we denote the contextual representation for the word qi by q~i 2 R2h and the contextual representation for the word dj;i in the document dj by d~j;i 2 R2h. Differently from [ 19 ], we build a unique representation for the whole set of documents Dq related to the query q by stacking each contextual representation d~j;i obtaining a matrix D~ q 2 Rl 2h, where l = jd1j + jd2j + : : : + jdjDqjj. 2.2

Inference phase

This phase uncovers a possible inference chain which models meaningful relationships between the query and the set of related documents. The inference chain is obtained by performing, for each inference step t = 1; 2; : : : ; T , the attention mechanisms given by the query attentive read and the document attentive read keeping a state of the inference process given by an additional recurrent neural network with GRU units. In this way, the network is able to progressively refine the attention weights focusing on the most relevant tokens of the query and the documents which are exploited by the prediction neural network to select the correct answers among the candidate ones.

Query attentive read Given the contextual representations for the query words (q~1; q~2; : : : ; q~jqj) and the inference GRU state st 1 2 Rs, we obtain a refined query representation qt (query glimpse) by performing an attention mechanism over the query at inference step t: ^qi;t = softmax q~i>(Aqst 1 + aq); i=1;:::;jqj qt =

X ^qi;tq~i

i where q^i;t are the attention weights associated to the query words, Aq 2 R2h s and aq 2 R2h are respectively a weight matrix and a bias vector which are used to perform the bilinear product with the query token representations q~i. The attention weights can be interpreted as the relevance scores for each word of the query dependent on the inference state st 1 at the current inference step t. Document attentive read Given the query glimpse qt and the inference GRU state st 1 2 Rs, we perform an attention mechanism over the contextual representations for the words of the stacked documents D~ q: d^i;t = softmax D~ q>i (Ad[st 1; qt] + ad); i=1;:::;l dt =

X d^i;tD~ qi

i where D~ qi is the i-th row of D~ q, d^i;t are the attention weights associated to the document words, Ad 2 R2h s and ad 2 R2h are respectively a weight matrix and a bias vector which are used to perform the bilinear product with the document token representations D~ qi . The attention weights can be interpreted as the relevance scores for each word of the documents conditioned on both the query glimpse and the inference state st 1 at the current inference step t. By combining the set of relevant documents in D~ q, we obtain the probability distribution (d^1;t; d^2;t; : : : d^l;t) over all the relevant document tokens using the above-mentioned attention mechanism.

Gating search results The inference GRU state at the inference step t is updated according to st = GRU ([rq qt; rd dt]; st 1), where rq and rd are the results of a gating mechanism obtained by evaluating g([st 1; qt; dt; qt dt]) for the query and the documents, respectively. The gating function g : Rs+6h ! R2h is defined as a 2-layer feed-forward neural network with a Rectified Linear Unit (ReLU) [ 12 ] activation function in the hidden layer and a sigmoid activation function in the output layer. The purpose of the gating mechanism is to retain useful information for the inference process about query and documents and forget useless one. 2.3

Prediction phase

The prediction phase, which is completely different from the pointer-sum loss reported in [ 19 ], is able to generate, given the query q, a relevance score for each candidate answer a 2 A by using the document attention weights d^i;T computed in the last inference step T . The relevance score of each word w is obtained by summing the attention weights of w in each document related to q. Formally the relevance score for a given word w is defined as: score(w) = 1 (w)

l X (i; w) i=1 where (i; w) returns 0 if (i) 6= w, d^i;T otherwise; (i) returns the word in position i of the stacked documents matrix D~ q and (w) returns the frequency of the word w in the documents Dq related to the query q. The relevance score takes into account the importance of token occurrences in the considered documents given by the computed attention weights. Moreover, the normalization term (1w) is applied to the relevance score in order to mitigate the weight associated to highly frequent tokens.

The evaluated relevance scores are concatenated in a single vector representation z = [score(w1); score(w2); : : : ; score(wjV j)] which is given in input to the answer prediction neural network defined as:

y = sigmoid(Who relu(Wihz + bih) + bho) where u is the hidden layer size, Wih 2 Ru jV j and Who 2 RjAj u are weight matrices, bih 2 Ru, bho 2 RjAj are bias vectors, sigmoid(x) = 1+1e x is the sigmoid function and relu(x) = max(0; x) is the ReLU activation function, which are applied pointwise to the given input vector.

The neural network weights are supposed to learn latent features which encode relationships between the most relevant words for the given query to predict the correct answers. The outer sigmoid activation function is used to treat the problem as a multi-label classification problem, so that each candidate answer is independent and not mutually exclusive. In this way the neural network generates a score which represents the probability that the candidate answer is correct. Moreover, differently from [ 19 ], the candidate answer A can be any word, even those which not belong to the documents related to the query.

The model is trained by minimizing the binary cross-entropy loss function comparing the neural network output y with the target answers for the given query q represented as a binary vector, in which there is a 1 in the corresponding position of the correct answer, 0 otherwise. 3

Experimental evaluation

The model performance is evaluated on the QA and Recs tasks of the bAbI Movie Dialog dataset using HITS@k evaluation metric, which is equal to the number of correct answers in the top-k results. In particular, the performance for the QA task is evaluated according to HITS@1, while the performance for the Recs task is evaluated according to HITS@100.

Differently from [ 5 ], the relevant knowledge base facts, taken from the knowledge base in triple form distributed with the dataset, are retrieved by implemented by exploiting the Elasticsearch engine and not according to an hash lookup operator which applies a strict filtering procedure based on word frequency. In our work, returns at most the top 30 relevant facts for q. Each entity in questions and documents is recognized using the list of entities provided with the dataset and considered as a single word of the dictionary V .

Questions, answers and documents given in input to the model are preprocessed using the NLTK toolkit [ 2 ] performing only word tokenization. The question given in input to the operator is preprocessed performing word tokenization and stopword removal.

The optimization method and tricks are adopted from [ 19 ]. The model is trained using ADAM [ 9 ] optimizer (learning rate=0:001) with a batch size of 128 for at most 100 epochs considering the best model until the HITS@k on the validation set decreases for 5 consecutive times. Dropout [ 20 ] is applied on rq and on rd with a rate of 0:2 and on the prediction neural network hidden layer with a rate of 0:5. L2 regularization is applied to the embedding matrix X with a coefficient equal to 0:0001. We clipped the gradients if their norm is greater than 5 to stabilize learning [ 14 ]. Embedding size d is fixed to 50. All GRU output sizes are fixed to 128. The number of inference steps T is set to 3. The size of the prediction neural network hidden layer u is fixed to 4096. Biases bih and bho are initialized to zero vectors. All weight matrices are initialized sampling from the normal distribution N (0; 0:05). The ReLU activation function in the prediction neural network has been experimentally chosen comparing different activation functions such as sigmoid and tanh and taking the one which leads to the best performance. The model, available on GitHub 1, is implemented in TensorFlow [ 1 ] and executed on an NVIDIA TITAN X GPU.

Following the experimental design, the results in Table 1 are promising because our model outperforms all other systems on both tasks except for the QA SYSTEM on the QA task. Despite the advantage of the QA SYSTEM, it is a carefully designed system to handle knowledge base data in the form of triples, but our model can leverage data in the form of documents, without making any assumption about the form of the input data and can be applied to different kind of tasks. Additionally, the model MEMN2N is a neural network whose weights are pre-trained on the same dataset without using the long-term memory and the models JOINT SUPERVISED EMBEDDINGS and JOINT MEMN2N are models trained across all the tasks of the dataset in order to boost performance. Despite that, our model outperforms the three above-mentioned ones without using any supplementary trick. Even though our model performance is higher than all the others on the Recs task, we believe that the obtained result may be improved and so we plan a further investigation. Moreover, the need for further investigation can be justified by the work reported in [ 18 ] which describes some issues regarding the Recs task.

Figure 1 shows the attention weights computed in the last inference step of the iterative attention mechanism used by the model to answer to a given question. Attention weights, represented as red boxes with variable color shades around the tokens, can be used to interpret the reasoning mechanism applied by the model because higher shades of red are associated to more relevant tokens on which the model focus its attention. It is worth to notice that the attention weights associated to each token are the result of the inference mechanism uncovered by the model which progressively tries to focus on the relevant aspects of the query and the documents which are exploited to generate the answers. 1 https://github.com/nlp-deepqa/imnamap Question: what does Larenz Tate act in ? Ground truth answers: The Postman, A Man Apart, Dead Presidents, Love Jones, Why Do Fools Fall in Love, The Inkwell Most relevant sentences: – The Inkwell starred actors Joe Morton , Larenz Tate , Suzzanne Douglas , Glynn Turman – Love Jones starred actors Nia Long , Larenz Tate , Isaiah Washington , Lisa Nicole Carson – Why Do Fools Fall in Love starred actors Halle Berry , Vivica A. Fox , Larenz Tate , Lela Rochon – The Postman starred actors Kevin Costner , Olivia Williams , Will Patton , Larenz Tate – Dead Presidents starred actors Keith David , Chris Tucker , Larenz Tate – A Man Apart starred actors Vin Diesel , Larenz Tate

Given the question “what does Larenz Tate act in?” shown in the abovementioned figure, the model is able to understand that “Larenz Tate” is the subject of the question and “act in” represents the intent of the question. Reading the related documents, the model associates higher attention weights to the most relevant tokens needed to answer the question, such as “The Postman”, “A Man Apart” and so on. 4

Related work

We think that it is necessary to consider models and techniques coming from research both in QA and recommender systems in order to pursue our desire to build an intelligent agent able to assist the user in decision-making tasks. We cannot fill the gap between the above-mentioned research areas if we do not consider the proposed models in a synergic way by virtue of the proposed analogy between the user profile (the set of user preferences) and the items to be recommended, as the question and the correct answers. The first work which goes in this direction is reported in [ 11 ], which exploits movie descriptions to suggest appealing movies for a given user using an architecture tipically used for QA tasks. In fact, most of the research in the recommender systems field presents ad-hoc systems which exploit neighbourhood information like in Collaborative Filtering techniques [ 13 ], item descriptions and metadata like in Content-based systems [ 6 ]. Recently presented neural network models [ 3, 4 ] systems are able to learn latent representations in the network weights leveraging information coming from user preferences and item information.

In recent days, a lot of effort is devoted to create benchmarks for artificial agents to assess their ability to comprehend natural language and to reason over facts. One of the first attempt is the bAbI [ 23 ] dataset which is a synthetic dataset containing elementary tasks such as selecting an answer between one or more candidate facts, answering yes/no questions, counting operations over lists and sets and basic induction and deduction tasks. Another relevant benchmark is the one described in [ 7 ], which provides CNN/Daily Mail datasets consisting of document-query-answer triples where an entity in the query is replaced by a placeholder and the system should identify the correct entity by reading and comprehending the given document. MCTest [ 16 ] requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension. Finally, SQuAD [ 15 ] consists in a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

According to the experimental evaluations conducted on the above-mentioned datasets, high-level performance can be obtained exploiting complex attention mechanisms which are able to focus on relevant evidences in the processed content. One of the earlier approaches used to solve these tasks is given by the general Memory Network [ 21, 24 ] framework which is one of the first neural network models able to access external memories to extract relevant information through an attention mechanism and to use them to provide the correct answer. A deep Recurrent Neural Network with Long Short-Term Memory units is presented in [ 7 ], which solves CNN/Daily Mail datasets by designing two different attention mechanisms called Impatient Reader and Attentive Reader. Another way to incorporate attention in neural network models is proposed in [ 8 ] which defines a pointer-sum loss whose aim is to maximize the attention weights which lead to the correct answer. 5

Conclusions and Future Work

In this work we propose a novel model based on Artificial Neural Networks to answer questions with multiple answers by exploiting multiple facts retrieved from a knowledge base. The proposed model can be considered a relevant building block of a conversational recommender system. Differently from [ 19 ], our model can consider multiple documents as a source of information in order to generate multiple answers which may not belong to the documents. As presented in this work, common tasks such as QA and top-n recommendation can be solved effectively by our model.

In a common recommendation system scenario, when a user enters a search query, it is assumed that his preferences are known. This is a stringent requirement because users cannot have a clear idea of their preferences at that point. Conversational recommender systems support users to fulfill their information needs through an interactive process. In this way, the system can provide a personalized experience dynamically adapting the user model with the possibility to enhance the generated predictions. Moreover, the system capability can be further enhanced giving explanations to the user about the given suggestions.

To reach our goal, we should improve our model by designing a operator able to return relevant facts recognizing the most relevant information in the query, by exploiting user preferences and contextual information to learn the user model and by providing a mechanism which leverages attention weights to give explanations. In order to effectively train our model, we plan to collect real dialog data containing contextual information associated to each user and feedback for each dialog which represents if the user is satisfied with the conversation. Given these enhancements, we should design a system able to hold effectively a dialog with the user recognizing his intent and providing him the most suitable contents.

With this work we try to show the effectiveness of our architecture for tasks which go from pure question answering to top-n recommendation through an experimental evaluation without any assumption on the task to be solved. To do that, we do not use any hand-crafted linguistic features but we let the system learn and leverage them in the inference process which leads to the answers through multiple reasoning steps. During these steps, the system understands relevant relationships between question and documents without relying on canonical matching, but repeating an attention mechanism able to unconver related aspects in distributed representations, conditioned on an encoding of the inference process given by another neural network. Equipping agents with a reasoning mechanism like the one described in this work and exploiting the ability of neural network models to learn from data, we may be able to create truly intelligent agents. 6

Acknowledgments

This work is supported by the IBM Faculty Award "Deep Learning to boost Cognitive Question Answering". The Titan X GPU used for this research was donated by the NVIDIA Corporation.

1. Abadi , M. , Agarwal , A. , Barham , P. , Brevdo , E. , Chen , Z. , Citro , C. , Corrado , G.S. , Davis , A. , Dean , J. , Devin , M. , Ghemawat , S. , Goodfellow , I.J. , Harp , A. , Irving , G. , Isard , M. , Jia , Y. , Józefowicz , R. , Kaiser , L. , Kudlur , M. , Levenberg , J. , Mané , D. , Monga , R. , Moore , S. , Murray , D.G. , Olah , C. , Schuster , M. , Shlens , J. , Steiner , B. , Sutskever , I. , Talwar , K. , Tucker , P.A. , Vanhoucke , V. , Vasudevan , V. , Viégas , F.B. , Vinyals , O. , Warden , P. , Wattenberg , M. , Wicke , M. , Yu , Y. , Zheng , X. : Tensorflow: Large-scale machine learning on heterogeneous distributed systems . CoRR abs/1603 .04467 ( 2016 )

2. Bird , S. : Nltk: the natural language toolkit . In: Proceedings of the COLING/ACL on Interactive presentation sessions . pp. 69 - 72 . Association for Computational Linguistics ( 2006 )

3. Cheng, H., Koc , L. , Harmsen , J. , Shaked , T. , Chandra , T. , Aradhye , H. , Anderson , G. , Corrado , G. , Chai , W. , Ispir , M. , Anil , R. , Haque , Z. , Hong , L. , Jain , V. , Liu , X. , Shah , H.: Wide & deep learning for recommender systems . CoRR abs/1606 .07792 ( 2016 )

4. Covington , P. , Adams , J. , Sargin , E.: Deep neural networks for youtube recommendations . In: Proceedings of the 10th ACM Conference on Recommender Systems . New York, NY, USA ( 2016 )

5. Dodge , J. , Gane , A. , Zhang , X. , Bordes , A. , Chopra , S. , Miller , A. , Szlam , A. , Weston , J.: Evaluating prerequisite qualities for learning end-to-end dialog systems . arXiv preprint arXiv:1511.06931 ( 2015 )

6. de Gemmis, M. , Lops , P. , Musto , C. , Narducci , F. , Semeraro , G.: Semantics-aware content-based recommender systems . In: Recommender Systems Handbook , pp. 119 - 159 . Springer ( 2015 )

7. Hermann, K.M. , Kociský , T. , Grefenstette , E. , Espeholt , L. , Kay , W. , Suleyman , M. , Blunsom , P. : Teaching machines to read and comprehend . In: Advances in Neural Information Processing Systems . pp. 1693 - 1701 ( 2015 )

8. Kadlec , R. , Schmid , M. , Bajgar , O. , Kleindienst , J.: Text understanding with the attention sum reader network . arXiv preprint arXiv:1603.01547 ( 2016 )

9. Kingma , D. , Ba , J.: Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980 ( 2014 )

10. Mahmood , T. , Ricci , F. : Improving recommender systems with adaptive conversational strategies . In: Proceedings of the 20th ACM conference on Hypertext and hypermedia . pp. 73 - 82 . ACM ( 2009 )

11. Musto , C. , Greco , C. , Suglia , A. , Semeraro , G.: Ask me any rating: A content-based recommender system based on recurrent neural networks . In: Proceedings of the 7th Italian Information Retrieval Workshop , Venezia, Italy, May 30 -31, 2016 .

12. Nair , V. , Hinton , G.E.: Rectified linear units improve restricted boltzmann machines . In: Proceedings of the 27th International Conference on Machine Learning (ICML-10) . pp. 807 - 814 ( 2010 )

13. Ning , X. , Desrosiers , C. , Karypis , G.: A comprehensive survey of neighborhoodbased recommendation methods . In: Recommender Systems Handbook , pp. 37 - 76 . Springer ( 2015 )

14. Pascanu , R. , Mikolov , T. , Bengio , Y. : On the difficulty of training recurrent neural networks . ICML (3) 28 , 1310 - 1318 ( 2013 )

15. Rajpurkar , P. , Zhang, J., Lopyrev , K. , Liang , P. : Squad: 100 , 000+ questions for machine comprehension of text . CoRR abs/1606 .05250 ( 2016 )

16. Richardson , M. , Burges , C.J.C. , Renshaw , E. : Mctest : A challenge dataset for the open-domain machine comprehension of text . In: EMNLP ( 2013 )

17. Rubens , N. , Kaplan , D. , Sugiyama , M. : Active learning in recommender systems . In: Recommender Systems Handbook , pp. 809 - 846 . Springer ( 2015 )

18. Searle , R. , Bingham-Walker , M. : Why “blow out”? a structural analysis of the movie dialog dataset . ACL 2016 p. 215 ( 2016 )

19. Sordoni , A. , Bachman , P. , Bengio , Y. : Iterative alternating neural attention for machine reading . arXiv preprint arXiv:1606.02245 ( 2016 )

20. Srivastava , N. , Hinton , G.E. , Krizhevsky , A. , Sutskever , I. , Salakhutdinov , R.: Dropout: a simple way to prevent neural networks from overfitting . Journal of Machine Learning Research 15 ( 1 ), 1929 - 1958 ( 2014 )

21. Sukhbaatar , S. , Szlam , A. , Weston , J. , Fergus , R.: End-to-end memory networks . In: Advances in neural information processing systems . pp. 2440 - 2448 ( 2015 )

22. Taylor , W.L.: Cloze procedure: a new tool for measuring readability . Journalism and Mass Communication Quarterly 30 ( 4 ), 415 ( 1953 )

23. Weston , J. , Bordes , A. , Chopra , S. , Mikolov , T. : Towards ai-complete question answering: A set of prerequisite toy tasks . CoRR abs/1502 .05698 ( 2015 )

24. Weston , J. , Chopra , S. , Bordes , A. : Memory networks . arXiv preprint arXiv:1410.3916 ( 2014 )