=Paper=
{{Paper
|id=Vol-1802/paper3
|storemode=property
|title=Iterative Multi-document Neural Attention for
Multiple Answer Prediction
|pdfUrl=https://ceur-ws.org/Vol-1802/paper3.pdf
|volume=Vol-1802
|authors=Claudio Greco,Alessandro Suglia,Pierpaolo Basile,Gaetano Rossiello,Giovanni Semeraro
|dblpUrl=https://dblp.org/rec/conf/aiia/0002SBRS16
}}
==Iterative Multi-document Neural Attention for
Multiple Answer Prediction==
<pdf width="1500px">https://ceur-ws.org/Vol-1802/paper3.pdf</pdf>
<pre>
    Iterative Multi-document Neural Attention for
              Multiple Answer Prediction

    Claudio Greco, Alessandro Suglia, Pierpaolo Basile, Gaetano Rossiello, and
                               Giovanni Semeraro

           Department of Computer Science, University of Bari Aldo Moro,
                        Via E. Orabona 4, 70125 Bari, Italy
          claudiogaetanogreco@gmail.com,alessandro.suglia@gmail.com,
                         {firstname.lastname}@uniba.it


        Abstract. People have information needs of varying complexity, which
        can be solved by an intelligent agent able to answer questions formulated
        in a proper way, eventually considering user context and preferences. In
        a scenario in which the user profile can be considered as a question, in-
        telligent agents able to answer questions can be used to find the most
        relevant answers for a given user. In this work we propose a novel model
        based on Artificial Neural Networks to answer questions with multiple
        answers by exploiting multiple facts retrieved from a knowledge base.
        The model is evaluated on the factoid Question Answering and top-n
        recommendation tasks of the bAbI Movie Dialog dataset. After assess-
        ing the performance of the model on both tasks, we try to define the
        long-term goal of a conversational recommender system able to interact
        using natural language and supporting users in their information seeking
        processes in a personalized way.


1     Motivation and Background
We are surrounded by a huge variety of technological artifacts which “live” with
us today. These artifacts can help us in several ways because they have the
power to accomplish complex and time-consuming tasks. Unfortunately, com-
mon software systems can do for us only specific types of tasks, in a strictly
algorithmic way which is pre-defined by the software designer. Machine Learn-
ing (ML), a branch of Artificial Intelligence (AI), gives machines the ability to
learn to complete tasks without being explicitly programmed.
    People have information needs of varying complexity, ranging from simple
questions about common facts which can be found in encyclopedias, to more
sophisticated cases in which they need to know what movie to watch during
a romantic evening. These tasks can be solved by an intelligent agent able to
answer questions formulated in a proper way, eventually considering user context
and preferences.
    Question Answering (QA) emerged in the last decade as one of the most
promising fields in AI, since it allows to design intelligent systems which are
able to give correct answers to user questions expressed in natural language.
Whereas, recommender systems produce individualized recommendations as out-
put and have the effect of guiding the user in a personalized way to interesting
or useful objects in a large space of possible options. In a scenario in which the
user profile (the set of user preferences) can be represented by a question, intel-
ligent agents able to answer questions can be used to find the most appealing
items for a given user, which is the classical task that recommender systems can
solve. Despite the efficacy of classical recommender systems, generally they are
not able to handle a conversation with the user so they miss the possibility of
understanding his contextual information, emotions and feedback to refine the
user profile and provide enhanced suggestions. Conversational recommender sys-
tems assist online users in their information-seeking and decision making tasks
by supporting an interactive process [10] which could be goal oriented with the
task of starting general and, through a series of interaction cycles, narrowing
down the user interests until the desired item is obtained [17].
    In this work we propose a novel model based on Artificial Neural Networks to
answer questions exploiting multiple facts retrieved from a knowledge base and
evaluate it on a QA task. Moreover, the effectiveness of the model is evaluated on
the top-n recommendation task, where the aim of the system is to produce a list
of suggestions ranked according to the user preferences. After having assessed
the performance of the model on both tasks, we try to define the long-term goal
of a conversational recommender system able to interact with the user using
natural language and supporting him in the information seeking process in a
personalized way.
    In order to fulfill our long-term goal of building a conversational recommender
system we need to assess the performance of our model on specific tasks involved
in this scenario. A recent work which goes in this direction is reported in [5],
which presents the bAbI Movie Dialog dataset, composed by different tasks such
as factoid QA, top-n recommendation and two more complex tasks, one which
mixes QA and recommendation and one which contains turns of dialogs taken
from Reddit. Having more specific tasks like QA and recommendation, and a
more complex one which mixes both tasks gives us the possibility to evaluate
our model on different levels of granularity. Moreover, the subdivision in turns
of the more complex task provides a proper benchmark of the model capability
to handle an effective dialog with the user.
    For the task related to QA, a lot of datasets have been released in order to
assess the machine reading and comprehension capabilities and a lot of neu-
ral network-based models have been proposed. Our model takes inspiration
from [19], which is able to answer Cloze-style [22] questions repeating an at-
tention mechanism over the query and the documents multiple times. Despite
the effectiveness on the Cloze-style task, the original model does not consider
multiple documents as a source of information to answer questions, which is
fundamental in order to extract the answer from different relevant facts. The
restricted assumption that the answer is contained in the given document does
not allow the model to provide an answer which does not belong to the docu-
ment. Moreover, this kind of task does not expect multiple answers for a given
question, which is important for the complex information needs required for a
conversational recommender system.
    According to our vision, the main outcomes of our work can be considered as
building blocks for a conversational recommender system and can be summarized
as follows:
 1. we extend the model reported in [19] to let the inference process exploit ev-
    idences observed in multiple documents coming from an external knowledge
    base represented as a collection of textual documents;
 2. we design a model able to leverage the attention weights generated by the
    inference process to provide multiple answers which does not necessarily
    belong to the documents through a multi-layer neural network which may
    uncover possible relationships between the most relevant evidences;
 3. we assess the efficacy of our model through an experimental evaluation on
    factoid QA and top-n recommendation tasks supporting our hypothesis that
    a QA model can be used to solve top-n recommendation, too.
    The paper is organized as follows: Section 2 describes our model, while Sec-
tion 3 summarizes the evaluation of the model on the two above-mentioned tasks
and the comparison with respect to state-of-the-art approaches. Section 4 gives
an overview of the literature of both QA and recommender systems, while final
remarks and our long-term vision are reported in Section 5.


2     Methodology
Given a query q, an operator ψ : Q → D that produces the set of documents
relevant for q, where Q is the set of all queries and D is the set of all docu-
ments. Our model defines a workflow in which a sequence of inference steps are
performed in order to extract relevant information from ψ(q) to generate the
answers for q.
    Following [19], our workflow consists of three steps: (1) the encoding phase,
which generates meaningful representations for query and documents; (2) the in-
ference phase, which extracts relevant semantic relationships between the query
and the documents by using an iterative attention mechanism and finally (3) the
prediction phase, which generates a score for each candidate answer.

2.1   Encoding phase
The input of the encoding phase is given by a query q and a set of documents
ψ(q) = {d1 , d2 , . . . , d|Dq | } ≡ Dq . Both queries and documents are represented by
a sequence of words X = (x1 , x2 , . . . , x|X| ), drawn from a vocabulary V . Each
word is represented by a continuous d-dimensional word embedding x ∈ Rd
stored in a word embedding matrix X ∈ R|V |×d .
    The sequences of dense representations for q and dj are encoded using a bidi-
rectional recurrent neural network encoder with Gated Recurrent Units (GRU)
as in [19] which represents each word xi ∈ X as the concatenation of a forward
          −
          →                                     ← −
encoding hk ∈ Rh and a backward encoding hk ∈ Rh . From now on, we denote
the contextual representation for the word qi by q̃i ∈ R2h and the contextual
representation for the word dj,i in the document dj by d̃j,i ∈ R2h . Differently
from [19], we build a unique representation for the whole set of documents Dq
related to the query q by stacking each contextual representation d̃j,i obtaining
a matrix D̃q ∈ Rl×2h , where l = |d1 | + |d2 | + . . . + |d|Dq | |.

2.2   Inference phase
This phase uncovers a possible inference chain which models meaningful relation-
ships between the query and the set of related documents. The inference chain
is obtained by performing, for each inference step t = 1, 2, . . . , T , the attention
mechanisms given by the query attentive read and the document attentive read
keeping a state of the inference process given by an additional recurrent neural
network with GRU units. In this way, the network is able to progressively refine
the attention weights focusing on the most relevant tokens of the query and the
documents which are exploited by the prediction neural network to select the
correct answers among the candidate ones.

Query attentive read Given the contextual representations for the query
words (q̃1 , q̃2 , . . . , q̃|q| ) and the inference GRU state st−1 ∈ Rs , we obtain a re-
fined query representation qt (query glimpse) by performing an attention mech-
anism over the query at inference step t:
                          q̂i,t = softmax q̃>   i (Aq st−1 + aq ),
                                   i=1,...,|q|
                                               X
                                      qt =        q̂i,t q̃i
                                                i

where q̂i,t are the attention weights associated to the query words, Aq ∈ R2h×s
and aq ∈ R2h are respectively a weight matrix and a bias vector which are used
to perform the bilinear product with the query token representations q̃i . The
attention weights can be interpreted as the relevance scores for each word of the
query dependent on the inference state st−1 at the current inference step t.

Document attentive read Given the query glimpse qt and the inference GRU
state st−1 ∈ Rs , we perform an attention mechanism over the contextual repre-
sentations for the words of the stacked documents D̃q :

                       d̂i,t = softmax D̃>qi (Ad [st−1 , qt ] + ad ),
                                i=1,...,l
                                          X
                                    dt =      d̂i,t D̃qi
                                            i

where D̃qi is the i-th row of D̃q , dˆi,t are the attention weights associated to
the document words, Ad ∈ R2h×s and ad ∈ R2h are respectively a weight ma-
trix and a bias vector which are used to perform the bilinear product with the
document token representations D̃qi . The attention weights can be interpreted
as the relevance scores for each word of the documents conditioned on both
the query glimpse and the inference state st−1 at the current inference step t.
By combining the set of relevant documents in D̃q , we obtain the probability
distribution (dˆ1,t , dˆ2,t , . . . dˆl,t ) over all the relevant document tokens using the
above-mentioned attention mechanism.


Gating search results The inference GRU state at the inference step t is
updated according to st = GRU ([rq · qt , rd · dt ], st−1 ), where rq and rd are the
results of a gating mechanism obtained by evaluating g([st−1 , qt , dt , qt · dt ]) for
the query and the documents, respectively. The gating function g : Rs+6h → R2h
is defined as a 2-layer feed-forward neural network with a Rectified Linear Unit
(ReLU) [12] activation function in the hidden layer and a sigmoid activation
function in the output layer. The purpose of the gating mechanism is to retain
useful information for the inference process about query and documents and
forget useless one.


2.3   Prediction phase

The prediction phase, which is completely different from the pointer-sum loss
reported in [19], is able to generate, given the query q, a relevance score for each
candidate answer a ∈ A by using the document attention weights dˆi,T computed
in the last inference step T . The relevance score of each word w is obtained by
summing the attention weights of w in each document related to q. Formally the
relevance score for a given word w is defined as:
                                                   l
                                          1 X
                             score(w) =          φ(i, w)
                                        π(w) i=1

where φ(i, w) returns 0 if σ(i) 6= w, dˆi,T otherwise; σ(i) returns the word in
position i of the stacked documents matrix D̃q and π(w) returns the frequency of
the word w in the documents Dq related to the query q. The relevance score takes
into account the importance of token occurrences in the considered documents
                                                                                       1
given by the computed attention weights. Moreover, the normalization term π(w)
is applied to the relevance score in order to mitigate the weight associated to
highly frequent tokens.
    The evaluated relevance scores are concatenated in a single vector represen-
tation z = [score(w1 ), score(w2 ), . . . , score(w|V | )] which is given in input to the
answer prediction neural network defined as:

                    y = sigmoid(Who relu(Wih z + bih ) + bho )

where u is the hidden layer size, Wih ∈ Ru×|V | and Who ∈ R|A|×u are weight
matrices, bih ∈ Ru , bho ∈ R|A| are bias vectors, sigmoid(x) = 1+e1−x is the
sigmoid function and relu(x) = max(0, x) is the ReLU activation function, which
are applied pointwise to the given input vector.
   The neural network weights are supposed to learn latent features which en-
code relationships between the most relevant words for the given query to predict
the correct answers. The outer sigmoid activation function is used to treat the
problem as a multi-label classification problem, so that each candidate answer is
independent and not mutually exclusive. In this way the neural network gener-
ates a score which represents the probability that the candidate answer is correct.
Moreover, differently from [19], the candidate answer A can be any word, even
those which not belong to the documents related to the query.
   The model is trained by minimizing the binary cross-entropy loss function
comparing the neural network output y with the target answers for the given
query q represented as a binary vector, in which there is a 1 in the corresponding
position of the correct answer, 0 otherwise.


3   Experimental evaluation

The model performance is evaluated on the QA and Recs tasks of the bAbI Movie
Dialog dataset using HITS@k evaluation metric, which is equal to the number
of correct answers in the top-k results. In particular, the performance for the
QA task is evaluated according to HITS@1, while the performance for the Recs
task is evaluated according to HITS@100.
    Differently from [5], the relevant knowledge base facts, taken from the knowl-
edge base in triple form distributed with the dataset, are retrieved by ψ imple-
mented by exploiting the Elasticsearch engine and not according to an hash
lookup operator which applies a strict filtering procedure based on word fre-
quency. In our work, ψ returns at most the top 30 relevant facts for q. Each
entity in questions and documents is recognized using the list of entities pro-
vided with the dataset and considered as a single word of the dictionary V .
    Questions, answers and documents given in input to the model are prepro-
cessed using the NLTK toolkit [2] performing only word tokenization. The ques-
tion given in input to the ψ operator is preprocessed performing word tokeniza-
tion and stopword removal.
    The optimization method and tricks are adopted from [19]. The model is
trained using ADAM [9] optimizer (learning rate=0.001) with a batch size of
128 for at most 100 epochs considering the best model until the HITS@k on the
validation set decreases for 5 consecutive times. Dropout [20] is applied on rq
and on rd with a rate of 0.2 and on the prediction neural network hidden layer
with a rate of 0.5. L2 regularization is applied to the embedding matrix X with
a coefficient equal to 0.0001. We clipped the gradients if their norm is greater
than 5 to stabilize learning [14]. Embedding size d is fixed to 50. All GRU output
sizes are fixed to 128. The number of inference steps T is set to 3. The size of
the prediction neural network hidden layer u is fixed to 4096. Biases bih and
bho are initialized to zero vectors. All weight matrices are initialized sampling
from the normal distribution N (0, 0.05). The ReLU activation function in the
prediction neural network has been experimentally chosen comparing different
activation functions such as sigmoid and tanh and taking the one which leads
to the best performance. The model, available on GitHub 1 , is implemented in
TensorFlow [1] and executed on an NVIDIA TITAN X GPU.


         METHODS                     QA TASK RECS TASK
         QA SYSTEM                   90.7    N/A
         SVD                         N/A     19.2
         LSTM                        6.5     27.1
         SUPERVISED EMBEDDINGS       50.9    29.2
         MEMN2N                      79.3    28.6
         JOINT SUPERVISED EMBEDDINGS 43.6    28.1
         JOINT MEMN2N                83.5    26.5
         OUR MODEL                   86.8    30.0
Table 1: Comparison between our model and baselines from [5] on the QA and
Recs tasks evaluated according to HITS@1 and HITS@100, respectively.


    Following the experimental design, the results in Table 1 are promising be-
cause our model outperforms all other systems on both tasks except for the QA
SYSTEM on the QA task. Despite the advantage of the QA SYSTEM, it is a
carefully designed system to handle knowledge base data in the form of triples,
but our model can leverage data in the form of documents, without making any
assumption about the form of the input data and can be applied to different kind
of tasks. Additionally, the model MEMN2N is a neural network whose weights
are pre-trained on the same dataset without using the long-term memory and
the models JOINT SUPERVISED EMBEDDINGS and JOINT MEMN2N are
models trained across all the tasks of the dataset in order to boost performance.
Despite that, our model outperforms the three above-mentioned ones without
using any supplementary trick. Even though our model performance is higher
than all the others on the Recs task, we believe that the obtained result may be
improved and so we plan a further investigation. Moreover, the need for further
investigation can be justified by the work reported in [18] which describes some
issues regarding the Recs task.
    Figure 1 shows the attention weights computed in the last inference step
of the iterative attention mechanism used by the model to answer to a given
question. Attention weights, represented as red boxes with variable color shades
around the tokens, can be used to interpret the reasoning mechanism applied by
the model because higher shades of red are associated to more relevant tokens
on which the model focus its attention. It is worth to notice that the attention
weights associated to each token are the result of the inference mechanism un-
covered by the model which progressively tries to focus on the relevant aspects
of the query and the documents which are exploited to generate the answers.
1
    https://github.com/nlp-deepqa/imnamap
Question:
what does Larenz Tate act in ?

Ground truth answers:
The Postman, A Man Apart, Dead Presidents, Love Jones, Why Do Fools Fall in Love, The Inkwell

Most relevant sentences:

 – The Inkwell starred actors Joe Morton , Larenz Tate , Suzzanne Douglas , Glynn Turman
 – Love Jones starred actors Nia Long , Larenz Tate , Isaiah Washington , Lisa Nicole Carson
 – Why Do Fools Fall in Love starred actors Halle Berry , Vivica A. Fox , Larenz Tate , Lela Rochon
 – The Postman starred actors Kevin Costner , Olivia Williams , Will Patton , Larenz Tate
 – Dead Presidents starred actors Keith David , Chris Tucker , Larenz Tate
 – A Man Apart starred actors Vin Diesel , Larenz Tate


Fig. 1: Attention weights q̃i and D̃qi computed by the neural network attention
mechanisms at the last inference step T for each token. Higher shades correspond
to higher relevance scores for the related tokens.


    Given the question “what does Larenz Tate act in?” shown in the above-
mentioned figure, the model is able to understand that “Larenz Tate” is the
subject of the question and “act in” represents the intent of the question. Reading
the related documents, the model associates higher attention weights to the most
relevant tokens needed to answer the question, such as “The Postman”, “A Man
Apart” and so on.


4     Related work

We think that it is necessary to consider models and techniques coming from
research both in QA and recommender systems in order to pursue our desire
to build an intelligent agent able to assist the user in decision-making tasks.
We cannot fill the gap between the above-mentioned research areas if we do
not consider the proposed models in a synergic way by virtue of the proposed
analogy between the user profile (the set of user preferences) and the items to be
recommended, as the question and the correct answers. The first work which goes
in this direction is reported in [11], which exploits movie descriptions to suggest
appealing movies for a given user using an architecture tipically used for QA
tasks. In fact, most of the research in the recommender systems field presents
ad-hoc systems which exploit neighbourhood information like in Collaborative
Filtering techniques [13], item descriptions and metadata like in Content-based
systems [6]. Recently presented neural network models [3, 4] systems are able
to learn latent representations in the network weights leveraging information
coming from user preferences and item information.
    In recent days, a lot of effort is devoted to create benchmarks for artificial
agents to assess their ability to comprehend natural language and to reason over
facts. One of the first attempt is the bAbI [23] dataset which is a synthetic
dataset containing elementary tasks such as selecting an answer between one or
more candidate facts, answering yes/no questions, counting operations over lists
and sets and basic induction and deduction tasks. Another relevant benchmark
is the one described in [7], which provides CNN/Daily Mail datasets consisting
of document-query-answer triples where an entity in the query is replaced by
a placeholder and the system should identify the correct entity by reading and
comprehending the given document. MCTest [16] requires machines to answer
multiple-choice reading comprehension questions about fictional stories, directly
tackling the high-level goal of open-domain machine comprehension. Finally,
SQuAD [15] consists in a set of Wikipedia articles, where the answer to each
question is a segment of text from the corresponding reading passage.
    According to the experimental evaluations conducted on the above-mentioned
datasets, high-level performance can be obtained exploiting complex attention
mechanisms which are able to focus on relevant evidences in the processed con-
tent. One of the earlier approaches used to solve these tasks is given by the
general Memory Network [21, 24] framework which is one of the first neural net-
work models able to access external memories to extract relevant information
through an attention mechanism and to use them to provide the correct answer.
A deep Recurrent Neural Network with Long Short-Term Memory units is pre-
sented in [7], which solves CNN/Daily Mail datasets by designing two different
attention mechanisms called Impatient Reader and Attentive Reader. Another
way to incorporate attention in neural network models is proposed in [8] which
defines a pointer-sum loss whose aim is to maximize the attention weights which
lead to the correct answer.


5   Conclusions and Future Work

In this work we propose a novel model based on Artificial Neural Networks to an-
swer questions with multiple answers by exploiting multiple facts retrieved from
a knowledge base. The proposed model can be considered a relevant building
block of a conversational recommender system. Differently from [19], our model
can consider multiple documents as a source of information in order to gener-
ate multiple answers which may not belong to the documents. As presented in
this work, common tasks such as QA and top-n recommendation can be solved
effectively by our model.
    In a common recommendation system scenario, when a user enters a search
query, it is assumed that his preferences are known. This is a stringent require-
ment because users cannot have a clear idea of their preferences at that point.
Conversational recommender systems support users to fulfill their information
needs through an interactive process. In this way, the system can provide a per-
sonalized experience dynamically adapting the user model with the possibility
to enhance the generated predictions. Moreover, the system capability can be
further enhanced giving explanations to the user about the given suggestions.
    To reach our goal, we should improve our model by designing a ψ operator
able to return relevant facts recognizing the most relevant information in the
query, by exploiting user preferences and contextual information to learn the user
model and by providing a mechanism which leverages attention weights to give
explanations. In order to effectively train our model, we plan to collect real dialog
data containing contextual information associated to each user and feedback for
each dialog which represents if the user is satisfied with the conversation. Given
these enhancements, we should design a system able to hold effectively a dialog
with the user recognizing his intent and providing him the most suitable contents.
    With this work we try to show the effectiveness of our architecture for tasks
which go from pure question answering to top-n recommendation through an
experimental evaluation without any assumption on the task to be solved. To
do that, we do not use any hand-crafted linguistic features but we let the sys-
tem learn and leverage them in the inference process which leads to the an-
swers through multiple reasoning steps. During these steps, the system under-
stands relevant relationships between question and documents without relying
on canonical matching, but repeating an attention mechanism able to unconver
related aspects in distributed representations, conditioned on an encoding of the
inference process given by another neural network. Equipping agents with a rea-
soning mechanism like the one described in this work and exploiting the ability
of neural network models to learn from data, we may be able to create truly
intelligent agents.


6    Acknowledgments
This work is supported by the IBM Faculty Award "Deep Learning to boost
Cognitive Question Answering". The Titan X GPU used for this research was
donated by the NVIDIA Corporation.


References
 1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
    G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I.J., Harp, A.,
    Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg,
    J., Mané, D., Monga, R., Moore, S., Murray, D.G., Olah, C., Schuster, M., Shlens,
    J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.A., Vanhoucke, V., Vasudevan,
    V., Viégas, F.B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y.,
    Zheng, X.: Tensorflow: Large-scale machine learning on heterogeneous distributed
    systems. CoRR abs/1603.04467 (2016)
 2. Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL
    on Interactive presentation sessions. pp. 69–72. Association for Computational Lin-
    guistics (2006)
 3. Cheng, H., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson,
    G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X.,
    Shah, H.: Wide & deep learning for recommender systems. CoRR abs/1606.07792
    (2016)
 4. Covington, P., Adams, J., Sargin, E.: Deep neural networks for youtube recommen-
    dations. In: Proceedings of the 10th ACM Conference on Recommender Systems.
    New York, NY, USA (2016)
 5. Dodge, J., Gane, A., Zhang, X., Bordes, A., Chopra, S., Miller, A., Szlam, A.,
    Weston, J.: Evaluating prerequisite qualities for learning end-to-end dialog systems.
    arXiv preprint arXiv:1511.06931 (2015)
 6. de Gemmis, M., Lops, P., Musto, C., Narducci, F., Semeraro, G.: Semantics-aware
    content-based recommender systems. In: Recommender Systems Handbook, pp.
    119–159. Springer (2015)
 7. Hermann, K.M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman,
    M., Blunsom, P.: Teaching machines to read and comprehend. In: Advances in
    Neural Information Processing Systems. pp. 1693–1701 (2015)
 8. Kadlec, R., Schmid, M., Bajgar, O., Kleindienst, J.: Text understanding with the
    attention sum reader network. arXiv preprint arXiv:1603.01547 (2016)
 9. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
    arXiv:1412.6980 (2014)
10. Mahmood, T., Ricci, F.: Improving recommender systems with adaptive conversa-
    tional strategies. In: Proceedings of the 20th ACM conference on Hypertext and
    hypermedia. pp. 73–82. ACM (2009)
11. Musto, C., Greco, C., Suglia, A., Semeraro, G.: Ask me any rating: A content-based
    recommender system based on recurrent neural networks. In: Proceedings of the
    7th Italian Information Retrieval Workshop, Venezia, Italy, May 30-31, 2016.
12. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-
    chines. In: Proceedings of the 27th International Conference on Machine Learning
    (ICML-10). pp. 807–814 (2010)
13. Ning, X., Desrosiers, C., Karypis, G.: A comprehensive survey of neighborhood-
    based recommendation methods. In: Recommender Systems Handbook, pp. 37–76.
    Springer (2015)
14. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural
    networks. ICML (3) 28, 1310–1318 (2013)
15. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100, 000+ questions for
    machine comprehension of text. CoRR abs/1606.05250 (2016)
16. Richardson, M., Burges, C.J.C., Renshaw, E.: Mctest: A challenge dataset for the
    open-domain machine comprehension of text. In: EMNLP (2013)
17. Rubens, N., Kaplan, D., Sugiyama, M.: Active learning in recommender systems.
    In: Recommender Systems Handbook, pp. 809–846. Springer (2015)
18. Searle, R., Bingham-Walker, M.: Why “blow out”? a structural analysis of the movie
    dialog dataset. ACL 2016 p. 215 (2016)
19. Sordoni, A., Bachman, P., Bengio, Y.: Iterative alternating neural attention for
    machine reading. arXiv preprint arXiv:1606.02245 (2016)
20. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
    Dropout: a simple way to prevent neural networks from overfitting. Journal of
    Machine Learning Research 15(1), 1929–1958 (2014)
21. Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks.
    In: Advances in neural information processing systems. pp. 2440–2448 (2015)
22. Taylor, W.L.: Cloze procedure: a new tool for measuring readability. Journalism
    and Mass Communication Quarterly 30(4), 415 (1953)
23. Weston, J., Bordes, A., Chopra, S., Mikolov, T.: Towards ai-complete question
    answering: A set of prerequisite toy tasks. CoRR abs/1502.05698 (2015)
24. Weston, J., Chopra, S., Bordes, A.: Memory networks. arXiv preprint
    arXiv:1410.3916 (2014)

</pre>