<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Iterative Multi-document Neural Attention for Multiple Answer Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudio Greco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Suglia</string-name>
          <email>alessandro.suglia@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaetano Rossiello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <addr-line>Via E. Orabona 4, 70125 Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>People have information needs of varying complexity, which can be solved by an intelligent agent able to answer questions formulated in a proper way, eventually considering user context and preferences. In a scenario in which the user profile can be considered as a question, intelligent agents able to answer questions can be used to find the most relevant answers for a given user. In this work we propose a novel model based on Artificial Neural Networks to answer questions with multiple answers by exploiting multiple facts retrieved from a knowledge base. The model is evaluated on the factoid Question Answering and top-n recommendation tasks of the bAbI Movie Dialog dataset. After assessing the performance of the model on both tasks, we try to define the long-term goal of a conversational recommender system able to interact using natural language and supporting users in their information seeking processes in a personalized way.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>We are surrounded by a huge variety of technological artifacts which “live” with
us today. These artifacts can help us in several ways because they have the
power to accomplish complex and time-consuming tasks. Unfortunately,
common software systems can do for us only specific types of tasks, in a strictly
algorithmic way which is pre-defined by the software designer. Machine
Learning (ML), a branch of Artificial Intelligence (AI), gives machines the ability to
learn to complete tasks without being explicitly programmed.</p>
      <p>People have information needs of varying complexity, ranging from simple
questions about common facts which can be found in encyclopedias, to more
sophisticated cases in which they need to know what movie to watch during
a romantic evening. These tasks can be solved by an intelligent agent able to
answer questions formulated in a proper way, eventually considering user context
and preferences.</p>
      <p>
        Question Answering (QA) emerged in the last decade as one of the most
promising fields in AI, since it allows to design intelligent systems which are
able to give correct answers to user questions expressed in natural language.
Whereas, recommender systems produce individualized recommendations as
output and have the effect of guiding the user in a personalized way to interesting
or useful objects in a large space of possible options. In a scenario in which the
user profile (the set of user preferences) can be represented by a question,
intelligent agents able to answer questions can be used to find the most appealing
items for a given user, which is the classical task that recommender systems can
solve. Despite the efficacy of classical recommender systems, generally they are
not able to handle a conversation with the user so they miss the possibility of
understanding his contextual information, emotions and feedback to refine the
user profile and provide enhanced suggestions. Conversational recommender
systems assist online users in their information-seeking and decision making tasks
by supporting an interactive process [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which could be goal oriented with the
task of starting general and, through a series of interaction cycles, narrowing
down the user interests until the desired item is obtained [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>In this work we propose a novel model based on Artificial Neural Networks to
answer questions exploiting multiple facts retrieved from a knowledge base and
evaluate it on a QA task. Moreover, the effectiveness of the model is evaluated on
the top-n recommendation task, where the aim of the system is to produce a list
of suggestions ranked according to the user preferences. After having assessed
the performance of the model on both tasks, we try to define the long-term goal
of a conversational recommender system able to interact with the user using
natural language and supporting him in the information seeking process in a
personalized way.</p>
      <p>
        In order to fulfill our long-term goal of building a conversational recommender
system we need to assess the performance of our model on specific tasks involved
in this scenario. A recent work which goes in this direction is reported in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
which presents the bAbI Movie Dialog dataset, composed by different tasks such
as factoid QA, top-n recommendation and two more complex tasks, one which
mixes QA and recommendation and one which contains turns of dialogs taken
from Reddit. Having more specific tasks like QA and recommendation, and a
more complex one which mixes both tasks gives us the possibility to evaluate
our model on different levels of granularity. Moreover, the subdivision in turns
of the more complex task provides a proper benchmark of the model capability
to handle an effective dialog with the user.
      </p>
      <p>
        For the task related to QA, a lot of datasets have been released in order to
assess the machine reading and comprehension capabilities and a lot of
neural network-based models have been proposed. Our model takes inspiration
from [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], which is able to answer Cloze-style [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] questions repeating an
attention mechanism over the query and the documents multiple times. Despite
the effectiveness on the Cloze-style task, the original model does not consider
multiple documents as a source of information to answer questions, which is
fundamental in order to extract the answer from different relevant facts. The
restricted assumption that the answer is contained in the given document does
not allow the model to provide an answer which does not belong to the
document. Moreover, this kind of task does not expect multiple answers for a given
question, which is important for the complex information needs required for a
conversational recommender system.
      </p>
      <p>
        According to our vision, the main outcomes of our work can be considered as
building blocks for a conversational recommender system and can be summarized
as follows:
1. we extend the model reported in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] to let the inference process exploit
evidences observed in multiple documents coming from an external knowledge
base represented as a collection of textual documents;
2. we design a model able to leverage the attention weights generated by the
inference process to provide multiple answers which does not necessarily
belong to the documents through a multi-layer neural network which may
uncover possible relationships between the most relevant evidences;
3. we assess the efficacy of our model through an experimental evaluation on
factoid QA and top-n recommendation tasks supporting our hypothesis that
a QA model can be used to solve top-n recommendation, too.
      </p>
      <p>The paper is organized as follows: Section 2 describes our model, while
Section 3 summarizes the evaluation of the model on the two above-mentioned tasks
and the comparison with respect to state-of-the-art approaches. Section 4 gives
an overview of the literature of both QA and recommender systems, while final
remarks and our long-term vision are reported in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>Given a query q, an operator : Q ! D that produces the set of documents
relevant for q, where Q is the set of all queries and D is the set of all
documents. Our model defines a workflow in which a sequence of inference steps are
performed in order to extract relevant information from (q) to generate the
answers for q.</p>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], our workflow consists of three steps: (1) the encoding phase,
which generates meaningful representations for query and documents; (2) the
inference phase, which extracts relevant semantic relationships between the query
and the documents by using an iterative attention mechanism and finally (3) the
prediction phase, which generates a score for each candidate answer.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Encoding phase</title>
        <p>The input of the encoding phase is given by a query q and a set of documents
(q) = fd1; d2; : : : ; djDqjg Dq. Both queries and documents are represented by
a sequence of words X = (x1; x2; : : : ; xjXj), drawn from a vocabulary V . Each
word is represented by a continuous d-dimensional word embedding x 2 Rd
stored in a word embedding matrix X 2 RjV j d.</p>
        <p>
          The sequences of dense representations for q and dj are encoded using a
bidirectional recurrent neural network encoder with Gated Recurrent Units (GRU)
as in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] which represents each word xi 2 X as the concatenation of a forward
encoding h!k 2 Rh and a backward encoding hk 2 Rh. From now on, we denote
the contextual representation for the word qi by q~i 2 R2h and the contextual
representation for the word dj;i in the document dj by d~j;i 2 R2h. Differently
from [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], we build a unique representation for the whole set of documents Dq
related to the query q by stacking each contextual representation d~j;i obtaining
a matrix D~ q 2 Rl 2h, where l = jd1j + jd2j + : : : + jdjDqjj.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Inference phase</title>
        <p>This phase uncovers a possible inference chain which models meaningful
relationships between the query and the set of related documents. The inference chain
is obtained by performing, for each inference step t = 1; 2; : : : ; T , the attention
mechanisms given by the query attentive read and the document attentive read
keeping a state of the inference process given by an additional recurrent neural
network with GRU units. In this way, the network is able to progressively refine
the attention weights focusing on the most relevant tokens of the query and the
documents which are exploited by the prediction neural network to select the
correct answers among the candidate ones.</p>
        <p>Query attentive read Given the contextual representations for the query
words (q~1; q~2; : : : ; q~jqj) and the inference GRU state st 1 2 Rs, we obtain a
refined query representation qt (query glimpse) by performing an attention
mechanism over the query at inference step t:
^qi;t = softmax q~i&gt;(Aqst 1 + aq);
i=1;:::;jqj
qt =</p>
        <p>X ^qi;tq~i</p>
        <p>i
where q^i;t are the attention weights associated to the query words, Aq 2 R2h s
and aq 2 R2h are respectively a weight matrix and a bias vector which are used
to perform the bilinear product with the query token representations q~i. The
attention weights can be interpreted as the relevance scores for each word of the
query dependent on the inference state st 1 at the current inference step t.
Document attentive read Given the query glimpse qt and the inference GRU
state st 1 2 Rs, we perform an attention mechanism over the contextual
representations for the words of the stacked documents D~ q:
d^i;t = softmax D~ q&gt;i (Ad[st 1; qt] + ad);
i=1;:::;l
dt =</p>
        <p>X d^i;tD~ qi</p>
        <p>i
where D~ qi is the i-th row of D~ q, d^i;t are the attention weights associated to
the document words, Ad 2 R2h s and ad 2 R2h are respectively a weight
matrix and a bias vector which are used to perform the bilinear product with the
document token representations D~ qi . The attention weights can be interpreted
as the relevance scores for each word of the documents conditioned on both
the query glimpse and the inference state st 1 at the current inference step t.
By combining the set of relevant documents in D~ q, we obtain the probability
distribution (d^1;t; d^2;t; : : : d^l;t) over all the relevant document tokens using the
above-mentioned attention mechanism.</p>
        <p>
          Gating search results The inference GRU state at the inference step t is
updated according to st = GRU ([rq qt; rd dt]; st 1), where rq and rd are the
results of a gating mechanism obtained by evaluating g([st 1; qt; dt; qt dt]) for
the query and the documents, respectively. The gating function g : Rs+6h ! R2h
is defined as a 2-layer feed-forward neural network with a Rectified Linear Unit
(ReLU) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] activation function in the hidden layer and a sigmoid activation
function in the output layer. The purpose of the gating mechanism is to retain
useful information for the inference process about query and documents and
forget useless one.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Prediction phase</title>
        <p>
          The prediction phase, which is completely different from the pointer-sum loss
reported in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], is able to generate, given the query q, a relevance score for each
candidate answer a 2 A by using the document attention weights d^i;T computed
in the last inference step T . The relevance score of each word w is obtained by
summing the attention weights of w in each document related to q. Formally the
relevance score for a given word w is defined as:
score(w) =
1
(w)
        </p>
        <p>l
X (i; w)
i=1
where (i; w) returns 0 if (i) 6= w, d^i;T otherwise; (i) returns the word in
position i of the stacked documents matrix D~ q and (w) returns the frequency of
the word w in the documents Dq related to the query q. The relevance score takes
into account the importance of token occurrences in the considered documents
given by the computed attention weights. Moreover, the normalization term (1w)
is applied to the relevance score in order to mitigate the weight associated to
highly frequent tokens.</p>
        <p>The evaluated relevance scores are concatenated in a single vector
representation z = [score(w1); score(w2); : : : ; score(wjV j)] which is given in input to the
answer prediction neural network defined as:</p>
        <p>y = sigmoid(Who relu(Wihz + bih) + bho)
where u is the hidden layer size, Wih 2 Ru jV j and Who 2 RjAj u are weight
matrices, bih 2 Ru, bho 2 RjAj are bias vectors, sigmoid(x) = 1+1e x is the
sigmoid function and relu(x) = max(0; x) is the ReLU activation function, which
are applied pointwise to the given input vector.</p>
        <p>
          The neural network weights are supposed to learn latent features which
encode relationships between the most relevant words for the given query to predict
the correct answers. The outer sigmoid activation function is used to treat the
problem as a multi-label classification problem, so that each candidate answer is
independent and not mutually exclusive. In this way the neural network
generates a score which represents the probability that the candidate answer is correct.
Moreover, differently from [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], the candidate answer A can be any word, even
those which not belong to the documents related to the query.
        </p>
        <p>The model is trained by minimizing the binary cross-entropy loss function
comparing the neural network output y with the target answers for the given
query q represented as a binary vector, in which there is a 1 in the corresponding
position of the correct answer, 0 otherwise.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental evaluation</title>
      <p>The model performance is evaluated on the QA and Recs tasks of the bAbI Movie
Dialog dataset using HITS@k evaluation metric, which is equal to the number
of correct answers in the top-k results. In particular, the performance for the
QA task is evaluated according to HITS@1, while the performance for the Recs
task is evaluated according to HITS@100.</p>
      <p>
        Differently from [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the relevant knowledge base facts, taken from the
knowledge base in triple form distributed with the dataset, are retrieved by
implemented by exploiting the Elasticsearch engine and not according to an hash
lookup operator which applies a strict filtering procedure based on word
frequency. In our work, returns at most the top 30 relevant facts for q. Each
entity in questions and documents is recognized using the list of entities
provided with the dataset and considered as a single word of the dictionary V .
      </p>
      <p>
        Questions, answers and documents given in input to the model are
preprocessed using the NLTK toolkit [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] performing only word tokenization. The
question given in input to the operator is preprocessed performing word
tokenization and stopword removal.
      </p>
      <p>
        The optimization method and tricks are adopted from [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The model is
trained using ADAM [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] optimizer (learning rate=0:001) with a batch size of
128 for at most 100 epochs considering the best model until the HITS@k on the
validation set decreases for 5 consecutive times. Dropout [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] is applied on rq
and on rd with a rate of 0:2 and on the prediction neural network hidden layer
with a rate of 0:5. L2 regularization is applied to the embedding matrix X with
a coefficient equal to 0:0001. We clipped the gradients if their norm is greater
than 5 to stabilize learning [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Embedding size d is fixed to 50. All GRU output
sizes are fixed to 128. The number of inference steps T is set to 3. The size of
the prediction neural network hidden layer u is fixed to 4096. Biases bih and
bho are initialized to zero vectors. All weight matrices are initialized sampling
from the normal distribution N (0; 0:05). The ReLU activation function in the
prediction neural network has been experimentally chosen comparing different
activation functions such as sigmoid and tanh and taking the one which leads
to the best performance. The model, available on GitHub 1, is implemented in
TensorFlow [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and executed on an NVIDIA TITAN X GPU.
      </p>
      <p>
        Following the experimental design, the results in Table 1 are promising
because our model outperforms all other systems on both tasks except for the QA
SYSTEM on the QA task. Despite the advantage of the QA SYSTEM, it is a
carefully designed system to handle knowledge base data in the form of triples,
but our model can leverage data in the form of documents, without making any
assumption about the form of the input data and can be applied to different kind
of tasks. Additionally, the model MEMN2N is a neural network whose weights
are pre-trained on the same dataset without using the long-term memory and
the models JOINT SUPERVISED EMBEDDINGS and JOINT MEMN2N are
models trained across all the tasks of the dataset in order to boost performance.
Despite that, our model outperforms the three above-mentioned ones without
using any supplementary trick. Even though our model performance is higher
than all the others on the Recs task, we believe that the obtained result may be
improved and so we plan a further investigation. Moreover, the need for further
investigation can be justified by the work reported in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] which describes some
issues regarding the Recs task.
      </p>
      <p>Figure 1 shows the attention weights computed in the last inference step
of the iterative attention mechanism used by the model to answer to a given
question. Attention weights, represented as red boxes with variable color shades
around the tokens, can be used to interpret the reasoning mechanism applied by
the model because higher shades of red are associated to more relevant tokens
on which the model focus its attention. It is worth to notice that the attention
weights associated to each token are the result of the inference mechanism
uncovered by the model which progressively tries to focus on the relevant aspects
of the query and the documents which are exploited to generate the answers.
1 https://github.com/nlp-deepqa/imnamap
Question:
what does Larenz Tate act in ?
Ground truth answers:
The Postman, A Man Apart, Dead Presidents, Love Jones, Why Do Fools Fall in Love, The Inkwell
Most relevant sentences:
– The Inkwell starred actors Joe Morton , Larenz Tate , Suzzanne Douglas , Glynn Turman
– Love Jones starred actors Nia Long , Larenz Tate , Isaiah Washington , Lisa Nicole Carson
– Why Do Fools Fall in Love starred actors Halle Berry , Vivica A. Fox , Larenz Tate , Lela Rochon
– The Postman starred actors Kevin Costner , Olivia Williams , Will Patton , Larenz Tate
– Dead Presidents starred actors Keith David , Chris Tucker , Larenz Tate
– A Man Apart starred actors Vin Diesel , Larenz Tate</p>
      <p>Given the question “what does Larenz Tate act in?” shown in the
abovementioned figure, the model is able to understand that “Larenz Tate” is the
subject of the question and “act in” represents the intent of the question. Reading
the related documents, the model associates higher attention weights to the most
relevant tokens needed to answer the question, such as “The Postman”, “A Man
Apart” and so on.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Related work</title>
      <p>
        We think that it is necessary to consider models and techniques coming from
research both in QA and recommender systems in order to pursue our desire
to build an intelligent agent able to assist the user in decision-making tasks.
We cannot fill the gap between the above-mentioned research areas if we do
not consider the proposed models in a synergic way by virtue of the proposed
analogy between the user profile (the set of user preferences) and the items to be
recommended, as the question and the correct answers. The first work which goes
in this direction is reported in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which exploits movie descriptions to suggest
appealing movies for a given user using an architecture tipically used for QA
tasks. In fact, most of the research in the recommender systems field presents
ad-hoc systems which exploit neighbourhood information like in Collaborative
Filtering techniques [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], item descriptions and metadata like in Content-based
systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Recently presented neural network models [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] systems are able
to learn latent representations in the network weights leveraging information
coming from user preferences and item information.
      </p>
      <p>
        In recent days, a lot of effort is devoted to create benchmarks for artificial
agents to assess their ability to comprehend natural language and to reason over
facts. One of the first attempt is the bAbI [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] dataset which is a synthetic
dataset containing elementary tasks such as selecting an answer between one or
more candidate facts, answering yes/no questions, counting operations over lists
and sets and basic induction and deduction tasks. Another relevant benchmark
is the one described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which provides CNN/Daily Mail datasets consisting
of document-query-answer triples where an entity in the query is replaced by
a placeholder and the system should identify the correct entity by reading and
comprehending the given document. MCTest [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] requires machines to answer
multiple-choice reading comprehension questions about fictional stories, directly
tackling the high-level goal of open-domain machine comprehension. Finally,
SQuAD [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] consists in a set of Wikipedia articles, where the answer to each
question is a segment of text from the corresponding reading passage.
      </p>
      <p>
        According to the experimental evaluations conducted on the above-mentioned
datasets, high-level performance can be obtained exploiting complex attention
mechanisms which are able to focus on relevant evidences in the processed
content. One of the earlier approaches used to solve these tasks is given by the
general Memory Network [
        <xref ref-type="bibr" rid="ref21 ref24">21, 24</xref>
        ] framework which is one of the first neural
network models able to access external memories to extract relevant information
through an attention mechanism and to use them to provide the correct answer.
A deep Recurrent Neural Network with Long Short-Term Memory units is
presented in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which solves CNN/Daily Mail datasets by designing two different
attention mechanisms called Impatient Reader and Attentive Reader. Another
way to incorporate attention in neural network models is proposed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which
defines a pointer-sum loss whose aim is to maximize the attention weights which
lead to the correct answer.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>
        In this work we propose a novel model based on Artificial Neural Networks to
answer questions with multiple answers by exploiting multiple facts retrieved from
a knowledge base. The proposed model can be considered a relevant building
block of a conversational recommender system. Differently from [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], our model
can consider multiple documents as a source of information in order to
generate multiple answers which may not belong to the documents. As presented in
this work, common tasks such as QA and top-n recommendation can be solved
effectively by our model.
      </p>
      <p>In a common recommendation system scenario, when a user enters a search
query, it is assumed that his preferences are known. This is a stringent
requirement because users cannot have a clear idea of their preferences at that point.
Conversational recommender systems support users to fulfill their information
needs through an interactive process. In this way, the system can provide a
personalized experience dynamically adapting the user model with the possibility
to enhance the generated predictions. Moreover, the system capability can be
further enhanced giving explanations to the user about the given suggestions.</p>
      <p>To reach our goal, we should improve our model by designing a operator
able to return relevant facts recognizing the most relevant information in the
query, by exploiting user preferences and contextual information to learn the user
model and by providing a mechanism which leverages attention weights to give
explanations. In order to effectively train our model, we plan to collect real dialog
data containing contextual information associated to each user and feedback for
each dialog which represents if the user is satisfied with the conversation. Given
these enhancements, we should design a system able to hold effectively a dialog
with the user recognizing his intent and providing him the most suitable contents.</p>
      <p>With this work we try to show the effectiveness of our architecture for tasks
which go from pure question answering to top-n recommendation through an
experimental evaluation without any assumption on the task to be solved. To
do that, we do not use any hand-crafted linguistic features but we let the
system learn and leverage them in the inference process which leads to the
answers through multiple reasoning steps. During these steps, the system
understands relevant relationships between question and documents without relying
on canonical matching, but repeating an attention mechanism able to unconver
related aspects in distributed representations, conditioned on an encoding of the
inference process given by another neural network. Equipping agents with a
reasoning mechanism like the one described in this work and exploiting the ability
of neural network models to learn from data, we may be able to create truly
intelligent agents.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is supported by the IBM Faculty Award "Deep Learning to boost
Cognitive Question Answering". The Titan X GPU used for this research was
donated by the NVIDIA Corporation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brevdo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Citro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irving</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Józefowicz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kudlur</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levenberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mané</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monga</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murray</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steiner</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talwar</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tucker</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasudevan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viégas</surname>
            ,
            <given-names>F.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Warden</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wattenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wicke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Tensorflow: Large-scale machine learning on heterogeneous distributed systems</article-title>
          .
          <source>CoRR abs/1603</source>
          .04467 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Nltk: the natural language toolkit</article-title>
          .
          <source>In: Proceedings of the COLING/ACL on Interactive presentation sessions</source>
          . pp.
          <fpage>69</fpage>
          -
          <lpage>72</lpage>
          . Association for Computational Linguistics (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Cheng, H.,
          <string-name>
            <surname>Koc</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harmsen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaked</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandra</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aradhye</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ispir</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anil</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haque</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
          </string-name>
          , H.:
          <article-title>Wide &amp; deep learning for recommender systems</article-title>
          .
          <source>CoRR abs/1606</source>
          .07792 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Covington</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sargin</surname>
          </string-name>
          , E.:
          <article-title>Deep neural networks for youtube recommendations</article-title>
          .
          <source>In: Proceedings of the 10th ACM Conference on Recommender Systems</source>
          . New York, NY, USA (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dodge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gane</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szlam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
          </string-name>
          , J.:
          <article-title>Evaluating prerequisite qualities for learning end-to-end dialog systems</article-title>
          .
          <source>arXiv preprint arXiv:1511.06931</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. de Gemmis,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Lops</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Musto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Narducci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Semeraro</surname>
          </string-name>
          , G.:
          <article-title>Semantics-aware content-based recommender systems</article-title>
          .
          <source>In: Recommender Systems Handbook</source>
          , pp.
          <fpage>119</fpage>
          -
          <lpage>159</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Hermann,
          <string-name>
            <given-names>K.M.</given-names>
            ,
            <surname>Kociský</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Grefenstette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Espeholt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Suleyman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Blunsom</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Teaching machines to read and comprehend</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <fpage>1693</fpage>
          -
          <lpage>1701</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kadlec</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bajgar</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kleindienst</surname>
          </string-name>
          , J.:
          <article-title>Text understanding with the attention sum reader network</article-title>
          .
          <source>arXiv preprint arXiv:1603.01547</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mahmood</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ricci</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Improving recommender systems with adaptive conversational strategies</article-title>
          .
          <source>In: Proceedings of the 20th ACM conference on Hypertext and hypermedia</source>
          . pp.
          <fpage>73</fpage>
          -
          <lpage>82</lpage>
          . ACM (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Musto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greco</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suglia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semeraro</surname>
          </string-name>
          , G.:
          <article-title>Ask me any rating: A content-based recommender system based on recurrent neural networks</article-title>
          .
          <source>In: Proceedings of the 7th Italian Information Retrieval Workshop</source>
          , Venezia, Italy, May
          <volume>30</volume>
          -31,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Rectified linear units improve restricted boltzmann machines</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Machine Learning (ICML-10)</source>
          . pp.
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ning</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Desrosiers</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karypis</surname>
          </string-name>
          , G.:
          <article-title>A comprehensive survey of neighborhoodbased recommendation methods</article-title>
          .
          <source>In: Recommender Systems Handbook</source>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>76</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pascanu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>On the difficulty of training recurrent neural networks</article-title>
          .
          <source>ICML (3) 28</source>
          ,
          <fpage>1310</fpage>
          -
          <lpage>1318</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rajpurkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Lopyrev</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Squad:
          <volume>100</volume>
          , 000+
          <article-title>questions for machine comprehension of text</article-title>
          .
          <source>CoRR abs/1606</source>
          .05250 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burges</surname>
            ,
            <given-names>C.J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Renshaw</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Mctest</surname>
          </string-name>
          :
          <article-title>A challenge dataset for the open-domain machine comprehension of text</article-title>
          . In: EMNLP (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rubens</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sugiyama</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Active learning in recommender systems</article-title>
          .
          <source>In: Recommender Systems Handbook</source>
          , pp.
          <fpage>809</fpage>
          -
          <lpage>846</lpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Searle</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bingham-Walker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Why “blow out”? a structural analysis of the movie dialog dataset</article-title>
          .
          <source>ACL</source>
          <year>2016</year>
          p.
          <volume>215</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sordoni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bachman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Iterative alternating neural attention for machine reading</article-title>
          .
          <source>arXiv preprint arXiv:1606.02245</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Sukhbaatar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szlam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fergus</surname>
          </string-name>
          , R.:
          <article-title>End-to-end memory networks</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <fpage>2440</fpage>
          -
          <lpage>2448</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , W.L.:
          <article-title>Cloze procedure: a new tool for measuring readability</article-title>
          .
          <source>Journalism and Mass Communication Quarterly</source>
          <volume>30</volume>
          (
          <issue>4</issue>
          ),
          <volume>415</volume>
          (
          <year>1953</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Towards ai-complete question answering: A set of prerequisite toy tasks</article-title>
          .
          <source>CoRR abs/1502</source>
          .05698 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Memory networks</article-title>
          .
          <source>arXiv preprint arXiv:1410.3916</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>