<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Latent Question Interpretation Through Parameter Adaptation Using Stochastic Neuron</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetiana Parshakova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dae-Shik Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Electrical Engineering</institution>
          ,
          <addr-line>KAIST</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>46</fpage>
      <lpage>55</lpage>
      <abstract>
        <p>Many neural network-based question-answering models rely on complex attention mechanisms but they are limited in their ability to capture natural language variability, and to generate diverse and/or reasonable answers. To address this limitation, we propose a module that learns the diversity of the possible interpretations for a given question. In order to identify the possible span of the respective answers, parameters for our questionanswering model are adapted using the value of the discrete ”interpretation neuron”. Additionally, we formulate a semi-supervised variational inference framework and fine-tune the final policy using the rewards from the answer accuracy with the policy gradient optimization. We demonstrate sample answers with induced latent interpretations, suggesting that our model has successfully discovered multiple ways of understanding for a given question. When tested using the Stanford Question Answering Dataset (SQuAD), our model outperformed the current baseline, suggesting the potential validity of the approach described in this work. We open source our implementation in PyTorch1.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The task of machine reading comprehension can be defined
through paragraph understanding and answering questions
that are related to it. It is a crucial task in Natural Language
Processing that led to the development of diverse deep
learning models. A wide range of these models use the
encoderdecoder structure to map sequence (e.g. paragraph and
question) to sequence (answer), by encoding the input with a long
short-term memory (LSTM) [Hochreiter and Schmidhuber,
1997] into a fixed dimensional vector representation, and
then decode the output from that vector with another LSTM
[Sutskever et al., 2014]. Variations of this framework have
been extensively exploited in the conversation modeling task,
where neural networks (NNs) learn the mapping between
queries and responses [Vinyals and Le, 2015], as well as
1https://github.com/parshakova/apsn
machine translation [Bahdanau et al., 2014; Cho et al., 2014],
text summarization [Paulus et al., 2017], image captioning
[Vinyals et al., 2015b] and more.</p>
      <p>SQuAD [Rajpurkar et al., 2016] is a benchmark dataset, that
is composed of 100,000+ questions posed by crowd-workers
on a set of Wikipedia articles. The answer to each question
is a span within a document, and the objective is to predict
the starting and ending indices of the answer: as, ae. Hence
most models generate two probability distributions over the
document in such a way that P = {P (as), P (ae|as)}.
Existing state-of-the-art models attempt to capture the most
relevant information for answering the question using
complex attention mechanisms. In particular, the key idea lies in
multi-layered attention that fuses semantic information from
the question into the document. It is achieved by coattention
encoders that build richer question-document representation
as well as various self-matching structures. These models
learn to output distributions over the span indices, and
during training get equally penalized for producing answers in
distinct positions with the ground truth even if the meaning
was similar. Thus, they cannot make basic actions needed
to generate natural answers. For example, consider the
following triplet (document, question, answer):</p>
      <p>D: Newcastle International Airport is located
approximately 6 miles (9.7 km) from the city centre on the
northern outskirts of the city near Ponteland and is the
larger of the two main airports serving the North East.
It is connected to the city via the Metro Light Rail
system and a journey into Newcastle city centre takes
approximately 20 minutes by train.</p>
      <p>Q: How far is Newcastle ’s airport from the center of
town?</p>
      <p>A: 6 miles
The span ”20 minutes by train” is also a correct answer if
the question is interpreted in the perspective of time (which
sometimes can be more practical), but since it differs from the
ground truth span, the cross entropy loss will discourage this
answer. As a consequence, these attention-based
discriminative models are limited in their ability to exhibit stochasticity
and variability of natural language and to generate diverse yet
reasonable answers.</p>
      <p>
        To address this problem, we propose integrating a module
that Adapts Parameters through Stochastic Neuron (APSN)
with a basic question-answering model
        <xref ref-type="bibr" rid="ref18 ref19 ref2 ref3 ref31 ref32 ref34 ref7">(in our case DrQA,
[Chen et al., 2017])</xref>
        and a training framework for learning a
complex distribution of the latent query interpretations during
the question answering. The discrete stochastic neuron
here represents the interpretation of a question and can be
considered as different personas of the answering agent. This
stochastic neuron is inferred from the question, and based
on its value the central document encoding parameters get
adapted to produce an answer for a particular interpretation.
APSN framework employs a discrete latent variable [Mnih
and Gregor, 2014], because continuous latent space is harder
to interpret and apply for semi-supervised learning
environment [Kingma et al., 2014]. The objective is to perform
Bayesian inference for the posterior distribution of latent
interpretations conditioned on the questions and document
sub-spans.
      </p>
      <p>In the framework of variational auto-encoder (VAE), we
construct an inference network as the variational approximation
of the posterior, and by sampling the interpretation for each
question-answer the model is able to learn the interpretation
distribution on the SQuAD by optimizing the variational
lower bound [Mnih and Gregor, 2014; Miao and Blunsom,
2016; Wen et al., 2017]. To reduce the variance further, we
develop the semi-supervised framework by jointly training on
the labelled and unlabelled latent interpretations [Kingma et
al., 2014].</p>
      <p>In order to prevent the mode collapse in selecting only
a single interpretation, we introduce a new objective that
discourages a cosine similarity and penalizes the feature
correlation proximity of original document encoding and
document encoding under different latent interpretations. The
latter is computed by a mean square error between Gram
matrices [Gatys et al., 2015; Gatys et al., 2016]. In addition,
after training the model in the semi-supervised variational
inference framework, we fine-tune it with a mixed objective
that combines traditional cross entropy loss over position of
a span with a policy gradient (PG) reinforcement learning
[Xiong et al., 2017; Paulus et al., 2017; Li et al., 2016].
In the mixed objective scenario, the latent interpretation is
sampled from the prior distribution, and the span distribution
is considered to be a policy for PG optimization. We
compared the performance of the model with two different
scores for obtaining rewards: the F1 score and the exact
match (EM).</p>
      <p>In summary, our results suggest that the neural variational
inference framework is able to detect discrete latent
interpretations of a question. Finding various reasonable answers
within the same document is important, because it brings a
stepping stone towards building a large-scale open-domain
QA, where one must first retrieve the few relevant articles
and then scan them to identify an answer. By allowing
multiple question interpretations, the agent may discover new
connections in the knowledge and arrive at more interesting
responses. The experimental results also indicate that by
introducing a module APSN and training framework to the
baseline DrQA, the accuracy of answers on the SQuAD
improves. Lastly, the quality of sample answers with induced
latent interpretations indicates that the model has successfully
discovered multiple ways of understanding the question.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Among the state-of-the-art end-to-end machine
comprehension models on the SQuAD dataset, attention mechanisms
play a crucial role.</p>
      <p>
        Bidirectional Attention Flow for Machine Comprehension
        <xref ref-type="bibr" rid="ref15 ref21 ref23 ref33 ref6 ref8">(BiDAF, [Seo et al., 2016])</xref>
        was built upon the hierarchical
multi-stage architecture. It filters document using the
question. Additionally BiDAF symmetrically filters the question
using the document, to extract relevant parts of the questions.
The Dynamic Coattention Network
        <xref ref-type="bibr" rid="ref15 ref21 ref23 ref33 ref6 ref8">(DCN, [Xiong et al.,
2016])</xref>
        uses coattention encoders to fuse the question and
paragraph into one representation. It also employs a dynamic
decoder that iteratively estimates the start and end indexes
using LSTM and a Highway Maxout Network. The extension
of DCN, DCN+ [Xiong et al., 2017], introduces the mixed
objective of cross entropy loss over span position and
selfcritical policy learning [Paulus et al., 2017].
      </p>
      <p>The R-Net [Wang et al., 2017] is based on match-LSTMs
[Wang and Jiang, 2016] that first incorporate question
information into passage representation and then use it for a
recurrent self-matching attention. Start and end indices are
predicted with the use of pointer networks [Vinyals et al.,
2015a].</p>
      <p>These models are equipped with a large number of
parameters, which is owing to the structure complexity of their
attention mechanisms that is loaded with various information
pathways and tangled connections between layers. In
contrast, DrQA [Chen et al., 2017] is a fairly small and simple
model but is powerful enough to achieve a high accuracy
on the SQuAD. That is why it was chosen as a baseline
model due to being more amenable to fast learning and
modifications.</p>
      <p>Gradient-based learning has been a key to most neural
network based algorithms. The backpropagation [Rumelhart et
al., 1986] computes exact gradients when the relationship
between the training objective and parameters is continuous and
generally smooth. However in many cases it is impossible
to apply backpropagation: for example when the model has
stochastic neurons, hard non-linearities, discrete sampling
operations, or when the objective function is unknown to
the agent (like in reinforcement learning). To get a learning
signal in such situations one has to construct a gradient
estimator.</p>
      <p>For models with continuous latent variables the
reparametrisation trick is commonly used [Kingma and Welling, 2013] to
achieve an unbiased low-variance gradient estimator. While
in a discrete latent variable case, advantage actor-critic
methods (A2C) give unbiased gradient estimates with reduced
variance [Sutton et al., 2000], and a more recent framework
RELAX [Grathwohl et al., 2017] that outperforms A2C can
be applicable even when no continuous relaxation of discrete
random variable is available.
In this work, we investigate the possibilities of
modeling the question interpretation distribution during a reading
comprehension using the discrete VAE for inference [Mnih
and Gregor, 2014], which parametrizes interpretation space
through discrete latent variable. This framework is capable
of combining different learning paradigms such as
semisupervised learning, reinforcement learning, and
samplebased variational inference to bootstrap performance.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Baseline and APSN Integration</title>
      <p>The APSN module is integrated with the question-answering
baseline DrQA. Suppose we are given a document d
consisting of m tokens {d1, ..., dm}, a question q consisting of
l tokens {q1, ..., ql}, and total ni of latent question
interpretations. We divide model parameters θ into two sets: the policy
π parameters θ2 and all the remaining ones θ1.</p>
      <sec id="sec-3-1">
        <title>Question Encoding</title>
        <p>
          First, we obtain the question encodings with a multi-layer
bidirectional Simple Recurrent Unit
          <xref ref-type="bibr" rid="ref10 ref14">(SRU, [Lei and Zhang,
2017])</xref>
          applied on top of the word embeddings qe =
{qe1, ..., qel}, where femb(qi) = E(qi) = qei:
        </p>
        <p>{q1, ..., ql} = SRU1{qe1, ..., qel}
Resulting encodings are combined into a question encoding
through a parametrized weighted sum qw = Pj bj qj .</p>
      </sec>
      <sec id="sec-3-2">
        <title>Document Encoding</title>
        <p>Each token in the document di is first preprocessed
into a feature vector dei that is comprised of
concatenated: word embedding femb(di) = E(di), exact match
fem(di) = I(di ∈ q), token features ftoken(di) =
(POS(di), NER(di), TF(di)), and aligned question
embedding falign(di) = Pj ai,j E(qj ), as in the original DrQA. To
encode the document we apply another recurrent network.
For reference, in the original single-layer SRU linear
transformation of the input de is performed by grouping matrix
multiplication:
W </p>
        <p>Wr
UT = WWfh [de1, de2, · · · , dem]
(2)</p>
      </sec>
      <sec id="sec-3-3">
        <title>Interpretation Policy</title>
        <p>The policy network encodes q into qw and de (word
embedding part only) into dw witheSRU1. Since empirically we
found that it is beneficial to share the question encoding
parameters with the prior policy. Then, the latent interpretation
z is parametrized by a three layered MLP,</p>
        <p>relu(W2T · relu(W1T [qw ⊕ dw])))) (3)
where σ stands for a softmax, biases are omitted for
simplicity, W1, W2, W3, W4 are trainable parameters, ⊕ stands for
concatenation. The latent interpretation z(n) ∈ {0, 1, ..., ni −
1} is sampled from a discrete conditional multinomial
distribution z(n) ∼ πθ2 (z|qe, de).
(1)</p>
        <p>πθ2 (z|qe, de) = σ(W4T · relu(W3T ·</p>
      </sec>
      <sec id="sec-3-4">
        <title>Parameter Adaptation</title>
        <p>In the APSN, the sampled interpretation is used to adapt
the central SRU parameters W from a single layer. For
each value of the latent interpretation z(n) = i there is
an individual set of weights associated with it, W . These
i
weights are used to distort the central parameters in order to
adapt them for new interpretations:</p>
        <p>W = W ⊕ Wh ⊕ Wf ⊕ Wr</p>
        <p>W = {W0, ..., Wni−1}
In this work, we present multiple methods for combining W
with Wz(n) to obtain new parameters Wzn(enw) , which are being
optimized for a particular interpretation:
• Addition: Wzn(enw) = W + Wz(n)
• Multiplication: Wzn(enw) = W
σ(Wz(n) )
• Convolutional: Wzn(enw) = CNN(W, Wz(n) ), which
consists of multiple layers of 2D convolutions with a
kernel zise of 3 × 3, ReLU activation between layers
and zero padding to keep the original size of the matrix
The above procedure is used to obtain adapted parameters
Wzn(enw) of a single SRU layer. Similar steps can be followed to
find adapted parameters in another layer but with a separate
set of perturbation weights. The set of adapted parameters
in initial layers of SRU, along with unchanged parameters on
the remaining layers, are used in what we call SRU2, to get
the encodings of the document information:</p>
        <p>{d1, ..., dm} = SRU2{de1, ..., dem}</p>
      </sec>
      <sec id="sec-3-5">
        <title>Prediction</title>
        <p>Similarly to the original model DrQA, we use bilinear term
to capture the similarity between di and qw and compute the
probabilities of each token being start and end of an answer:
pθ(as = i|qe, de) ∝ exp(diWsqw)
pθ(ae = i|qe, de) ∝ exp(diWeqw)
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Training Framework</title>
      <sec id="sec-4-1">
        <title>Inference</title>
        <p>To implement sampling from the variational posterior for
the given observation, we construct an inference network
qφ(z|qe, a) with parameters φ as the variational approximation
of the posterior distribution p(z|qe, a) [Mnih and Gregor,
2014; Miao and Blunsom, 2016; Wen et al., 2017]:
L = Eqφ(z|qe,a) logpθ1 (a|z, qe, de) −
≤ log X pθ1 (a|z, qe, de)πθ2 (z|qe, de)</p>
        <p>z
= logpθ(a|qe, de)</p>
        <p>βDKL qφ(z|qe, a)||πθ2 (z|qe, de)
Note that a coefficient β = 0.1 scales the learning signal of
the KL divergence [Higgins et al., 2016]. Although we are
not optimizing the exact variational lower bound, the final
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
goal of learning effective answering model that is based on
the question interpretation, is mostly up to the reconstruction
error.</p>
        <p>The inference network qφ(z|qe, a) is conditioned on the
ane
swer embedding, which is a document sub-span {dei}i=s ⊂
{dei}im=1, and the question embeddings {qei}i=1, on top of
l
which a recurrent neural network is applied. Then, to obtain
a multinomial distribution over the latent interpretations,
the concatenation of the resulting hidden units of answer
and question encodings is passed through a similar network
described in Eq. 3.</p>
        <p>During the training we draw N samples z(n) ∼ qφ(z|qe, a)
independently for computing the gradients. Parameters θ1 are
directly updated by backpropagating the stochastic gradients:
1
∇θ1 L ≈ N
n
X ∂logpθ1 (a|z(n), qe, de) .</p>
        <p>∂θ1
Parameters of the prior network θ2 are trained by mimicking
the posterior network:
∇θ2 L = β X qφ(z|qe, a)
z
∂logπθ2 (z|qe, de) .</p>
        <p>∂θ2
For the parameters φ in the posterior network, we firstly
define the learning signal as:
(12)
(13)
l(a, z(n), qe, de) = logpθ1 (a|z(n), qe, de)−</p>
        <p>β logqφ(z(n)|qe, a) − logπθ2 (z(n)|qe, de) . (14)</p>
        <sec id="sec-4-1-1">
          <title>Then the parameters φ are updated by:</title>
          <p>n
∇φL ≈ N1 X l(a, z(n), qe, de) − b(qe, de) ·
∂logqφ(z(n)|qe, a)
∂φ
. (15)
To reduce the variance in this gradient estimator, which relies
on samples from qφ(z|qe, a), we follow the REINFORCE
algorithm [Mnih and Gregor, 2014] and introduce a baseline
critic network b(qe, de) = MLP(qw ⊕ dw). During the
training, the baseline is updated by minimising the mean
square error with the learning signal.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Semi-Supervision</title>
        <p>While learning interpretations in a completely unsupervised
manner, one major difficulty remains: the high variance of an
inference network on the early stages of training. Thus, we
adopt a semi-supervised training framework [Kingma et al.,
2014]. We used a standard clustering algorithm to generate
labels zˆ for questions-answer pairs. In this case our training
examples are separated into two sets: (zˆ, qe, de, a) ∈ L
and (qe, de, a) ∈ U, that together produce a joint objective
function:</p>
        <p>Lss = α</p>
        <p>X
(zˆ,qe,de,a)∈L
h</p>
        <p>X
(qe,de,a)∈U</p>
        <p>Eqφ(z|qe,a) logpθ1 (a|z, qe, de) −
i
βDKL qφ(z|qe, a)||πθ2 (z|qe, de) +
log pθ1 (a|zˆ, qe, de)πθ2 (zˆ|qe, de)qφ(zˆ|qe, a)
(16)
where α is a balancing parameter between updates from a
modified variational bound (Eq. 9) and joint log-likelihood
of the fully observed data.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Interpretation Diversity</title>
        <p>While training a system in the semi-supervised variational
inference framework, the interpretation policy suffers from
mode collapse. To prevent that, we maximize a new
regularization objective:
nXi−1 h
i=0
Lreg =</p>
        <p>− 0.1 · cos(U, Ui)+
0.001 · MSE(Gram(U), Gram(Ui))
i (17)
where U, Ui are linear transformations of the average pooled
input to SRU across the time steps (Eq. 2) with and without
parameter adaptation respectively, cos is the cosine similarity,
Gram is a Gramian matrix divided by size(U).</p>
        <p>By optimizing this objective, the proximity of document
encodings under various interpretations that is obtained by
feature correlations (i.e., Gram matrix) and cosine similarity,
gets minimized. Gram matrix has remarkable ability of
capturing texture information and style [Gatys et al., 2015;
Gatys et al., 2016], while cosine similarity is useful for
measuring how documents are semantically related. Lreg along
with the main objective (Eq. 9) aims to find such parameters
W, W that help to make document encodings different in
semantics and in style across the latent interpretations, but
yet producing the correct answers.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Policy Gradient</title>
        <p>After the interpretation policy πθ2 (z|qe, de) and the
answering policy pθ1 (a|z, q, de) are learned, we apply a policy
e
gradient-based reinforcement learning algorithm to fine-tune
the parameters θ [Xiong et al., 2017; Paulus et al., 2017;
Li et al., 2016]. By sampling z(n) ∼ πθ2 (z|qe, de) and aˆ ∼
pθ1 (a|z(n), q, de) the system receives a reward rs(cno)re(a, aˆ).
The new eexpected gradient from a mixed objective that
includes a cross entropy and a policy gradient is computed
as:</p>
        <p>1 X ∂
∇θLce+pg ≈ N ∂θ
n</p>
        <p>(1 − γ) · logpθ(a|z(n), qe, de)+
γ · rs(cno)re(a, aˆ)log pθ(aˆ|z(n), qe, de)πθ(z(n)|qe, de) . (18)
We evaluated the performance of the model while different
scores were used in computing rewards: F1 and EM (between
the ground truth and a predicted span). The final value of
rs(cno)re(a, aˆ) was normalized over the batch.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Rationale</title>
      <p>We will now provide the intuition behind the parameter
adaptation and the training framework. Current works in
hierarchical reinforcement learning are based on the options
framework [Sutton et al., 1999], where a master policy
selects among options (sub-policies) to accomplish the final
goal. Similarly, our algorithm learns a hierarchical
policy, where a master policy πθ2 (z(n)|qe, de) switches between
the interpretation-specific weights Wz(n) that fine-tune the
shared central weights W and form a sub-policy (sub-policies
correspond to pθ(a|qe, de) with Wzn(enw) ) for a particular
interpretation value.</p>
      <p>Next, we consider VAE framework. It is used to approximate
the posterior distribution over the latent interpretations, so
that the system could optimize the variational lower bound of
the joint distribution. Hence, by sampling the interpretations
for each question and correct document sub-span (answer),
the model is able to learn the interpretation distribution on
SQuAD. To reduce the variance of an inference network
on the early stage of training, we introduce semi-supervised
learning signal. While maintaining such framework, the
system suffers from mode collapse in the interpretation
policy. The mode collapse has been prevented by the use
of the interpretation diversity objective. In effect, it led
to maximally effective behaviour in the question-answering
task.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <sec id="sec-6-1">
        <title>Implementation Details</title>
        <p>For the word embeddings we use GloVe embeddings
pretrained on the 840B Common Crawl corpus [Pennington et
al., 2014]. Each recurrent network is a bidirectional SRU that
has 5 layers and the hidden state size 128, as in the baseline
DrQA. We apply dropout with p = 0.8 to all hidden units
of SRU, use mini-batches of size 64. The model is trained
by Adamax [Kingma and Ba, 2014] and tuned with early
stopping on the validation set. In SQuAD some questions
contain several ground truth answers, however during training
only a single answer per question was used. We apply
the Spacy English language models [Honnibal and Montani,
2017] for tokenization and also generating lemma,
part-ofspeech, and named entity tags.</p>
        <p>The trade-off coefficients α and γ are set to 0.1. The
final objective in the semi-supervised variational inference
framework is Lreg+Lss. The parameters from the pre-trained
DrQA are used as the initialization for the APSN model.
Number of features in the convolutional parameter distortion
is set to 64. The baseline critic network is a 3 layered MLP
with a hidden size 128. Provided accuracies are obtained on
the SQuAD validation set.</p>
        <p>To produce self-labelled question clusters for
semisupervised learning of the interpretations, we used Sent2Vec
[Pagliardini et al., 2017] to obtain sentence embeddings for
Lce+pg &amp; rEM</p>
        <p>Lce+pg &amp; rF1</p>
        <p>Lss + Lreg
Model</p>
        <sec id="sec-6-1-1">
          <title>APSN conv3</title>
          <p>APSN conv4
APSN conv4
APSN conv5
APSN conv4
APSN conv5
APSN conv3
APSN conv3
ni
3
3
4
4
5
5
8
10</p>
          <p>F1
question-answer pairs, and the KMeans for clustering. The
number of labelled interpretations was in range from 30%
to 50% across the whole dataset, depending on the value of
interpretations ni.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>SQuAD Accuracy</title>
        <sec id="sec-6-2-1">
          <title>Model</title>
        </sec>
        <sec id="sec-6-2-2">
          <title>DrQA</title>
        </sec>
        <sec id="sec-6-2-3">
          <title>APSN add 1</title>
        </sec>
        <sec id="sec-6-2-4">
          <title>APSN mul 1</title>
        </sec>
        <sec id="sec-6-2-5">
          <title>APSN mul 2</title>
        </sec>
        <sec id="sec-6-2-6">
          <title>APSN conv4 1 APSN conv5 1</title>
          <p>ni
5
5
5
5
10</p>
          <p>EM
70.28
70.91
70.87
71.30
71.29
71.88</p>
          <p>Lss</p>
          <p>F1
79.50
80.32
79.86
80.33
80.72
81.09
Empirically obtained evaluation results in Table 2 indicate
that convolutional operations for adapting parameters are the
most effective with our interpretation policy.</p>
          <p>The performance of the model in Table 1 illustrates that the
accuracy improves while the number of latent interpretations
ni increases from 3 to 5 and then goes down. Also, it is
crucial to find a proper number of layers in convolutional
parameter adaptation module individually for each value of
ni. The policy gradient framework consistently improves
the accuracy achieved by applying solely semi-supervised
variational inference training. The APSN model outperforms
the baseline DrQA in all cases.</p>
          <p>Interestingly, while the regularization objective is used, the
model arrives at its best performance on the SQuAD after
being fine-tuned with PG, comparing to the case with a
single Lss objective. This is the result of the mode collapse
that happens in the latter case, when the central parameters
of SRU, W, get adapted only for a single task. In this
case W become insensitive to changes in a latent question
interpretation neuron.</p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>Analysis of Samples</title>
        <p>The sample answers based on the induced values of latent
interpretation are illustrated in Table 3. Among the generated
spans, some contain new sequences that do not have a word
overlap with the first option of ground truth (that the model
was trained with) but yet are the plausible answers (samples
#1-3 in Table 3 marked in violet). It was the main goal of the
interpretation neuron. Other things to note:
1. While the model was trained only with a single
answer per question, it is able to find multiple alternative
answers in cases when several different options are
included in the gold reference (sample #4-6).
2. We also note that predicted spans of some interpretations
are inexplicitly related to the correct answer by the
causal relationship (samples #7, #8). In such cases,
produced answers contain helpful information about the
ground truth even when they are not directly answering
the question. It may be a valuable path of future
investigations to use such spans as an intermediate step
for refining the final answers.
3. A paraphrasing behaviour of a question (sample #9) may
be useful in making a question-answering model elicit
the best answers [Buck et al., 2017].
4. In 80% of cases, the model finds a span that has an
overlap with a true answer but either contains
additional words (samples #10-12 answer questions more
thoroughly) or is more concise. It can be interpreted as
the fact that some people are more talkative while others
are laconic.
dynamos in a power house six miles away were repeatedly burned out, due to the powerful high frequency currents set
up in them, and which caused heavy sparks to jump through the windings and destroy the insulation
What did the sparks do to the insulation?
[’destroy’, ’jump through the windings and destroy the insulation’]
100.0) jump through the windings and destroy the insulation
100.0) destroy
The situation in New France was further exacerbated by a poor harvest in 1757, a difficult winter, and the allegedly
corrupt machinations of Franc¸ois Bigot, the intendant of the territory.</p>
        <p>What other reason caused poor supply of New France from a difficult winter?
[’poor harvest’, ’allegedly corrupt machinations of Franc¸ois Bigot’]
100.0) poor harvest
80.0) the allegedly corrupt machinations of Franc¸ois Bigot, the intendant of the territory
As the D-loop moves through the circular DNA, it adopts a theta intermediary form, also known as a Cairns replication
intermediate, and completes replication with a rolling circle mechanism.</p>
        <p>What is a Cairns replication intermediate?
[’a theta intermediary form’]
0.0) a rolling circle mechanism
100.0) a theta intermediary form
Research shows that student motivation and attitudes towards school are closely linked to student-teacher relationships.
Enthusiastic teachers are particularly good at creating beneficial relations with their students.</p>
        <p>What type of relationships do enthusiastic teachers cause?
[’beneficial’]
0.0) student-teacher
66.7) beneficial relations
Thus, the marginal utility of wealth per person (”the additional dollar”) decreases as a person becomes richer.
What the marginal utility of wealth per income per person do as that person becomes richer?
[’decreases’]
100) decreases
0.0) the additional dollar
10
11
12
13
14
Oxfam’s claims have however been questioned on the basis of the methodology used: by using net wealth (adding up
assets and subtracting debts), the Oxfam report, for instance, finds that there are more poor people in the United States
and Western Europe than in China (due to a greater tendency to take on debts). Anthony Shorrocks, the lead author
of the Credit Suisse report which is one of the sources of Oxfam’s data, considers the criticism about debt to be a ”silly
argument” and ”a non-issue . . . a diversion”.</p>
        <p>Why does Oxfam and Credit Suisse believe their findings are being doubted?
[’a diversion’, ’there are more poor people in the United States and Western Europe than in China’]
100.0) there are more poor people in the United States and Western Europe than in China
0.0) the criticism about debt to be a ”silly argument”
Thus, the APSN clearly has multiple modes of understanding
the question and, therefore, answering it.
ing a single question interpretation is a property of SQuAD,
or our language in general.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Works</title>
      <p>In this paper we have proposed a training framework and the
APSN model for learning question interpretations that help
to find various valid answers within the same document. The
role of a discrete interpretation neuron is to make the central
weights W more sensitive to a particular interpretation. It
allows the model to implement multiple modes of answering,
since these weights control document representations that are
used to get an answer. An important implication of this study
is that when the latent distribution is updated by the rewards
from a variational lower bound and then the final policy is
fine-tuned by the rewards from the answer accuracy, it
provides an effective learning approach for the neural network.
The sample answers with induced latent interpretations
indicate that the model has successfully discovered multiple ways
of understanding the question. Lastly, empirical evaluation
results on SQuAD suggest that the integration of the APSN
into the baseline DrQA is an effective approach for question
answering.</p>
      <p>In a fair amount of cases the model produces sub-spans or
super-spans, failing to detect multiple question
interpretations. Further work needs to be done to establish whether
havA single sentence from one language can be mapped to
multiple variants in another language, thus another direction worth
investigation is to connect APSN with a machine translation
model. For that APSN will learn a complex distribution of
interpretations in mapping source to target sentences. Then
the latent interpretation neuron could be seen as a multiple
personas translating a sentence.</p>
      <p>The APSN module is integrated with the question-answering
model DrQA, however, we believe that other baseline models
could bring more insights and better results. It may also be
fruitful to apply RELAX framework for computing a low
variance gradient estimator for the APSN model instead of
semi-supervised variational inference due to its outstanding
performance in a game domain. Further research in this
area could make multi-interpretation approach a standard
component in building the answering system.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by Brain Korea 21+ Project,
BK Electronics and Communications Technology Division,
KAIST in 2018. This research was also funded by the
Hyundai NGV Company (Project No. G01170378).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Bahdanau et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio.</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409.0473</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Buck et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Buck</surname>
          </string-name>
          , Jannis Bulian, Massimiliano Ciaramita, Andrea Gesmundo, Neil Houlsby, Wojciech Gajewski, and
          <string-name>
            <given-names>Wei</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Ask the right questions: Active question reformulation with reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1705.07830</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Chen et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Danqi</given-names>
            <surname>Chen</surname>
          </string-name>
          , Adam Fisch, Jason Weston, and
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          .
          <article-title>Reading wikipedia to answer opendomain questions</article-title>
          .
          <source>arXiv preprint arXiv:1704.00051</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Cho et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart Van Merrie¨nboer, Dzmitry Bahdanau, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>On the properties of neural machine translation: Encoder-decoder approaches</article-title>
          .
          <source>arXiv preprint arXiv:1409.1259</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Gatys et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Leon</given-names>
            <surname>Gatys</surname>
          </string-name>
          , Alexander S Ecker, and
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Bethge</surname>
          </string-name>
          .
          <article-title>Texture synthesis using convolutional neural networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>262</fpage>
          -
          <lpage>270</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Gatys et al.,
          <year>2016</year>
          ]
          <article-title>Leon A Gatys, Alexander S Ecker,</article-title>
          and
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Bethge</surname>
          </string-name>
          .
          <article-title>Image style transfer using convolutional neural networks</article-title>
          .
          <source>In Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2016 IEEE Conference on</source>
          , pages
          <fpage>2414</fpage>
          -
          <lpage>2423</lpage>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Grathwohl et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Will</given-names>
            <surname>Grathwohl</surname>
          </string-name>
          , Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud.
          <article-title>Backpropagation through the void: Optimizing control variates for black-box gradient estimation</article-title>
          .
          <source>arXiv preprint arXiv:1711.00123</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Higgins et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Irina</given-names>
            <surname>Higgins</surname>
          </string-name>
          , Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner.
          <article-title>beta-vae: Learning basic visual concepts with a constrained variational framework</article-title>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Hochreiter and Schmidhuber</source>
          , 1997]
          <article-title>Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Honnibal and Montani</source>
          , 2017]
          <article-title>Matthew Honnibal and Ines Montani. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing</article-title>
          . To appear,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[Kingma and Ba</source>
          , 2014]
          <article-title>Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Kingma and Welling</source>
          , 2013]
          <article-title>Diederik P Kingma and Max Welling</article-title>
          .
          <article-title>Auto-encoding variational bayes</article-title>
          .
          <source>arXiv preprint arXiv:1312.6114</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Kingma et al.,
          <year>2014</year>
          ]
          <string-name>
            <surname>Diederik P Kingma</surname>
          </string-name>
          , Shakir Mohamed, Danilo Jimenez Rezende, and
          <string-name>
            <given-names>Max</given-names>
            <surname>Welling</surname>
          </string-name>
          .
          <article-title>Semisupervised learning with deep generative models</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>3581</fpage>
          -
          <lpage>3589</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>[Lei and Zhang</source>
          , 2017]
          <string-name>
            <given-names>Tao</given-names>
            <surname>Lei</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yu</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>Training rnns as fast as cnns</article-title>
          .
          <source>arXiv preprint arXiv:1709.02755</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Jiwei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Will</given-names>
            <surname>Monroe</surname>
          </string-name>
          , Alan Ritter, Michel Galley,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          .
          <article-title>Deep reinforcement learning for dialogue generation</article-title>
          .
          <source>arXiv preprint arXiv:1606.01541</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[Miao and Blunsom</source>
          , 2016]
          <string-name>
            <given-names>Yishu</given-names>
            <surname>Miao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Phil</given-names>
            <surname>Blunsom</surname>
          </string-name>
          .
          <article-title>Language as a latent variable: Discrete generative models for sentence compression</article-title>
          .
          <source>arXiv preprint arXiv:1609.07317</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Mnih and Gregor</source>
          , 2014]
          <string-name>
            <given-names>Andriy</given-names>
            <surname>Mnih</surname>
          </string-name>
          and
          <string-name>
            <given-names>Karol</given-names>
            <surname>Gregor</surname>
          </string-name>
          .
          <article-title>Neural variational inference and learning in belief networks</article-title>
          .
          <source>arXiv preprint arXiv:1402.0030</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [Pagliardini et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Matteo</given-names>
            <surname>Pagliardini</surname>
          </string-name>
          , Prakhar Gupta, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Jaggi</surname>
          </string-name>
          .
          <article-title>Unsupervised learning of sentence embeddings using compositional n-gram features</article-title>
          .
          <source>arXiv preprint arXiv:1703.02507</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [Paulus et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Romain</given-names>
            <surname>Paulus</surname>
          </string-name>
          , Caiming Xiong, and
          <string-name>
            <given-names>Richard</given-names>
            <surname>Socher</surname>
          </string-name>
          .
          <article-title>A deep reinforced model for abstractive summarization</article-title>
          .
          <source>arXiv preprint arXiv:1705.04304</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Pennington et al.,
          <year>2014</year>
          ] Jeffrey Pennington, Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [Rajpurkar et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Pranav</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          , Jian Zhang, Konstantin Lopyrev, and
          <string-name>
            <given-names>Percy</given-names>
            <surname>Liang</surname>
          </string-name>
          . Squad:
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of text</article-title>
          .
          <source>arXiv preprint arXiv:1606.05250</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [Rumelhart et al.,
          <year>1986</year>
          ]
          <string-name>
            <given-names>David E Rumelhart</given-names>
            , Geoffrey E Hinton, and
            <surname>Ronald J Williams</surname>
          </string-name>
          .
          <article-title>Learning representations by back-propagating errors</article-title>
          .
          <source>nature</source>
          ,
          <volume>323</volume>
          (
          <issue>6088</issue>
          ):
          <fpage>533</fpage>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Seo et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Minjoon</given-names>
            <surname>Seo</surname>
          </string-name>
          , Aniruddha Kembhavi, Ali Farhadi, and
          <string-name>
            <given-names>Hannaneh</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          .
          <article-title>Bidirectional attention flow for machine comprehension</article-title>
          .
          <source>arXiv preprint arXiv:1611.01603</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Sutskever et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , Oriol Vinyals, and Quoc V Le.
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>[Sutton</surname>
          </string-name>
          et al.,
          <year>1999</year>
          ] Richard S Sutton,
          <string-name>
            <given-names>Doina</given-names>
            <surname>Precup</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Satinder</given-names>
            <surname>Singh</surname>
          </string-name>
          .
          <article-title>Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning</article-title>
          .
          <source>Artificial intelligence</source>
          ,
          <volume>112</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>181</fpage>
          -
          <lpage>211</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>[Sutton</surname>
          </string-name>
          et al.,
          <year>2000</year>
          ] Richard S Sutton,
          <string-name>
            <given-names>David A McAllester</given-names>
            ,
            <surname>Satinder P Singh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yishay</given-names>
            <surname>Mansour</surname>
          </string-name>
          .
          <article-title>Policy gradient methods for reinforcement learning with function approximation</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>1057</fpage>
          -
          <lpage>1063</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>[Vinyals and Le</source>
          , 2015]
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          and
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <article-title>A neural conversational model</article-title>
          .
          <source>arXiv preprint arXiv:1506.05869</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [Vinyals et al., 2015a]
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , Meire Fortunato, and
          <string-name>
            <given-names>Navdeep</given-names>
            <surname>Jaitly</surname>
          </string-name>
          .
          <article-title>Pointer networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>2692</fpage>
          -
          <lpage>2700</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [Vinyals et al., 2015b]
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , Alexander Toshev, Samy Bengio, and
          <string-name>
            <given-names>Dumitru</given-names>
            <surname>Erhan</surname>
          </string-name>
          .
          <article-title>Show and tell: A neural image caption generator</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>3156</fpage>
          -
          <lpage>3164</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>[Wang and Jiang</source>
          , 2016]
          <string-name>
            <given-names>Shuohang</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jing</given-names>
            <surname>Jiang</surname>
          </string-name>
          .
          <article-title>Machine comprehension using match-lstm and answer pointer</article-title>
          .
          <source>arXiv preprint arXiv:1608.07905</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>[Wang</surname>
          </string-name>
          et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Wenhui</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Nan Yang</surname>
            , Furu Wei,
            <given-names>Baobao</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
            , and
            <given-names>Ming</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Gated self-matching networks for reading comprehension and question answering</article-title>
          .
          <source>In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>1</volume>
          , pages
          <fpage>189</fpage>
          -
          <lpage>198</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [Wen et al.,
          <year>2017</year>
          ]
          <string-name>
            <surname>Tsung-Hsien</surname>
            <given-names>Wen</given-names>
          </string-name>
          , Yishu Miao, Phil Blunsom, and
          <string-name>
            <given-names>Steve</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Latent intention dialogue models</article-title>
          .
          <source>arXiv preprint arXiv:1705.10229</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [Xiong et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Caiming</given-names>
            <surname>Xiong</surname>
          </string-name>
          , Victor Zhong, and Richard Socher.
          <article-title>Dynamic coattention networks for question answering</article-title>
          .
          <source>arXiv preprint arXiv:1611.01604</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [Xiong et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Caiming</given-names>
            <surname>Xiong</surname>
          </string-name>
          , Victor Zhong, and Richard Socher. Dcn+:
          <article-title>Mixed objective and deep residual coattention for question answering</article-title>
          .
          <source>arXiv preprint arXiv:1711.00106</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>