Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Latent Question Interpretation Through Parameter Adaptation Using Stochastic Neuron Tetiana Parshakova, Dae-Shik Kim School of Electrical Engineering, KAIST ten10@kaist.ac.kr, daeshik@kaist.ac.kr Abstract machine translation [Bahdanau et al., 2014; Cho et al., 2014], text summarization [Paulus et al., 2017], image captioning Many neural network-based question-answering [Vinyals et al., 2015b] and more. models rely on complex attention mechanisms but they are limited in their ability to capture natural SQuAD [Rajpurkar et al., 2016] is a benchmark dataset, that language variability, and to generate diverse and/or is composed of 100,000+ questions posed by crowd-workers reasonable answers. To address this limitation, on a set of Wikipedia articles. The answer to each question we propose a module that learns the diversity of is a span within a document, and the objective is to predict the possible interpretations for a given question. the starting and ending indices of the answer: as , ae . Hence In order to identify the possible span of the most models generate two probability distributions over the respective answers, parameters for our question- document in such a way that P = {P (as ), P (ae |as )}. answering model are adapted using the value of the Existing state-of-the-art models attempt to capture the most discrete ”interpretation neuron”. Additionally, we relevant information for answering the question using com- formulate a semi-supervised variational inference plex attention mechanisms. In particular, the key idea lies in framework and fine-tune the final policy using multi-layered attention that fuses semantic information from the rewards from the answer accuracy with the the question into the document. It is achieved by coattention policy gradient optimization. We demonstrate encoders that build richer question-document representation sample answers with induced latent interpreta- as well as various self-matching structures. These models tions, suggesting that our model has successfully learn to output distributions over the span indices, and dur- discovered multiple ways of understanding for a ing training get equally penalized for producing answers in given question. When tested using the Stanford distinct positions with the ground truth even if the meaning Question Answering Dataset (SQuAD), our model was similar. Thus, they cannot make basic actions needed outperformed the current baseline, suggesting the to generate natural answers. For example, consider the potential validity of the approach described in this following triplet (document, question, answer): work. We open source our implementation in PyTorch1 . D: Newcastle International Airport is located approxi- mately 6 miles (9.7 km) from the city centre on the northern outskirts of the city near Ponteland and is the 1 Introduction larger of the two main airports serving the North East. It is connected to the city via the Metro Light Rail The task of machine reading comprehension can be defined system and a journey into Newcastle city centre takes through paragraph understanding and answering questions approximately 20 minutes by train. that are related to it. It is a crucial task in Natural Language Q: How far is Newcastle ’s airport from the center of Processing that led to the development of diverse deep learn- town? ing models. A wide range of these models use the encoder- A: 6 miles decoder structure to map sequence (e.g. paragraph and ques- The span ”20 minutes by train” is also a correct answer if tion) to sequence (answer), by encoding the input with a long the question is interpreted in the perspective of time (which short-term memory (LSTM) [Hochreiter and Schmidhuber, sometimes can be more practical), but since it differs from the 1997] into a fixed dimensional vector representation, and ground truth span, the cross entropy loss will discourage this then decode the output from that vector with another LSTM answer. As a consequence, these attention-based discrimina- [Sutskever et al., 2014]. Variations of this framework have tive models are limited in their ability to exhibit stochasticity been extensively exploited in the conversation modeling task, and variability of natural language and to generate diverse yet where neural networks (NNs) learn the mapping between reasonable answers. queries and responses [Vinyals and Le, 2015], as well as To address this problem, we propose integrating a module 1 https://github.com/parshakova/apsn that Adapts Parameters through Stochastic Neuron (APSN) 46 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden with a basic question-answering model (in our case DrQA, latent interpretations indicates that the model has successfully [Chen et al., 2017]) and a training framework for learning a discovered multiple ways of understanding the question. complex distribution of the latent query interpretations during the question answering. The discrete stochastic neuron 2 Related Work here represents the interpretation of a question and can be considered as different personas of the answering agent. This Among the state-of-the-art end-to-end machine comprehen- stochastic neuron is inferred from the question, and based sion models on the SQuAD dataset, attention mechanisms on its value the central document encoding parameters get play a crucial role. adapted to produce an answer for a particular interpretation. Bidirectional Attention Flow for Machine Comprehension APSN framework employs a discrete latent variable [Mnih (BiDAF, [Seo et al., 2016]) was built upon the hierarchical and Gregor, 2014], because continuous latent space is harder multi-stage architecture. It filters document using the ques- to interpret and apply for semi-supervised learning environ- tion. Additionally BiDAF symmetrically filters the question ment [Kingma et al., 2014]. The objective is to perform using the document, to extract relevant parts of the questions. Bayesian inference for the posterior distribution of latent The Dynamic Coattention Network (DCN, [Xiong et al., interpretations conditioned on the questions and document 2016]) uses coattention encoders to fuse the question and sub-spans. paragraph into one representation. It also employs a dynamic In the framework of variational auto-encoder (VAE), we con- decoder that iteratively estimates the start and end indexes struct an inference network as the variational approximation using LSTM and a Highway Maxout Network. The extension of the posterior, and by sampling the interpretation for each of DCN, DCN+ [Xiong et al., 2017], introduces the mixed question-answer the model is able to learn the interpretation objective of cross entropy loss over span position and self- distribution on the SQuAD by optimizing the variational critical policy learning [Paulus et al., 2017]. lower bound [Mnih and Gregor, 2014; Miao and Blunsom, The R-Net [Wang et al., 2017] is based on match-LSTMs 2016; Wen et al., 2017]. To reduce the variance further, we [Wang and Jiang, 2016] that first incorporate question in- develop the semi-supervised framework by jointly training on formation into passage representation and then use it for a the labelled and unlabelled latent interpretations [Kingma et recurrent self-matching attention. Start and end indices are al., 2014]. predicted with the use of pointer networks [Vinyals et al., In order to prevent the mode collapse in selecting only 2015a]. a single interpretation, we introduce a new objective that These models are equipped with a large number of param- discourages a cosine similarity and penalizes the feature eters, which is owing to the structure complexity of their correlation proximity of original document encoding and attention mechanisms that is loaded with various information document encoding under different latent interpretations. The pathways and tangled connections between layers. In con- latter is computed by a mean square error between Gram trast, DrQA [Chen et al., 2017] is a fairly small and simple matrices [Gatys et al., 2015; Gatys et al., 2016]. In addition, model but is powerful enough to achieve a high accuracy after training the model in the semi-supervised variational on the SQuAD. That is why it was chosen as a baseline inference framework, we fine-tune it with a mixed objective model due to being more amenable to fast learning and that combines traditional cross entropy loss over position of modifications. a span with a policy gradient (PG) reinforcement learning [Xiong et al., 2017; Paulus et al., 2017; Li et al., 2016]. Gradient-based learning has been a key to most neural net- In the mixed objective scenario, the latent interpretation is work based algorithms. The backpropagation [Rumelhart et sampled from the prior distribution, and the span distribution al., 1986] computes exact gradients when the relationship be- is considered to be a policy for PG optimization. We tween the training objective and parameters is continuous and compared the performance of the model with two different generally smooth. However in many cases it is impossible scores for obtaining rewards: the F1 score and the exact to apply backpropagation: for example when the model has match (EM). stochastic neurons, hard non-linearities, discrete sampling operations, or when the objective function is unknown to In summary, our results suggest that the neural variational the agent (like in reinforcement learning). To get a learning inference framework is able to detect discrete latent interpre- signal in such situations one has to construct a gradient tations of a question. Finding various reasonable answers estimator. within the same document is important, because it brings a stepping stone towards building a large-scale open-domain For models with continuous latent variables the reparametri- QA, where one must first retrieve the few relevant articles sation trick is commonly used [Kingma and Welling, 2013] to and then scan them to identify an answer. By allowing achieve an unbiased low-variance gradient estimator. While multiple question interpretations, the agent may discover new in a discrete latent variable case, advantage actor-critic meth- connections in the knowledge and arrive at more interesting ods (A2C) give unbiased gradient estimates with reduced responses. The experimental results also indicate that by variance [Sutton et al., 2000], and a more recent framework introducing a module APSN and training framework to the RELAX [Grathwohl et al., 2017] that outperforms A2C can baseline DrQA, the accuracy of answers on the SQuAD be applicable even when no continuous relaxation of discrete improves. Lastly, the quality of sample answers with induced random variable is available. 47 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Figure 1: Structure overview of integrated APSN module with DrQA. In this illustration ni = 2 and z (n) = 1. In this work, we investigate the possibilities of model- fem (di ) = I(di ∈ q), token features ftoken (di ) = ing the question interpretation distribution during a reading (POS(di ), NER(dP i ), TF(di )), and aligned question embed- comprehension using the discrete VAE for inference [Mnih ding falign (di ) = j ai,j E(qj ), as in the original DrQA. To and Gregor, 2014], which parametrizes interpretation space encode the document we apply another recurrent network. through discrete latent variable. This framework is capable of combining different learning paradigms such as semi- For reference, in the original single-layer SRU linear trans- supervised learning, reinforcement learning, and sample- formation of the input d e is performed by grouping matrix based variational inference to bootstrap performance. multiplication:   3 Baseline and APSN Integration W W  e e e The APSN module is integrated with the question-answering UT =  h  [ d 1 , d2 , · · · , dm ] (2) Wf baseline DrQA. Suppose we are given a document d con- Wr sisting of m tokens {d1 , ..., dm }, a question q consisting of l tokens {q1 , ..., ql }, and total ni of latent question interpreta- Interpretation Policy tions. We divide model parameters θ into two sets: the policy π parameters θ2 and all the remaining ones θ1 . The policy network encodes q e into qw and de (word embed- Question Encoding ding part only) into dw with SRU1 . Since empirically we found that it is beneficial to share the question encoding pa- First, we obtain the question encodings with a multi-layer rameters with the prior policy. Then, the latent interpretation bidirectional Simple Recurrent Unit (SRU, [Lei and Zhang, z is parametrized by a three layered MLP, 2017]) applied on top of the word embeddings q e = {e el }, where femb (qi ) = E(qi ) = q q1 , ..., q ei : πθ2 (z|e e = σ(WT · relu(WT · q, d) {q1 , ..., ql } = SRU1 {e el } q1 , ..., q (1) 4 3 Resulting encodings are combined into a question encoding relu(W2T · relu(W1T [qw ⊕ dw ])))) (3) P through a parametrized weighted sum qw = j bj qj . where σ stands for a softmax, biases are omitted for simplic- Document Encoding ity, W1 , W2 , W3 , W4 are trainable parameters, ⊕ stands for Each token in the document di is first preprocessed concatenation. The latent interpretation z (n) ∈ {0, 1, ..., ni − e i that is comprised of concate- into a feature vector d 1} is sampled from a discrete conditional multinomial distri- nated: word embedding femb (di ) = E(di ), exact match bution z (n) ∼ πθ2 (z|e e q, d). 48 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Parameter Adaptation goal of learning effective answering model that is based on In the APSN, the sampled interpretation is used to adapt the question interpretation, is mostly up to the reconstruction the central SRU parameters W from a single layer. For error. each value of the latent interpretation z (n) = i there is The inference network qφ (z|eq, a) is conditioned on the an- an individual set of weights associated with it, Wi . These swer embedding, which is a document sub-span {d e i }e ⊂ weights are used to distort the central parameters in order to i=s adapt them for new interpretations: {de i } , and the question embeddings {e m qi }i=1 , on top of l i=1 which a recurrent neural network is applied. Then, to obtain W = W ⊕ Wh ⊕ Wf ⊕ Wr (4) a multinomial distribution over the latent interpretations, W = {W0 , ..., Wni −1 } (5) the concatenation of the resulting hidden units of answer and question encodings is passed through a similar network In this work, we present multiple methods for combining W described in Eq. 3. with Wz(n) to obtain new parameters Wznew (n) , which are being During the training we draw N samples z (n) ∼ qφ (z|e q, a) optimized for a particular interpretation: independently for computing the gradients. Parameters θ1 are • Addition: Wznew (n) = W + Wz (n) directly updated by backpropagating the stochastic gradients: • Multiplication: Wznew (n) = W σ(Wz(n) ) 1 X ∂logpθ1 (a|z (n) , q e e, d) • Convolutional: Wznew = CNN(W, Wz(n) ), which ∇θ 1 L ≈ . (12) (n) N n ∂θ1 consists of multiple layers of 2D convolutions with a kernel zise of 3 × 3, ReLU activation between layers Parameters of the prior network θ2 are trained by mimicking and zero padding to keep the original size of the matrix the posterior network: The above procedure is used to obtain adapted parameters (n) of a single SRU layer. Similar steps can be followed to Wznew X ∂logπθ2 (z|e e q, d) find adapted parameters in another layer but with a separate ∇θ 2 L = β qφ (z|e q, a) . (13) ∂θ2 set of perturbation weights. The set of adapted parameters z in initial layers of SRU, along with unchanged parameters on the remaining layers, are used in what we call SRU2 , to get For the parameters φ in the posterior network, we firstly the encodings of the document information: define the learning signal as: e 1 , ..., d {d1 , ..., dm } = SRU2 {d em} (6) e = logpθ (a|z (n) , q e l(a, z (n) , q e, d) 1 e, d)−   Prediction β logqφ (z (n) |e q, a) − logπθ2 (z (n) |e e . (14) q, d) Similarly to the original model DrQA, we use bilinear term to capture the similarity between di and qw and compute the Then the parameters φ are updated by: probabilities of each token being start and end of an answer: 1 X e − b(e  e · pθ (as = i|e e ∝ exp(di Ws qw ) q, d) (7) ∇φ L ≈ l(a, z (n) , q e, d) q, d) N n pθ (ae = i|e e ∝ exp(di We qw ) q, d) (8) ∂logqφ (z (n) |e q, a) . (15) 4 Training Framework ∂φ Inference To reduce the variance in this gradient estimator, which relies To implement sampling from the variational posterior for on samples from qφ (z|e q, a), we follow the REINFORCE the given observation, we construct an inference network algorithm [Mnih and Gregor, 2014] and introduce a baseline q, a) with parameters φ as the variational approximation qφ (z|e critic network b(e e = MLP(qw ⊕ dw ). During the q, d) of the posterior distribution p(z|e q, a) [Mnih and Gregor, training, the baseline is updated by minimising the mean 2014; Miao and Blunsom, 2016; Wen et al., 2017]: square error with the learning signal.   L = Eqφ (z|eq,a) logpθ1 (a|z, q e − e, d)   (9) Semi-Supervision βDKL qφ (z|e q, a)||πθ2 (z|e e q, d) While learning interpretations in a completely unsupervised X ≤ log e θ (z|e e, d)π pθ1 (a|z, q e q, d) (10) manner, one major difficulty remains: the high variance of an z 2 inference network on the early stages of training. Thus, we adopt a semi-supervised training framework [Kingma et al., = logpθ (a|e e q, d) (11) 2014]. We used a standard clustering algorithm to generate Note that a coefficient β = 0.1 scales the learning signal of labels ẑ for questions-answer pairs. In this case our training the KL divergence [Higgins et al., 2016]. Although we are examples are separated into two sets: (ẑ, q e a) ∈ L e, d, not optimizing the exact variational lower bound, the final and (e e q, d, a) ∈ U, that together produce a joint objective 49 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden function: the ground truth and a predicted span). The final value of h X   (n) rscore (a, â) was normalized over the batch. Lss = α Eqφ (z|eq,a) logpθ1 (a|z, q e − e, d) e q,d,a)∈U (e 5 Rationale  i βDKL qφ (z|e q, a)||πθ2 (z|e e + q, d) We will now provide the intuition behind the parameter X   adaptation and the training framework. Current works in e θ (ẑ|e e, d)π log pθ1 (a|ẑ, q e φ (ẑ|e q, d)q q, a) (16) hierarchical reinforcement learning are based on the options 2 e q,d,a)∈L (ẑ,e framework [Sutton et al., 1999], where a master policy selects among options (sub-policies) to accomplish the final where α is a balancing parameter between updates from a goal. Similarly, our algorithm learns a hierarchical pol- modified variational bound (Eq. 9) and joint log-likelihood icy, where a master policy πθ2 (z (n) |e e switches between q, d) of the fully observed data. the interpretation-specific weights Wz(n) that fine-tune the Interpretation Diversity shared central weights W and form a sub-policy (sub-policies While training a system in the semi-supervised variational correspond to pθ (a|e e with Wnew q, d) ) for a particular inter- z (n) inference framework, the interpretation policy suffers from pretation value. mode collapse. To prevent that, we maximize a new regular- ization objective: Next, we consider VAE framework. It is used to approximate the posterior distribution over the latent interpretations, so i −1 h nX that the system could optimize the variational lower bound of Lreg = − 0.1 · cos(U, Ui )+ the joint distribution. Hence, by sampling the interpretations i=0 for each question and correct document sub-span (answer), i the model is able to learn the interpretation distribution on 0.001 · MSE(Gram(U), Gram(Ui )) (17) SQuAD. To reduce the variance of an inference network on the early stage of training, we introduce semi-supervised where U, Ui are linear transformations of the average pooled learning signal. While maintaining such framework, the input to SRU across the time steps (Eq. 2) with and without system suffers from mode collapse in the interpretation parameter adaptation respectively, cos is the cosine similarity, policy. The mode collapse has been prevented by the use Gram is a Gramian matrix divided by size(U). of the interpretation diversity objective. In effect, it led By optimizing this objective, the proximity of document to maximally effective behaviour in the question-answering encodings under various interpretations that is obtained by task. feature correlations (i.e., Gram matrix) and cosine similarity, gets minimized. Gram matrix has remarkable ability of 6 Experiments capturing texture information and style [Gatys et al., 2015; Implementation Details Gatys et al., 2016], while cosine similarity is useful for mea- For the word embeddings we use GloVe embeddings pre- suring how documents are semantically related. Lreg along trained on the 840B Common Crawl corpus [Pennington et with the main objective (Eq. 9) aims to find such parameters al., 2014]. Each recurrent network is a bidirectional SRU that W, W that help to make document encodings different in has 5 layers and the hidden state size 128, as in the baseline semantics and in style across the latent interpretations, but DrQA. We apply dropout with p = 0.8 to all hidden units yet producing the correct answers. of SRU, use mini-batches of size 64. The model is trained Policy Gradient by Adamax [Kingma and Ba, 2014] and tuned with early After the interpretation policy πθ2 (z|e e and the answer- q, d) stopping on the validation set. In SQuAD some questions e contain several ground truth answers, however during training ing policy pθ1 (a|z, q e, d) are learned, we apply a policy only a single answer per question was used. We apply gradient-based reinforcement learning algorithm to fine-tune the Spacy English language models [Honnibal and Montani, the parameters θ [Xiong et al., 2017; Paulus et al., 2017; 2017] for tokenization and also generating lemma, part-of- Li et al., 2016]. By sampling z (n) ∼ πθ2 (z|e e and â ∼ q, d) speech, and named entity tags. (n) e e (n) pθ1 (a|z , q, d) the system receives a reward rscore (a, â). The trade-off coefficients α and γ are set to 0.1. The The new expected gradient from a mixed objective that final objective in the semi-supervised variational inference includes a cross entropy and a policy gradient is computed framework is Lreg +Lss . The parameters from the pre-trained as:  DrQA are used as the initialization for the APSN model. 1 X ∂ e Number of features in the convolutional parameter distortion ∇θ Lce+pg ≈ (1 − γ) · logpθ (a|z (n) , q e, d)+ is set to 64. The baseline critic network is a 3 layered MLP N n ∂θ  with a hidden size 128. Provided accuracies are obtained on   the SQuAD validation set. (n) (n) γ · rscore (a, â)log pθ (â|z , q e (n) e, d)πθ (z |e e q, d) . (18) To produce self-labelled question clusters for semi- We evaluated the performance of the model while different supervised learning of the interpretations, we used Sent2Vec scores were used in computing rewards: F1 and EM (between [Pagliardini et al., 2017] to obtain sentence embeddings for 50 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Lce+pg & rEM Lce+pg & rF1 Lss + Lreg Model ni F1 EM F1 EM F1 EM APSN conv3 3 80.60 71.54 80.65 71.51 80.57 71.40 APSN conv4 3 80.56 71.21 80.64 71.31 80.56 71.32 APSN conv4 4 80.72 71.58 80.79 71.66 80.75 71.40 APSN conv5 4 80.45 71.40 80.62 71.43 80.49 71.33 APSN conv4 5 80.66 71.31 80.59 71.20 80.58 71.23 APSN conv5 5 80.98 71.48 81.11 71.80 80.91 71.44 APSN conv3 8 80.61 71.23 80.67 71.27 80.59 71.38 APSN conv3 10 80.59 71.69 80.69 71.52 80.67 71.55 Table 1: Evaluation on the different number of latent interpretations. In the ”APSN conv5”, number 5 depicts the amount of layers in the convolutional architecture for obtaining adapted parameters Wnew . Lce+pg is the semi-supervised variational inference objective. Lce+pg is a mixed objective: cross entropy and policy gradient, where rF1 , rEM means that the F1 score and the EM respectively are used in computing rewards. question-answer pairs, and the KMeans for clustering. The Interestingly, while the regularization objective is used, the number of labelled interpretations was in range from 30% model arrives at its best performance on the SQuAD after to 50% across the whole dataset, depending on the value of being fine-tuned with PG, comparing to the case with a interpretations ni . single Lss objective. This is the result of the mode collapse that happens in the latter case, when the central parameters SQuAD Accuracy of SRU, W, get adapted only for a single task. In this case W become insensitive to changes in a latent question Lss interpretation neuron. Model ni EM F1 Analysis of Samples The sample answers based on the induced values of latent DrQA - 70.28 79.50 interpretation are illustrated in Table 3. Among the generated APSN add 1 5 70.91 80.32 spans, some contain new sequences that do not have a word overlap with the first option of ground truth (that the model APSN mul 1 5 70.87 79.86 was trained with) but yet are the plausible answers (samples APSN mul 2 10 71.30 80.33 #1-3 in Table 3 marked in violet). It was the main goal of the APSN conv4 1 5 71.29 80.72 interpretation neuron. Other things to note: APSN conv5 1 5 71.88 81.09 1. While the model was trained only with a single an- swer per question, it is able to find multiple alternative answers in cases when several different options are Table 2: Evaluation of different architectures for obtaining included in the gold reference (sample #4-6). adapted parameters Wnew among modules with additive (add), multiplicative (mul) and convolutional (conv) operations. In the 2. We also note that predicted spans of some interpretations ”APSN conv5 1”, the number 1 corresponds to a number of layers are inexplicitly related to the correct answer by the in the multi-layered SRU where parameters get adapted; the number causal relationship (samples #7, #8). In such cases, 5 depicts an amount of layers in the convolutional architecture. produced answers contain helpful information about the ground truth even when they are not directly answering Empirically obtained evaluation results in Table 2 indicate the question. It may be a valuable path of future that convolutional operations for adapting parameters are the investigations to use such spans as an intermediate step most effective with our interpretation policy. for refining the final answers. The performance of the model in Table 1 illustrates that the 3. A paraphrasing behaviour of a question (sample #9) may accuracy improves while the number of latent interpretations be useful in making a question-answering model elicit ni increases from 3 to 5 and then goes down. Also, it is the best answers [Buck et al., 2017]. crucial to find a proper number of layers in convolutional 4. In 80% of cases, the model finds a span that has an parameter adaptation module individually for each value of overlap with a true answer but either contains addi- ni . The policy gradient framework consistently improves tional words (samples #10-12 answer questions more the accuracy achieved by applying solely semi-supervised thoroughly) or is more concise. It can be interpreted as variational inference training. The APSN model outperforms the fact that some people are more talkative while others the baseline DrQA in all cases. are laconic. 51 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden Table 3: Sample answers from the APSN model with ni = 5 produced by inducing the value of a latent interpretation given the document D (here only a part of it is shown) and a question Q on SQuAD validation set. In this dataset some questions contain several gold reference answers A, however during training only a single answer per question was used. The tuple (1 33.3) represents the value of a latent interpretation 1 and the F1 score 33.3%. In each sample, there are shown two predicted answers, among which the one beside the tuple highlighted in bold was chosen by the policy during testing. A job where there are many workers willing to work a large amount of time (high supply) competing for a job that D: few require (low demand) will result in a low wage for that job. Q: When there are many workers competing for a few jobs its considered as what? 1 A: [’high supply’, ’low demand’] (0 0.0) willing to work a large amount of time (1 33.3) high supply) competing for a job that few require (low demand ITV Tyne Tees was based at City Road for over 40 years after its launch in January 1959. In 2005 it moved to a new D: facility on The Watermark business park next to the MetroCentre in Gateshead. Q: Where did ITV Tyne Tees move in 2005? 2 A: [’a new facility’, ’The Watermark business park’] (1 100.0) The Watermark business park (2 0.0) Gateshead D: It is believed that the civilization was later devastated by the spread of diseases from Europe, such as smallpox. Q: What was believed to be the cause of devastation to the civilization? 3 A: [’spread of diseases from Europe’] (1 0.0) smallpox (4 100.0) spread of diseases from Europe For Luther, also Christ’s life, when understood as an example, is nothing more than an illustration of the Ten D: Commandments, which a Christian should follow in his or her vocations on a daily basis. Q: What should a Christian follow in his life? 4 A: [’Ten Commandments’, ’his or her vocations on a daily basis’] (1 100.0) Ten Commandments (4 72.7) vocations on a daily basis dynamos in a power house six miles away were repeatedly burned out, due to the powerful high frequency currents set D: up in them, and which caused heavy sparks to jump through the windings and destroy the insulation Q: What did the sparks do to the insulation? 5 A: [’destroy’, ’jump through the windings and destroy the insulation’] (2 100.0) jump through the windings and destroy the insulation (3 100.0) destroy The situation in New France was further exacerbated by a poor harvest in 1757, a difficult winter, and the allegedly D: corrupt machinations of François Bigot, the intendant of the territory. Q: What other reason caused poor supply of New France from a difficult winter? 6 A: [’poor harvest’, ’allegedly corrupt machinations of François Bigot’] (0 100.0) poor harvest (1 80.0) the allegedly corrupt machinations of François Bigot, the intendant of the territory As the D-loop moves through the circular DNA, it adopts a theta intermediary form, also known as a Cairns replication D: intermediate, and completes replication with a rolling circle mechanism. Q: What is a Cairns replication intermediate? 7 A: [’a theta intermediary form’] (0 0.0) a rolling circle mechanism (1 100.0) a theta intermediary form Research shows that student motivation and attitudes towards school are closely linked to student-teacher relationships. D: Enthusiastic teachers are particularly good at creating beneficial relations with their students. Q: What type of relationships do enthusiastic teachers cause? 8 A: [’beneficial’] (0 0.0) student-teacher (4 66.7) beneficial relations D: Thus, the marginal utility of wealth per person (”the additional dollar”) decreases as a person becomes richer. Q: What the marginal utility of wealth per income per person do as that person becomes richer? 9 A: [’decreases’] (0 100) decreases (4 0.0) the additional dollar 52 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden D: Plastoglobuli (...), are spherical bubbles of lipids and proteins about 45–60 nanometers across. Q: What shape are plastoglobuli? 10 A: [’spherical bubbles’, ’spherical’] (1 100.0) spherical (2 36.4) spherical bubbles of lipids and proteins about 45–60 nanometers This behaviour started with his learning of the execution of Johann Esch and Heinrich Voes, the first individuals to be D: martyred by the Roman Catholic Church for Lutheran views Q: Why were Johann Esch and Heinrich Voes executed by the Catholic Church? 11 A: [’for Lutheran views’, ’Lutheran views’] (0 100.0) Lutheran views (1 40.0) the first individuals to be martyred by the Roman Catholic Church for Lutheran views D: the rainforest could be threatened though the 21st century by climate change in addition to deforestation Q: What are the main threats facing the Amazon rainforest in the current century? 12 A: [’climate change in addition to deforestation’] (0 100.0) climate change in addition to deforestation (3 50.0) climate change protesters attempted to enter the test site knowing that they faced arrest (...) they stepped across the ”line” and were D: immediately arrested Q: What was the result of the disobedience protesting the nuclear site? 13 A: [’arrest’, ’were immediately arrested’] (1 50.0) they faced arrest (2 0.0) Heistler Oxfam’s claims have however been questioned on the basis of the methodology used: by using net wealth (adding up assets and subtracting debts), the Oxfam report, for instance, finds that there are more poor people in the United States D: and Western Europe than in China (due to a greater tendency to take on debts). Anthony Shorrocks, the lead author of the Credit Suisse report which is one of the sources of Oxfam’s data, considers the criticism about debt to be a ”silly 14 argument” and ”a non-issue . . . a diversion”. Q: Why does Oxfam and Credit Suisse believe their findings are being doubted? A: [’a diversion’, ’there are more poor people in the United States and Western Europe than in China’] (1 100.0) there are more poor people in the United States and Western Europe than in China (2 0.0) the criticism about debt to be a ”silly argument” Thus, the APSN clearly has multiple modes of understanding ing a single question interpretation is a property of SQuAD, the question and, therefore, answering it. or our language in general. A single sentence from one language can be mapped to multi- 7 Conclusion and Future Works ple variants in another language, thus another direction worth In this paper we have proposed a training framework and the investigation is to connect APSN with a machine translation APSN model for learning question interpretations that help model. For that APSN will learn a complex distribution of to find various valid answers within the same document. The interpretations in mapping source to target sentences. Then role of a discrete interpretation neuron is to make the central the latent interpretation neuron could be seen as a multiple weights W more sensitive to a particular interpretation. It personas translating a sentence. allows the model to implement multiple modes of answering, since these weights control document representations that are The APSN module is integrated with the question-answering used to get an answer. An important implication of this study model DrQA, however, we believe that other baseline models is that when the latent distribution is updated by the rewards could bring more insights and better results. It may also be from a variational lower bound and then the final policy is fruitful to apply RELAX framework for computing a low fine-tuned by the rewards from the answer accuracy, it pro- variance gradient estimator for the APSN model instead of vides an effective learning approach for the neural network. semi-supervised variational inference due to its outstanding The sample answers with induced latent interpretations indi- performance in a game domain. Further research in this cate that the model has successfully discovered multiple ways area could make multi-interpretation approach a standard of understanding the question. Lastly, empirical evaluation component in building the answering system. results on SQuAD suggest that the integration of the APSN into the baseline DrQA is an effective approach for question Acknowledgments answering. This work was supported by Brain Korea 21+ Project, In a fair amount of cases the model produces sub-spans or BK Electronics and Communications Technology Division, super-spans, failing to detect multiple question interpreta- KAIST in 2018. This research was also funded by the tions. Further work needs to be done to establish whether hav- Hyundai NGV Company (Project No. G01170378). 53 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden References [Lei and Zhang, 2017] Tao Lei and Yu Zhang. Training rnns [Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun as fast as cnns. arXiv preprint arXiv:1709.02755, 2017. Cho, and Yoshua Bengio. Neural machine translation [Li et al., 2016] Jiwei Li, Will Monroe, Alan Ritter, Michel by jointly learning to align and translate. arXiv preprint Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforce- arXiv:1409.0473, 2014. ment learning for dialogue generation. arXiv preprint [Buck et al., 2017] Christian Buck, Jannis Bulian, Massim- arXiv:1606.01541, 2016. iliano Ciaramita, Andrea Gesmundo, Neil Houlsby, Woj- [Miao and Blunsom, 2016] Yishu Miao and Phil Blunsom. ciech Gajewski, and Wei Wang. Ask the right questions: Language as a latent variable: Discrete generative Active question reformulation with reinforcement learn- models for sentence compression. arXiv preprint ing. arXiv preprint arXiv:1705.07830, 2017. arXiv:1609.07317, 2016. [Chen et al., 2017] Danqi Chen, Adam Fisch, Jason Weston, [Mnih and Gregor, 2014] Andriy Mnih and Karol Gregor. and Antoine Bordes. Reading wikipedia to answer open- Neural variational inference and learning in belief net- domain questions. arXiv preprint arXiv:1704.00051, works. arXiv preprint arXiv:1402.0030, 2014. 2017. [Pagliardini et al., 2017] Matteo Pagliardini, Prakhar Gupta, [Cho et al., 2014] Kyunghyun Cho, Bart Van Merriënboer, and Martin Jaggi. Unsupervised learning of sentence Dzmitry Bahdanau, and Yoshua Bengio. On the prop- embeddings using compositional n-gram features. arXiv erties of neural machine translation: Encoder-decoder preprint arXiv:1703.02507, 2017. approaches. arXiv preprint arXiv:1409.1259, 2014. [Paulus et al., 2017] Romain Paulus, Caiming Xiong, and [Gatys et al., 2015] Leon Gatys, Alexander S Ecker, and Richard Socher. A deep reinforced model for abstractive Matthias Bethge. Texture synthesis using convolutional summarization. arXiv preprint arXiv:1705.04304, 2017. neural networks. In Advances in Neural Information [Pennington et al., 2014] Jeffrey Pennington, Richard Processing Systems, pages 262–270, 2015. Socher, and Christopher Manning. Glove: Global vectors [Gatys et al., 2016] Leon A Gatys, Alexander S Ecker, and for word representation. In Proceedings of the 2014 Matthias Bethge. Image style transfer using convolutional conference on empirical methods in natural language neural networks. In Computer Vision and Pattern Recog- processing (EMNLP), pages 1532–1543, 2014. nition (CVPR), 2016 IEEE Conference on, pages 2414– [Rajpurkar et al., 2016] Pranav Rajpurkar, Jian Zhang, Kon- 2423. IEEE, 2016. stantin Lopyrev, and Percy Liang. Squad: 100,000+ [Grathwohl et al., 2017] Will Grathwohl, Dami Choi, questions for machine comprehension of text. arXiv Yuhuai Wu, Geoff Roeder, and David Duvenaud. preprint arXiv:1606.05250, 2016. Backpropagation through the void: Optimizing control [Rumelhart et al., 1986] David E Rumelhart, Geoffrey E variates for black-box gradient estimation. arXiv preprint Hinton, and Ronald J Williams. Learning representations arXiv:1711.00123, 2017. by back-propagating errors. nature, 323(6088):533, 1986. [Higgins et al., 2016] Irina Higgins, Loic Matthey, Arka Pal, [Seo et al., 2016] Minjoon Seo, Aniruddha Kembhavi, Ali Christopher Burgess, Xavier Glorot, Matthew Botvinick, Farhadi, and Hannaneh Hajishirzi. Bidirectional atten- Shakir Mohamed, and Alexander Lerchner. beta-vae: tion flow for machine comprehension. arXiv preprint Learning basic visual concepts with a constrained varia- arXiv:1611.01603, 2016. tional framework. 2016. [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Quoc V Le. Sequence to sequence learning with neural Jürgen Schmidhuber. Long short-term memory. Neural networks. In Advances in neural information processing computation, 9(8):1735–1780, 1997. systems, pages 3104–3112, 2014. [Honnibal and Montani, 2017] Matthew Honnibal and Ines [Sutton et al., 1999] Richard S Sutton, Doina Precup, and Montani. spacy 2: Natural language understanding with Satinder Singh. Between mdps and semi-mdps: A frame- bloom embeddings, convolutional neural networks and work for temporal abstraction in reinforcement learning. incremental parsing. To appear, 2017. Artificial intelligence, 112(1-2):181–211, 1999. [Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. [Sutton et al., 2000] Richard S Sutton, David A McAllester, Adam: A method for stochastic optimization. arXiv Satinder P Singh, and Yishay Mansour. Policy gradient preprint arXiv:1412.6980, 2014. methods for reinforcement learning with function approx- [Kingma and Welling, 2013] Diederik P Kingma and Max imation. In Advances in neural information processing Welling. Auto-encoding variational bayes. arXiv preprint systems, pages 1057–1063, 2000. arXiv:1312.6114, 2013. [Vinyals and Le, 2015] Oriol Vinyals and Quoc Le. A neural [Kingma et al., 2014] Diederik P Kingma, Shakir Mohamed, conversational model. arXiv preprint arXiv:1506.05869, Danilo Jimenez Rezende, and Max Welling. Semi- 2015. supervised learning with deep generative models. In Ad- [Vinyals et al., 2015a] Oriol Vinyals, Meire Fortunato, and vances in Neural Information Processing Systems, pages Navdeep Jaitly. Pointer networks. In Advances in Neural 3581–3589, 2014. Information Processing Systems, pages 2692–2700, 2015. 54 Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden [Vinyals et al., 2015b] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015. [Wang and Jiang, 2016] Shuohang Wang and Jing Jiang. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905, 2016. [Wang et al., 2017] Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching networks for reading comprehension and question an- swering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189–198, 2017. [Wen et al., 2017] Tsung-Hsien Wen, Yishu Miao, Phil Blunsom, and Steve Young. Latent intention dialogue models. arXiv preprint arXiv:1705.10229, 2017. [Xiong et al., 2016] Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for ques- tion answering. arXiv preprint arXiv:1611.01604, 2016. [Xiong et al., 2017] Caiming Xiong, Victor Zhong, and Richard Socher. Dcn+: Mixed objective and deep resid- ual coattention for question answering. arXiv preprint arXiv:1711.00106, 2017. 55