Cross Attention for Selection-based Question Answering Alessio Gravina, Federico Rossetto, Silvia Severini and Giuseppe Attardi Dipartimento di Informatica Università di Pisa gravina.alessio@gmail.com, fedingo@gmail.com, sissisev@gmail.com, attardi@di.unipi.it Abstract. Answer Sentence Selection (ASS) is one of the steps typi- cally involved in Question Answering, a hard task for natural language processing since full solutions would require both natural language un- derstanding and world knowledge. We present a new approach to tackle ASS, based on a Cross-Attentive Convolutional Neural Network. The approach was designed for competing in the Fujitsu AI-NLP challenge Fujitsu [4], which evaluates systems on their performance on the SelQA[7] dataset. This dataset was created on purpose as a benchmark to stress the ability of systems to go beyond simple word co-occurrence criteria. Our submission achieved the top score in the challenge. 1 Introduction Typical approaches to Question Answering involve primarily the following steps: question analysis, which determines what to look for; candidate extraction, which exploits Information Retrieval (IR) techniques to search through documents for candidate answers; answer selection, which prunes the set of candidates and answer extraction which extracts the correct answer from the selected sentences. Given large enough document collections, IR techniques are often capable of providing satisfactory results for both candidate extraction and answer selection: however relying on simple keyword matching is not sufficient when question and answer do not match closely enough, e.g. the question is phrased in different terms from those present in the document containing the answer. More sophis- ticated techniques have been proposed, for example query rewriting or query expansion [6, 8, 18], involving for example dictionaries of synonyms or word em- beddings [9], or using topic modeling to identify a shared latent topic between question and answer [19]. These approaches fail though when deeper knowledge is required, for example world knowledge or inference from given facts. Answer Sentence Selection is an important sub-task of Question Answering, that aims at selecting the correct answers to a given question among a set of candidate sentences. Answer extraction involves Natural Language Processing techniques for interpreting candidate answer sentences and establishing how they relate to questions. 53 More sophisticated methods of ASS that go beyond IR approaches involve for example tree edit models [5] and semantic distances based on word embeddings [15]. Recently, Deep Neural Networks have also been applied to this task [11], pro- viding performance improvements with respect to previous techniques. The most common approaches exploit either recurrent or convolutional neural networks. These models are good at capturing contextual information from sentences, mak- ing them a nice fit for the problem of answer sentence selection. The improvements in the state of the art on ASS over the years are listed in [1], with the current top score of 0.863 MRR [2] on the TREC QA dataset reported by Tayyar Madabushi et al. [14]. Research on this problem has benefited in the last few years from the de- velopment of specific datasets for training systems on this task, like SelQA [7]. This dataset is notable for its larger size, that reaches more that 60.000 sentence- question pairs. This allows for the creation of deeper and more complex models, that do not risk much to over-fit. Moreover the SelQA dataset has been specif- ically crafted in order to make it harder to handle by systems based on purely Information Retrieval techniques that rely on word co-occurrences. All questions were paraphrased using different terms, in order to ensure that solutions would involve more sophisticated techniques such as reading comprehension capabili- ties. In this paper we present a new model for the task of answer sentence selection that improves the current state of the art performances. The model relies on a Convolutional Neural Network with a double mechanism of attention between question and answer. The model is inspired by the light attentive mechanism proposed by Yin and Sch utze [17], which we improve and apply in both direc- tions to question and answer pairs. In the sections below we first survey the more relevant literature, highlighting the context of the Question Answering in which our model fits. After that, we explain the model architecture and the results achieved in our experiments, on the SelQA dataset. 2 Related work Deep learning (DL) approaches have been exploited for the task of answer selec- tions and significantly outperforming traditional method. Attention-based mech- anisms have shown very promising results on a variety of NLP task and have been recently proposed also for the answer selection task. In particular we men- tion the approaches based on either Convolutional Neural Networks (CNN) or Long Short-Term Memory (LSTM) networks, with various types of attention mechanisms, like for example the attentive pooling network by dos Santos et al. [12] and the LSTM-based models for non-factoid answer selection by Tan et al. [13]. dos Santos et al. [12] introduce the mechanism of attentive pooling that enables the pooling layer to be aware of the current question/answer pair, so 54 that information from the two items influences the computation of each other’s representation. This enables joint learning of both the representations of the input pairs as well as a measure of their similarity. An attention vector is created, which guides the subsequent pooling. This model has the ability of embedding two inputs, not semantically comparable, into a common representation space, of working with input pairs of different length and the independence from the underlying representation learning like CNN or RNN. Attentive pooling can be effectively used with CNNs (AP-CNN ) and biLSTM (AP-biLSTM ) in the context of the answer selection task, achieving the best reported results on the WikiQA dataset. Tan et al. [13] present four Deep Learning models for answer selection based on biLSTM and CNN, with different complexities and capabilities. The basic model, called QA-LSTM, implements two similar flows, one for the question and one for the answer. In general, a biLSTM creates a representation of the question/answer that is processed by a max or average pooling layer. The two flows are then merged with a cosine similarity matching that expresses how close question and answer are. More complexity is obtained with QA-LSTM/CNN, a model similar to the previous one but, instead of the pooling layer, exploits a more complex CNN. The output of biLSTM is sent to a convolution filter, in order to give a more complete representation of questions and answers. This filter is followed by 1-max pooling layer and a fully connected layer. Finally, the paper presents the most complex models, QA-LSTM with attention and QA- LSTM/CNN with attention, that extend the previous models with the addition of a simple attention mechanism between question and answer, which aims to better identify the best candidate answer to the question. The mechanism con- sists in multiplying the biLSTM hidden units of the answers with the output computed from the question pooling layer. These models are tested on the In- suranceQA [3] and TREC-QA [16] datasets, achieving quite good performances. Wang et al. [15] propose an approach to answer selection that takes into account similarities and dissimilarities between sentences by decomposing and composing lexical semantics over sentences. In particular they represent each word as a vector and calculate a semantic matching vector for each word based on all words in the other sentence. Then each word vector is decomposed into a similar and a dissimilar component, based on the semantic matching vector. A CNN is then used to capture features by composing these parts and estimating a similarity score over the composed feature vectors to predict which sentence is the answer to the question. The most influential work for our approach is the one on attentive convolution by Yin and Sch utze [17]. The authors of the paper apply an attention mechanism not only to the pooling operation like in dos Santos et al. [12], but also to the convolutional layer itself. They present in fact two different models: a simpler mechanism called light attentive ConvNet, and a more complex one where they split the attention computations and the convolution itself. This type of models are quite effective at comparing a text with a reference text, and are tested in many different applications, like Textual Entailment, Answer Sentence Selection 55 and Text Classification. In all the tested tasks they achieved state-of-the-art performances, overcoming previous model applied to those tasks. We now describe in more detail the light attentive ConvNets, since they are the foundation of the model that we will present in the next section. 2.1 Light Attentive ConvNets The aim of the model presented by Yin and Sch utze [17] is to compute a rep- resentation for the main sentence in a way that convolution filters encode not only local context, but also an attentive context over the reference sentence. The first Attentive Convolution layer generates the Attentive Context Vector. To do this, an energy function is used to determine how much each hidden state in the sentence is relevant to the current hidden state of the question. Then the average of the hidden state of the sentence is computed, weighted by the matching score, in order to obtain the attentive context for the current question’s hidden state. After this layer, there is an Attentive Convolution layer. This layer performs first a standard convolution without attention over the window (hi−1 , hi , hi+1 ), where hi is the i-th hidden state of the question. Secondly, there is a convolution using the attentive context. The final results are added element-wise, a bias term is added and a non-linear activation function is applied. The output of the i-th hidden layer of the (n+1)-th layer is: hn+1 i = tanh(W 1 · [hni−1 , hni , hni+1 ] + W 2 · cni + b) (1) where W 1 ∈ Rd×3d and W 2 ∈ Rd×d are weights, b ∈ Rd is the bias and cni is the i-th attentive context of the n-th layer. 3 Model description In this section we describe the architecture of our model, as illustrated in Figure 1, while the detailed design of the cross-attentive layer can be found in Figure 2. 3.1 Network Architecture In this section we describe in more detail the overall network architecture. At first, the inputs are transformed using an embedding layer initialized using the GloVe word embeddings. Then we apply a number of stacked layers of Cross- attentive convolutions. We will explain below in section 3.2 how they are built. Then, for each layer, we apply a global max pooling to extract both a question and a sentence representation for that layer. 56 These representations are then con- catenated together to obtain two vec- tors Q and S that represent each ques- tion/answer sentence pair. These vectors are then both added and multiplied and the results concatenated before being fed to a simple Feed-Forward Neural Net- work. Finally, we use the predicted dis- tribution from the previous network and we augment it with additional in- formation, in order to feed a Logis- tic Regression layer. Based on our ex- periments we found it useful to ap- ply some Information Retrieval tech- niques for the final classifier, like the tf-idf or the number of co-occurrent words. Fig. 1: Cross-attentive Convolu- 3.2 Cross-Attentive Convolutional tional Network Layer This section describes the Cross-Attentive Convolutional Layer, as shown in Figure 2. Our solution is derived from the Attentive Convolution layer presented in [17]. We use the light attentive mechanism described there in both directions, between the question and the candidate answer. The basic idea is to use a function f that creates a similarity matrix for the two sentence representations that we are convolving. f : Q, S → A, Rl×e × Rl×e → Rl×l where l is the fixed (padded) length of each sentence, and e is the dimension of the embeddings that we are using. After generating this A similarity matrix, we apply a softmax function to normalize the columns ci , and rows rj . We then use these weighting vectors to create the Attentive dense layer for each of the two sentences considered. The context is transformed by this Dense Layer and added to the base convolution. As a final operation we route the results in two directions. For the next layer, we use a Max-Pooling operator, with a default 2 by 2 window. For the other direction instaed, we apply a global pooling to extract a Layer-Representation of the sentence. This representation throughout the layers are finally concatenated as described before. 57 4 Experimental Results We investigate the perfor- mance of our model on the in- teresting dataset SelQA. The dataset needs a prepro- cessing phase before feeding its data to the model. As eval- uation metrics of the perfor- mance, we use the Mean Re- ciprocal Rank M RR as dis- cussed below. After showing the results in terms of the MRR on our model, we com- pare them with the state-of- the-art outcomes and finally we include some discussion with an error analysis. 4.1 Data preprocessing Before feeding the model with the data, some preprocessing of the sentences is needed. This operation consists of the following steps: Fig. 2: Cross-attentive Convolutional Network 1. removal of non-ASCII char- acters, to improve the coverage of words present in the GloVe [10] embed- dings, that are used to initialize the embeddings layer; 2. replacement of digits with the 0 character, to reduce the number of tokens de- noting numbers since, for the task at hand, they have similar meaning; 3. removal of punctuation since it doesn’t have much rel- evance and we are deal- ing with single sentences. This improves the tokeniza- tion process; 58 4. normalize the length of sen- tences to a fixed length of 50 by adding padding, in order to feed them to our Neural Network. The sentence representation consists of the list of word embedddings of its tokens. Each sentence is tokenized using the Keras Tokenizer API and the vector of each token is looked up in the GloVe word embeddings: if no exact match is found, the closest token within an edit distance of 2 is used, if present, otherwise the word embedding for the unknown token is used. 4.2 Evaluation Metrics As evaluation metrics to measure the accuracy of our model, we used Mean Reciprocal Rank (MRR) [2], a statistical measure for evaluating a process that produces a ranked list of possible responses to each query in a test sample. The mean reciprocal rank is the average of the reciprocal of the rank of the first correct result for each of a sample of queries Q: |Q| 1 X 1 M RR = |Q| i=1 ranki where ranki refers to the rank position of the first correct result for the i-th query. 4.3 SelQA dataset As an experiment, we tested our model on the SelQA dataset [7]. The dataset introduces a corpus annotation scheme that enhances the gen- eration of large, diverse, and challenging datasets by explicitly aiming to reduce word co-occurrences between the question and answers. The SelQA dataset con- sists of questions generated through crowd-sourcing and long sentence answers drawn from the ten most prevalent topics in the English Wikipedia. A total of 486 articles are uniformly sampled from the topics of: Arts, Country, Food, Historical Events, Movies, Music, Science, Sports, Travel, TV. After that, the original data is preprocessed into smaller chunks, resulting in 8,481 sections, 113,709 sentences and 2,810,228 tokens. For each section, a question that can be answered in that same section by one or more sentences was generated by human annotators. The corresponding sentence or sentences that answer the question were selected. As an additional noise process, annotators were also asked to create another set of questions from the same selected sections excluding the original sentences selected as answers in previous task. Then all questions were paraphrased using different terms, in order to make sure the QA algorithm would be evaluated by reading comprehension 59 Table 1: SelQA results Validation MRR Test MRR Cross-Attentive CNN 91.37 % 90.61 % CNN SelQA 86.67 % 85.68 % RNN SelQA 88.25 % 87.59 % rather than by the ability to count word co-occurrences. Lastly if ambiguous questions were found, they were rephrased again by a human annotator. Table 1 shows the results obtained by testing our model and by the two models proposed by Jurczyk et al. [7]. The result on the Test set outperform the other two models by more than 3%. 5 Conclusions The aim of this work was to try to improve the state of the art results for the task of Answer Sentence Selection using an architecture based on Convolutional Neural Networks. We implemented a Cross-attentive CNN and we tested it on SelQA dataset. The experiments show that our model is able to beat the current state of the art on the SelQA dataset. More precisely, we were able to achieve 90% of MRR on the test set. We think that this is due to two main factors. First, the dataset is fairly new and it has not been deeply experimented. Second, our model has a deep structure and a big amount of parameters. This means that with more data, the model performance might further improve. The strong points of our model are simplicity and the ability to generalize. The simplicity is shown for example from the speed of the training phase that took only two hours on a 24 core machine without any GPU acceleration. An interesting future development would be to test our model on other datasets reported in the literature, in order to obtain a more direct compari- son with our results. 6 Acknowledgments The experiments were carried on a Dell server with 4 Nvidia GPUs Tesla P100, partly funded by the University of Pisa under grant Grandi Attrezzature 2016. We thank Fujitsu for organizing the challenge and giving us the opportunity to participate in a stimulating experiment. 60 Bibliography [1] ACL: Question answering (state of the art). https://aclweb.org/aclwiki/ Question Answering (State of the art) (2018), accessed: 2018-10-30 [2] Craswell, N.: Mean Reciprocal Rank. In: Liu, L., Özsu, M.T. (eds.) En- cyclopedia of Database Systems. Springer US, Boston, MA (2009), https: //doi.org/10.1007/978-0-387-39940-9 488 [3] Feng, M., Xiang, B., Glass, M.R., Wang, L., Zhou, B.: Applying deep learning to answer selection: A study and an open task. arXiv preprint arXiv:1508.01585 (2015) [4] Fujitsu: Fujitsu AI-NLP Challenge. https://openinnovationgateway.com/ ai-nlp-challenge/challenge.php (2018), accessed: 2018-05-30 [5] Heilman, M., Smith, N.A.: Tree edit models for recognizing textual en- tailments, paraphrases, and answers to questions. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 10. pp. 1011–1019. Association for Computational Linguistics (2010) [6] Jeon, J., Croft, W.B., Lee, J.H.: Finding similar questions in large question and answer archives. In: Proc. of the 14th ACM International Conference on Information and Knowledge Management (CIKM2005). pp. 84–90. ACL Association for Computational Linguistics (2005) [7] Jurczyk, T., Zhai, M., Choi, J.D.: SelQA: A New Benchmark for Selection- based Question Answering. pp. 820–827 (2016) [8] Komiya, K., Abe, Y., Morita, H., Kotani, Y.: Question answering system using q&a site corpus query expansion and answer candidate evaluation. Springerplus 396(2), 1–11 (2013) [9] Kuzi, S., Shtok, A., Kurland, O.: Query expansion using word embeddings. In: Proc. of the 25th ACM International on Conference on Information and Knowledge Management (CIKM2016). pp. 1929–1932. ACM (2016) [10] Pennington, J., Socher, R., Manning, C.D.: GloVe: Global Vectors for Word Representation. In: Empirical Methods in Natural Language Process- ing (EMNLP). pp. 1532–1543 (2014), http://www.aclweb.org/anthology/ D14-1162 [11] Rao, J., He, H., Lin, J.: Noise-contrastive estimation for answer selection with deep neural networks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM 16). pp. 1913–1916. ACM (2016) [12] dos Santos, C.N., Tan, M., Xiang, B., Zhou, B.: Attentive pooling networks. CoRR (2016) [13] Tan, M., Xiang, B., Zhou, B.: Lstm-based deep learning models for non- factoid answer selection. CoRR abs/1511.04108 (2015), http://arxiv.org/ abs/1511.04108 [14] Tayyar Madabushi, H., Lee, M., Barnden, J.: Integrating question clas- sification and deep learning for improved answer selection. In: Proceed- ings of the 27th International Conference on Computational Linguistics. pp. 3283–3294. Association for Computational Linguistics (2018), http: //aclweb.org/anthology/C18-1278 [15] Wang, Z., Mi, H., Ittycheriah, A.: Sentence similarity learning by lexical decomposition and composition. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. pp. 1340–1349. The COLING 2016 Organizing Committee (2016), http: //www.aclweb.org/anthology/C16-1127 [16] Yao, X., Durme, B.V., Callison-Burch, C., Clark, P.: Answer extraction as sequence tagging with tree edit distance. In: Proceedings of the 2013 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 858–867 (2013) [17] Yin, W., Sch utze, H.: Attentive Convolution. CoRR (2017) [18] Yu, Z.T., Zheng, Z.Y., Tang, S.P., Guo, J.Y.: Query expansion for answer document retrieval in chinese question answering system. In: Proc. of 2005 International Conference on Machine Learning and Cybernetics. pp. 72–77 (2005) [19] Zhang, K., Wu, W., Wu, H., Li, Z., Zhou, M.: Question retrieval with high quality answers in community question answering. pp. 371–380 (2014) 62