A Comparative Study of Models for Answer Sentence Selection Alessio Gravina Federico Rossetto Silvia Severini Giuseppe Attardi Università di Pisa Università di Pisa Università di Pisa Università di Pisa gravina.alessio@gmail.com fedingo@gmail.com sissisev@gmail.com attardi@di.unipi.it go beyond Information Retrieval approaches in- Abstract volve for example tree edit models (Heilman and Smith, 2010) and semantic distances based on Answer Sentence Selection is one of the word embeddings (Wang et al., 2016). steps typically involved in Question An- Recently, Deep Neural Networks have also been swering. Question Answering is consid- applied to this task (Rao et al., 2016), providing ered a hard task for natural language pro- performance improvements with respect to previ- cessing systems, since full solutions would ous techniques. The most common approaches ex- require both natural language understand- ploit either recurrent or convolutional neural net- ing and inference abilities. In this pa- works. These models are good at capturing con- per, we explore how the state of the art textual information from sentences, making them in answer selection has improved recently, a nice fit for the problem of answer sentence se- comparing two of the best proposed mod- lection. els for tackling the problem: the Cross- Research on this problem has benefited in attentive Convolutional Network and the the last few years by the development of better BERT model. The experiments are carried datasets for training systems on this task. These out on two datasets, WikiQA and SelQA, datasets include WikiQA (Yang et al., 2015) and both created for and used in open-domain SelQA (Jurczyk et al., 2016). The latter is notable question answering challenges. We also for its larger size, that reaches more that 60.000 report on cross domain experiments with sentence-question pairs. This allows for the cre- the two datasets. ation of deeper and more complex models, with less risk of overfit. 1 Introduction The state of the art model on the SelQA dataset (Jurczyk et al., 2016), up to 2018, was Cross- Answer Sentence Selection is an important sub- attentive Convolutional Network (Gravina et al., task of Question Answering, that aims at select- 2018), with a score of 0.906 MRR (Craswell, ing the sentence containing the correct answer to 2009). a given question among a set of candidate sen- In this paper we present further experiments tences. Table 1 shows an example of a question with the Cross-attentive Convolutional Network and a list of its candidate answers, taken from the model as well as experiments that exploit the SelQA dataset (Jurczyk et al., 2016). The last col- BERT language model by Devlin et al. (2018). umn contains a binary value, representing whether In the following sections we survey relevant lit- the sentence contains the answer or not. erature on the topic, we describe the datasets used Answer extraction involves natural language in our experiments and present the models tested processing techniques for interpreting candidate in our experiments. Finally, we describe the ex- sentences and establishing whether they relate to periments conducted with these models and report questions and contain an answer. More sophisti- the results achieved. cated methods of Answer Sentence Selection that All authors contributed equally to this manuscript. 2 Related work Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 We present a brief survey of the most recent ap- International (CC BY 4.0). proaches for answer selection in question answer- Table 1: Sample question/candidate answers. How much cholesterol is there in an ounce of bacon? One rasher of cooked streaky bacon contains 5.4g of fat, and 4.4g of protein. 0 Four pieces of bacon can also contain up to 800mg of sodium. 0 The fat and protein content varies depending on the cut and cooking method. 0 Each ounce of bacon contains 30mg of cholesterol. 1 ing. tween sentences by decomposing and composing Tan et al. (2015) present four Deep Learning lexical semantics over sentences. In particular the models for answer selection based on biLSTM model represents each word as a vector and cal- (bidirectional LSTM) and CNN (Convolutional culates a semantic matching vector for each word Neural Network), with different complexities and based on all words in the other sentence. Then capabilities. The basic model, called QA-LSTM, each word vector is decomposed into a similar implements two similar flows, one for the ques- and a dissimilar component, based on the seman- tion and one for the answer. The biLSTM builds tic matching vector. Afterwards, a CNN model is a representation of the question/answer pair that used to capture features by composing these parts is passed by a max or average pooling layer. The and a similarity score is estimated over the com- two flows are then merged with a cosine similarity posed feature vectors to predict which sentence is matching that expresses how close question and the answer to the question. answer are. 3 Models A more complex solution, called QA- LSTM/CNN, uses a similar model, which We describe here the models used in our experi- replaces the pooling layer with a CNN. The ments. output of biLSTM is sent to a convolution filter, in order to give a more complete representation 3.1 Simple Logistic Regression Classifier of questions and answers. This filter is followed Jurczyk et al. (2016) state that the SelQA dataset by 1-max pooling layer and a fully connected was created through a process that tried to reduce layer. Finally, the paper presents the most the number of co-occurrent words, so that simple complex models, QA-LSTM with attention and word matching methods would be less effective. QA-LSTM/CNN with attention, that extend the To evaluate whether this aim was indeed achieved, previous models with the addition of a simple we built a simple linear regression classifier using attention mechanism between question and as features the sentence and question length, the answer, which aims to better identify the best number of co-occurrent words and the idf coeffi- candidate answer to the question. The mechanism cients of the word co-occurrences. consists in multiplying the biLSTM hidden units of the answers with the output computed from 3.2 Cross-attentive Convolutional Network the question pooling layer. These models are The Cross-attentive Convolutional Network tested on the InsuranceQA (Feng et al., 2015) and (CACN) is a model designed for the task of TREC-QA (Yao et al., 2013) datasets, achieving Answer Sentence Selection and in 2018 had quite good performances. achieved state of the art performance (Gravina et The HyperQA (Tay et al., 2017) model uses al., 2018). The model relies on a Convolutional a pairwise ranking objective to represent the re- Neural Network with a double mechanism of lationship between question and answer embed- attention between questions and answers. The dings in a hyperbolic space instead of an euclidean model is inspired by the light attentive mechanism space. This empowers the model with a self- proposed by Yin and Schütze (2017), which it organizing ability and enables automatic discovery improves by applying it in both directions to of latent hierarchies while learning embeddings of question and answer pairs. questions and answers. The CACN model achieved top score in the ”Fujitsu AI NLP Challenge 2018” 1 , that used the Wang et al. (2016) present a model that takes 1 into account similarities and dissimilarities be- https://openinnovationgateway.com/ai-nlp-challenge/ SelQA dataset. was preprocessed into smaller chunks, resulting in 8,481 sections, 113,709 sentences and 2,810,228 3.3 BERT language representation model tokens. BERT (Bidirectional Encoder Representations For each section, a question that can be an- from Transformers) (Devlin et al., 2018) is a lan- swered in that same section by one or more sen- guage representation model. BERT usage involves tences was generated by human annotators. The two steps: pre-training and fine-tuning. During corresponding sentence or sentences that answer pre-training, the model is trained on a large col- the question were selected. To add some noise, lection of unlabeled text on a language modeling annotators were also asked to create another set task. Fine-tuning BERT on a downstream task in- of questions from the same selected sections ex- volves extending the model with additional layers cluding the original sentences previously selected tailored to the task, initializing the model with the as answers. Then all questions were paraphrased pre-trained parameters, and then training the ex- using different terms, in order to ensure the QA al- tended model with labeled data from the task. The gorithm would be evaluated by their reading com- extended model might consist just of a single out- prehension ability rather than from statistical mea- put layer. Such models have been shown capa- sures like counting word co-occurrences. Lastly ble to achieve state-of-the-art accuracy for a wide if ambiguous questions were found, they were range of tasks, such as question answering, ma- rephrased again by a human annotator. chine translation, summarization and language in- ference. 4.2 WikiQA Several pre-trained BERT models are publicly available, including the following ones that we The WikiQA dataset (Yang et al., 2015) dataset used in our experiments: consists of 3047 questions sampled from Bing query logs from the period of May 1st, 2010 to • BERT-Base Uncased: with 12 layers, hidden July 31st, 2011. Each question is associated to size of 768 and a total number of 110M pa- sentences taken from a Wikipedia page assumed rameters; to be the topic of the question based on the user clicks. In order to eliminate answer sentence bi- • BERT-Large Uncased: with 24 layers, hidden ases caused by key-word matching, the sentences size of 1024 and a total number of 340M pa- were taken from the summary of this selected rameters. page. 4 Datasets The WikiQA dataset contains also questions for which there are no correct sentences to enable re- We tested the models on two datasets: SelQA and searchers to work on answer triggering. WikiQA. The first one is the one used in the Fu- This dataset has the drawback to be smaller jitsu AI-NLP Challenge, while the second one is a compared to SelQA. Because of this, a model is commonly used dataset for open-domain Question more likely to over-fit the training set. To avoid Answering. A more detailed description follows. this problem we added some strong regularization to the models. 4.1 SelQA The SelQA dataset (Jurczyk et al., 2016) was 5 Experiments specifically created to be challenging for question answering systems, in particular by explicitly re- 5.0.1 GloVe, ELMo and FastText ducing word co-occurrences between question and answers. Questions with associated long sentence We carried out some preliminary experiments on answers were generated through crowd-sourcing the SelQA dataset, in order to determine which from articles drawn from the ten most prevalent embeddings would work best with the CACN. topics in the English Wikipedia. We tested three types of embeddings: GloVe The dataset consists of a total of 486 articles that (size 300), ELMo (Che et al., 2018) (size 1024) were randomly sampled from the topics of: Arts, and FastText (Joulin et al., 2016) (size 300). With Country, Food, Historical Events, Movies, Mu- ELMo the model achieved comparable results to sic, Science, Sports, Travel, TV. The original data GloVe, but the training time was almost twice. Model Dev MRR Test MRR Model MRR ELMo 91.09% 90.00% LR Classifier 83.36 FastText 89.47% 88.43% CACN GloVe 90.61 GloVe 91.37% 90.61% BERT-Base + FCN 91.17 BERT-Base + CACN 91.11 Table 2: Results for CACN on SelQA with various BERT-Large + CACN 89.97 embeddings. BERT-Base Fine-tuned 95.29 Table 3: Results on SelQA with various models. 5.1 SelQA results The logistic regression classifier obtains a score of 83.36 %, which is 7 points lower than CACN, not one correct answer for each question. This sig- bad considering the simplicity of the model. Nev- nificantly reduced the number of training exam- ertheless this confirms that a simple word match- ples but, despite this, the MRR score of the CACN ing method is not competitive with more sophisti- model improved. cated methods on SelQA. Also in this case we kept the word embeddings CACN was the best performing model on the fixed during training the CACN. We also added a Fujitsu AI NLP Challenge 2018, with a MRR of dropout and normalization to regularize the model, 90.61 %. that helped the model to better learn from the train- ing set. After the introduction of BERT, we decided to compare CACN with several versions of BERT, We then fine-tuned BERT on the WikiQA train- both alone and in combination with CACN. ing set, performing full updates to the model, achieving again a significant improvement to a top We tried a few variant approaches. First, we score of 87.53 % MRR. fine-tuned a fully connected layer on top of BERT, From the current leaderboard on the WikiQA leaving his parameters frozen, on the SelQA train- dataset 2 , we have extracted the top 5 entries ing set. This model achieved 91.17, a marginal and added the results with CACN and BERT-Base improvement over CACN. fine-tuned, as reported in Table 4. We then explored adding different networks on top of the BERT architecture. Model MRR Year We added a full CACN, on top of either the BERT-Base Fine-tuned 87.53 % 2019 BERT-Base and BERT-Large models, with no im- Comp-Clip + LM + LC 78.40 % 2019 provement and even a drop with BERT-Large. RE2 76.18 % 2019 Also in this case we froze the parameters of the HyperQA (Tay et al., 2017) 72.70 % 2017 BERT model. PWIM 72.34 % 2016 Since these experiments did not provide im- CACN (Gravina et al., 2018) 72.12 % 2018 provements, we didn’t try to train the entire model. The best results were achieved by fine-tuning Table 4: Experimental results on WikiQA. the BERT model on the SelQA dataset with a sim- ple feed-forward layer, that achieved an impres- sive improvement of about 5 points to a MRR 5.3 Cross-domain experiments score of 95.29 %. Fine-tuning required about 4 In this section we report the results of our cross- hours on a server with an Nvidia P100 GPU. domain experiments. The aim was to evaluate how The results of all our experiments on SelQA are well the CACN model performs in a context differ- summarized in table 3. ent from the one in which it was trained. In other words, we test the transfer learning ability of the 5.2 WikiQA results model to a different domain. In the experiments with CACN on WikiQA, we The experiments consisted in training a model removed from the training set questions with no on one dataset and then testing it on the other one. correct answer, but left the test set unchanged, so We report in Table 5 the results of these experi- that the results are comparable with thos in the lit- ments. erature. This was done to preserve a similar struc- 2 https://paperswithcode.com/sota/question-answering- ture to the SelQA dataset, which contains at least on-wikiqa Trainset Testset MRR Transfer score ily solvable using simple word-occurrences meth- SelQA SelQA 90.61% ods like a logistic regression classifier on word SelQA WikiQA 59.94% 82.95% count features. WikiQA WikiQA 72.12% BERT models confirmed their superiority to WikiQA SelQA 69.45% 76.64% previous state of the art models for the task of An- swer Sentence Selection. This was to be expected Table 5: Cross domain experiments. since they perform quite well also on the more complex task of Reading Comprehension, which The drop in MRR score is small when training requires not only to select a sentence but also to on WikiQA and testing on SelQA and larger in the extract the answer from that sentence. other direction. This is possibly due to the size of the datasets. 7 Acknowledgements In the second case in fact we are training on only The experiments were carried on a Dell server 8000 pairs and testing on more than 80000 ques- with 4 Nvidia GPUs Tesla P100, partly funded by tion/answer pairs. the University of Pisa under grant Grandi Attrez- However, the transfer score, computed as the ra- zature 2016. tio between the in-domain and out-domain MRR, is fairly good: about 83% in the SelQA to WikiQA case and over 76% in the other direction. References Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, 6 Conclusions and Ting Liu. 2018. Towards better UD pars- ing: Deep contextualized word embeddings, ensem- We compared the Cross-attentive Convolutional ble, and treebank concatenation. In Proceedings of Network and several BERT based models on the CoNLL 2018 Shared Task: Multilingual Pars- the task of Answer Sentence Selection on two ing from Raw Text to Universal Dependencies, pages 55–64, Brussels, Belgium, October. Association for datasets. Computational Linguistics. The experiments show that a BERT model, fine- tuned on an Answer Sentence Selection dataset, Nick Craswell. 2009. Mean Reciprocal Rank. In Ling Liu and M. Tamer Özsu, editors, Encyclopedia of improves significantly the state of the art, with a Database Systems. Springer US, Boston, MA. gain of 5 to 9 points of MRR score on SelQA and WikiQA respectively. As a drawback, this ap- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and proach takes a considerable amount of time to be Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language under- trained even on GPUs. standing. arXiv preprint arXiv:1810.04805. The BERT-Base model without fine-tuning achieves almost the same accuracy as the CACN Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, and Bowen Zhou. 2015. Applying deep with GloVe embeddings, which uses a much learning to answer selection: A study and an open smaller number of parameters in the model. The task. arXiv preprint arXiv:1508.01585. CACN also requires less data to train. On the other Alessio Gravina, Federico Rossetto, Silvia Severini, hand, BERT is quite effective at leveraging the and Giuseppe Attardi. 2018. Cross attention for knowledge collected from large amounts of unla- selection-based question answering. In NL4AI@ beled text, and at transferring it across tasks. AI* IA, pages 53–62. We also evaluated the abilities of CACN at Michael Heilman and Noah A. Smith. 2010. Tree edit transfer learning. BERT is a model that has been models for recognizing textual entailments, para- pre-trained on a large corpus, while CACN lever- phrases, and answers to questions. In Human Lan- ages the GloVe embeddings as a starting point for guage Technologies: The 2010 Annual Conference the training. of the North American Chapter of the Association for Computational Linguistics, HLT 10, pages 1011– We also exploited the WikiQA and SelQA 1019. Association for Computational Linguistics. datasets in a cross-domain experiment using CACN. We found that the model maintains a good Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hrve Jgou, and Tomas Mikolov. score across domains, with a transfer score of 2016. Fasttext.zip: Compressing text classification about 83% from SelQA to WikiQA. models. cite arxiv:1612.03651Comment: Submit- We confirmed that the SelQA dataset is not eas- ted to ICLR 2017. Tomasz Jurczyk, Michael Zhai, and Jinho D. Choi. 2016. SelQA: A New Benchmark for Selection- based Question Answering. In Proceedings of the 28th International Conference on Tools with Artifi- cial Intelligence, of ICTAI’16, pages 820–827. Jinfeng Rao, Hua He, and Jimmy Lin. 2016. Noise- contrastive estimation for answer selection with deep neural networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM 16), pages 1913–1916. ACM. Ming Tan, Bing Xiang, and Bowen Zhou. 2015. Lstm- based deep learning models for non-factoid answer selection. CoRR, abs/1511.04108. Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2017. Enabling efficient question answer retrieval via hy- perbolic neural networks. CoRR, abs/1707.07847. Zhiguo Wang, Haitao Mi, and Abraham Ittycheriah. 2016. Sentence similarity learning by lexical de- composition and composition. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1340–1349. The COLING 2016 Organizing Committee. Yi Yang, Scott Wen tau Yih, and Chris Meek. 2015. WikiQA: A challenge dataset for open-domain ques- tion answering. In Proceedings of the 2015 Con- ference on Empirical Methods in Natural Language Processing. ACL Association for Computational Linguistics, September. Xuchen Yao, Benjamin Van Durme, Chris Callison- Burch, and Peter Clark. 2013. Answer extraction as sequence tagging with tree edit distance. In Pro- ceedings of the 2013 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 858–867. Wenpeng Yin and Hinrich Schütze. 2017. Attentive Convolution. CoRR.