A Comparative Study of Models for Answer Sentence Selection

    Alessio Gravina               Federico Rossetto                Silvia Severini          Giuseppe Attardi
    Università di Pisa           Università di Pisa             Università di Pisa       Università di Pisa
gravina.alessio@gmail.com fedingo@gmail.com                   sissisev@gmail.com          attardi@di.unipi.it


                                                              go beyond Information Retrieval approaches in-
                         Abstract                             volve for example tree edit models (Heilman and
                                                              Smith, 2010) and semantic distances based on
       Answer Sentence Selection is one of the                word embeddings (Wang et al., 2016).
       steps typically involved in Question An-                  Recently, Deep Neural Networks have also been
       swering. Question Answering is consid-                 applied to this task (Rao et al., 2016), providing
       ered a hard task for natural language pro-             performance improvements with respect to previ-
       cessing systems, since full solutions would            ous techniques. The most common approaches ex-
       require both natural language understand-              ploit either recurrent or convolutional neural net-
       ing and inference abilities. In this pa-               works. These models are good at capturing con-
       per, we explore how the state of the art               textual information from sentences, making them
       in answer selection has improved recently,             a nice fit for the problem of answer sentence se-
       comparing two of the best proposed mod-                lection.
       els for tackling the problem: the Cross-                  Research on this problem has benefited in
       attentive Convolutional Network and the                the last few years by the development of better
       BERT model. The experiments are carried                datasets for training systems on this task. These
       out on two datasets, WikiQA and SelQA,                 datasets include WikiQA (Yang et al., 2015) and
       both created for and used in open-domain               SelQA (Jurczyk et al., 2016). The latter is notable
       question answering challenges. We also                 for its larger size, that reaches more that 60.000
       report on cross domain experiments with                sentence-question pairs. This allows for the cre-
       the two datasets.                                      ation of deeper and more complex models, with
                                                              less risk of overfit.
   1   Introduction                                              The state of the art model on the SelQA dataset
                                                              (Jurczyk et al., 2016), up to 2018, was Cross-
   Answer Sentence Selection is an important sub-             attentive Convolutional Network (Gravina et al.,
   task of Question Answering, that aims at select-           2018), with a score of 0.906 MRR (Craswell,
   ing the sentence containing the correct answer to          2009).
   a given question among a set of candidate sen-                In this paper we present further experiments
   tences. Table 1 shows an example of a question             with the Cross-attentive Convolutional Network
   and a list of its candidate answers, taken from the        model as well as experiments that exploit the
   SelQA dataset (Jurczyk et al., 2016). The last col-        BERT language model by Devlin et al. (2018).
   umn contains a binary value, representing whether             In the following sections we survey relevant lit-
   the sentence contains the answer or not.                   erature on the topic, we describe the datasets used
      Answer extraction involves natural language             in our experiments and present the models tested
   processing techniques for interpreting candidate           in our experiments. Finally, we describe the ex-
   sentences and establishing whether they relate to          periments conducted with these models and report
   questions and contain an answer. More sophisti-            the results achieved.
   cated methods of Answer Sentence Selection that
        All authors contributed equally to this manuscript.
                                                              2     Related work
        Copyright c 2019 for this paper by its authors. Use
   permitted under Creative Commons License Attribution 4.0   We present a brief survey of the most recent ap-
   International (CC BY 4.0).                                 proaches for answer selection in question answer-
                             Table 1: Sample question/candidate answers.
    How much cholesterol is there in an ounce of bacon?
    One rasher of cooked streaky bacon contains 5.4g of fat, and 4.4g of protein.                            0
    Four pieces of bacon can also contain up to 800mg of sodium.                                             0
    The fat and protein content varies depending on the cut and cooking method.                              0
    Each ounce of bacon contains 30mg of cholesterol.                                                        1


ing.                                                  tween sentences by decomposing and composing
   Tan et al. (2015) present four Deep Learning       lexical semantics over sentences. In particular the
models for answer selection based on biLSTM           model represents each word as a vector and cal-
(bidirectional LSTM) and CNN (Convolutional           culates a semantic matching vector for each word
Neural Network), with different complexities and      based on all words in the other sentence. Then
capabilities. The basic model, called QA-LSTM,        each word vector is decomposed into a similar
implements two similar flows, one for the ques-       and a dissimilar component, based on the seman-
tion and one for the answer. The biLSTM builds        tic matching vector. Afterwards, a CNN model is
a representation of the question/answer pair that     used to capture features by composing these parts
is passed by a max or average pooling layer. The      and a similarity score is estimated over the com-
two flows are then merged with a cosine similarity    posed feature vectors to predict which sentence is
matching that expresses how close question and        the answer to the question.
answer are.
                                                      3       Models
   A more complex solution, called QA-
LSTM/CNN, uses a similar model, which                 We describe here the models used in our experi-
replaces the pooling layer with a CNN. The            ments.
output of biLSTM is sent to a convolution filter,
in order to give a more complete representation       3.1       Simple Logistic Regression Classifier
of questions and answers. This filter is followed     Jurczyk et al. (2016) state that the SelQA dataset
by 1-max pooling layer and a fully connected          was created through a process that tried to reduce
layer.    Finally, the paper presents the most        the number of co-occurrent words, so that simple
complex models, QA-LSTM with attention and            word matching methods would be less effective.
QA-LSTM/CNN with attention, that extend the           To evaluate whether this aim was indeed achieved,
previous models with the addition of a simple         we built a simple linear regression classifier using
attention mechanism between question and              as features the sentence and question length, the
answer, which aims to better identify the best        number of co-occurrent words and the idf coeffi-
candidate answer to the question. The mechanism       cients of the word co-occurrences.
consists in multiplying the biLSTM hidden units
of the answers with the output computed from          3.2       Cross-attentive Convolutional Network
the question pooling layer. These models are          The Cross-attentive Convolutional Network
tested on the InsuranceQA (Feng et al., 2015) and     (CACN) is a model designed for the task of
TREC-QA (Yao et al., 2013) datasets, achieving        Answer Sentence Selection and in 2018 had
quite good performances.                              achieved state of the art performance (Gravina et
   The HyperQA (Tay et al., 2017) model uses          al., 2018). The model relies on a Convolutional
a pairwise ranking objective to represent the re-     Neural Network with a double mechanism of
lationship between question and answer embed-         attention between questions and answers. The
dings in a hyperbolic space instead of an euclidean   model is inspired by the light attentive mechanism
space. This empowers the model with a self-           proposed by Yin and Schütze (2017), which it
organizing ability and enables automatic discovery    improves by applying it in both directions to
of latent hierarchies while learning embeddings of    question and answer pairs.
questions and answers.                                   The CACN model achieved top score in the
                                                      ”Fujitsu AI NLP Challenge 2018” 1 , that used the
   Wang et al. (2016) present a model that takes
                                                          1
into account similarities and dissimilarities be-             https://openinnovationgateway.com/ai-nlp-challenge/
SelQA dataset.                                            was preprocessed into smaller chunks, resulting in
                                                          8,481 sections, 113,709 sentences and 2,810,228
3.3    BERT language representation model                 tokens.
BERT (Bidirectional Encoder Representations                  For each section, a question that can be an-
from Transformers) (Devlin et al., 2018) is a lan-        swered in that same section by one or more sen-
guage representation model. BERT usage involves           tences was generated by human annotators. The
two steps: pre-training and fine-tuning. During           corresponding sentence or sentences that answer
pre-training, the model is trained on a large col-        the question were selected. To add some noise,
lection of unlabeled text on a language modeling          annotators were also asked to create another set
task. Fine-tuning BERT on a downstream task in-           of questions from the same selected sections ex-
volves extending the model with additional layers         cluding the original sentences previously selected
tailored to the task, initializing the model with the     as answers. Then all questions were paraphrased
pre-trained parameters, and then training the ex-         using different terms, in order to ensure the QA al-
tended model with labeled data from the task. The         gorithm would be evaluated by their reading com-
extended model might consist just of a single out-        prehension ability rather than from statistical mea-
put layer. Such models have been shown capa-              sures like counting word co-occurrences. Lastly
ble to achieve state-of-the-art accuracy for a wide       if ambiguous questions were found, they were
range of tasks, such as question answering, ma-           rephrased again by a human annotator.
chine translation, summarization and language in-
ference.                                                  4.2     WikiQA
   Several pre-trained BERT models are publicly
available, including the following ones that we           The WikiQA dataset (Yang et al., 2015) dataset
used in our experiments:                                  consists of 3047 questions sampled from Bing
                                                          query logs from the period of May 1st, 2010 to
    • BERT-Base Uncased: with 12 layers, hidden           July 31st, 2011. Each question is associated to
      size of 768 and a total number of 110M pa-          sentences taken from a Wikipedia page assumed
      rameters;                                           to be the topic of the question based on the user
                                                          clicks. In order to eliminate answer sentence bi-
    • BERT-Large Uncased: with 24 layers, hidden          ases caused by key-word matching, the sentences
      size of 1024 and a total number of 340M pa-         were taken from the summary of this selected
      rameters.                                           page.
4     Datasets                                               The WikiQA dataset contains also questions for
                                                          which there are no correct sentences to enable re-
We tested the models on two datasets: SelQA and           searchers to work on answer triggering.
WikiQA. The first one is the one used in the Fu-             This dataset has the drawback to be smaller
jitsu AI-NLP Challenge, while the second one is a         compared to SelQA. Because of this, a model is
commonly used dataset for open-domain Question            more likely to over-fit the training set. To avoid
Answering. A more detailed description follows.           this problem we added some strong regularization
                                                          to the models.
4.1    SelQA
The SelQA dataset (Jurczyk et al., 2016) was
                                                          5     Experiments
specifically created to be challenging for question
answering systems, in particular by explicitly re-
                                                          5.0.1    GloVe, ELMo and FastText
ducing word co-occurrences between question and
answers. Questions with associated long sentence          We carried out some preliminary experiments on
answers were generated through crowd-sourcing             the SelQA dataset, in order to determine which
from articles drawn from the ten most prevalent           embeddings would work best with the CACN.
topics in the English Wikipedia.                             We tested three types of embeddings: GloVe
   The dataset consists of a total of 486 articles that   (size 300), ELMo (Che et al., 2018) (size 1024)
were randomly sampled from the topics of: Arts,           and FastText (Joulin et al., 2016) (size 300). With
Country, Food, Historical Events, Movies, Mu-             ELMo the model achieved comparable results to
sic, Science, Sports, Travel, TV. The original data       GloVe, but the training time was almost twice.
       Model       Dev MRR       Test MRR                         Model                          MRR
       ELMo         91.09%        90.00%                          LR Classifier                  83.36
       FastText     89.47%        88.43%                          CACN GloVe                     90.61
       GloVe        91.37%        90.61%                          BERT-Base + FCN                91.17
                                                                  BERT-Base + CACN               91.11
Table 2: Results for CACN on SelQA with various                   BERT-Large + CACN              89.97
embeddings.                                                       BERT-Base Fine-tuned           95.29

                                                         Table 3: Results on SelQA with various models.
5.1   SelQA results
The logistic regression classifier obtains a score of
83.36 %, which is 7 points lower than CACN, not         one correct answer for each question. This sig-
bad considering the simplicity of the model. Nev-       nificantly reduced the number of training exam-
ertheless this confirms that a simple word match-       ples but, despite this, the MRR score of the CACN
ing method is not competitive with more sophisti-       model improved.
cated methods on SelQA.                                    Also in this case we kept the word embeddings
   CACN was the best performing model on the            fixed during training the CACN. We also added a
Fujitsu AI NLP Challenge 2018, with a MRR of            dropout and normalization to regularize the model,
90.61 %.                                                that helped the model to better learn from the train-
                                                        ing set.
   After the introduction of BERT, we decided to
compare CACN with several versions of BERT,                We then fine-tuned BERT on the WikiQA train-
both alone and in combination with CACN.                ing set, performing full updates to the model,
                                                        achieving again a significant improvement to a top
   We tried a few variant approaches. First, we
                                                        score of 87.53 % MRR.
fine-tuned a fully connected layer on top of BERT,
                                                           From the current leaderboard on the WikiQA
leaving his parameters frozen, on the SelQA train-
                                                        dataset 2 , we have extracted the top 5 entries
ing set. This model achieved 91.17, a marginal
                                                        and added the results with CACN and BERT-Base
improvement over CACN.
                                                        fine-tuned, as reported in Table 4.
   We then explored adding different networks on
top of the BERT architecture.                            Model                                 MRR          Year
   We added a full CACN, on top of either the            BERT-Base Fine-tuned                 87.53 %       2019
BERT-Base and BERT-Large models, with no im-             Comp-Clip + LM + LC                  78.40 %       2019
provement and even a drop with BERT-Large.               RE2                                  76.18 %       2019
Also in this case we froze the parameters of the         HyperQA (Tay et al., 2017)           72.70 %       2017
BERT model.                                              PWIM                                 72.34 %       2016
   Since these experiments did not provide im-           CACN (Gravina et al., 2018)          72.12 %       2018
provements, we didn’t try to train the entire model.
   The best results were achieved by fine-tuning              Table 4: Experimental results on WikiQA.
the BERT model on the SelQA dataset with a sim-
ple feed-forward layer, that achieved an impres-
sive improvement of about 5 points to a MRR             5.3    Cross-domain experiments
score of 95.29 %. Fine-tuning required about 4          In this section we report the results of our cross-
hours on a server with an Nvidia P100 GPU.              domain experiments. The aim was to evaluate how
   The results of all our experiments on SelQA are      well the CACN model performs in a context differ-
summarized in table 3.                                  ent from the one in which it was trained. In other
                                                        words, we test the transfer learning ability of the
5.2   WikiQA results                                    model to a different domain.
In the experiments with CACN on WikiQA, we                 The experiments consisted in training a model
removed from the training set questions with no         on one dataset and then testing it on the other one.
correct answer, but left the test set unchanged, so     We report in Table 5 the results of these experi-
that the results are comparable with thos in the lit-   ments.
erature. This was done to preserve a similar struc-        2
                                                             https://paperswithcode.com/sota/question-answering-
ture to the SelQA dataset, which contains at least      on-wikiqa
    Trainset   Testset    MRR       Transfer score     ily solvable using simple word-occurrences meth-
    SelQA      SelQA     90.61%                        ods like a logistic regression classifier on word
    SelQA      WikiQA    59.94%         82.95%         count features.
    WikiQA     WikiQA    72.12%                           BERT models confirmed their superiority to
    WikiQA     SelQA     69.45%         76.64%         previous state of the art models for the task of An-
                                                       swer Sentence Selection. This was to be expected
        Table 5: Cross domain experiments.             since they perform quite well also on the more
                                                       complex task of Reading Comprehension, which
   The drop in MRR score is small when training        requires not only to select a sentence but also to
on WikiQA and testing on SelQA and larger in the       extract the answer from that sentence.
other direction.
   This is possibly due to the size of the datasets.
                                                       7   Acknowledgements
In the second case in fact we are training on only     The experiments were carried on a Dell server
8000 pairs and testing on more than 80000 ques-        with 4 Nvidia GPUs Tesla P100, partly funded by
tion/answer pairs.                                     the University of Pisa under grant Grandi Attrez-
   However, the transfer score, computed as the ra-    zature 2016.
tio between the in-domain and out-domain MRR,
is fairly good: about 83% in the SelQA to WikiQA
case and over 76% in the other direction.              References
                                                       Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng,
6     Conclusions                                        and Ting Liu. 2018. Towards better UD pars-
                                                         ing: Deep contextualized word embeddings, ensem-
We compared the Cross-attentive Convolutional            ble, and treebank concatenation. In Proceedings of
Network and several BERT based models on                 the CoNLL 2018 Shared Task: Multilingual Pars-
the task of Answer Sentence Selection on two             ing from Raw Text to Universal Dependencies, pages
                                                         55–64, Brussels, Belgium, October. Association for
datasets.                                                Computational Linguistics.
   The experiments show that a BERT model, fine-
tuned on an Answer Sentence Selection dataset,         Nick Craswell. 2009. Mean Reciprocal Rank. In Ling
                                                         Liu and M. Tamer Özsu, editors, Encyclopedia of
improves significantly the state of the art, with a      Database Systems. Springer US, Boston, MA.
gain of 5 to 9 points of MRR score on SelQA
and WikiQA respectively. As a drawback, this ap-       Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
proach takes a considerable amount of time to be          Kristina Toutanova. 2018. Bert: Pre-training of
                                                          deep bidirectional transformers for language under-
trained even on GPUs.                                     standing. arXiv preprint arXiv:1810.04805.
   The BERT-Base model without fine-tuning
achieves almost the same accuracy as the CACN          Minwei Feng, Bing Xiang, Michael R. Glass, Lidan
                                                         Wang, and Bowen Zhou. 2015. Applying deep
with GloVe embeddings, which uses a much                 learning to answer selection: A study and an open
smaller number of parameters in the model. The           task. arXiv preprint arXiv:1508.01585.
CACN also requires less data to train. On the other
                                                       Alessio Gravina, Federico Rossetto, Silvia Severini,
hand, BERT is quite effective at leveraging the
                                                         and Giuseppe Attardi. 2018. Cross attention for
knowledge collected from large amounts of unla-          selection-based question answering. In NL4AI@
beled text, and at transferring it across tasks.         AI* IA, pages 53–62.
   We also evaluated the abilities of CACN at
                                                       Michael Heilman and Noah A. Smith. 2010. Tree edit
transfer learning. BERT is a model that has been         models for recognizing textual entailments, para-
pre-trained on a large corpus, while CACN lever-         phrases, and answers to questions. In Human Lan-
ages the GloVe embeddings as a starting point for        guage Technologies: The 2010 Annual Conference
the training.                                            of the North American Chapter of the Association
                                                         for Computational Linguistics, HLT 10, pages 1011–
   We also exploited the WikiQA and SelQA                1019. Association for Computational Linguistics.
datasets in a cross-domain experiment using
CACN. We found that the model maintains a good         Armand Joulin, Edouard Grave, Piotr Bojanowski,
                                                         Matthijs Douze, Hrve Jgou, and Tomas Mikolov.
score across domains, with a transfer score of
                                                         2016. Fasttext.zip: Compressing text classification
about 83% from SelQA to WikiQA.                          models. cite arxiv:1612.03651Comment: Submit-
   We confirmed that the SelQA dataset is not eas-       ted to ICLR 2017.
Tomasz Jurczyk, Michael Zhai, and Jinho D. Choi.
  2016. SelQA: A New Benchmark for Selection-
  based Question Answering. In Proceedings of the
  28th International Conference on Tools with Artifi-
  cial Intelligence, of ICTAI’16, pages 820–827.
Jinfeng Rao, Hua He, and Jimmy Lin. 2016. Noise-
   contrastive estimation for answer selection with
   deep neural networks. In Proceedings of the 25th
   ACM International on Conference on Information
   and Knowledge Management (CIKM 16), pages
   1913–1916. ACM.
Ming Tan, Bing Xiang, and Bowen Zhou. 2015. Lstm-
  based deep learning models for non-factoid answer
  selection. CoRR, abs/1511.04108.
Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2017.
  Enabling efficient question answer retrieval via hy-
  perbolic neural networks. CoRR, abs/1707.07847.
Zhiguo Wang, Haitao Mi, and Abraham Ittycheriah.
  2016. Sentence similarity learning by lexical de-
  composition and composition. In Proceedings of
  COLING 2016, the 26th International Conference
  on Computational Linguistics: Technical Papers,
  pages 1340–1349. The COLING 2016 Organizing
  Committee.
Yi Yang, Scott Wen tau Yih, and Chris Meek. 2015.
  WikiQA: A challenge dataset for open-domain ques-
  tion answering. In Proceedings of the 2015 Con-
  ference on Empirical Methods in Natural Language
  Processing. ACL Association for Computational
  Linguistics, September.

Xuchen Yao, Benjamin Van Durme, Chris Callison-
  Burch, and Peter Clark. 2013. Answer extraction
  as sequence tagging with tree edit distance. In Pro-
  ceedings of the 2013 Conference of the North Amer-
  ican Chapter of the Association for Computational
  Linguistics: Human Language Technologies, pages
  858–867.
Wenpeng Yin and Hinrich Schütze. 2017. Attentive
  Convolution. CoRR.