1. Introduction

Post-processing BioBERT And Using Voting Methods for Biomedical Question Answering

Margarida M. Campos

Francisco M. Couto

0 0 LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa , 1749-016 Lisboa , Portugal

There have been remarkable advances in the field of Biomedical Question Answering (QA) through the application of Transfer Learning to overcome the scarcity of domain-specific corpora. The fine-tuning of BioBERT on general purpose larger datasets prior to fine-tuning on a specific biomedical task has proven to significantly improve performance. There are, however, a lot of post-processing techniques to the outputs of fine-tuned models to be explored. In this paper we present our QA system, developed for the BioASQ 9th challenge - Task B, Phase B, developed by our team - LASIGE_ULISBOA. Using the outputs from the fine-tuning of BioBERT on both the Multi-Genre Natural Language Inference (MNLI) and the Stanford Question Answering Dataset (SQuAD) datasets. We compare diferent post processing strategies for prediction retrieval for Yes/No, Factoid, and List type questions. We show that using Softmax in the proper location of the pipeline of answer retrieval leads to better performance and also increases the explainability of a prediction's confidence level in QA. We also present a method for applying voting system algorithms to choose candidates for List type answers, how they can increase MacroF1 score and how one can use them to optimize for either Precision or Recall. The obtained results, averaged over batches, were 0.798 MacroF1 for Yes/No, 0.478 MRR for Factoid, and 0.466 F1 for List. The used software is available in an open access repository.

1. Introduction

• Yes/No - binary answer • Factoid - answer is a string • List - answer is a list of strings, each identifying a diferent entity Our approach is concerned only with the retrieval of exact answers, and therefore it it was not designed to retrieve answers to Summary type questions or ideal answers (paragraph-sized summaries).

For factoid and list questions, predictions are always substrings of the provided passages (snippets), making the success of the previous task of snippet retrieval paramount to obtain good results.

Although the most significant advances in the area have been made by fine-tuning on diferent and bigger datasets or the development of new and complex transformers architectures [ 2 ], we aim to show the importance of post-processing and the use of proper final layers for each task.

Considering that a tractable and meaningful measure of the level of confidence of a prediction is as important as the prediction itself, we present as well a proposal of said confidence level for Yes/No and Factoid questions.

All the software used can be found in https://github.com/lasigeBioTM/BioASQ9B.

2. Related Work 2.1. BioBERT

Our baseline approach was inspired in the work done by DMIS Laboratory (Korea University) for the previous edition of BioASQ challenge[ 3 ].

The base model for our system is BioBERT, a BERT[ 4 ] which was pre-trained using PubMed abstracts and PubMed Central (PMC) articles. BioBERT has obtained state-of-the-art results in several biomedical NLP tasks, including QA [ 1 ].

2.2. Sequential Transfer Learning

Substantial advances have been made in Natural Language Processing (NLP), specially in domain-specific tasks with the use of Transfer Learning - using the learnt model from a task for a subsequent task [ 5 ]. The use of extra corpora to train is particularly important given the reduced size of the BioASQ dataset. Research has found that fine-tuning on the SQuAD dataset [ 6 ] improves the performance of QA systems where the correct answer is a segment of a provided passage. Another dataset that has proven important is the Multi-Genre Natural Language Inference (MNLI)[ 7 ], which is widely used to improve questions of type Yes/No, but has also proven to be useful for factoid and list question types, as was shown by DMIS Laboratory (DMIS)[ 3 ].

3. Methods 3.1. Data & Pre-Processing

MNLI Training data consists of pairs of sentences, each classified with a label from {, , }. The cardinality of each label set can be found in Table 1. Intuitively there is a mapping (MNLI ↔ ): ↔ and ↔ . This could suggest that training without the pairs could improve performance, however our experiments showed that our system’s performance did not benefit from this strategy, hence the entire dataset was used.

SQuAD Training data consists of pairs {, } and the correct answer as well as its starting position. For training the QA model, the end position was identified and added as input.

BioASQ Training of the systems was done using BioASQ 8B training data, and evaluation was done on BioASQ 8B test batches. In Table 2 we can see the number of questions in the BioASQ training data, and in Figure 1 we can see the distribution of the number of snippets associated to a question. Examples of questions can be found in Table 3, and the number of train and test questions for each type of question can be found in Table 4. It is important to mention that only 177 (20%) of the Yes/No questions have the label No, making the classification extremely imbalanced. To handle this, undersampling[ 8 ] of the Yes class was performed, resulting in an even smaller set of 354 unique questions. Oversampling the No class proved to be inefective.

For list questions, each entity in the golden label was considered a correct answer for the given {, } pair, i.e. a pair whose golden list contains entities will appear as distinct input observations, each labeled with a diferent correct answer. A summary of diferent type of inputs can be seen in Table 5.

Both factoid and list inputs were converted to the mentioned SQuAD format - containing the answer’s start and end positions. Observations whose snippets did not contain the correct answer were discarded.

As with all BERT inputs, {, } pairs are added a [CLS] token in the beginning -for classification - and a separation token ([SEP]) is added in between the two input texts, as well as in the end of the input.

Additional biomedical datasets could have been curated to be used for fine tuning the system, however this was not done due to time constraints.

3.2. Fine Tuning

For the fine-tuning of BioBERT the best performing sequences of training reported in [ 3 ] were used. For Yes/No questions the sequence is BioBERT-MNLI-BioASQ, as for factoid and list type questions BioBERT-MNLI-SQuAD-BioASQ was used.

For fine-tuning in the MNLI dataset we used a slightly altered version of BertForSequenceClassification model from the Transformers [ 9 ] library, which consists of adding a linear layer which receives as input the hidden vector of the [] token [ 4 ]. To train in the binary classification of BioASQ Yes/No questions, following BertForSequenceClassification last 3 neuron layer tests were made with addition of: • one extra binary layer ([CLS]-3-2) • fully-connected 512 neuron layer followed by a binary one ([CLS]-3–256-2) • fully-connected 256 neuron layer followed by a binary one ([CLS]-3-512-2) • replacing previous MNLI 3 neuron layer with a binary one ([CLS]-2)

For training on the SQuAD corpus, the final classification layers are removed and the architecture of BertForQuestionAnswering model from the Transformers library is used. A simplified overview of Input/Output from BertForQuestionAnswering can be seen in Figure 2. In QA the input provided contains the start and end positions of the tokens representing the span of the correct answer within the passage. Training is done by creating two new vectors - start logits and end logits of shape (_ℎ, 1) that represent the likelihood of each token being the start and end of the answer, respectively.

3.3. Post-Processing and Output Aggregation

Given that the same question can have multiple snippets associated to it, leading to diferent {, } pairs as input, a strategy is needed to combine the diferent outputs into single predictions. Each type of question demands a diferent approach, hence they are presented separately.

Let and represent the model’s output probability of question given the snippet having answer Yes and No, respectively. The predicted answer will be the one with a highest mean probability over the snippets associated to question . represents the level of confidence that the provided answer is correct. 3.3.1. Yes/No = 1 ∑︀

=1 = 1 ∑︀

=1 3.3.2. Factoid The relevant outputs from the nfie-tuned QA model are the start and end logits vectors. In Figure 3 an output example from the BioASQ golden set from Batch 1 of task 8B can be seen.

Let and be the start and end logits value corresponding to token , the ℎ of the ℎ snippet associated to question , and and be the lists of predictions and associated confidence levels for the same input.

In order to choose the best prediction for each input, one should find the span (, ) that maximizes some combination of and . Given the logits are not normalized, to use merely the sum of start and end logits would result in an unfounded comparison between confidence levels for predictions of diferent snippets.

To minimize this discrepancy, our approach for each input was implemented as follows: 1. Create upper triangular matrix where, , = + (See Figure 4), for ≥ , guaranteeing end does not precede the start 2. Choose positions and that maximize , 3. If the expression resulting from the span from to satisfies admission rules, append expression to and , to 4. Remove entry , from 5. Repeat steps 2 to 4, until lists have length , where is an hyperparameter chosen by the user 6. Apply the softmax function to vector of length = {︃ ( ) = , if ≥ ( ) = , otherwise

To select the top 5 predictions for question , we simply select the 5 expressions from the concatenation of the vectors with the 5 highest corresponding values in the concatenated . 3.3.3. List Potential answers for list questions are retrieved using the same method as for factoid questions. The process however requires some extra processing steps, given that for list questions diferent entities need to be discriminated.

To select the best list of candidates, we used voting systems treating each distinct obtained answer as a candidate and the frequency in the answers as votes. The systems of Single Transferable Vote (STV) and Preferential Block Voting (PBV) were tested, with STV having the best performance. Elections are performed in rounds, in each round candidates are categorized in states: Elected - if the candidate has already won, Rejected - if the candidate is already unable to win, Hopeful - if the candidate has neither won nor has yet been discarded.

Candidates for answers are obtained by splitting the predictions by all usual separator characters and words(e.g. ’,’, ’and’, ’;’, ’or’). We tested the approach of doing the splitting after the voting - treating full answers as candidates for the STV (STV + PostProcess), and doing the splitting before the voting - separate distinct entities are treated as a vote, with the score for the ranked ballot being the average score of all the answers that contain that entity. An example of ranked candidates before and after being processed can be seen in Tables 6 and 7. E.g. the score of candidate "dizziness" will be the average of scores where the candidate is contained: 0.21, 0.20 and 0.18 (1,2 and 5ℎ entries of Table 6). Each snippet will contribute to the voting with a ballot of ranked candidates, which then enter the voting algorithm.

A potential handicap of using voting system algorithms for answer selection is the need to predefine the number of elected entities, since in an election the number of winners is established beforehand. This is not ideal since the correct number of answers for a given list question is not defined. Two characteristics of the implemented algorithms that allow us to minimize this problem are: • If the number of non rejected candidates is inferior to the number of selected winners, these are elected • If there are ties in the election, all tied candidates are elected - even if this means electing a superior number of candidates

Although these factors allow for some flexibility in the number of predictions, a more flexible approach can be used. Since elections are performed in rounds, one can define that the selected answers are the ones that are not rejected in the pre-final round, i.e., all candidates with states in {Hopeful,Elected}. When referencing this approach we call the number of candidates Hopeful.

3.4. Software

Our team tried to replicate the results of 4 state of the art systems and found some reproducibility issues. Some of the causes were: outdated versions of packages, compatibility issues due to the use of conflicting code libraries like the use of both Tensorflow and PyTorch for diferent stages of the pipeline.

To avoid the aforementioned issues our implementation was done in a modularized fashion, built in Python 3.6[ 10 ], using the Pytorch[ 11 ] versions of model implementations from the Transformers[ 9 ] library as main structure. In spite of its fully Pytorch architecture, the system accepts as input Tensorflow [ 12 ] checkpoints (model’s saved parameters).

Fine-tuning was performed using parallelization on 6 GPUs (Tesla M10) with 8GB of memory each. Total batch size is 18 (3 samples per GPU). Summary of the training details of the reported results can be found in Table 8

3.5. Metrics

For evaluation and comparison of diferent models, the oficial BioASQ measures of performance were used [ 13 ].

For Yes/No questions the oficial metric is the MacroF1 - mean of F1 score of both Yes and No classes. Accuracy is also calculated for completeness. Factoid questions are evaluated using Mean Reciprocal Rank (Metric). Strict Accuracy (SAcc) and Lenient Accuracy (LAcc) are also calculated. List questions are evaluated by the average F1 score of all questions, with the mean precision and recall also reported.

4. Results 4.1. Experimental Results

In this section we present the experimental results of the referred approaches. Training was done in task 8B training set and evaluation was done in the aggregation of all 8B Phase B batches. Results are compared with the average (weighted by number of questions) of all DMIS systems results in the 5 batches. 4.1.1. Yes/No 4.1.2. Factoid In Table 9 we can see the results of the diferent classification architectures for the Yes/No question type. Results are significantly better with the extra fully connected layer, before the ifnal binary one. Experiments showed that performance difers slightly with the number of neurons of the middle layer if it lays between 128 and 512. Higher MacroF1 was obtained with 256 neurons ([CLS]-256-2).

For factoid questions performance increased substantially with the use of the k-candidates approach. The results can be seen in Table 10. Best results were obtained with = 2 number of candidate answers per snippet. It is interesting to point out that for > 4 the results almost do not difer. This is due to the fact that candidates of order higher than 4 typically have extremely low scores and end up with probabilities close to 0, therefore are discarded when the top 5 predictions are extracted. 4.1.3. List In Table 11 we can see the results of experiments with the list questions. We can observe the impact of requesting diferent number of winners from the algorithm. Unsurprisingly, a larger number of winners leads to an increase in Recall and a decrease in Precision. Maximum performance (MacroF1) is obtained with the Hopeful strategy, for both processing strategies.

Results show that splitting candidates prior to the voting leads to better results.

4.2. BioASQ Oficial Results

A summary of the oficial results from BioASQ Task9B - Phase B can be seen in Table 12, where we present the results of the top teams along with ours (LASIGE), considering the BioASQ ordering. The place in each batch is considered to be the place of the best scoring system for each team, considering all systems of each team as one.

4.3. Unanswerability

An important aspect to look at when evaluating performance, is the unanswerability of some questions in the dataset. Several questions in the test sets have an answer which can not be extracted from the provided snippets. Ideally, to measure the actual performance of answer extracting systems, these would be removed from the test set. Examples of such questions can be seen in Table 13.

For the test set of task 8B (resulting of the aggregation the 5 test batches), 22, 5% of factoid questions do not contain the golden answer in any provided snippet, and 25.3% of list questions have at least one entity that is not contained in the snippets. For Yes/No questions unanswerability would have to be manually done.

5. Discussion 5.1. Analysis of Results

All reported results were obtained using BioBERT Base (12 stacked encoding layers). Although we tested with BioBERT Large (24 stacked encoding layers), which usually obtains better results, the results were very poor. This is probably due to memory restrictions. Since BioBERT Large has over three times more trainable parameters, reductions in input size and batch size had to be made, which are probably the cause for the low performance. 5.1.1. Yes/No The addition of a fully connected layer between the MNLI classification layer (3 neurons) and the BioASQ binary classification layer improved performance on the test set. This indicates that the relation between the knowledge obtained on the NLI data and the one needed for the BioASQ questions is not as obvious as one might expect. This is not uncommon when we are dealing with corpora from diferent domains (general purpose vs biomedical), and might also be related to the existence of unanswerable questions in the dataset that dificult model’s learning of what represents agreement between question and snippet, since the inputs with no relation are inducing noise for the binary task.

In Figure 5 we can see the distribution of the confidence levels for Yes ( ( )) and No ( ( )) predictions, compared between the actual value of the correct answer. Note that model’s discriminatory power (distance between ( ) and ( )) is much greater for answers with Yes label. This can also be seen by looking at the diferences between the F1 scores of both classes, noting that 1 is much higher than 1 across experiments. This is not surprising in NLI, as it is easier to identify entailment than it is to distinguish between contradiction and neutral relations. Entailment is usually distinctly expressed in the passage, whilst contradiction sometimes needs to be inferred from more complicated relations between sentences. 5.1.2. Factoid Looking at experimental results (Table10) we can see that sorting predictions using scores obtained by applying Softmax to the predictions for each snippet strongly improved all metrics. Moreover, we can look at the fitness of the scores by analysing Figure 6 where we compare the distribution of condfience levels for predictions when the answer was in fact correct or not. We can see that for the classic approach there is an almost 100% overlap of incorrect scores with correct ones, which implies the scoring is not strong. Although there is still some expected overlap in the k-candidates approach, one can distinctly see a higher level of confidence for correct answers, indicating the validity of the proposed score as a confidence level metric. 5.1.3. List Using voting systems for the choice of list questions proved to be efective, and we can see in Table 12 that the proposed system obtained overall strong results for List type questions, with the exception of Batch 5.

By using the Hopeful approach, one has flexibility in the number of entities that are selected, and in fact this approach has the best MacroF1 scores across experiments. With the application of the voting systems, opposed to using a predefined threshold for answer selection, we make use not only of the confidence level of each answers but also of the occurrence of the answer and its relative certainty amongst other answers from the same input.

(a) Scores from Softmax(Start Logits) plus

Softmax(End Logits) (b) Softmax( top predictions)

6. Conclusion

In this paper we used transfer learning to fine-tune BioBERT on general purpose datasets (MNLI and SQuAD) prior to fine-tuning on the BioASQ dataset. We showed how the post-processing of the model outputs greatly impacts performance, revealing that applying Softmax on the output scores from only the selected candidates, for obtaining predictions’ confidence level improves overall performance and makes scores more meaningful. We also showed that using the Single Transferable vote system for electing list questions candidates for answers obtains promising results, outperforming the previous approach of selecting candidates merely based on a defined threshold.

To increase the current model’s performance in the future, one can: enrich transfer learning sequences with additional biomedical domain corpora, train current system using BioBERT Large in larger memory GPUs, with same learning parameters (input size, learning rate and batch size). Another possibility is to adapt BERT architecture to allow for training of start and end logits combined, i.e., train QA for finding the exact span of the answer within the text - conditioning end of answer to its start - instead of training them separately and doing the conditioning in the post-processing phase.

Acknowledgments

This work was supported by FCT through project DeST: Deep Semantic Tagger project, ref. PTDC/CCI-BIO/28685/2017, and the LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020.

We would like to thank Doctor Maria Fernandes from the University of Luxembourg, who provided us access to larger GPUs for running experiments, for all her help and support. biomedical semantic indexing and question answering, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, Springer, 2020. URL: https://link.springer.com/chapter/10.1007/978-3-030-58219-7_16.

[1]

Lee ,

Yoon ,

Kim ,

C. H.

So ,

Kang , Biobert: a pre-trained biomedical language representation model for biomedical text mining , Bioinformatics ( 2019 ). URL: http: //dx.doi.org/10.1093/bioinformatics/btz682. doi: 10 .1093/bioinformatics/btz682.

[2]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need, 2017 . arXiv: 1706 . 03762 .

[3]

Jeong ,

Sung ,

Kim ,

Yoon ,

Yoo ,

Kang , Transferability of natural language inference to biomedical question answering , 2021 . arXiv: 2007 .00217.

[4]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , 2019 . arXiv: 1810 .04805.

[5]

Yosinski ,

Clune ,

Bengio ,

Lipson , How transferable are features in deep neural networks ?, 2014 . arXiv: 1411 . 1792 .

[6]

Rajpurkar ,

Zhang ,

Lopyrev ,

Liang , Squad: 100 ,000+ questions for machine comprehension of text, 2016 . arXiv: 1606 . 05250 .

[7]

Williams ,

Nangia ,

S. R.

Bowman , A broad-coverage challenge corpus for sentence understanding through inference , 2018 . arXiv: 1704 . 05426 .

[8]

Dendamrongvit ,

Kubat , Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains , 2009 , pp. 40 - 52 . doi: 10 .1007/ 978-3- 642 -14640- 4 _ 4 .

[9]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. L.

Scao ,

Gugger ,

Drame ,

Lhoest ,

A. M.

Rush , Transformers: State-of-the-art natural language processing , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , Association for Computational Linguistics , Online, 2020 , pp. 38 - 45 . URL: https://www.aclweb.org/anthology/2020.emnlp-demos. 6 .

[10]

Python

Core Team , Python: A dynamic, open source programming language , Python Software Foundation , Vienna, Austria, 2016 . URL: https://www.python.org/.

[11]

Paszke ,

Gross ,

Massa ,

Lerer ,

Bradbury , G. Chanan,

Killeen ,

Lin ,

Gimelshein ,

Antiga ,

Desmaison ,

Kopf ,

Yang ,

DeVito ,

Raison ,

Tejani ,

Chilamkurthy ,

Steiner ,

Fang ,

Bai ,

Chintala , Pytorch: An imperative style, high-performance deep learning library , in: H. Wallach , H.

Larochelle , A.

Beygelzimer , F.

d'Alché-

Buc , E.

Fox , R. Garnett (Eds.), Advances in Neural Information Processing Systems 32 , Curran

Associates

, Inc., 2019 , pp. 8024 - 8035 . URL: http://papers.neurips.cc/paper/ 9015-pytorch -an-imperative-style-high-performance-deep-learning-library .pdf.

[12]

Abadi ,

Agarwal ,

Barham ,

Brevdo ,

Chen ,

Citro ,

G. S.

Corrado ,

Davis ,

Dean ,

Devin ,

Ghemawat , I. Goodfellow ,

Harp , G. Irving,

Isard ,

Jia ,

Jozefowicz ,

Kaiser ,

Kudlur ,

Levenberg ,

Mané ,

Monga ,

Moore ,

Murray ,

Olah ,

Schuster ,

Shlens ,

Steiner , I. Sutskever,

Talwar ,

Tucker ,

Vanhoucke ,

Vasudevan ,

Viégas ,

Vinyals ,

Warden ,

Wattenberg ,

Wicke ,

Yu ,

Zheng , TensorFlow: Large-scale machine learning on heterogeneous systems , 2015 . URL: https://www.tensorflow.org/, software available from tensorflow. org.

[13]

Nentidis ,

Krithara ,

Bougiatiotis ,

Krallinger ,

Rodriguez-Penagos ,

Villegas , G. Paliouras, Overview of bioasq 2020 : The eighth bioasq challenge on large-scale