=Paper=
{{Paper
|id=Vol-2936/paper-27
|storemode=property
|title=NCU-IISR/AS-GIS: Results of Various Pre-trained Biomedical Language Models and Linear
Regression Model in BioASQ Task 9b Phase B
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-27.pdf
|volume=Vol-2936
|authors=Yu Zhang,Jen-Chieh Han,Richard Tzong-Han Tsai
|dblpUrl=https://dblp.org/rec/conf/clef/ZhangHT21
}}
==NCU-IISR/AS-GIS: Results of Various Pre-trained Biomedical Language Models and Linear
Regression Model in BioASQ Task 9b Phase B==
NCU-IISR/AS-GIS: Results of Various Pre-trained Biomedical Language Models and Linear Regression Model in BioASQ Task 9b Phase B Yu Zhang 1, Jen-Chieh Han1 and Richard Tzong-Han Tsai123* 1 Department of Computer Science and Information Engineering, National Central University, Taiwan 2 IoX Center, National Taiwan University, Taiwan 3 Research Center for Humanities and Social Sciences, Academia Sinica, Taiwan Abstract Transformer has been widely applied in Natural Language Processing (NLP) field, and it also results in an amount of pre-trained language models like BioBERT, SciBERT, NCBI_Bluebert, and PubMedBERT. In this paper, we introduce our system for the BioASQ Task 9b Phase B. We employed various pre-trained biomedical language models, including BioBERT, BioBERT-MNLI, and PubMedBERT, to generate “exact” answers for the questions, and a linear regression model with our sentence embedding to construct the top-n sentences as a prediction for “ideal” answers. Keywords Biomedical Question Answering, Pre-trained Language Model, Linear Regression 1. Introduction Given the rapid growth of people’s interest in Artificial Intelligence (AI), and biomedical question- answering has been receiving attention [1-3]. Is AI able to answer a biomedical question, like “Does metformin interfere thyroxine absorption?”, correctly? Is AI able to give textual evidence for its answer? To facilitate answering these questions, we participated in BioASQ Task 9b Phase B (QA task), where participants should return either an exact answer or an ideal answer based on the given biomedical question and List of question-relevant articles/snippets. BioASQ Task 9b PhaseB task provided 3743 training questions, including the previous year's test set with gold annotations, plus 500 test questions for evaluation, divided into five batches of 100 questions each. All questions and answers were constructed by a team of biomedical experts from across Europe and were classified into four types: Yes/no, Factoid, List, and Summary. Three types of questions required accurate answers: Yes/no, Factoid and List. For all four types of questions, participants were asked to submit the ideal answers. In Task 9b, each participant was allowed to submit up to five results per batch. Figure 1 illustrates four examples of QA types for BioASQ Task 9b Phase B (QA task). As shown in Figure 1, the BioASQ QA example gives a question and several relevant PubMed abstract fragments as relevant snippets. Therefore, we formulated the task as a query-based multi-document a. extraction for the exact answer and b. summarization for the ideal answer. Last year, we used the BioBERT model combined with logistic regression to achieve the best result in generating ideal answers at batch 5 [4]. In this paper, we employed pre-trained language models to improve our results, including BioBERT [5], BioBERT-MNLI [6], and PubMedBERT [7]. BioBERT-MNLI is a fine-tuned model of BioBERT on the MultiNLI (The Multi-Genre Natural Language Inference) corpus, which is a dataset created for 1 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania. EMAIL: phoenix000.taipei@gmail.com(A.1); joyhan@cc.ncu.edu.tw(A.2); thtsai@csie.ncu.edu.tw(A.3) * Corresponding author ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Figure 1. The QA examples of the BioASQ Task 9b Phase B (QA task). sentence understanding task [8]. BioBERT related models achieved the best performance in extracting exact answers last year [6]. PubMedBERT is the latest BERT model pre-trained on the biomedical corpus, which outperformed BioBERT on the BLURB (Biomedical Language Understanding and Reasoning Benchmark). We applied Pre-trained models’ [CLS] embeddings as input to a linear regression model for predicting ideal answers. The sections are organized as follows. Section 2 briefly reviews recent works on biomedical QA and pretrained model. The details of our methods are described separately in Section 3. Section 4 describes our configurations submitted to the BioASQ 9b Phase B challenge and the results. Section 5 is the discussion and summary of our system’s performance in the BioASQ QA task. 2. Related Work The acquisition of biomedical knowledge is often carried out by reading academic papers. This process is time-consuming and labor-intensive, and has a high professional threshold. Biomedical professionals cannot quickly obtain the required knowledge in a short period of time. The general public is also unable to complete the acquisition of biomedical knowledge in the absence of expert assistance. QA in natural language processing tasks has the potential to solve these problems by providing direct answers to users' questions. This tests the ability of machine learning systems to semantically understand, retrieve, and generate answers from existing text. Many QA models based on deep learning have been developed and even applied in the past [9]. Biomedical QA Task: Biomedical QA tasks require a large amount of annotated corpus to train the model. This is a prerequisite for deep learning. In addition to BioASQ, many QA datasets annotated by biomedical experts have been published recently [1, 2]. The PubMedQA dataset is a research question set, and each question has a reference text from a PubMed abstract and the span of the text providing the answer (yes/maybe/no) to the research question. BioBERT generally outperformed other deep learning methods such as BiLSTM and ESIM on the PubMedQA dataset [1]. Another biomedical QA task that deserves our attention is the COVID-QA. COVID-QA is a SQuAD-like Question Answering dataset consisting of 2,019 question/answer pairs annotated by volunteer biomedical experts on scientific articles related to COVID-19. This dataset differs from traditional MRC datasets such as SQuAD in that the answers to the questions come from a much longer context [2]. PubMedBERT: Following the successful application of BERT to natural language processing tasks in various fields, more and more specialized pre-trained language models are being developed in the biomedical field, including BioBERT, SciBERT [10], ClinicalBERT [11], BlueBERT [12], PubMedBERT, and so on. Among them, PubMedBERT is the state-of-art model developed by Microsoft. Its pre-training method is different from the existing biomedical language models, PubMedBERT adopts the method of training professional texts (PubMed papers) from scratch, instead of continuing training on the basis of texts in the general domain [7]. It has outperformed BioBERT in many biomedical NER, QA, and Relation Extraction tasks. Sequential Learning with BioBERT: The pre-trained language model effectively improves the performance of target tasks, while sequential transfer learning based on transfer learning can be used to further improve the performance of biomedical question answering. In the general QA domain, learning relationships between sentence pairs first is effective in sequential transfer learning [13]. BioBERT's research team has also found that this approach can be applied to biomedical QA. They demonstrated that fine-tuning on the language comprehension dataset and the question-answer dataset can improve BioBERT's performance on BioASQ tasks and released new fine-tuned models such as BioBERT- MNLI, BioBERT-MNLI-SQuAD [6]. 3. Method For ideal answer, We basically used the similar method that we used to participate in the BioASQ 8B last year and tried to test this method on different pre-trained language models to boost performance. The goal of our method is to select the most relevant segments for each question in the BioASQ QA instance, and our work was inspired by the logistic regression model framework proposed by Diego Mollá [14]. The approach follows the two steps of his summarization process. In the first step, the input text is segmented into candidate sentences and each candidate sentence is scored. In the 2nd step, the top n sentences with the highest scores are returned. We use a pre-trained language model and replace its features with word embeddings. Figure 2. How a candidate answer (sentence) and the corresponding question obtains the contextual embeddings in the last layer of the BERT model (BioBERT, PubMedBERT etc.). The training steps are as follows: 1. For the snippets and ideal answer from training set released from BioASQ organizers, we used NLTK’s sentence tokenizer to divide snippets into sentences. 2. We calculated ROUGE-SU4 F1 scores [15] between each sentence and the associated ideal answer. 3. All the sentences from snippets with different generated scores were considered as candidate answers. The candidate answers, their corresponding questions and scores became the training set for our linear regression model. 4. We input a candidate answer (sentence) and the corresponding question at the same time, using score as the prediction target. The pre-trained BERT language model is used for fitting the task. We used [CLS] embeddings representing the relation between a candidate sentence and a question as the feature and appended a dense layer with ReLU activation after the output layer of BERT model. Mean squared error was used as the loss function. Our script is modified from Google BERT's official TensorFlow code and took default settings from BERT trained on SQuAD [16]. For inference, we used the fine-tuned model from step 3 to predict the scores of the test data and then re-rank the candidate sentences for each question. Because the ideal answer in training set mostly consist of only one sentence, we selected only the top 1 sentence as our system output (ideal answer). The improvement of the above method is mainly focused on the replacement of the pretrained language models. We used BioBERT-MNLI (NCU-IISR/AS-GIS-2) and PubMedBERT (NCU- IISR/AS-GIS-3) to replace BioBERT (NCU-IISR/AS-GIS-1) in the above method in an attempt to improve the performances. BioBERT-MNLI is a fine-tuned BioBERT model on the MultiNLI dataset. MultiNLI (Multi-Type Natural Language Inference), published by New York University, is a text entailment task that requires determining whether a hypothesis holds given a premise (Premise), or determining whether the hypothesis is contradictory and neutral to the premise. MNLI's main feature is that it is a collection of texts in many different domains. We believe that the task of MultiNLI dataset has a high similarity to the ideal answer selection task. On the one hand, the questions and ideal answers in the BioASQ 9b training set are usually one sentence, and the premises and assumptions in the MultiNLI data set are also one sentence. Therefore, the data lengths are basically the same. On the other hand, the questions and the ideal answers need to maintain a logical entailment relationship. We can analogize the question to the premise and the ideal answer to the hypothesis. Only answers that are logically related should be considered. PubMedBERT is similar to BioBERT in that both are trained using the PubMed corpus. However, BioBERT adopts a continuous pre-training approach based on BERT. So it uses vocabulary from Wikipedia and the book corpus. PubMedBERT, on the other hand, is pre-trained from scratch using the PubMed text. This means that PubMed is less influenced by general domain texts and focuses on the biomedical research corpus. In addition, to test the applicability of PubMedBERT in BioASQ tasks, we also used PubMedBERT with KU-DMIS’s method [6] for the exact answer task (similar to SQuAD). This method converts BioASQ's List, Factoid question and answer data format to a format similar to SQuAD. Then, it uses a fine-tuning method similar to Google’s BERT on SQuAD for model training. For the Yes/No problem, it adds a linear regression layer to the BERT model for sequence binary classification. These methods have performed well in past challenges. To simplify the parameters adjustment process, We used Microsoft's open source AutoML system NNI [17] to automatically adjust the parameters of this task. However, in the ideal answer task, we did not perform multiple experiments because of the time limit. For the hardware, we used an NVIDIA GeForce GTX 1080 GPU for Factoid, List question exact answer tasks. Ideal answer tasks and Yes/no question exact answer tasks were trained using an NVIDIA Tesla T4 GPU provided by Google Colab. Because of the limitation of GPU memory, we reduce the batch size for Factoid, List type question tasks to 4, which may affect the performance of following experiments. 4. Submission Our submitted configurations are summarized in Table 1. We tested the performance of the pre- trained language models by conducting experiments with BioASQ 9b data for each task of the exact answer. Since the results of PubMedBERT are not as good as the BioBERT-related models in the experiments, we only submit the best KU-DMIS BioBERT-related model results. The BioBERT-MNLI model was used for the Yes/no questions, while the BioBERT-MNLI-SQuAD model was used for both the Factoid and List questions. We should additionally mention that all three systems use the same answer for the exact answer submitted. Table 1. Descriptions of our three systems System Name System Description Participating Batch Exact answers: Using KU-DMIS BioBERT related models. NCU-IISR/AS-GIS-1 Ideal answers: Using BioBERT with 4,5 predicted ROUGE-SU4 scores to select the top 1 sentences of snippets. Exact answers: Using KU-DMIS BioBERT related models. NCU-IISR/AS-GIS-2 Ideal answers: Using BioBERT-MNLI with 4,5 predicted ROUGE-SU4 scores to select the top 1 sentences of snippets. Exact answers: Using KU-DMIS BioBERT related models. NCU-IISR/AS-GIS-3 Ideal answers: Using PubMedBERT with 4,5 predicted ROUGE-SU4 scores to select the top 1 sentences of snippets. Table 2. Results of test batch 4,5 for exact answers in the BioASQ QA task. Total Systems counts the number of participants for each batch in the given category. For example, our system ranked third in batch 5 in Yes/no questions. Best Score indicates the best result across all participants, and Median Score the median result. Yes/no Factoid List Batch System Name Macro F1 System Name MRR System Name F-Measure Best Score 0.9480 Best Score 0.6929 Best Score 0.7061 4 Ours 0.8441 Ours 0.4232 Ours 0.4261 Median Score 0.4186 Median Score 0.5030 Median Score 0.4960 Total systems 52 41 30 Best Score 0.8246 Best Score 0.5880 Best Score 0.5175 5 Ours 0.7738(#3) Ours 0.5287 Ours 0.3673 Median Score 0.5522 Median Score 0.4722 Median Score 0.3438 Total systems 56 45 38 Model performances in predicting exact answers are shown in Table 2. Our system performed better than the median system score for all three question types in batch 5. In particular, our system generally performed higher in the Yes/no category than on the other two question types and scored near the best Macro F1 scores for both batch 4 and batch 5. Among them, we ranked third in the fifth batch in terms of Yes/no type questions. Table 3. Results (ROUGE-2 and ROUGE-SU4 F1 scores and Recall scores) of test batch 4,5 for ideal answers in the BioASQ QA task. Total Systems counts the number of participants in each batch. In batch 4 and 5, our system “NCU-IISR/AS-GIS-2” took first place out of submitted systems in both F1 scores. However, the Recall Score of our systems are lower than the best score. System Name Batch 4 Batch 5 ROUGE-2 F1 Best Score 0.3790(#2) 0.3846(#2) NCU-IISR/AS-GIS-1 0.3280 0.2839 NCU-IISR/AS-GIS-2 0.4454(#1) 0.3946(#1) NCU-IISR/AS-GIS-3 0.2694 0.2817 Median Score 0.3414 0.2629 ROUGE-SU4 F1 Best Score 0.3681(#2) 0.3733(#2) NCU-IISR/AS-GIS-1 0.3318 0.2846 NCU-IISR/AS-GIS-2 0.4402(#1) 0.3893(#1) NCU-IISR/AS-GIS-3 0.2674 0.2666 Median Score 0.3330 0.2573 ROUGE-2 Recall Best Score 0.7124 0.6056 NCU-IISR/AS-GIS-1 0.3370 0.2962 NCU-IISR/AS-GIS-2 0.4505 0.4072 NCU-IISR/AS-GIS-3 0.2830 0.2817 Median Score 0.4505 0.2863 ROUGE-SU4 Recall Best Score 0.7107 0.6077 NCU-IISR/AS-GIS-1 0.2851 0.3093 NCU-IISR/AS-GIS-2 0.4550 0.4087 NCU-IISR/AS-GIS-3 0.3471 0.2939 Median Score 0.4550 0.2926 Total Systems 31 28 The performance of the model in predicting the ideal answer is shown in Table 3. For ideal answers, BioASQ used two evaluation metrics: ROUGE and human evaluation. Roughly speaking, ROUGE calculates the n-gram overlap between an automatically constructed summary and a set of human- written (golden) summaries, with higher ROUGE scores being better. Specifically, ROUGE-2 and ROUGE-SU4 were used to evaluate ideal answers. These automatic evaluations are the most widely used versions of ROUGE and have been discovered to correlate well with human judgments when multiple reference summaries are available for each question. The organizers have not yet reported the results of the human evaluation (manual scoring). All ideal system answers will also be evaluated by biomedical experts. In batch 4 and 5, our system “NCU-IISR/AS-GIS-2” took first place out of submitted systems in ROUGE-2 F1 and ROUGE-SU4 F1. In particular, in batch 4, the "NCU-IISR/AS-GIS-2" system scored 0.0664 (ROUGE-2) and 0.0721 (ROUGE-SU4) higher than the second-ranked system in terms of F1 score. However, the Recall Score of our systems are lower than the best score. This may be related to the fact that we ended up submitting only the top 1 sentence. We considered increasing the number of sentences submitted, but in the end, we did not have time to test it. Results of internal experiments for exact answers on the BioASQ 9b dataset are shown in Table 4. Both Factoid and List type problem experiments were performed using NNI to fine-tune the parameters. Each model was experimented at least 20 times to find the best performance. We conducted this experiment to examine whether PubMedBERT could achieve better results than BioBERT on the BioASQ task. The results are generally in line with our expectations. For Factoid and List type questions, PubMedBERT (especially the Fulltext version) outperformed the basic version of BioBERT. But for Yes/no questions, PubMedBERT was not even as good as the basic version of BioBERT. This result is similar to what we have seen in ideal answer, where the ROUGE-related scores of the system using PubMedBERT “NCU-IISR/AS-GIS-3” are worse than the system of the version using BioBERT “NCU-IISR/AS-GIS-1”. Table 4. Results of internal experiments for exact answers on the BioASQ 9b dataset. Because of the difference in problem types, not all BioBERT-related models have been used in the experiments. Although PubMedBERT-Fulltext outperformed the basic version of BioBERT for Factoid, List type questions, the score was still much lower than the BioBERT-MNLI-SQuAD results. Yes/no* Factoid** List*** Pretrained Model Name Macro F1 MRR F-Measure BioBERT 0.7659 0.3990 0.3518 BioBERT-MNLI 0.8671 - - BioBERT-MNLI-SQuAD - 0.4509 0.3740 PubMedBERT-Abstract 0.7199 0.4020 0.3470 PubMedBERT-Fulltext 0.6960 0.4248 0.3548 * Because results of Yes/no questions were too disparate, we did not conduct enough experiments to adjust the parameters to achieve the best performance. Except for the BioBERT-MNLI, we run the experiments for only three times for each model. **Parameter search space for Factoid type question task: [1e-6 - 5e-5] for learning rate, [4,6] for batch size and [2,3,4] for epoch. *** Parameter search space for List type question task: [1e-6 - 1e-5] for learning rate, [4] for batch size and [1,2] for epoch. Although PubMedBERT has better results than BioBERT for some tasks, it still has a gap with BioBERT-MNLI and BioBERT-MNLI-SQuAD, which have been fine-tuned with external datasets. Therefore, we did not use the PubMedBERT trained exact answer system for formal submissions. In addition, we also tried to fine-tune PubMedBERT on MultiNLI and SQuAD datasets to get better results. However, we were not able to make any progress until the end of the competition. 5. Discussions and Conclusions In the 9th BioASQ QA task, we used pre-trained models including BioBERT, BioBERT-MNLI, BioBERT-MNLI-SQuAD, PubMedBERT to generate both the exact and ideal answers. In generating exact answers, we use the KU-DMIS approach to find the offsets (both start and end positions) of the answer within the given passage (snippets). Although PubMedBERT outperforms the basic version of BioBERT in Factoid, List type questions, it still cannot reach the performance of BioBERT-MNLI- SQuAD which has been fine-tuned with external datasets. This result indicates the significant effect of sequential learning using existing datasets. When it comes to the ideal answer, the most relevant fragment or sentence was selected in order to maintain the integrity of the ideal answer, rather than taking the fragment offset approach, which may focus on the wrong location and produce imperfect answers. Our results combining BioBERT-MNLI with linear regression ranked first for both ROUGE-2 F1 and ROUGE-SU4 F1 scores in batch 4 and 5. Our results show that using the linear regression model to select sentences can yield excellent results in generating ideal answers. At the same time, BioBERT's performance on this task was significantly improved after fine-tuning with the MultiNLI dataset, which means that the sentence entailment relationships contained in the MultiNLI dataset is useful for finding the ideal answer. However, we also found that the combined PubMedBERT scored worse than the basic version of BioBERT in generating answers for the ideal answer and the exact answer to the Yes/no question. We speculate that this may be related to the difference between pre-trained corpus of PubMedBERT and BioBERT. PubMedBERT was not trained on the general field corpus such as Wikipedia and Books, but pre-trained from scratch on the PubMed research papers corpus. Are differences between the general field corpus and the research paper corpus likely to contribute to the differences in the predictive results of these two tasks? Are there any linguistic elements that are present in general field texts but missing in biomedical research texts? We do not have sufficient evidence to answer these questions now. Future research could further explore the possible reasons for this discrepancy and conduct more experiments. Directions for improvement for our system also include expanding the range of snippets to include full abstracts, and comparing activation or loss functions to find a better one. In the regression method, we only processed snippet context and did not use the complete PubMed abstracts. Thus, these can be utilized in the future. All told, we hope to keep the base of pre-trained language model and make an effort to combine it with different approaches. 6. Acknowledgments This study is supported by the Ministry of Science and Technology, Taiwan (No.: MOST 109-2221- E-008-062-MY3). 7. Reference [1] Jin, Q., et al., PubMedQA: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019. [2] Möller, T., et al. COVID-QA: A Question Answering Dataset for COVID-19. in Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. 2020. [3] Tsatsaronis, G., et al., An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 2015. 16: p. 138. [4] Han, J.-C. and R.T.-H. Tsai, NCU-IISR: Using a Pre-trained Language Model and Logistic Regression Model for BioASQ Task 8b Phase B. 2020. [5] Lee, J., et al., BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020. 36(4): p. 1234-1240. [6] Jeong, M., et al., Transferability of natural language inference to biomedical question answering. arXiv preprint arXiv:2007.00217, 2020. [7] Gu, Y., et al., Domain-specific language model pretraining for biomedical natural language processing. arXiv preprint arXiv:2007.15779, 2020. [8] Williams, A., N. Nangia, and S.R. Bowman, A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017. [9] Jin, Q., et al., Biomedical Question Answering: A Comprehensive Review. arXiv preprint arXiv:2102.05281, 2021. [10] Beltagy, I., K. Lo, and A. Cohan, SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019. [11] Huang, K., J. Altosaar, and R. Ranganath, Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019. [12] Peng, Y., S. Yan, and Z. Lu, Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474, 2019. [13] Clark, C., et al., BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019. [14] Mollá, D. and C. Jones. Classification betters regression in query-based multi-document summarisation techniques for question answering. in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2019. Springer. [15] Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. in Text summarization branches out. 2004. [16] Devlin, J., et al., Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [17] Microsoft. Neural Network Intelligence (NNI). Available from: https://github.com/microsoft/nni.