=Paper=
{{Paper
|id=Vol-2936/paper-14
|storemode=property
|title=Large Biomedical Question Answering Models with ALBERT and ELECTRA
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-14.pdf
|volume=Vol-2936
|authors=Sultan Alrowili,Vijay Shanker
|dblpUrl=https://dblp.org/rec/conf/clef/AlrowiliS21
}}
==Large Biomedical Question Answering Models with ALBERT and ELECTRA==
Large Biomedical Question Answering Models with ALBERT and ELECTRA Sultan Alrowili1 , K. Vijay-Shanker1 1 Department of Computer and Information Science, University of Delaware, Newark, Delaware, USA Abstract The majority of systems that participated in the BioASQ8 challenge are based on BioBERT model [1]. We adopt a different approach in our participation in the BioASQ9B challenge by taking advantage of large biomedical language models that are built on ELECTRA [2] and ALBERT [3] architectures, including both BioM-ELECTRA and BioM-ALBERT [4]. Moreover, we examine the advantage of transferability [5] between BioASQ and other text classification tasks such as The Multi-Genre Natural Language Inference (MultiNLI) [6]. Our results show that both BioM-ELECTRA and BioM-ALBERT significantly outperform the BioBERT model on the BioASQ9B task. Keywords BERT, ELECTRA, ALBERT, BioASQ 1. Introduction BioBERT model [7] represents the early success of domain adaptation of BERT [8] model in the biomedical domain. BioBERT model shows impressive results on the BioASQ7B challenge by taking the lead on most five batches of BioASQ7B challenge [9]. Furthermore, the BioBERT model is used in the majority of biomedical models that competed in the BioASQ8 challenge [1]. However, since the introduction of BERT model in 2018, new Transformer-based models have been introduced to NLP community including RoBERTa [10], ELECTRA [2], XLNET [11], MegaTron-LM [12], and ALBERT [3]. An adaptation of some of these models to the biomedical domain have been introduced later as BioRoBERTA [13], BioMegaTron [14] and PubMedBERT [15]. Additionally, we have introduced both BioM-ELECTRA and BioM-ALBERT models [4]. Both models are large-scale models that are adapted to the biomedical domain by pretraining both on Pubmed abstracts. As noted earlier, a majority of participant systems in the BioASQ8B challenge are based on the BioBERT base-scale model. This motivates us to examine the effectiveness of large-scale biomedical models. The main findings of our investigations are that: (i) Both BioM-ALBERT and BioM-ELECTRA, models that we have recently developed are effective in addressing both BioASQ factoid and list questions. (ii) Treating BioASQ yes/no question as a classification problem is an effective approach that can lead to competitive performance. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " alrowili@udel.edu (S. Alrowili); vijay@udel.edu (K. Vijay-Shanker) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. System Description We use large-scale biomedical language models, which is one of the primary differences between our system and other prior systems that participated in the BioASQ8B challenge. In our participation in the BioASQ9B challenge, we use both our models BioM-ELECTRA and BioM- ALBERT models [4]. 2.1. BioM-ELECTRA ELECTRA is a model that was built upon the idea of Transformer encoder, and attention mechanism [16] that BERT model uses. However, the ELECTRA model introduces novelty to the loss function by eliminating Next Sentence Prediction (NSP) objective, which is a similar decision taken by the RoBERTA model [10]. Moreover, ELECTRA improves the loss function by incorporating ideas from GAN model [17] where it generates corrupted (fake) tokens by employing a small Masked Language Model (MLM). Then, a discriminator model will judge those corrupted tokens and decide if they are "original" or "replaced" tokens. To shift the contextual representation of ELECTRA, we pretrain ELECTRA on PubMed abstracts using specific domain vocabulary learned from PubMed abstracts. We pretrain our BioM-ELECTRA for 434K steps using TPUv3-512 units with a batch size of 4096. 2.2. BioM-ALBERT ALBERT model [3] takes a similar decision to ELECTRA regarding the loss function by dropping Next Sentence Prediction (NSP) function. Furthermore, ALBERT introduces a self-supervised loss for sentence-order prediction (SOP) objective. Additionally, the ALBERT model improves the efficiency of the Transformer model by introducing both parameter-sharing and factorization of embedding layers techniques. The Parameter-sharing technique improves the architecture by reducing the parameters redundancy inside the model. On the other hand, factorization of embedding layers allows the model to increase its hidden layer size up to 4096 while having only 235M parameters in the case of ALBERT-xxlarge. We build BioM-ALBERTxxlarge by pretraining ALBERTxxlarge on PubMed Abstracts using TPUv3- 512 unit for 264K steps and a batch size of 8192. Similar to BioM-ELECTRA, we also pretrain BioM-ALBERT on PubMed abstracts only. Table 1 shows the architecture design and the reported results [4] of our models on SQuAD2.0 [18] and BioASQ7B-Factoid tasks against other SOTA models. We include this table to show a head-to-head comparison between different architectures that have have been used by par- ticipants’ systems in the BioASQ9B challenge [19]. We should also note that it is a common practice in the literature to fine-tune the biomedical language model on the the SQuAD dataset first and then on the BioASQ dataset. The reason to follow this approach because SQuAD2.0 dataset has more than 130K examples, which is much larger than the BioASQ dataset. Table 1 Results of BioM-ALBERT and BioM-ELECTRA on BioASQ7B-Factoid Task. Evaluation metrics are F1 score for SQuAD task and Mean Reciprocal Rank MRR for BioASQ task. We use reported results of BioBERT-Base, BioBERT-Large, and BioMegaTron [14]; PubMedBERT, BioM-ELECTRA and BioM- ALBERT [4]. Model #Parameters #Hidden Size SQuAD2.0 BioASQ7B-Factoid BioBERT-Base 110M 768 - 41.1 BioM-ELECTRA-Base 110M 768 84.4 52.3 PubMedBERT-PMC-base 110M 768 80.9 51.9 BioBERT-Large 335M 1024 - 50.1 BioMegaTron345m 345M 1024 84.2 52.5 BioM-ELECTRA-Large 335M 1024 88.3 54.1 BioM-ALBERT-xxLarge 235M 4096 87.0 56.9 3. Experimental Setup 3.1. Pre-Processing phase For BioASQ9B factoid and list questions, we converted all questions to SQuADv1.1 format. Therefore, we duplicate the snippet (context) for each question in the training and test dataset instead of having a group of snippets and one corresponding question. For yes/no questions, we adopted a binary classification approach to solve this task by having the context (snippet) as "sentence 1", questions as "sentence 2" and the answer (yes/no) as a "label." We use a pre- processing script developed by [15] to generate the BioASQ classification dataset. 3.2. Environmental Design We fine-tune our models on factoid and list questions using Google Cloud Compute Engine with TPUv3-8 units and TensorFlow 1.15. For the yes/no task, we use the Hugging-face Transformers library [20] and V100 GPU on the Google Colab Pro environment. 3.3. Hyperparameters For factoid and list questions, we use the same hyperparameters settings that we use in our previous work [4] as shown in Table 2. We made this decision to examine the consistency and reproducibility of both BioM-ELECTRA and BioM-ALBERT on the BioASQ9B challenge. For the yes/no question, we use the training and testing dataset of the BioASQ8B challenge to determine our choices of hyperparameters. 3.4. Task-to-Task Transfer Learning The early work done by [5] and [21] shows that the transferability (Task-to-Task Transfer Learning) between general domain tasks such as MultiNLI [6] and SQuAD helps to improve the results on SQuAD and BioASQ8B tasks. We did a similar approach by fine-tuning both BioM-ALBERT and BioM-ELECTRA on the MNLI task, then SQuAD, and later on the BioASQ Table 2 Details of fine-tuning hyperparameters that we use for both BioM-ALBERT and BioM-ELECTRA. (MSL=Max Seq. Length) Task Model Learning Rate Batch Epochs MSL Factoid/List BioM-ELECTRA 2e-5 24 4 512 Factoid/List BioM-ALBERT 1e-5 128 3 384 Yes/No All our models 3e-5 8 5 256 training dataset. We investigate and report the impact of this transferability on BioASQ9B in the result section. 4. Results and Discussion We participated in the BioASQ9B challenge under the name "UDEL-LAB". Our reported results in this section are obtained from the BioASQ9B official leader board. We participate in the BioASQ9B-Factoid challenge starting from batch 3, and we use batch 2 to test the format of our submission. Therefore, we only include results of BioASQ-Factoid challenge starting from batch 3. We participated in yes/no, and list questions on batch five only since both types of tasks require extra pre-processing that we could not develop at early stage. 4.1. Factoid Task Table 3 shows the results of our system on the BiASQ9B-Factoid challenge. We show only the top five systems for each batch based on the mean reciprocal rank (MRR) score. The Fudan University team participated with four systems under the name of ir_sys [19]. Their systems combined SpanBERT [22], PubMedBERT [15] and XLNet models [11]. On other hand, "bio- answerfinder" system uses the BioELECTRA model [23], which they have developed early based on ELECTRA architecture. The result of BioM-ALBERT and BioM-ELECTRA against other models on both batch three and batch five suggests that our models has more consistency on the BioASQ performance than other models. Results also highlight that language model scale is a dominant factor on the performance of BioASQ-Factoid questions. Only large-scale models that are based on ALBERT-xxlarge, ELECTRA-large, and XLNET are taking the lead in all three batches. On the other hand, using the transferability between MNLI and SQuAD tasks improves the score of our systems in the third batch by almost 2% in MRR score. However, this improvement is not consistent in both batch 4 and 5. We attribute this inconsistency to the fact that the fine- tuning layer of BERT-like models is randomly initialized. This randomness causes a fluctuation in the results, especially if we have small evaluation data set [15]. On the other hand, the score of BioM-ALBERT and BioM-ELECTRA in both batches 3 and 5 suggest that having an ensemble model could help further improve the results. Table 3 Results of BioM-ALBERT and BioM-ELECTRA on BioASQ9B-Factoid Task. Strict Acc. is based on the evaluation of the first predicted answer by the system. Lenient Acc. is based on whether the system returns the exact answer in the top five predicted answers. Batch Model Strict Acc. Lenient Acc. MRR BioM-ALBERTxxlarge+MNLI+SQuAD+BioASQ 0.5405 0.7027 0.6149 Ir_sys2 0.5946 0.6486 0.6135 9B Batch 3 BioM-ALBERTxxlarge+SQuAD+BioASQ 0.5405 0.6757 0.5946 BioM-ELECTRA-large+SQuAD+BioASQ 0.5135 0.7027 0.5923 bio-answerfinder 0.5676 0.5946 0.5811 Ir_sys1 0.6429 0.7857 0.6929 Ir_sys2 0.6071 0.7500 0.6464 9B Batch 4 BioM-ELECTRA-large+SQuAD+BioASQ 0.5357 0.7857 0.6351 BioM-ELECTRA-large+MNLI+SQuAD+BioASQ 0.5000 0.7857 0.6321 BioM-ALBERTxxlarge+SQuAD+BioASQ 0.5357 0.7143 0.5982 BioM-ELECTRA-large+SQuAD+BioASQ 0.5000 0.7222 0.5880 BioM-ELECTRA-large+MNLI+SQuAD+BioASQ 0.4722 0.6944 0.5694 9B Batch 5 finetuning1 0.5000 0.6667 0.5671 BioM-ALBERTxxlarge+SQuAD+BioASQ 0.4444 0.7222 0.5588 BioM-ALBERTxxlarge+MNLI+SQuAD+BioASQ 0.4722 0.6667 0.5556 Table 4 Results of BioM-ALBERT and BioM-ELECTRA on BioASQ9B challenge for a list and yes/no questions. Official evaluation metrics for Yes/No task is Macro-F1 score, and for List questions is F-Measure. We only participated in batch five for these type of questions. Task Model #Rank Score BioM-ALBERTxxlarge+MNLI+SQuAD+BioASQ #1 0.5175 BioM-ALBERTxxlarge+SQuAD+BioASQ #2 0.4927 List Ir_sys2 #3 0.4804 BioM-ELECTRA-large+SQuAD+BioASQ #7 0.4031 BioM-ELECTRA-large+MNLI+BioASQ #8 0.3936 KU-DMIS-2 #1 0.8246 Yes/No BioM-ALBERTxxlarge+SQuAD+BioASQ #4 0.7564 BioM-ELECTRA-large+SQuAD+BioASQ #5 0.6801 4.2. List and Yes/No Tasks Table 4 shows the results of our system on the BiASQ9B List and Yes/No challenge. In the list task, our systems ranked in first and second place. We achieved this score for list questions despite using the same hyperparameters that we use for the factoid task. On yes/no task, BioM- ALBERT performs significantly better than BioM-ELECTRA but falls behind the performance of "KU-DMIS-2" system, which uses BioBERT-Large [19]. We should also note that the number of both list questions (18) and yes/no (19) questions are relatively smaller than factoid questions (36). Tasks with small data sets usually are sensitive to hyperparameter choice and fluctuate between each fine-tuning run, especially in the case of binary classification (yes/no) task. 5. Conclusion and Future Work We demonstrate that BioM-ELECTRA and BioM-ALBERT models are effective in addressing the BioASQ challenge. Our systems take the lead in two batches of factoid tasks and by a significant margin (2%) in batch 5. Additionally, we show that applying transferability between MNLI and SQuAD led our systems to score at first place on factoid (batch 3) and list (batch 5) questions. For future work, we plan to build a large ensemble QA system based on both BioM-ELECTRA and BioM-ALBERT to address the BioASQ and pandemic challenges. Acknowledgement We would like to acknowledge the support we have from Tensorflow Research Cloud (TFRC) team to grant us access to TPUv3 units. References [1] A. Nentidis, A. Krithara, K. Bougiatiotis, M. Krallinger, C. Rodriguez-Penagos, M. Ville- gas, G. Paliouras, Overview of bioasq 2020: The eighth bioasq challenge on large-scale biomedical semantic indexing and question answering, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, Springer, 2020. URL: https://link.springer.com/chapter/10.1007/978-3-030-58219-7_16. [2] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators rather than generators, 2020. arXiv:2003.10555. [3] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised learning of language representations, 2020. arXiv:1909.11942. [4] S. Alrowili, V. Shanker, BioM-transformers: Building large biomedical language models with BERT, ALBERT and ELECTRA, in: Proceedings of the 20th Workshop on Biomedical Language Processing, Association for Computational Linguistics, Online, 2021, pp. 221– 227. URL: https://www.aclweb.org/anthology/2021.bionlp-1.24. doi:10.18653/v1/2021. bionlp-1.24. [5] M. Jeong, M. Sung, G. Kim, D. Kim, W. Yoon, J. Yoo, J. Kang, Transferability of natural language inference to biomedical question answering, 2021. arXiv:2007.00217. [6] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman, GLUE: A multi-task benchmark and analysis platform for natural language understanding, in: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 353–355. URL: https://www.aclweb.org/anthology/W18-5446. doi:10.18653/v1/W18-5446. [7] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics (2019). URL: https://doi.org/10.1093/bioinformatics/btz682. doi:10.1093/bioinformatics/btz682. [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/ anthology/N19-1423. doi:10.18653/v1/N19-1423. [9] A. Nentidis, K. Bougiatiotis, A. Krithara, G. Paliouras, Results of the seventh edition of the bioasq challenge, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Springer, 2019. URL: https://arxiv.org/pdf/2006.09174. pdf. [10] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy- anov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692. [11] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, 2020. arXiv:1906.08237. [12] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, Megatron- lm: Training multi-billion parameter language models using model parallelism, 2020. arXiv:1909.08053. [13] P. Lewis, M. Ott, J. Du, V. Stoyanov, Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art, in: Proceedings of the 3rd Clinical Natural Language Processing Workshop, Association for Computational Linguistics, Online, 2020, pp. 146–157. URL: https://www.aclweb.org/anthology/2020. clinicalnlp-1.17. doi:10.18653/v1/2020.clinicalnlp-1.17. [14] H.-C. Shin, Y. Zhang, E. Bakhturina, R. Puri, M. Patwary, M. Shoeybi, R. Mani, BioMegatron: Larger biomedical domain language model, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computa- tional Linguistics, Online, 2020, pp. 4700–4706. URL: https://www.aclweb.org/anthology/ 2020.emnlp-main.379. doi:10.18653/v1/2020.emnlp-main.379. [15] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon, Domain-specific language model pretraining for biomedical natural language processing, 2021. arXiv:2007.15779. [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, I. Polo- sukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010. [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems, volume 27, Curran Associates, Inc., 2014. URL: https://proceedings.neurips.cc/paper/2014/ file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf. [18] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswerable questions for SQuAD, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Mel- bourne, Australia, 2018, pp. 784–789. URL: https://www.aclweb.org/anthology/P18-2124. doi:10.18653/v1/P18-2124. [19] A. Nentidis, G. Katsimpras, E. Vandorou, A. Krithara, L. Gasco, M. Krallinger, G. Paliouras, Overview of bioasq 2021: The ninth bioasq challenge on large-scale biomedical semantic indexing and question answering, 2021. arXiv:2106.14885. [20] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. doi:10.18653/v1/2020.emnlp-demos.6. [21] H. Zhang, H. Zhao, C. Liu, D. Yu, Task-to-task transfer learning with parameter-efficient adapter, in: X. Zhu, M. Zhang, Y. Hong, R. He (Eds.), Natural Language Processing and Chinese Computing, Springer International Publishing, Cham, 2020, pp. 391–402. [22] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, O. Levy, SpanBERT: Improving pre-training by representing and predicting spans, arXiv preprint arXiv:1907.10529 (2019). [23] I. B. Ozyurt, On the effectiveness of small, discriminatively pre-trained language representa- tion models for biomedical text mining, in: Proceedings of the First Workshop on Scholarly Document Processing, Association for Computational Linguistics, Online, 2020, pp. 104– 112. URL: https://aclanthology.org/2020.sdp-1.12. doi:10.18653/v1/2020.sdp-1.12.