MDS_UNCC Question Answering System for Biomedical Data with Preliminary Error Analysis Seethalakshmi Gopalakrishnan1 , Swathi Padithala2 , Hilmi Demirhan1 and Wlodek Zadrozny1,2 1 College of Computing and Informatics, University of North Carolina at Charlotte, United States 2 School of Data Science, University of North Carolina at Charlotte, United States Abstract In this paper, we are describing the submission that we made for the 9th year BioASQ competition. The BioASQ challenge aims to promote methodologies and systems for large-scale biomedical semantic indexing and question answering. There are shared tasks that are the benchmark datasets through the BioASQ website yearly. The dataset represents the information needed from biomedical experts. This paper has worked on the 9th task with the BioASQ BioBERT model based on the Bidirectional Encoder Representations from the Transformers (BERT) model. We have fine-tuned the BioASQ BioBERT model and submitted our results both with training and without training. The results show that fine-tuning with training the model gives better results. For the Batch5 factoid submission, we have got an MRR of 0.52, which is higher than the original version of BioASQ BioBERT. For Yes/No, we got the F1 score of 0.81, and for the list, the F1 was 0.26. We also present preliminary results of our error analysis, where we hypothesize about the causes of some errors, and run simple experiments to confirm or disprove them. For example, we see that the presence of the natural language modalities — which are quite common in questions, answers and snippets — influences the accuracy. Keywords BioASQ, Question Answering, Error Analysis, Factoid, Yes/No, List, Modal Verbs 1. Introduction Question Answering (QA) focuses the Information Retrieval (IR) system on the actual infor- mation need of the user. Performance on a QA task reflects domain knowledge and natural language understanding ability of a system. BioASQ, Biomedical Semantic Question Answering Challenge, is one of the famous competi- tions that ask practitioners to develop a system for unstructured biomedical text and decision support from PubMed articles. The system is supposed to automatically respond to biomedical questions with relevant concepts, articles, snippets, exact answers and summaries. BioASQ chal- lenge is a competition of document classification, information retrieval and question answering. The participants are expected to achieve the goal of submitting ideal and user-understandable answers to the four types of natural language questions by combining the information from CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " sgopala4@uncc.edu (S. Gopalakrishnan); spaditha@uncc.edu (S. Padithala); hdemirha@uncc.edu (H. Demirhan); wzadrozn@uncc.edu (W. Zadrozny) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) biomedical articles and ontologies. The four types of questions from benchmark datasets are Yes/No questions, Factoid questions, List questions, and Summary questions. One example of Yes/No questions is that "Do CpG islands colocalize with transcription start sites?". One example for a Factoid question is that "Which virus is best known as the cause of infectious mononu- cleosis?" . One example for a List question is "Which are the Raf kinase inhibitors?". Also one example for a Summary question is that "What is the treatment of infectious mononucleosis?". There are two phases in the challenge: Phase A and Phase B. During phase A, participating systems have to reply with related concepts which are from designated terminologies and ontolo- gies, related articles in English, related snippets and RDF triples. During phase B, participating systems need to have responses with exact and ideal answers in English. This project builds on the 2019 BioASQ experiments reported in [1]. In the project, we are using Task 9b data (the 9th year BioASQ competition shared task). In Task 9a, participants are asked to classify new abstracts written in English, as they became available online. The classes came from the MESH hierarchy, i.e., the subject headings that are currently used to manually index the abstracts. As new manual annotations became available, they were used to evaluate the classification performance of participating systems (that classified articles be- fore they were manually annotated), using standard information retrieval (IR) measures (e.g., precision, recall, accuracy), as well as hierarchical variants of these measures. The models we are using are BioBERT which is based on Bidirectional Encoder Representations from Trans- formers (BERT) model [2] which is pre-trained on Wikipedia articles. The BioBERT model was pre-trained with biomedical text using the PubMed and PMC articles [3]. As the participants, we had to annotate input natural language questions with biomedical concepts, and retrieve relevant documents, snippets and triples (Phase A). Eventually, we need to find and report the answers to the questions (Phase B), given as additional input the golden responses of the Phase A. Our contributions in this paper are as follows: — We describe our systems, which placed 4th in Task 9b Phase B. In data preparation, we created a new way of converting the BioASQ format to BioBERT format, and we make our code available https://github.com/seetagopal/BioASQ-2021.git — We performed error analysis (on training data) of the Yes/No questions, and hypothesized about causes of some errors. These, e.g. include the presence of modal verbs/auxiliaries such as ‘may’. — We performed a statistical analysis of Wh-questions and of the modal verbs/auxiliaries. 2. BioASQ Related Work and the Competition Data 2.1. BERT and BioBERT BERT stands for "Bidirectional Encoder Representations from Transformers" is a contextual word embedding model which was developed in 2018 by Google [2]. It is a contextualized word representation model that is pre-trained using bidirectional transformers [4]. The model takes a sentence as an input and outputs a contextual embedding of the word. Currently Google Search Engine uses BERT model for over 70 languages [5]. In addition to search, BERT can be used for additional tasks such as question answering and language inference. BERT’s pre-trained deep bidirectional representations from unlabeled text can be modified with an additional output layer for these different tasks. Bidirectional representations are crucial in biomedical text mining to represent relationships in a biomedical corpus [6]. In this work, we are using BERT for question answering tasks. Question and paragraph (context) are given as an input to the model. In this work, we used BioBERT models. BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) is a domain-specific language representation model pre-trained on large-scale biomedical corpora developed by Lee et al. [7]. BioBERT and BERT have very similar architecture. Since it is pre-trained on biomedical corpora, it achieves a better performance than BERT in biomedical text mining tasks. Lee et al. [7] have fine-tuned BioBERT for question answering by using the same BERT architecture used for SQuAD [8]. They used the BioASQ factoid datasets for fine-tuning as factoid datasets have similarity to that of SQUAD. Some of the BioASQ factoid questions were unanswerable as exact answers were not present in the given texts. The unanswerable questions were removed from the training sets. They used the same pre-training process of Wiese et al. [9] as they have also used SQuAD. They have used strict accuracy, lenient accuracy and mean reciprocal rank as evaluation metrics. Yoon et al. [10] have also used BioBERT for answering biomedical questions including factoid, list, and yes/no type questions. Jin et al. [11] have reviewed the Biomedical Question Answering approaches by classifying them in 6 major methodologies namely open-domain, knowledge base, information retrieval, machine reading comprehension, question entailment and visual QA. 2.2. Prior BioASQ work This year ninth edition of the BioASQ Challenge is being held. The previous year, 8th BioASQ challenge can be summarized as follows [12]. There were three tasks last year: Task 8a, task 8b, task MESINESP. Task 8a was a large-scale biomedical semantic indexing task. Task 8b was a biomedical question answering task. Task MESINESP was a new task on medical semantic indexing in Spanish (task MESINESP). Some of the task 8b submissions were as follows. Kazaryan et al. [13] have participated as ITMO team. They used BERT fine-tuned on SQUAD [8]. They also used a model based on BioMed-RoBERTa [14] to improve the produced answers. Ozyurt et al. [15] used Electra [16] and BioBERT [7] on SQuAD and BioASQ datasets combined. Pappas et al. [17] experimented with a SciBERT-based model for exact answer extraction [18] modelled for cloze-style biomedical machine reading comprehension [19] (MRC). 2.3. BioASQ Data For the 9th BioASQ tasks, training dataset for Task 9a is about semantic indexing. Task 9a data contains MeSH terms MEDLINE curators annotated in the biomedical articles from PubMed. Different years version means different MeSH terms used in the articles from PubMed besides different sizes. Task 9b is about Question Answering; and as part of the task in the data we get concepts, articles, snippets, RDF triples, “exact” answers and “ideal” answers in JSON format. Task 9a, named “Large-scale online biomedical semantic indexing”, has 15,559,157 articles. In each article, there are an average 12.68 MeSH terms annotated. 29,369 MeSH covered total. The dataset is 7.9 GB zipped and 25.6 GB unzipped. Task 9b, named “Introductory biomedical semantic QA,” uses benchmark datasets containing development and test questions, in English, along with gold standard (reference) answers. The benchmark datasets are being constructed by a team of biomedical experts from around Europe. Below is the number of questions in each category in the 9b training data. Yes/No: 1033, Factoid: 1092, List: 719, Summary: 899. The number of questions in each category in Task 9b Batch5 test data: Yes/No: 19, Factoid: 36, List: 18, Summary: 27. Figure 1: Overview of the model. First step is to prepare the data (Section 3) into BioBERT format which will then be used by the BioASQ-BioBERT model for training and prediction. The last step is the error analysis which is explained in Section 5. The major finding of our error analysis of Yes/No questions was that understanding coreference resolution, modality, synonyms and antonyms may help in improving the model accuracy 3. Data Preparation and Experiments 3.1. Data Preparation The main purpose of our data preparation is to convert the BioASQ data into the format which is accepted by the BioBERT. For the factoid questions we are creating a dictionary in which the question id, question, answer, context will be added. BioBERT will expect the start index of the answers during the training. So, one of the main tasks of this data preparation is to find the start of the answers. In order to find the start index, we are performing the set operation on the ideal and exact answers to find the unique answers for that question. Once we have the answers the next step is to find out what is the index of those answer in the given snippet. If the answer is present in the snippet, then that will be set as the start index which we are doing by setting flags. Apart from creating a dictionary to append all the information given above we are also adding three more numbers to the given id. This is necessary because the given id will be of length 24 but the evaluation script in BioBERT to convert the BioBERT predictions into BioASQ format will expect the id of length 28. The same procedure is followed for the list type questions. For Yes/No same procedure is followed except for finding the answer start index. 3.2. Experiments: Finetuning BioBERT We are using the pretrained weights provided by the BioASQ BioBERT model [20, 3]. The first step for training this model is to convert the given data into the BioBERT format which includes the answer start index. The process of how we are doing this is explained in the data preparation (Section 3). In the BioASQ BioBERT model, they have released the pretrained weights for the factoid, list and Yes/No questions separately. These weights are pretrained on SQUAD dataset on top of the BioBERT model. We are using those weights to train the BioBERT model for the BioASQ training 9b data. Once the training is done, we are using the weights obtained as a result of training for the prediction. We are following this procedure for factoid, list and Yes/No question answer. Once we got the predictions we ran the script provided in the BioASQ BioBERT model to convert the predictions into the BioASQ format. The overview of the model is given in Figure 1. 4. Results We have submitted our results to the 9th Batch of BioASQ competition by finetuning the BiOASQ BioBERT model on the BioASQ Task9B – Phase B dataset. We have submitted our predictions in the Batch 5 of the BioASQ competition and we are placed 4th in the leader board. Our system name is MDS_UNCC. The results that we got for our Test Batch5 submission is given in Table 1. Yes/No Factoid System Macro F1 F1 Yes F1 No Accuracy Strict Acc. Lenient Acc. MRR KU-DMIS-2 0.8246 0.88 0.7692 0.8421 0.3889 0.6111 0.4722 KU-DMIS-3 0.8246 0.88 0.7692 0.8421 4167 0.5556 0.4745 KU-DMIS-5 0.8246 0.83 0.7692 0.8421 0.3889 0.5833 0.4583 MDS_UNCC 0.7841 0.8182 0.75 0.7895 0.4167 0.6944 0.5204 Table 1 BioASQ competition results for the Batch5 submission. KU-DMIS is placed first in the leaderboard. MDS_UNCC is our submission which is placed 4th in the competition. For the List questions we got an Mean Precision of 0.3259, recall of 0.2963 and F-measure of 0.2678 (which is about 12% below the best system). 5. Error Analysis In this section we discuss some aspects of our error analysis. We start with Batch 3, but we mostly focus on Batch 5. We perform both error analysis and data analysis in this section. In particular we discuss the presence of the modalities in the data, types of questions, and the accuracy on these types. The Yes/No test data prediction file from the Batch3 has identified 87% Yes Accuracy and 54% No Accuracy; that is, most of the Yes were identified correctly but almost half of No answers were identified wrong. Upon doing the further analysis, we see that the probability of No questions was underestimated. We hypothesize that corpus expansion [21, 22] would help in improving the accuracy, but we have not yet tested this hypothesis. For Factoid questions, we observed a pattern of lower probabilities for the best and longest match, and the higher probability for initial sentences was higher. For List questions, the probability was again smaller for the long text and the probability for initial sentences higher; moreover, the best matching words were often left out. Thus, we need to find a solution to include the best match words and for matching the longest sentence. Questions Answers Snippets Can 58 319 140 Could 0 55 45 May 6 201 107 Might 0 22 21 Should 5 30 16 Will 0 24 33 Would 4 12 9 Must 0 11 4 Possible 8 19 23 Percent of modals 7% 59% 33% Table 2 Number of modals present in the BioASQ training9b dataset. In this table, column indicates the modal verb and the row indicates the count of the modal verb for questions, answers and snippets in the dataset. The last row indicates the percentage of modals present in each of the questions, answers and snippets. We hypothesize that a better understanding of the role of modals can improve the accuracy of question answering. In our BioASQ Batch5 submission we got an accuracy of 78% for Yes/No questions. In order to find out where the machine fails with the Yes/No questions, we are presenting a more detailed error analysis for the Yes/No questions in this section. Since the golden data for the task9b is not yet published, we have divided the training data in train and test data using the scikit learn library with train data 80% and test data 20%. We had 835 Yes/No questions in the train data and 207 questions as test data. We ran BioASQ-BioBERT on this train data and made some predictions. Out of 207 Yes/No questions, we got 184 questions answered correctly and 23 ques- tions answered incorrectly. For further analysis, we have checked these 23 questions manually along with the snippets. Our analysis show that in many of the questions, the machine cannot understand the synonyms and the antonyms. Also, for few complex questions coreference resolution is needed. For example: Example 1: Question: Does dronedarone affect T3 and T4 levels? Actual answer: No Predicted answer: Yes Snippet: Amiodarone resulted in increased T4, T4/T3 and rT3, whereas dronedarone did not alter the thyroid hormone profile in normal animals. In the above example the expected answer is No but the predicted answer was Yes. In the snippet, the first part talks about Amiodarone, whereas the second part gives the answer for the above question and says that the dronedarone did not alter the thyroid hormone profile. This question can be answered correctly only if the machine can understand that T3, T4 refers to the thyroid hormone. In order to check whether our hypothesis is correct, we have changed the above question into "Does dronedarone alter the thyroid hormone level?" and ran BioASQ-BioBERT. Changing the question predicted this answer correctly. We can infer that the proper dealing of coreference resolution is needed. Training Batch5 Test Starts with "Wh" 54% 64% Starts with "What is" 23% 29% in the form "What is X?" 5% 6% in the form "What is .. of/using .. ?" 12% 16% Starts with "What ..." but not "What is/are" 4% 4% Which 22% 25% What are 3% 3% Where 1% 1% When 0.3% 1% Complement of "Wh" 46% 36% Table 3 Percentage allocation of most frequent question types in training for Task 9b and Batch5 test dataset. Here the Complement of "Wh" indicates the questions which do not start with "Wh." (The percentages are rounded-off; and some types of “what is ...." questions are not counted here). One other example which explains the understanding of modality is important is given below: Example 2: Question: Is cardiac magnetic resonance imaging indicated in the pre-participation screening of athletes? Actual answer: No Predicted answer: Yes Snippet: As modern imaging further enhances our understanding of the spectrum of athlete’s heart, its role may expand from the assessment of athletes with suspected disease to being part of com- prehensive pre-participation screening in apparently healthy athletes. In the above example the expected answer was No and it was predicted as Yes. Looking into the snippet, we can understand that modern imaging may be used for pre-participation but this is not necessary. If the machine can understand the modal verb “may” then this question can be answered correctly. To further analyze the role of modals in BioASQ data we have collected the count of modals in Yes/No List Factoid Starts with "Wh" N/A 37% 30% starts with "Which" N/A 44% 43% starts with "List" N/A 53% N/A starts with "What is" N/A 0% 32% starts with "What are" N/A 19% 0% starts with "Where" N/A 0% 33% other type of "What questions" N/A 12.5% 16% complement of wh and list N/A 42% N/A starts with "What is X?" N/A N/A 0% starts with what is of/using ..? N/A N/A 13% Complement of wh questions 89% N/A 24% Overall accuracy 89% 43% 31% Table 4 Accuracy of different type of questions and the over all accuracy. There are no "wh" questions in the Yes/No type. There are no "What is X" and "What is of/using.." questions in the list, Yes/No type. In most of the categories with 0% accuracy the number of questions is very less with a minimum count of 1 and a maximum count of 10 questions. the training data which is given in Table 2. This count shows that there are a number of modal verbs present in the training data. If the machine can understand the modals during the train, this may help in improving the accuracy. Also, from Table 2 we can see that the percentage of modals in the answers is higher than that of the questions. A better understanding of the relationship between the modals in the questions and answers may help improve the accuracy further. In order to investigate our claim about the role of the modalities, we have changed the above question in Example 2 into "May cardiac magnetic resonance imaging be indicated in the pre-participation screening of athletes?". After changing the question the model has predicted the answer correctly which suggests the importance of understanding the modality. Obviously, we need to do more work on a larger set of examples to prove or disprove this hypothesis. Example 3: Question: Does the BRAFV600E mutation have an effect on clinical response to radioiodine therapy? Actual answer: Yes Predicted answer: No Snippet: Preclinical studies showed that BRAF mutation significantly reduced radioiodine uptake and decreased the sensitivity to radioactive iodine (RAI) therapy. In the above example, the question asks about the effect of BRAFV600E. In the snippet, it shows that BRAF significantly reduced radioiodine which is an effect of BRAF. This question can be answered correctly if the machine can understand that ‘effect’ and ‘reduced’ are synonyms here. To investigate this claim we have changed the question to "Does the BRAFV600E mutation have reduced on clinical response to radioiodine therapy?". However, the system still predicts the answer wrongly. We can infer from this result that a better understanding about how the machine interprets these synonyms and antonyms is necessary. Another aspect of the analysis is shown in Table 3. Namely we see that the distribution of types of questions is different in the training and test data. The accuracy results for the three types of questions (Yes/No, List, and Factoid) are computed on the test data we obtained by splitting the original training data, as described earlier. We calculate the accuracy by finding the exact match between the correct answers in the training data and the predictions we got for the List and Factoid questions. Out of the obtained list of predictions, even if one of the predictions is correct, we consider that question to be answered correctly. The results of the accuracy we got for the different types of questions are given in the Table 4. 6. Summary and Conclusions In this article, we described our contribution to the 9th BioASQ competition. We showed the process of retraining the BioASQ-BioBERT model used in our experiments leads to better results. However, to improve our results further, we need better understanding of the models, as well as of the data; that is, questions, answers, and snippets. Regarding the models, in accordance with common knowledge, we believe that both larger data sets for training and corpus expansion for background knowledge [21, 22] should lead to improved results. Based on a few examples, we also predict that deeper language understanding, e.g. coreference and synonym resolution could have an impact on accuracy. As for the data, we showed that modalities are potentially important, and are present surprisingly frequently in both questions and answers. Again, this suggests that we should experiment with deeper NLP methods. Acknowledgments The authors acknowledge the help of David Ruddell in data preparation. References [1] S. K. Telukuntla, A. Kapri, W. Zadrozny, UNCC biomedical semantic question answering systems. BioASQ: Task-7B, Phase-B, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2019, pp. 695–710. [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT, 2018. [3] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics (2019). doi:10. 1093/bioinformatics/btz682. [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need, arXiv preprint arXiv:1706.03762 (2017). [5] Wikipedia contributors, Bert (language model), 2021. URL: https://en.wikipedia.org/wiki/ BERT_(language_model), [Online; accessed 11-May-2021]. [6] M. Krallinger, O. Rabal, S. A. Akhondi, M. P. Pérez, J. Santamaría, G. P. Rodríguez, et al., Overview of the biocreative vi chemical-protein interaction track, in: Proceedings of the sixth BioCreative challenge evaluation workshop, volume 1, 2017, pp. 141–146. [7] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240. [8] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions for machine comprehension of text, arXiv preprint arXiv:1606.05250 (2016). [9] G. Wiese, D. Weissenborn, M. Neves, Neural domain adaptation for biomedical question answering, arXiv preprint arXiv:1706.03610 (2017). [10] W. Yoon, J. Lee, D. Kim, M. Jeong, J. Kang, Pre-trained language model for biomedical question answering, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2019, pp. 727–740. [11] Q. Jin, Z. Yuan, G. Xiong, Q. Yu, C. Tan, M. Chen, S. Huang, X. Liu, S. Yu, Biomedical question answering: A comprehensive review, arXiv preprint arXiv:2102.05281 (2021). [12] A. Nentidis, A. Krithara, K. Bougiatiotis, G. Paliouras, Overview of bioasq 8a and 8b: Results of the eighth edition of the bioasq tasks a and b (2020). [13] A. Kazaryan, U. Sazanovich, V. Belyaev, Transformer-based open domain biomedical question answering at bioasq8 challenge (????). [14] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop pretraining: Adapt language models to domains and tasks, arXiv preprint arXiv:2004.10964 (2020). [15] I. B. Ozyurt, A. Bandrowski, J. S. Grethe, Bio-answerfinder: a system to find answers to questions from biomedical texts, Database 2020 (2020). [16] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators rather than generators, arXiv preprint arXiv:2003.10555 (2020). [17] D. Pappas, P. Stavropoulos, I. Androutsopoulos, Aueb-nlp at bioasq 8: Biomedical document and snippet retrieval, CLEF (cit. on p. 41) (2020). [18] D. Chen, A. Fisch, J. Weston, A. Bordes, Reading wikipedia to answer open-domain questions, arXiv preprint arXiv:1704.00051 (2017). [19] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, arXiv preprint arXiv:1903.10676 (2019). [20] W. Yoon, J. Lee, D. Kim, M. Jeong, J. Kang, Pre-trained language model for biomedical question answering, in: P. Cellier, K. Driessens (Eds.), Machine Learning and Knowledge Discovery in Databases, Springer International Publishing, Cham, 2020, pp. 727–740. [21] N. Schlaefer, J. Chu-Carroll, E. Nyberg, J. Fan, W. Zadrozny, D. Ferrucci, Statistical source expansion for question answering, in: Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, pp. 345–354. [22] J. Chu-Carroll, J. J. Fan, N. M. Schlaefer, W. W. Zadrozny, Source expansion for information retrieval and information extraction, 2014. US Patent 8,892,550.