=Paper=
{{Paper
|id=Vol-2936/paper-21
|storemode=property
|title=End-to-end Biomedical Question Answering via Bio-AnswerFinder and Discriminative Language
Representation Models
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-21.pdf
|volume=Vol-2936
|authors=Ibrahim Burak Ozyurt
|dblpUrl=https://dblp.org/rec/conf/clef/Ozyurt21
}}
==End-to-end Biomedical Question Answering via Bio-AnswerFinder and Discriminative Language
Representation Models==
End-to-end Biomedical Question Answering via Bio-AnswerFinder and Discriminative Language Representation Models Ibrahim Burak Ozyurt1 1 FDI Lab, Department of Neurosciences, University of California at San Diego, La Jolla, CA USA Abstract Generative Transformers based language representation models such as BERT and its biomedical do- main adapted version BioBERT have been shown to be highly effective for biomedical question answer- ing. Here, discriminative, sample-efficient biomedical language representation models based on ELEC- TRA language representation model architecture were introduced to enhance an end-to-end biomedical question answering system, Bio-AnswerFinder, for the BioASQ challenge. The introduced language representation models outperformed other language models including BioBERT in answer span classi- fication, answer candidate re-ranking and yes/no answer classification tasks. The resulting end-to-end system participated in BioASQ Synergy and both phases of Task 9B with promising results. Keywords question answering, language representation models, biomedical information retrieval 1. Introduction Transformers based language representation models such as BERT [1], XLNet [2] and AL- BERT [3] are becoming increasingly popular for many downstream NLP tasks due to their ubiquitous performance advantages over previous methods. Domain adaptation of general lan- guage model BERT to the biomedical domain has shown significant performance improvements for downstream biomedical NLP tasks [4]. BERT, XLNet and ALBERT use a masked language modeling (MLM) approach by masking 15% of the training sentences and learning to guess the masked tokens in a generative manner resulting in only learning from 15% of the tokens per example. Recently a new pretraining approach named ELECTRA [5] is introduced for BERT transformer based encoder architecture, where a discriminative model is trained to detect whether each token in the corrupted input was replaced by a co-trained generator model sample or not. It is shown that ELECTRA is computationally more efficient than BERT and outperforms it given the same model size, data and computation resources [5]. The improvements over BERT by ELECTRA is most impressive at small model sizes and that effectiveness translates to biomedical domain for domain adapted small ELECTRA models [6]. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " iozyurt@ucsd.edu (I. B. Ozyurt) 0000-0003-3944-1893 (I. B. Ozyurt) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) The development and evaluation of a question answering system without an expert generated training/evaluation question answer data set is impossible. BioASQ, an EU-funded biomedical semantic indexing and question answering challenge [7, 8] yearly provides cumulative sets of biomedical question/gold standard answer data and evaluation platform for the advancement of biomedical question answering. In this paper, enhancements to a sentence level end-to-end biomedical question answering system, Bio-AnswerFinder [9] to provide answers to all four types (factoid, list, yes/no and summary) of BioASQ challenge questions is introduced. To achieve this, three new biomedical domain adapted pretrained ELECTRA models are introduced. The introduced Bio-ELECTRA models are compared against many language representation models for question keyword selection, question answer span classification, answer candidate re-ranking, yes/no answer classification tasks showing superior performance. An abstractive summarization module based on the Transformers based text-to-text generation model T5 [10] is also introduced. The resulting system can answer any biomedical domain natural language questions and used in the the 9th BioASQ Challenge for the Synergy and 9B tasks. The rest of the paper is organized as follows. After a brief overview of Bio-AnswerFinder and proposed enhancements, details of the pretraining of the ELECTRA based biomedical language models are provided. This is followed by the experiments on answer span classification for the BioASQ factoid/list questions, answer candidate re-ranking, search engine keyword selection and yes/no question answer determination. After the introduction of extractive and abstractive summarizer systems, details of the BioASQ Synergy and BioASQ 9B systems are provided. Following this, results of the challenge is discussed together with an error analysis on the BioASQ 8B ground truth data for factoid questions. 2. Overview of Bio-AnswerFinder Bio-AnswerFinder [9] is a biomedical question answering system that takes a natural language question and returns a list of sentences from biomedical texts ranked in the order of confidence that they would answer the question. An overview of the system is shown in Figure 1. The original system is retrofitted with new modules to be able to provide answers for the four types of questions of the BioASQ Task B. The retrofitted modules together with the enhanced existing modules are shown in blue in Figure 1. The modules of the Bio-AnswerFinder can be grouped logically into question processing, document processing and answer processing phases. In the question processing phase, the natural language question is parsed, followed by the detection of the focus of the question. Afterwards, search keywords are selected from the words of the question using a supervised long short term memory (LSTM) [11] based keyword classifier. For BioASQ 9B, this module is replaced by a Bio-ELECTRA++ [6] based keyword tagger. In the document processing phase, query relevant documents are retrieved from a traditional keyword based information retrieval system (Elasticsearch). Bio-AnswerFinder uses an iterative most specific to most generic keyword search guided by the keyword classifier selected keywords to retrieve a relevant set of documents from an Elasticsearch index. The order of keywords dropped from iteration to iteration is learned from a set of annotated BioASQ 5B questions using a ranking classifier based on RankNet [12] with LSTM using attention [13]. In the answer processing phase, the question type (focus, definition question or other) is detected. Definition questions are handled by definition pattern based filtering of the sentences from the retrieved documents. For questions with a focus a detected entity type, the entity type is used for filtering out candidate sentences not having entities of the focus entity type. For both focus and other non-definition questions, the answer candidate sentences are ranked by a weighted version of the relaxed word mover’s distance [14]. Afterwards, up to first 100 of these sentences are further re-ranked by a fine-tuned BERT [1] classifier. For BioASQ 9, the BERT re-ranker is replaced by a better performing Bio-ELECTRA re- ranker. For factoid and list questions, a Bio-ELECTRA based question answer span classifier is used. For yes/no questions, two different Bio-ELECTRA based classification approaches are introduced. For summary questions, both extractive and abstractive summarization approaches are introduced. These approaches are explained in more detail in the following sections. 3. ELECTRA Based Biomedical Language Representation Models For pretraining corpus both PubMed abstracts and PubMed Central (PMC) open access full- length papers were used. The main pretraining corpus was built using 21.2 million PubMed abstracts from the January 2021 baseline distribution. From the abstracts, title and abstract text sentences were extracted resulting in a corpus of 3.6 billion words. The second 12.3 billion words corpus was built using the sentences extracted from the sections of PMC open access papers excluding the references sections, which, unlike the other paper sections, do not have a regular sentence format. A domain specific word piece vocabulary was generated using SentencePiece byte-pair-encoding (BPE) model [15] from PubMed abstract texts. The Bio-ELECTRA Mid and Base models were pretrained for one million steps on the PubMed abstracts corpus followed by 200,000 steps training on the PMC open access papers corpus. The Bio-ELECTRA Mid Combined model was pretrained on the combination of abstract and full text paper corpus for 1.2 million steps. Since the improvements over BERT and other transformers based language models by ELEC- TRA are pronounced at small model sizes [5], a mid sized model with a hidden layer size in the middle of the small and base ELECTRA architectures is introduced to investigate its competitive- ness against the base model with a more than twice training parameters size. The Bio-ELECTRA model architectures are summarized in Table 1. All the mid and base sized Bio-ELECTRA models were pretrained on a single 8 core version 3 tensor processing unit (TPU) with 128 GB of RAM. The small Bio-ELECTRA++ model [6] was pretrained on a consumer grade 8 GB Nvidia RTX 2070 GPU. The hardware and pretraining times for all Bio-ELECTRA models were summarized in Table 2. Figure 1: Overview of the Bio-AnswerFinder system 4. Experiments with Factoid/List Question Answer Span Detection Since most factoid and list questions can be answered by a word or phrase (multiple word/phrases for list questions), the answers can be detected by learning to estimate scores for spans in the sequence of tokens of an answer candidate passage. To this end, the training/testing sets were generated from the factoid and list questions of the BioASQ 8b training data. From about 30% of the list and factoid questions which cannot aligned to their exact answers, 152 more questions were recovered via manual inspection for synonyms and transliterations. The labeled data set is split into 85%/15% training/testing data sets of size 9557 and 1809, respectively. To increase performance over the smaller BioASQ data, the training set was combined with the Table 1 ELECTRA Model Architectures for Biomedical Domain Model Params Architecture Bio-ELECTRA++ [6] 11M hidden:256, layers:12, batch:64, attention heads:4 Bio-ELECTRA Mid 50M hidden:512, layers:12, batch:256, attention heads:8 Bio-ELECTRA Base 110M hidden:768, layers:12, batch:256, attention heads:12 Bio-ELECTRA Mid Combined 50M hidden:512, layers:12, batch:256, attention heads:8 Table 2 Pretraining of ELECTRA Models for Biomedical Domain Model Params Steps Train Time/Hardware Bio-ELECTRA++ 11M 3.6M 48 days on RTX 2070 8GB GPU Mid 50M 1.2M 6.5 days on 8 TPUv3s Base 110M 1.2M 12.5 days on 8 TPUv3s Mid Combined 50M 1.2M 6.5 days on 8 TPUv3s out-of-domain SQuAD [16] v1.1 data set. All together ten language representation models including BERT based biomedical domain specific BioBERT model were evaluated for factoid and list question answer span detection task. The performance of the models are evaluated by the standard SQUAD evaluation metrics, exact match and 𝐹1 score. Ten randomly initialized answer span classifiers are fine-tuned for each language representation model. The experiment results are summarized in Table 3. All non-small Bio-ELECTRA had significantly outperformed BioBERT on this task. Given that mid sized Bio-ELECTRA models have less than half of the parameters of BioBERT, the results are very encouraging. The best performing Bio-ELECTRA Mid pretrained for 1.2 million steps was chosen to be used in the final system. Snippets provided by BioASQ challenge were first passed through Bio-AnswerFinder bypass- ing the candidate document retrieval section. The re-ranked candidate sentences were then used as input for the factoid/list question classifier. The answer candidate word sequences were scored by a combination of their span classification probabilities, number of occurrence and rank of the sentence in which they have occurred first. The answer candidates were normalized and filtered to remove sub-phrases, singular/plural differences and acronyms. For list questions, answers candidates were enriched by coordinated phrase detection and processing. A classifier score threshold of 0.65 was selected to maximize 𝐹1 performance on a holdout set of questions for selecting a subset of list question answer candidate span of words for the BioASQ challenge. 5. Experiments with Answer Candidate Re-ranking In Bio-AnswerFinder, answer candidate sentences are first ranked by the inverse document frequency weighted relaxed word mover’s distance on PubMed abstract trained GloVe word and phrase embeddings. While this ranking usually results in decent results, supervised re-ranking improves performance as measured on blind, multiple curator tests [9]. By casting the ranking Table 3 Biomedical Question Answering Test Results Model Exact Match 𝐹1 Bio-ELECTRA++ 57.93 (0.66) 67.48 (0.44) ELECTRA Small++ 57.78 (0.64) 67.10 (0.55) BERT 59.98 (0.66) 70.25 (0.48) BioBERT 63.58 (0.66) 72.72 (0.48) ELECTRA Base 65.01 (0.84) 72.82 (0.70) Bio-ELECTRA Mid (1M) 68.71 (0.76) 75.52 (0.49) Bio-ELECTRA Mid (1.2M) 69.50 (0.54) 75.82 (0.40) Bio-ELECTRA Base (1M) 68.44 (0.56) 75.02 (0.60) Bio-ELECTRA Base (1.2M) 68.44 (0.38) 75.50 (0.35) Bio-ELECTRA Mid Combined (1.2M) 66.46 (0.65) 74.05 (0.44) problem as a 0/1 loss classification problem, the learned probability estimates can be used to rank the candidate sentences by relevance. For Bio-AnswerFinder, up to 100 answer candidates per question as returned by the weighted rWMD ranker were annotated as relevant or not (up to the first occurrence of a correct answer). The questions were selected from the BioASQ 5b training set. In total, 44933 sentences for 492 training questions and 9064 sentences for 100 testing questions were annotated. Nine language representation models were tested. Due to highly unbalanced nature of the data set (on average one positive example per 99 negative examples), a weighted loss function where the errors for the positive examples weighted 99 times more than a negative example error was also used. The mean reciprocal rank (MRR) results averaged on ten randomly initialized runs using 14 language representation models (including weighted models) are summarized in Table 4. Based on the results, Bio-ELECTRA Mid (1M) was chosen for BioASQ 9 challenge since it had more stable score distribution than the Bio-ELECTRA++ model and twice as fast as the larger Bio-ELECTRA Base re-ranking models. While all Bio-ELECTRA models were significantly better than both BioBERT and BERT Base, the performance differences among the best performing Bio-ELECTRA models were not statistically significant. 6. Search Engine Keyword Selection via Bio-ELECTRA++ Selection of keywords is a vital step in the question answering step, since missing of even a single important keyword could prevent retrieval of relevant candidate documents. The original Bio-AnswerFinder had used a LSTM based multi class classifier using GloVe word embeddings trained on PubMed abstracts. To minimize out-of-vocabulary (OOV) effects on GloVe embeddings LSTM based model uses also inputs from part of speech tags of the question words encoded by a separate LSTM layer. Encouraged by the performance of the discriminative language representation models, a Bio-ELECTRA++ model based approach was introduced. The keyword selection from a question Table 4 Biomedical Question Answer Candidate Re-ranking Test Results Model MRR ELECTRA Small++ 0.281 (0.014) ELECTRA Small++ (weighted) 0.281 (0.008) Bio-ELECTRA++ 0.335 (0.017) Bio-ELECTRA++ (weighted) 0.332 (0.013) BERT Base 0.246 (0.007) BioBERT 0.283 (0.020) ELECTRA Base 0.294 (0.017) Bio-ELECTRA Mid (1M) 0.333 (0.017) Bio-ELECTRA Mid (1M) (weighted) 0.336 (0.017) Bio-ELECTRA Mid (1.2M) 0.316 (0.015) Bio-ELECTRA Mid (1.2M) (weighted) 0.322 (0.015) Bio-ELECTRA Base (1M) 0.333 (0.024) Bio-ELECTRA Base (1.2M) 0.328 (0.013) Bio-ELECTRA Base (1.2M) (weighted) 0.336 (0.023) Table 5 Test Performance for Keyword Selection Classifiers Model Precision Recall 𝐹1 LSTM Multi-input Model 91.72 (0.99) 89.39 (1.70) 90.53 (0.67) Bio-ELECTRA++ 97.58 (0.47) 96.93 (0.54) 97.25 (0.24) task is cast as a sequence tagging problem. Bio-ELECTRA++ was selected over other larger Bio-ELECTRA models, because of its inference time performance is up to eight times better than the larger models. From BioASQ 5b, 752 training questions and 100 test questions were annotated for each word in the question being a keyword or not. The performance of both models averaged over ten randomly initialized training/testing phases is shown in Table 5, which shows that Bio-ELECTRA++ based keyword selection significantly outperforms LSTM based multi-input model. 7. Yes/No Question Answer Determination Yes/no question answer determination from provided passages can be cast as a binary classi- fication similar to sentiment classification to determine the implicit sentiment positive (yes) or negative (no) from the given context. However, some of the candidate passages might not provide enough evidence for either of the sentiments. In these cases due to the binary nature of the decision making, spurious decisions can be introduced. This is especially a problem with sentence level operation nature of the Bio-AnswerFinder. To remedy this, a third label (neutral) is introduced. Negative sampling for neutral label was done using weighted rWMD based sentence similarity where sentences from the snippets are selected based on their weighted rWMD score being less than or equal to 0.6 compared to the sentences of the ideal answer. Snippet sentences having weighted rWMD score grater than or equal to 0.8 were chosen as additional label support sentences besides ideal answers. The thresholds were selected by minimizing the number of questions without any neutral sentence given the threshold values. While random sampling from other questions could be easily used for negative (neutral) sampling, the goal is to differentiate between candidate sentences related to the question but not provide an answer. The neutral sentences selected this way were afterwards checked and labeled manually. From BioASQ 8b training data, 727 yes/no questions were selected for training and 128 for testing. Training/testing instances were prepared from sentences of the ideal answers and snippets. For yes/no classification there were 727 training instances and 128 test instances. For yes/no/neutral classification there were 2938 training instances and 539 testing instances. Nine language representation models were evaluated for yes/no classification to decide on the model for further yes/no/neutral answer classification. Test results for the average of ten randomly initialized classifiers per language representation model together with their standard deviations are shown in Table 6. The best performing Bio-ELECTRA Base model pretrained for 1M steps was selected for comparison experiments between yes/no and yes/no/neutral classifiers together with three different voting strategies for final decision. During inference time, either the first ten highest ranked answer candidate sentences selected by Bio-AnswerFinder or the snippets as they are provided by BioASQ challenge is passed to the classifiers to make yes/no decision on each one of the candidate sentences/snippets. The final decision is made by a voting strategy. To this end, three voting strategies were used. The majority voting strategy uses the most common yes/no decision as the final decision. The best score strategy uses the decision of the answer candidate with the highest score as the final decision. The score voting strategy uses the highest sum of scores for the yes and no predicted answer candidates as the final decision. For evaluation, snippets provided for 128 test questions were scored by both types of classifiers. For yes/no/neutral classifier, any snippets with a neutral score greater than 0.5 were excluded from the voting. The test results are shown in Table 7. The yes/no/neutral classifier with score voting was the best performing classifier, which was chosen to be used for BioASQ 9 challenge. 8. Summary Question Handling 8.1. Extractive Summarization for BioASQ Summary Questions In extractive summarization, a summary is generated by selecting sentences for the docu- ments/snippets to be summarized. The introduced salient sentence selection strategy leverages the ranked sentences outputted by the Bio-AnswerFinder where the top 10 ranked answer candidate sentences are used. To minimize repetition, a hierarchical agglomerative clustering using weighted relaxed word mover’s distance (wRWMD) similarity is introduced to group sentences where the cluster merge stop similarity threshold to maximize ROGUE-2 score was determined on training set answer summaries. From each cluster, the highest Bio-AnswerFinder ranked sentence is selected. The selected sentences are then ordered by their abstract occurrence Table 6 Biomedical Yes/No Question Answer Classification Test Results Model P (Yes) R (Yes) 𝐹1 (Yes) P (No) R (No) 𝐹1 (No) Bio-ELECTRA++ 91.24 (1.57) 95.29 (2.31) 93.19 (0.75) 78.91 (7.41) 63.85 (7.92) 69.84 (3.87) ELECTRA Small++ 88.18 (0.71) 94.31 (1.74) 91.14 (1.00) 69.92 (7.34) 50.38 (3.19) 58.40 (3.61) BERT Base 87.02 (2.57) 95.49 (2.64) 90.99 (1.00) 65.15 (22.99) 43.46 (15.20) 51.71 (17.49) BioBERT 92.94 (1.19) 93.04 (1.55) 92.63 (0.91) 71.94 (4.10) 69.23 (5.16) 70.42 (3.46) ELECTRA Base 94.73 (1.67) 96.19 (0.82) 95.44 (0.88) 82.02 (3.08) 76.32 (8.01) 78.86 (5.06) Mid (1M) 98.07 (0.71) 94.71 (1.40) 96.36 (0.96) 81.83 (4.15) 92.69 (2.69) 86.89 (3.23) Base (1M)* 97.43 (1.14) 95.98 (1.20) 96.69 (0.58) 85.31 (3.64) 90.00 (4.62) 87.46 (2.31) Mid (1.2M) 95.71 (1.97) 94.80 (1.76) 95.23 (0.89) 80.69 (4.44) 83.08 (8.28) 81.52 (4.16) Base (1.2M) 97.22 (1.21) 95.49 (1.26) 96.34 (0.83) 83.61 (3.78) 89.23 (4.80) 86.23 (3.17) Table 7 Yes/No versus Yes/No/Neutral Classification Performance Model P (Yes) R (Yes) 𝐹1 (Yes) P (No) R (No) 𝐹1 (No) Yes/No classifier Bio-ELECTRA Base (1M) majority voting 77.78 95.79 85.85 88.57 54.39 67.39 best score 80.91 93.68 86.83 85.71 63.16 72.73 score voting 80.91 93.68 86.83 85.71 63.16 72.73 Yes/No/Neutral classifier Bio-ELECTRA Base (1M) majority voting 84.62 93.62 88.89 86.67 70.91 78.00 best score 84.85 89.36 87.05 80.00 72.73 76.19 score voting 85.44 94.68 89.90 88.89 72.73 79.21 Yes/No/Neutral classifier Bio-ELECTRA Base (1M) seq length: 256 majority voting 84.76 94.68 89.45 86.64 70.91 78.79 best score 85.29 92.55 88.78 85.11 72.73 78.43 score voting 85.58 94.68 89.90 88.89 72.73 80.0 order and concatenated. 8.2. Abstractive Summarization for BioASQ Summary Questions Unlike extractive summarization where the summary is generated from the sentences of can- didate documents/snippets, in abstractive summarization new content summarization the candidate documents/snippets is generated. To this end, a unified text-to-text transformer model called T5 [10] is trained with combined snippets as the document and the ideal answer as the summary for all summary questions from the BioASQ 8B training data. As a preprocessing step, any overlapping snippets are detected and only the longest of the overlapping snippets are included in generating document to be summarized. A T5 Base model is fine-tuned with a maximum input sequence length of 512, batch size of 2 for 2 epochs to generate summaries of 150 tokens maximum. 9. BioASQ 2021 Synergy Task Systems In the BioASQ 9 Synergy task, all questions were on the developing problem of COVID-19 without any guarantee that all them could be answered at the moment. There are no separate information retrieval and question answering from provided snippets, making the task only suitable to end-to-end systems. Also, feedback from the domain experts is provided after each round of the task allowing the participating systems to take advantage of the provided feedback in the next round. For the BioASQ Synergy document retrieval and snippets selection, Bio-AnswerFinder [9] was used. Instead of the LSTM based keyword selection classifier, a Bio-ELECTRA++ [6] model based keyword selection classifier described in Section 6 was used for better performance. Starting from Synergy round 2, provided expert feedback data was used to augment the training data used for the BERT [1] based reranker classifier the Bio-AnswerFinder uses after weighted relaxed word mover’s distance (wRWMD) similarity based ranking and focus word based filtering. At each round the BERT Base based reranker was retrained with the cumulative Synergy expert feedback. Also for the rounds 2 and 3, an alternative keyword search engine (instead of Elasticsearch) was used after keyword query generation which was based on Pyserini search engine with MonoT5 based document re-ranking [17]. For round 4, a GloVe [18] embedding vector similarity based boolean search engine was developed where an approximate KNN GloVe vector similarity index was used for efficient similarity based retrieval of expansions for query keywords. The candidate set of abstracts retrieved by this search engine was combined with the Elasticsearch retrieved results for downstream processing by the Bio-AnswerFinder. The results of this system was entered as ’bio-answerfinder-2’ to the Synergy task web site. GloVe vectors used for round 1 and 2 were generated from the 2017 PubMed abstracts thus having no COVID-19 related terms resulting Bio-AnswerFinder excluding COVID-19 related terms from selected keywords for abstract retrieval, weighted relaxed word mover’s distance (wRWMD) similarity based ranking resulting in system degradation. After this was noticed, new GloVe vectors were trained on the 2021 base PubMed abstracts and used to retrain affected classifiers in the Bio-AnswerFinder which were used in rounds 3 and 4. 9.1. Exact Answers/Ideal Answers The re-ranked candidate sentences from the Bio-AnswerFinder are the input to the Synergy challenge subsystems. 9.1.1. Factoid and List Questions For factoid and list questions, answer span classifier and post-processing described in Section 4 was used. Since, at the time of Synergy challenge Bio-ELECTRA models were not pretrained, ELECTRA_Base [5] were fine-tuned using the combined SQuAD v1.1 and BioASQ 8b training data. For factoid ideal answers, the highest Bio-AnswerFinder re-ranked sentence that contains the highest scored exact answer was selected. For list ideal answers, the sentence containing the most number of highest scored exact answers was selected among the top ten Bio-AnswerFinder re-ranked sentences. 9.1.2. Yes/No Questions For yes/no questions, both binary and ternary classifiers described in Section 7 were used for different rounds. Similar to the factoid and list questions ELECTRA Base [5] models were used. Top 10 Bio-AnswerFinder selected sentences were passed to the binary classifier for the round 1 and yes/no decision was based on majority voting. The three-way ELECTRA Base based classifier for yes, no and neutral sentences was used using majority voting for rounds 2, 3 and 4. The highest re-ranked sentence from Bio-AnswerFinder was selected as the ideal answer. 9.1.3. Summary Questions For summary questions, the extractive system described in Section 8.1 was used. 10. BioASQ 2021 9B Systems Similar to the Synergy task, for BioASQ 9B Phase A task, Bio-AnswerFinder [9] was used with the Bio-ELECTRA++ based keyword classifier and Bio-ELECTRA Mid based re-ranker as described in Section 2. The iterative keyword query against Elasticsearch based document retrieval mechanism of the Bio-AnswerFinder was enhanced by a word embeddings based keyword synonym expansion mechanism for the batches 4 and 5. For each keyword selected by the Bio-ELECTRA++ based keyword classifier up to four most similar (by cosine similarity of GloVe word vectors) were added as synonyms to the Elasticsearch query which was iteratively refined until enough documents are returned. This approach was used for the challenge Task A system "bio-answerfinder-2". For Task 9B Phase B, snippets provided by BioASQ challenge were first passed through Bio-AnswerFinder after bypassing retrieval section. The re-ranked candidate sentences were the input to the challenge subsystems. 10.1. Factoid and List Questions For factoid and list questions, Bio-ELECTRA Base based answer span classifier and post- processing described in Section 4 was used. For ideal answers, the same mechanism as in the Synergy Task was used. 10.2. Yes/No Questions For yes/no questions, the best performing Bio-ELECTRA Mid model based ternary yes/no/neutral classifier described in Section 7 was used. Final decision was made by score voting. For ideal answers, also the same mechanism as in the Synergy Task was used. 10.3. Summary Questions For the summary questions, both the extractive and abstractive systems described in the Sec- tion 8.1 and Section 8.2, respectively, were used. The abstractive summarization system was used for the BioASQ challenge Task B system "bio-answerfinder-2". 11. Discussion Bio-AnswerFinder together with the extensions introduced in this paper is one of the few end-to-end systems participating BioASQ challenges that can handle Synergy and both phases of Task B for all question types. For the Synergy task the systems described in Section 9, GloVe vectors used in the rounds 1 and 2 did not had any COVID-19 related terms since they were generated from 2017 PubMed abstracts. This had detrimental effects to the document retrieval and ranking which relies on GloVe vectors for keyword ranking for greedy iterative retrieval and wRWMD based ranking. Since documents with a feedback from previous rounds needs to be excluded from the eligible document pool for the questions in the subsequent rounds, the detrimental effect from the first two rounds adversely affected the other rounds also. Also, because of a misunderstanding of the instructions for the Synergy challenge, only abstracts with a PubMed ID were indexed for search leaving out all preprint abstracts that make about the half of the CORD-19 corpus, This was not noticed until the Synergy version 2 challenge, Even after these setbacks, the system performance was decent based on official BioASQ Synergy Task results (on average, 12th out of the 23 individual systems on documents 𝐹1 , 12th out of the 24 systems on snippets 𝐹1 , 6th out of the 24 systems on yes/no overall 𝐹1 , 6th out of the 24 systems on factoid MRR and 5th out of the 24 systems on list 𝐹1 ). The GloVe embedding vector similarity based boolean search engine introduced for the round 4 to increase coverage over the iterative keyword query based document retrieval improved performance over the default retrieval based on the official Synergy Task test results. In the BioASQ 9B Phase A, the introduced system was the best system on document retrieval in four out of the five test batches based on 𝐹1 score and second on the remaining batch. In snippets, the system was second best in two batches and third in three batches. The keyword synonym expansion approach described in Section 10 (’bio-answerfinder-2’) used in batches 4 and 5 had slightly worse performance on document retrieval. For snippets, the results were more mixed, the expansion approach had better performance than the original system in batch 4 while performing worse in batch 5. In the 9B Phase B, the introduced systems were second for yes/no questions in test batches 2 and 3. For the list questions, the performance was better than last year. While factoid question performance was decent, there is room for improvement. However, based on the factoid question error analysis for last years’ submissions described in the next section, the observed near-miss issue is suspected this year also. This will be investigated once the gold standard annotations will be available. For the ideal questions, only automatic ROUGE scores are available. Based on both ROUGE-2 (F1) and ROUGE-SU4 (F1) scores, the introduced system was the best scoring system for the test batches 1 and 2. Despite the terse nature of its results (usually a single sentence), T5 [10] based abstractive summarization system ’bio-answerfinder-1’ seems to work well outperforming extractive summarization in batches 2 and 4. In BioASQ 8B, Bio-AnswerFinder won the second place for human evaluated ideal answers in three test batches. Since Bio-AnswerFinder was mainly designed as a practical knowledge discovery tool for biomedical researchers who prefer ideal answers (answer with evidence and context), this result was very encouraging validation towards the main design goal of Bio-AnswerFinder. 11.1. Error Analysis for Factoid Questions Based on the analysis of the BioASQ 8b factoid question ’bio-answerfinder’ system submissions against the ground truth answers in the BioASQ 9B training data, it was identified that about 53% of the errors can be attributed to near misses, i.e. singular/plural differences, differences in stop words (e.g. articles), single special character differences, acronym versus its expansion, and other transliterations or paraphrasings. Another common issue is the provided ground truth being a paraphrased sentence, more akin to an ideal answer than factoid answer, not occurring in any of the supplied documents for the question. Representative near miss error of different types are shown in Table 8. For examples 4 and 5, the ground truth does not exists in the provided phrases. Even though these were errors for the automatic evaluation, for a human user the predicted answers would be the correct answers. QA systems are designed for human usage and while automatic evaluation provides fast, systematic evaluation of QA systems, more than 50% near miss emphasizes importance of human evaluation for QA systems even though they are more costly than automatic evaluation. 12. Conclusions In this paper, extensions to an end-to-end biomedical QA system, Bio-AnswerFinder [9] for BioASQ biomedical question answering challenge were introduced. To this end, three ELEC- TRA [5] discriminative language representation models were pretrained from scratch on PubMed abstracts and PMC open access papers. Based on performance comparison against numerous other language representation models including BioBERT, the introduced Bio-ELECTRA models had shown superior performance for the classifiers used in the Bio-AnswerFinder sub-systems. The resulting system(s) had shown very good performance in BioASQ 9B Phase A and good performance for yes/no questions and ideal answers in Phase B based on the official automatic evaluation results. In the future, sensitivity of some subsystems such as keyword ranking and weighted relaxed WMD measure based answer candidate ranking to the out-of-vocabulary terms will be addressed. Based on the insights of the in-depth analysis of questions that cannot be properly answered from BioASQ Synergy and 9B Tasks, Bio-AnswerFinder will be further improved with the eventual goal of answering all answerable biomedical domain questions. Table 8 Examples of near miss factoid errors for BioASQ 8b bio-answerfinder system No Type Description 1 Question What gene is mutated in Huntington’s Disease patients? Ground Truth HTT gene encoding the protein huntingtin Prediction HTT gene huntingtin (HTT) gene huntingtin gene 2 Question Which diagnostic test is approved for coronavirus infection screening? Ground Truth real-time reverse transcription-PCR Prediction rRT-PCR 3 Question How large is a lncRNAs? Ground Truth >200 nucleotides Prediction more than 200 nucleotides 4 Question When was vaxchora first licensed by the FDA? Ground Truth 10 June 2016 Prediction June 10, 2016 5 Question What is the LINCS Program? Ground Truth NIH-funded program to generate a library of integrated, network- based, cellular signatures Prediction NIH Common Fund Library of Integrated Network-based Cellular Sig- natures Library of Integrated Network-based Cellular Signatures 13. Software and Data Availability Bio-AnswerFinder source code and documentation is available on GitHub (https://github.com/ scicrunch/bio-answerfinder). The datasets used for Bio-ELECTRA model evaluations and Bio- ELECTRA++ source code are available on Github (https://github.com/SciCrunch/bio_electra). The small Bio-ELECTRA++ models are available on Zenodo (https://doi.org/10.5281/zenodo. 3971235). The mid and base sized pre-trained Bio-ELECTRA models are available on Zenodo (https://doi.org/10.5281/zenodo.4699034). 14. Acknowledgments This work was supported by the NIDDK Information Network (dkNET; http://dknet.org) via NIH’s National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) award U24DK097771. I would like also to thank Google TensorFlow Research Cloud (TFRC) program for providing me with free TPUs which allowed me to pretrain Bio-ELECTRA models. References [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/ anthology/N19-1423. doi:10.18653/v1/N19-1423. [2] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, in: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 2019, pp. 5753–5763. [3] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised learning of language representations, 2020. arXiv:1909.11942. [4] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234–1240. URL: https://doi.org/10.1093/bioinformatics/btz682. doi:10.1093/ bioinformatics/btz682. [5] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators rather than generators, 2020. arXiv:2003.10555. [6] I. B. Ozyurt, On the effectiveness of small, discriminatively pre-trained language representa- tion models for biomedical text mining, in: Proceedings of the First Workshop on Scholarly Document Processing, Association for Computational Linguistics, 2020, pp. 104–112. URL: https://www.aclweb.org/anthology/2020.sdp-1.12. doi:10.18653/v1/2020.sdp-1.12. [7] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weis- senborn, A. Krithara, S. Petridis, D. Polychronopoulos, Y. Almirantis, J. Pavlopoulos, N. Baskiotis, P. Gallinari, T. Artieres, A. Ngonga, N. Heino, E. Gaussier, L. Barrio-Alvers, M. Schroeder, I. Androutsopoulos, G. Paliouras, An overview of the bioasq large-scale biomedical semantic indexing and question answering competition, BMC Bioinformatics 16 (2015) 138. URL: http://www.biomedcentral.com/content/pdf/s12859-015-0564-6.pdf. doi:10.1186/s12859-015-0564-6. [8] A. Nentidis, A. Krithara, K. Bougiatiotis, G. Paliouras, Overview of bioasq 8a and 8b: Results of the eighth edition of the bioasq tasks a and b, in: Proceedings of the 8th BioASQ Workshop A challenge on large-scale biomedical semantic indexing and question answering, 2020. URL: http://ceur-ws.org/Vol-2696/paper_164.pdf. [9] I. B. Ozyurt, A. Bandrowski, J. S. Grethe, Bio-AnswerFinder: a system to find answers to questions from biomedical texts, Database 2020 (2020). URL: https://doi.org/10.1093/ database/baz137. doi:10.1093/database/baz137, baz137. [10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, 2020. arXiv:1910.10683. [11] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997) 1735–1780. URL: https://doi.org/10.1162/neco.1997.9.8.1735. doi:10.1162/neco.1997. 9.8.1735. [12] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in: Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, Association for Computing Machinery, New York, NY, USA, 2005, p. 89–96. URL: https://doi.org/10.1145/1102351.1102363. doi:10.1145/1102351. 1102363. [13] I. B. Ozyurt, J. Grethe, Iterative document retrieval via deep learning approaches for biomedical question answering, in: 2019 15th International Conference on eScience (eScience), 2019, pp. 533–538. doi:10.1109/eScience.2019.00072. [14] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to document distances, in: F. Bach, D. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, PMLR, Lille, France, 2015, pp. 957–966. [15] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1715–1725. URL: https://www.aclweb.org/anthology/P16-1162. doi:10. 18653/v1/P16-1162. [16] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD: 100,000+ questions for machine comprehension of text, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, 2016, pp. 2383–2392. URL: https://www.aclweb.org/anthology/D16-1264. doi:10.18653/ v1/D16-1264. [17] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-to-sequence model, in: Findings of the Association for Computational Lin- guistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 708– 718. URL: https://www.aclweb.org/anthology/2020.findings-emnlp.63. doi:10.18653/v1/ 2020.findings-emnlp.63. [18] J. Pennington, R. Socher, C. Manning, GloVe: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543. URL: https://www.aclweb.org/anthology/D14-1162. doi:10.3115/v1/D14-1162.