=Paper=
{{Paper
|id=Vol-2696/paper_75
|storemode=property
|title=Transformer-Based Open Domain Biomedical Question Answering at BioASQ8 Challenge
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_75.pdf
|volume=Vol-2696
|authors=Ashot Kazaryan,Uladzislau Sazanovich,Vladislav Belyaev
|dblpUrl=https://dblp.org/rec/conf/clef/KazaryanSB20
}}
==Transformer-Based Open Domain Biomedical Question Answering at BioASQ8 Challenge==
Transformer-Based Open Domain Biomedical Question Answering at BioASQ8 Challenge Ashot Kazaryan1,2 , Uladzislau Sazanovich1,2 , and Vladislav Belyaev1,3 1 JetBrains Research, Russia {ashot.kazaryan,uladzislau.sazanovich,vladislav.belyaev}@jetbrains.com 2 ITMO University, Russia 3 National Research University Higher School of Economics, Russia {287371,191872}@niuitmo.ru Abstract. BioASQ task B focuses on biomedical information retrieval and question answering. This paper describes the participation and pro- posed solutions of our team. We build a system based on recent advances in the general domain as well as the approaches from previous years of the competition. We adapt a system based on a pretrained BERT for document and snippet retrieval, question answering and summarization. We describe all approaches we experimented with and show that while neural approaches do well, sometimes baseline approaches have high au- tomatic metrics. The proposed system achieves competitive performance while being general so that it can be applied to other domains as well. Keywords: BioASQ Challenge · Biomedical Question Answering · Open Domain Question Answering · Information Retrieval · Deep Learning 1 Introduction BioASQ [27] is a large scale competition for biomedical research. It provides eval- uation measures for various setups like semantic indexing, information retrieval and question answering, all regarding the biomedical domain. The competition takes place annually online, and each year gains more attention from research groups all around the world. The BioASQ provides necessary datasets, evalua- tion metrics and leaderboards for each of its sub-challenges. More specifically, the BioASQ challenge consists of two major objectives which are called “tasks”. The first is semantic indexing, which goal is to con- struct a search index given a set of documents, such that certain semantic re- lationships are held between the index terms. The second objective is passage ranking and question answering in various forms, which is given a question to return a piece of text. The returned text must either answer the question directly or contain enough information to derive the answer. In terms of the BioASQ, those objectives are called Task A and Task B, respectively. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. In this work, we explore applications of the state-of-the-art model in natural language processing and deep learning in biomedical question answering. As a result, we develop a system, that is capable of providing answers in the form of documents, snippets, exact answers or abstractive text, given biomedical ques- tions from various domains. We evaluate our system on the recent BioASQ 2020 challenge, where it achieves competitive performance. 1.1 BioASQ Tasks Our team participated in the Task B, which involves information retrieval, ques- tion answering, summarization and more. This task uses benchmark datasets containing development and test questions, in English, along with gold standard (reference) answers constructed by a team of biomedical experts. The task is separated into two phases. Phase A The first phase measures the ability of systems to answer biomedical questions with a list of relevant documents and snippets of text from retrieved documents. The main metric for documents and snippets is the mean average precision (MAP). The average precision is defined as follows: P|L| r=1 P (r) · rel(r) AP = |LR | where |L| is the number of items in a list predicted by the system, |LR | is the number of relevant items. P (r) is a precision when only first r returned items are considered, and rel(r) is equal to 1 if the r-th returned item is relevant. MAP and GMAP are arithmetic and geometric means of all questions in the evaluation set. For the snippets retrieval, precision is measured in terms of characters, and rel(r) is equal to 1 if the returned item has non-zero overlap with at least one relevant snippet. Additional metrics are precision, recall and F1 score. A more detailed description is present in the original paper [13]. Phase B The second phase evaluates the performance of question answering, given a list of relevant documents and snippets from the previous phase. The questions are of several types: questions where the answer is either yes or no (“yes/no”), questions where the answer is a single term (“factoid”), and ques- tions where the answer is a list of terms (“list”). Additionally, each question has an “ideal” answer, where the aim is to measure the systems’ ability to generate paragraph sized passage, that answers the question. The metrics of phase B are F1-macro for yes/no questions, mean reciprocal rank (MRR) [29] for factoid questions and F1 score for list questions. To eval- uate answers in natural language the ROUGE [16] scores are used. We should note that human experts will additionally evaluate all systems after the contest. However, the results are not available at the time of writing this paper, thus we use only the automatic measurements to draw our conclusions. 1.2 Related Work Most of the contemporary large scale QA systems attempt to fill the gap between massive source of knowledge and a complex neural reasoning model. Many of the popular knowledge sources are sets of unstructured or semi-structured texts, like Wikipedia [5], [33]. It is still the case for biomedical domain, where the PubMed [4] is amongst the largest sources of biomedical scientific knowledge. During document retrieval a question answering system can benefit from structured knowledge as well. There is a rich set of different biomedical ontologies like UMLS [2] or GO [1], successful use of which is shown in different QA systems, including ones that were submitted by previous years BioASQ participants [12]. However, in our work we do not leverage such information and instead explore a more general approach, applicable to any other domain. Many systems perform re-ranking after initial document retrieval. Specialized neural models like DRMM [8] have been successfully used in previous BioASQ challenges [3]. More recent approaches utilize transformer-based language models [28] like BERT [6] for wide variety of tasks. Applications of transformers in document re-ranking had set a new state of the art [21], including the last years’ BioASQ challenges [22]. There are also systems that do document re-ranking based on snippet extraction [22], but they did not achieve the highest positions. Some systems solve snippet extraction by utilizing methods that were origi- nally developed for document re-ranking. [22] uses the earlier mentioned DRMM in the Task 7B and achieves top results. [23] comes up with another neural ap- proach, employing both textual and conceptual information from question and candidate text. In our work, we experiment with different methods and show how strong baselines consistently demonstrate high metrics, given a proper document retriever. Deep learning has shown its superiority in question answering. In the Task 5b, [30] achieve top scores by training an RNN based neural network. However, most of the modern advancements in question answering can be attributed to transformer-based models. Last years’ challenges were dominated by systems that used BERT or its task-specific adapations, like BioBERT [34], [10]. In this work we experiment with a similar approach. Deep neural and transformer based models in particular have shown their ability to tackle summarization in different setups [17], including QA summa- rization [15]. However, as [19] noticed, BioASQ summaries tend to look very sim- ilar to the input examples. They exploit this observation and introduce several solutions, based on sentence re-ranking, and achieve top automatic and human scores in several batches. There is also an attempt to utilize pointer-generator networks [26] for BioASQ ideal questions [7]. During the competition we ex- tend the snippet re-ranking approach by using transformer models. Moreover, we introduce a fully generational approach, based on transformers as well. 2 Methods In this section, we describe the system we implemented for document and snippet retrieval, as well as our question answering system. We provide the results of different approaches which we experimented with during the competition. To assess the performance of different methods more accurately, we merge all test batches of 8B task into one and use resulting 500 questions as an evaluation set. Here and during the competition, we created a development set using 100 questions from 6B and 200 questions from 7B task. Our system evolved from batch to batch, achieving its final shape in batch 5. All the ablation experiments and retrospective evaluations are performed on the system, that was used for the 5-th batch submission. 2.1 Document retrieval For document retrieval, we implement a system conceptually similar to [21]. At first we extract a list of N document candidates using Anserini implementation of bm25 algorithm [32]. Then we use the BERT model to re-rank candidate documents and output at most ten top-scored documents. BM25 For initial document retrieval, we used Anserini [32]. We created an index using the PubMed Baseline Repository of the 2019 year [4]. For each paper in the PubMed Baseline, we extracted the PubMed identifier, the title, and the abstract. We stored title and abstract as separate fields in the index. We applied default stopwords filtering and Porter stemming to the title and abstract provided with Anserini [31]. Overall the searcher index contains 19 million documents. BERT Re-ranking The initial set of documents obtained with BM25 is passed to the BERT re-ranker, which assigns relevance scores to documents based on a question. We consider all documents with a score higher than a threshold to be relevant and output at most ten papers with the highest scores. To train BERT re-ranker, we created a binary classification dataset. We obtained positive examples from the gold documents of BioASQ dataset. We collected negative examples using BM25 by extracting 200 documents using a question as a query and consider all documents starting from position 100 to be non-relevant if they are not in the gold documents set. As BioASQ dataset for question answering contains questions collected from the past year contests, the relevant documents include only the papers published before the year of the competition. Usually, there are several relevant documents for the question that were published after the year of the contest. To exclude such papers from the negative examples, we calculated the maximum publication year for all relevant documents and filtered all documents published after this year from the negative examples. Experiments We evaluated several approaches to document retrieval. First, we evaluated the performance of the BM25 algorithm, and then we applied different modifications of BERT-based re-ranker. We examined the effects of relevance score threshold as well as the number of documents obtained from the BM25 stage. The results are presented in table 1. We can see how BERT-based re-ranker consistently improves base BM25 performance. Since our re-ranker is trained to perform logistic regression, we can vary the decision boundary to achieve an appropriate trade-off between precision and recall. However, the MAP metric, which is used as a final ranking measure, does not penalize the system for additional non-relevant documents, which means the system should always output as much documents as possible to achieve the highest score, while reducing its practical usefulness. We decided to orient our system towards both precision and recall and as a result we achieve the highest F1 scores across all batches, while maintaining competitive MAP scores. Table 1. Results of different approaches to document retrieval on the combined test set of 500 questions from 8B. N is the number of documents returned from the BM25 stage. T is the score threshold for relevant documents. Method Precision Recall F-Measure MAP GMAP BM25(N = 10) 0.1190 0.5022 0.1730 0.3579 0.0128 BM25+BERT(N = 50, T = 0.5) 0.2892 0.5158 0.3334 0.3979 0.0155 BM25+BERT(N = 50, T = 0.) 0.1358 0.5481 0.1954 0.4114 0.0221 BM25+BERT(N = 500, T = 0.5) 0.2734 0.5387 0.3249 0.4046 0.0191 2.2 Snippet Retrieval Snippet extraction systems extract a continuous span of text from one of the relevant documents for the given question. We observe that snippets from the BioASQ training set are usually one sentence long, thus our system is designed as a sentence retriever and snippet extraction is formulated as a sentence ranking problem. We experiment with both neural and statistical approaches to tackle this challenge. Baseline We use a simple statistical baseline for sentence ranking, which is based on measuring entity cooccurrence in question and candidate sentence. For each question and sentence we extract sets of entities Q and S respectively and compute relevance score: |Q ∩ S| relevance(q, s) = |Q| We use ScispaCy [20] en core web sm model for extracting entitities. Word2Vec Similarity One approach of determining sentence similarity is to map both query and candidate into the same vector space and measure the dis- tance between them. For embedding word sequences, we use Word2Vec model, pretrained on PubMed texts [18], and compute the mean of individual word embeddings. Suppose the Eq and Es are the embeddings of question and snip- pet correspondingly. The relevance of snippet for a given question is a cosine similarity between embeddings: Eq · Es relevance(q, s) = ||Eq ||||Es || BERT Similarity As the transformer pretrained on the biomedical domain should contain a lot of transferable knowledge, we check the zero-shot perfor- mance of the pretrained model. Similar to embeddings similarity, we use cosine distance between the embeddings of question and snippet. The embedding of a text span is the contextualized embedding corresponding to the special [CLS] token which is inserted before the tokenized text. The relevance is a cosine sim- ilarity between embeddings of question and snippet. BERT Relevance As the task of snippet retrieval is very similar to docu- ment retrieval, we test a similar approach. We use BERTrel model, trained for document ranking to assign a relevance score to the pair of question and snippet: relevance(q, s) = BERTrel (q, s) Document Scores Finally, after assigning each question-sentence pair a rel- evance score, we scale the latter by additional score, based on the position of the document, which the candidate sentences are extracted from, in the list of relevant documents. Despite the simplicity of this trick, experiments show considerable improvements of evaluation metrics, which points out a strong cor- relation between the rank of the abstracts and the rank of the snippets from those abstracts. For each document di from the list of ranked relevant docu- ments D = d1 , d2 , . . . , dn there is a list of sentences Si = si,1 , si,2 , . . . , si,m and the similarity score between query q and sentence si,j is: relevance(q, si,j ) score(q, si,j ) = i Experiments We evaluated all described approaches to snippet retrieval. The results are presented in table 2. We can see that heuristic of adding docu- ment score into the score of a snippet allows to improve MAP scores for all approaches significantly. In line with the document retrieval, BERT relevance model has higher precision and recall with lower MAP scores. Surprisingly, re- trieval based on BioBERT cosine similarly performed well even without training on any BioASQ data. We can consider the approach to be a zero-shot perfor- mance of BioBERT on the task of snippet retrieval. Table 2. Results of different approaches to snippet retrieval on the combined test set of 500 questions from 8B. “Docs” means scaling the snippet score by the position of the source document. Method Precision Recall F-Measure MAP GMAP Baseline 0.1631 0.2871 0.1841 0.6521 0.0036 Baseline + Docs 0.1733 0.2876 0.1934 0.8902 0.0020 Word2Vec Similarity 0.1702 0.2941 0.1904 0.6408 0.0054 Word2Vec Similarity + Docs 0.1727 0.2850 0.1928 0.9350 0.0019 BERT Similarity 0.1607 0.2621 0.1763 0.6338 0.0031 BERT Similarity + Docs 0.1733 0.2847 0.1927 0.9374 0.0019 BERT Relevance 0.1931 0.3383 0.2174 0.6926 0.0102 BERT Relevance + Docs 0.1921 0.3344 0.2161 0.8098 0.0071 2.3 Exact answers Factoid and List questions. For factoid and list questions we generate an- swers with a single extractive question-answering system. Its design follows the classical transformer-based approach, described in [6]. As an underlying neural model, we use ALBERT [14] finetuned on SQuAD 2.0 [24] and BioASQ training set. SQuAD is an extractive question answering dataset, so it is well suited for BioASQ tasks. In essence, list and factoid questions can be handled by the same span extraction technique. Thus we can use the same model for both questions types, differing only at the postprocessing stage. Throughout all the 5 batches we experiment mainly at pre- and post- pro- cessing stages, without substantial changes in the architecture of the system itself. During preprocessing, we convert input questions to the SQuAD format [25], where contexts are built from the relevant snippets, that come with each input question. The postprocessing stage is implemented in the same manner as [34]. However, for list question we additionally split the resulting extracted spans by “and/or” and “or” conjunctions, which we observed to be frequently used in chemical/gene enumerations in various biomedical abstracts. Table 3 shows the importance of this step. Table 3. The performance of the QA model for list questions with and without splitting of answers by conjunctions as a postprocessing step. The evaluation is performed on the first batch of 8B. Method Mean Precision Mean Recall F-Measure BioBERT 0.2750 0.2250 0.2305 BioBERT + conj split 0.3884 0.5629 0.4315 Yes/No questions. For yes/no questions we formulate the task as a logistic regression over question-snippet pairs and implement a transformer-based ap- proach, similar to [34]. We use the ALBERT model and fine-tune it using SQuAD and BioASQ datasets. In the fifth batch, we additionaly use PubMedQA dataset [11] and replace the model with BioBERT. Despite that PubMedQA contains more than 200 thousand labelled examples, the average question length is twice as large as BioASQ questions’ length is. We sampled 2 thousand questions with similar to BioASQ questions distribution and incorporated them into the final train set. 2.4 Summarization Phase B also includes summarization objective, where a participating system has to generate a paragraph sized text, answering the question. We come up with different approaches for tackling this challenge. Weak baseline BioASQ does not impose any limitations on the source of the summary. We observed that summaries tend to be one or two sentences long, reminding how snippets are composed. Straightforward approach is to use snippets, provided with the question for computing the summary. Our weak baseline selects the first snippet from the question for this purpose. Snippet Reranking Naturally, the first snippet may not answer the question directly and clearly, despite being marked as the most relevant. A logical im- provement to the baseline is to select the appropriate snippet, potentially in a question-aware manner. To make answers more granular, we split snippets by sentences and the resulting candidate pool contains snippets and snippet sen- tences. Sometimes, however, snippets are absent for a given question. In that case we extract the candidate sentences from the relevant abstracts. For re-ranking, we use BERTrel trained for document re-ranking, as described in 2.2. Overall, we can describe this system as sentence-level extractive summarization. Abstractive Summarization Our final system performs abstractive summa- rization over provided snippets. We use traditional encoder-decoder transformer architecture [28], where the encoder is based on BioMed-RoBERTa [9], while the decoder is trained from scratch, following BertSUM [17]. First, we pretrain the model on a summarization dataset based on PubMed, where the target is an arbitrary span from the abstract and the source is a piece of text, from which the target can be derived. After that, we fine-tune the model to produce summaries given the question and concatenation of relevant snippets from the BioASQ training dataset, separated with a special token. 3 Results In this section, we present an official automatic evaluation of our system, com- paring to the top competitor system. We denote our system as “PA” which stands for the Paper Analyzer team. We additionally perform a retrospective evaluation of phase A, where the gold answers are available. 3.1 Documents Retrieval In table 4, we present the results of our document retrieval system on all batches compared to the top competitor. The final design of our system was implemented only in the fifth batch. So, to evaluate our proposed system against our own and other participants’ systems from previous batches, we computed evaluation metrics over golden answers, provided by BioASQ for the Phase B. We were able to fully reproduce official leaderboard scores for the fifth batch and show, that our final system outperforms all our previous submissions. The retrospective evaluation shows that we significantly improved our system during the contest and achieved better results with the final system. Table 4. The performance of the document and snippet retrieval system on all batches of task 8B. “final” represents the retrospective evaluation of a system for batch 5 on previous batches. “Top Competitor” is a top-scoring submission from other teams. Documents Snippets Batch System F-Measure MAP GMAP F-Measure MAP GMAP 1 PAfinal 0.3389 0.3718 0.0156 0.1951 0.8935 0.0019 PAbatch-1 0.2680 0.3346 0.0078 0.1678 0.5449 0.0028 Top Competitor 0.1748 0.3398 0.0120 0.1752 0.8575 0.0017 2 PAfinal 0.2689 0.3315 0.0141 0.1487 0.7383 0.0008 PAbatch-2 0.2300 0.3304 0.0185 0.1627 0.3374 0.0047 Top Competitor 0.2205 0.3181 0.0165 0.1773 0.6821 0.0015 3 PAfinal 0.3381 0.4303 0.0189 0.1958 0.9422 0.0028 PAbatch-3 0.2978 0.4351 0.0143 0.1967 0.6558 0.0062 Top Competitor 0.1932 0.4510 0.0187 0.2140 1.0039 0.0056 4 PAfinal 0.3239 0.4049 0.0189 0.1753 0.9743 0.0015 PAbatch-4 0.3177 0.3600 0.0163 0.1810 0.7163 0.0056 Top Competitor 0.1967 0.4163 0.0204 0.2151 1.0244 0.0055 5 PAfinal 0.3963 0.4825 0.0254 0.2491 1.1267 0.0038 PAbatch-5 (final) 0.3963 0.4825 0.0254 0.2491 1.1267 0.0038 Top Competitor 0.1978 0.4842 0.0330 0.2652 1.0831 0.0086 3.2 Snippet Retrieval In table 4, we present the results of our snippet retrieval system on all batches compared to the top competitor. Similar to the document retrieval, we performed a retrospective evaluation on all batches for the final implemented system. The evaluation shows that we significantly improved our system during the contest. 3.3 Question Answering We submitted only baselines for batches 1 and 2, so we present results only for batches starting with 3. Overall, we achieved moderate results on the question answering task, as we mainly focused on Phase A. We believe this was caused by poor selection of the training dataset. We will analyze errors and perform additional experiments in the future. The performance of our system is presented in tables 5 and 6. Table 5. The performance of the proposed system on the yes/no questions. “Top Competitor” is a top-scoring submission from other teams. Batch System Accuracy F1 yes F1 no F1 macro 3 ALBERT(SQuAD, BioASQ) 0.9032 0.9189 0.8800 0.8995 Top competitor 0.9032 0.9091 0.8966 0.9028 4 ALBERT(SQuAD, BioASQ) 0.7308 0.7879 0.6316 0.7097 Top competitor 0.8462 0.8571 0.8333 0.8452 5 BioBERT(SQuAD, BioASQ, PMQ) 0.8235 0.8333 0.8125 0.8229 Top competitor 0.8529 0.8571 0.8485 0.8528 Table 6. The performance of the proposed system on the list and factoid questions. “Top Competitor” is a top-scoring submission from other teams. Batch System SAcc LAcc MRR Mean Prec. Rec F-Measure 3 PA 0.2500 0.4643 0.3137 0.5278 0.4778 0.4585 Top Competitor 0.3214 0.5357 0.3970 0.7361 0.4833 0.5229 4 PA 0.4706 0.5588 0.5098 0.3571 0.3661 0.3030 Top Competitor 0.5588 0.7353 0.6284 0.5375 0.5089 0.4571 5 PA 0.4375 0.6250 0.5260 0.3075 0.3214 0.3131 Top Competitor 0.5625 0.7188 0.6354 0.5516 0.5972 0.5618 3.4 Summarization We evaluated our systems in all the five batches. However, we were able to experiment with only one system per batch. The results are presented in the table 7. We show how simple snippet re-ranker can achieve top scores in automatic evaluation. Meanwhile the abstractive summarizer, while providing readable and coherent responses, achieves lower scores, however still very competitive ones. We hope that human evaluation will show the opposite results. We included side-by-side comparison of answers provided by both systems in the appendix (table 8). Table 7. The performance of the proposed system on the ideal answers. “Top Com- petitor” is a top-scoring submission from other teams chosen by R-SU4 (F1). Batch System R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1) 1 Baseline 0.1118 0.1118 0.1116 0.1117 Top competitor 0.6004 0.3660 0.6035 0.3556 2 Baseline 0.0600 0.0655 0.0615 0.0650 Top competitor 0.5651 0.3451 0.5725 0.3376 3 Snippet Reranking 0.5235 0.3297 0.5303 0.3256 Top competitor 0.4980 0.3833 0.5045 0.3811 4 Snippet Reranking 0.5470 0.3087 0.5471 0.3001 Top competitor 0.5281 0.3069 0.5329 0.2987 5 Abstractive Generation 0.3696 0.3006 0.3688 0.2895 Top competitor 0.3867 0.3668 0.3805 0.3548 4 Conclusion and Future Work In this work, we demonstrate several approaches for our first participation in BioASQ competition. We propose a working system for each of the evaluation objectives in Task 8B, achieving top positions in several batches. We designate a few areas for improvements. First, a more robust experi- mentation process is required. After the challenge, we found several errors in document and snippeet retrieval pipelines that could affect the resulting per- formance. Second, our system does not utilize semantic indexing, which can be implemented using different biomedical ontologies. Appendix: Side-by-Side Comparison of Snippet Ranking and Generative Models. Table 8. Examples of answers generated by the snippet ranking and generative ques- tion answering systems. Question Sclerostin regulates what process? Gold Sclerostin plays a critical role in bone homeostasis and its deficiency or pharmacological neutralization increases bone formation Snippet Ranking Sclerostin regulates release of bone mineral by osteocytes by induc- tion of carbonic anhydrase 2. Generative Sclerostin is a locally acting regulator of late-osteoblast/preosteocyte differentiation and regulates mineralization through a lo- cal/paracrine regulator of bone metabolism rather than as an endocrine hormone. The osteocyte product sclostin, a secreted in- hibitor of Wnt signaling and plays an essential role in the regulation of bone mass. Question Can CD55 deficiency cause thrombosis? Gold Yes, loss of CD55 is associated with thrombosis in patients with Paroxysmal nocturnal hemoglobinuria. CD55 deficiency with hy- peractivation of complement, angiopathic thrombosis, and protein- losing enteropathy (the CHAPLE syndrome) is caused by abnormal complement activation due to biallelic loss-of-function mutations in CD55 Snippet Ranking CD55 Deficiency, Early-Onset Protein-Losing Enteropathy, and Thrombosis Generative Yes. CD55 deficiency with hyperactivation of complement, angio- pathic thrombosis, and protein-losing enteropathy (the CHAPLE syndrome) is caused by abnormal complement activation due to bial- lelic loss-of-function mutations in CD55 Question Can prevnar 13 be used in children? Gold Yes, PCV13 is approved for routine vaccination of all infants as a 4-dose series at age 2, 4, 6, and 12-15 months for children who pre- viously received 1 or more doses of the 7-valent pneumococcal con- jugate vaccine (PCV7), and for children with underlying medical conditions that increase their risk for pneumococcal disease or its complications. Snippet Ranking PCV13 is approved for routine vaccination of all infants as a 4-dose series at age 2, 4, 6, and 12-15 months for children who previously received 1 or more doses of the 7-valent pneumococcal conjugate vaccine (PCV7), and for children with underlying medical conditions that increase their risk for pneumococcal disease or its complications. Generative PCV13 is approved for routine vaccination of all infants. PCV 13 is a revision of pneumococcal conjugate vaccine that should be included on pharmacy formularies. References [1] Michael Ashburner et al. “Gene Ontology: tool for the unification of biol- ogy”. In: Nature Genetics 25 (2000), pp. 25–29. [2] Olivier Bodenreider. “The Unified Medical Language System (UMLS): in- tegrating biomedical terminology”. In: Nucleic acids research 32 Database issue (2004), pp. D267–70. [3] George Brokos et al. “AUEB at BioASQ 6: Document and Snippet Re- trieval”. In: Proceedings of the 6th BioASQ Workshop A challenge on large-scale biomedical semantic indexing and question answering. Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 30– 39. doi: 10 . 18653 / v1 / W18 - 5304. url: https : / / www . aclweb . org / anthology/W18-5304. [4] Kathi Canese and Sarah Weis. “PubMed: the bibliographic database”. In: The NCBI Handbook [Internet]. 2nd edition. National Center for Biotech- nology Information (US), 2013. [5] Danqi Chen et al. “Reading Wikipedia to Answer Open-Domain Ques- tions”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2017). doi: 10. 18653/v1/p17-1171. url: http://dx.doi.org/10.18653/v1/P17-1171. [6] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transform- ers for Language Understanding”. In: (Oct. 2018). arXiv: 1810 . 04805. url: http://arxiv.org/abs/1810.04805. [7] Alexios Gidiotis and Grigorios Tsoumakas. “Structured Summarization of Academic Publications”. In: PKDD/ECML Workshops. 2019. [8] Jiafeng Guo et al. “A Deep Relevance Matching Model for Ad-hoc Re- trieval”. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (2016). [9] Suchin Gururangan et al. “Don’t Stop Pretraining: Adapt Language Mod- els to Domains and Tasks”. In: ACL. 2020. [10] Stefan Hosein, Daniel Andor, and Ryan T. McDonald. “Measuring Domain Portability and ErrorPropagation in Biomedical QA”. In: PKDD/ECML Workshops. 2019. [11] Qiao Jin et al. “PubMedQA: A Dataset for Biomedical Research Question Answering”. In: Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing (EMNLP-IJCNLP) (2019). doi: 10.18653/v1/d19-1259. url: http://dx.doi.org/10.18653/v1/D19- 1259. [12] Zan-Xia Jin et al. “A Multi-strategy Query Processing Approach for Biomed- ical Question Answering: USTB PRIR at BioASQ 2017 Task 5B”. In: BioNLP. 2017. [13] Martin Krallinger et al. “BioASQ at CLEF2020: Large-Scale Biomedical Semantic Indexing and Question Answering”. In: European Conference on Information Retrieval. Springer. 2020, pp. 550–556. [14] Zhenzhong Lan et al. “ALBERT: A Lite BERT for Self-supervised Learn- ing of Language Representations”. In: (Sept. 2019). arXiv: 1909.11942. url: http://arxiv.org/abs/1909.11942. [15] Mike Lewis et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: ArXiv abs/1910.13461 (2020). [16] Chin-Yew Lin. “Rouge: A package for automatic evaluation of summaries”. In: Text summarization branches out. 2004, pp. 74–81. [17] Yang Liu and Mirella Lapata. “Text Summarization with Pretrained En- coders”. In: EMNLP/IJCNLP. 2019. [18] Ryan McDonald, George Brokos, and Ion Androutsopoulos. “Deep Rele- vance Ranking Using Enhanced Document-Query Interactions”. In: Pro- ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018). doi: 10.18653/v1/d18-1211. url: http://dx.doi. org/10.18653/v1/D18-1211. [19] Diego Mollá and Christopher Jones. “Classification Betters Regression in Query-Based Multi-document Summarisation Techniques for Question Answering”. In: Communications in Computer and Information Science (2020), pp. 624–635. issn: 1865-0937. doi: 10.1007/978-3-030-43887- 6_56. url: http://dx.doi.org/10.1007/978-3-030-43887-6_56. [20] Mark Neumann et al. “ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing”. In: Proceedings of the 18th BioNLP Work- shop and Shared Task. Florence, Italy: Association for Computational Lin- guistics, Aug. 2019, pp. 319–327. doi: 10.18653/v1/W19- 5034. eprint: arXiv:1902.07669. url: https://www.aclweb.org/anthology/W19- 5034. [21] Rodrigo Nogueira and Kyunghyun Cho. “Passage Re-ranking with BERT”. In: arXiv e-prints, arXiv:1901.04085 (Jan. 2019), arXiv:1901.04085. arXiv: 1901.04085 [cs.IR]. [22] Dimitris Pappas et al. “AUEB at BioASQ 7: Document and Snippet Re- trieval”. In: Machine Learning and Knowledge Discovery in Databases. Ed. by Peggy Cellier and Kurt Driessens. Cham: Springer International Publishing, 2020, pp. 607–623. isbn: 978-3-030-43887-6. [23] Mónica Pineda-Vargas et al. “A Mixed Information Source Approach for Biomedical Question Answering: MindLab at BioASQ 7B”. In: Machine Learning and Knowledge Discovery in Databases. Ed. by Peggy Cellier and Kurt Driessens. Cham: Springer International Publishing, 2020, pp. 595– 606. isbn: 978-3-030-43887-6. [24] Pranav Rajpurkar, Robin Jia, and Percy Liang. “Know What You Don’t Know: Unanswerable Questions for SQuAD”. In: arXiv e-prints, arXiv:1806.03822 (June 2018), arXiv:1806.03822. arXiv: 1806.03822 [cs.CL]. [25] Pranav Rajpurkar et al. “SQuAD: 100,000+ Questions for Machine Com- prehension of Text”. In: arXiv:1606.05250 (June 2016). arXiv: 1606.05250 [cs.CL]. url: http://arxiv.org/abs/1606.05250. [26] Abigail See, Peter J. Liu, and Christopher D. Manning. “Get To The Point: Summarization with Pointer-Generator Networks”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers) (2017). doi: 10.18653/v1/p17-1099. url: http://dx.doi.org/10.18653/v1/P17-1099. [27] George Tsatsaronis et al. “An overview of the BIOASQ large-scale biomed- ical semantic indexing and question answering competition”. In: BMC Bioinformatics 16 (Apr. 2015), p. 138. doi: 10.1186/s12859-015-0564- 6. [28] Ashish Vaswani et al. “Attention is All you Need”. In: ArXiv abs/1706.03762 (2017). [29] Ellen M Voorhees. “The TREC question answering track”. In: Natural Language Engineering 7.4 (2001), p. 361. [30] Georg Wiese, Dirk Weissenborn, and Mariana Neves. “Neural Question Answering at BioASQ 5B”. In: BioNLP 2017 (2017). doi: 10.18653/v1/ w17-2309. url: http://dx.doi.org/10.18653/v1/W17-2309. [31] Peter Willett. “The Porter stemming algorithm: then and now”. In: Pro- gram (2006). [32] Peilin Yang, Hui Fang, and Jimmy Lin. “Anserini: Reproducible Ranking Baselines Using Lucene”. In: J. Data and Information Quality 10.4 (Oct. 2018). issn: 1936-1955. doi: 10.1145/3239571. url: https://doi.org/ 10.1145/3239571. [33] Wei Yang et al. “End-to-end open-domain question answering with bert- serini”. In: arXiv preprint arXiv:1902.01718 (2019). [34] Wonjin Yoon et al. “Pre-trained Language Model for Biomedical Question Answering”. In: arXiv:1909.08229 (Sept. 2019), arXiv:1909.08229. arXiv: 1909.08229 [cs.CL]. url: http://arxiv.org/abs/1909.08229.