-

Overview of BioASQ 8a and 8b: Results of the eighth edition of the BioASQ tasks a and b

Anastasios Nentidis

0 1

Anastasia Krithara

Konstantinos Bougiatiotis

1 2

Georgios Paliouras

1 0 Aristotle University of Thessaloniki , Thessaloniki , Greece 1 National Center for Scienti c Research \Demokritos" , Athens , Greece 2 National and Kapodistrian University of Athens , Athens , Greece

In this paper, we present an overview of the eighth edition of the tasks a and b of the BioASQ challenge, which ran as a lab in the Conference and Labs of the Evaluation Forum (CLEF) 2020. BioASQ aims at promoting methodologies and systems for large-scale biomedical semantic indexing and question answering through the organization of yearly challenges since 2012. These shared tasks o er to teams around the world the opportunity to develop and compare their methods on the same benchmark datasets that represent the demanding information needs of biomedical experts. This year, apart from introduction of a new task on medical semantic indexing in Spanish (MESINESP8), the eighth versions of the two established BioASQ tasks on semantic indexing (8a) and question answering (8b) in English were also o ered. In total, 34 teams with more than 100 systems participated in the three tasks of the challenge, with seven of them focusing on task 8a and 23 on task 8b. As in previous versions of the tasks, the evaluation of system responses reveals some participating systems managed to outperform the strong baselines, indicating that continuous advancements in state-of-the-art systems keep pushing the frontier of research leading to performance improvements.

Biomedical knowledge Semantic Indexing Question An- swering

This paper presents the shared tasks 8a and 8b of the eighth edition of the BioASQ challenge in 2020, the corresponding datasets and the approaches and achieved results of the participating systems. A detailed description of the new task on medical indexing in Spanish is o ered in the MESINESP task overview. A condensed BioASQ 2020 Lab overview [ 2 ] is also available, describing the eighth edition the BioASQ challenge as a whole, in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2020. Towards this direction, in section 2 we provide an overview of the shared tasks 8a and 8b, that took place from February to May 2020, as well as the corresponding datasets developed for training and testing the participating systems. In section 3, we brie y overview the participating systems and the approaches proposed by the corresponding teams for these two tasks. Detailed descriptions for some of the systems are also available in the proceedings of the BioASQ lab. In section 4, we present the results of the evaluation of participating systems, based on manual assessment or state-of-the-art evaluation measures, depending on the nature of the required system response. Finally, we conclude and discuss the eighth version of the BioASQ tasks a and b in section 5. 2

Overview of the Tasks

In the eighth version of the BioASQ challenge were o ered three tasks: (1) a large-scale biomedical semantic indexing task (task 8a), (2) a biomedical question answering task (task 8b), both considering documents in English, and (3) a new task on medical semantic indexing in Spanish (task MESINESP). In this section we provide a brief description of the two established tasks (8a and 8b) with focus on di erences from previous versions of the challenge [ 32 ]. A detailed overview of the initial versions of the tasks and the general structure of BioASQ is also already available [ 46 ]. 2.1

Large-scale semantic indexing - Task 8a

In Task 8a the aim is to classify articles from the PubMed/MedLine4 digital library into concepts of the MeSH hierarchy. In particular, new PubMed articles that are not yet annotated by the indexers in NLM are gathered to form the test sets for the evaluation of the participating systems. Some basic details about each test set and batch are provided in Table 1. As done in previous versions of the task, the task is divided into three independent batches of 5 weekly test sets each, providing an on-line and large-scale scenario, and the test sets consist of new articles without any restriction on the journal published. The performance of the participating systems is calculated using standard at information retrieval measures, as well as, hierarchical ones, when the annotations from the NLM indexers become available. As usual, participants have 21 hours to provide their answers for each test set. However, as it has been observed that new MeSH annotations are released in PubMed earlier that in previous years, we shifted the submission period accordingly to avoid having some annotations available from NLM while the task is still running. For training, a dataset of 14,913,939 articles with 12.68 labels per article, on average, was provided to the participants.

4 https://pubmed.ncbi.nlm.nih.gov/

Total 1 2 Total

3 Total Task 8b aims at providing a realistic large-scale question answering challenge o ering to the participating teams the opportunity to develop systems for all the stages of question answering in the biomedical domain. Four types of questions are considered in the task: \yes/no", \factoid", \list" and \summary" questions [ 4 ]. A training dataset of 3,243 questions annotated with golden relevant elements and answers is provided for the participants to develop their systems. Table 2 presents some statistics about the training dataset as well as the ve test sets.

As in previous versions of the challenge, the task is structured into two phases that focus on the retrieval of the required information (phase A) and answering the question (phase B). In addition, the task is split into ve independent bi-weekly batches and the two phases for each batch run during two consecutive days. In each phase, the participants receive the corresponding test set and have 24 hours to submit the answers of their systems. In particular, in phase A, a test set of 100 questions written in English is released and the participants are expected to identify and submit relevant elements from designated resources, including PubMed/MedLine articles, snippets extracted from these articles, concepts and RDF triples. In phase B, the manually selected relevant articles and snippets for these 100 questions are also released and the participating systems are asked to respond with exact answers, that is entity names or short phrases, and ideal answers, that is natural language summaries of the requested information. 3

Overview of participation

This year, 34 teams from institutes around the world participated in the three tasks of the challenge with more than 100 distinct systems. Seven of these teams focused on task 8a and 23 on task 8b. As presented in g. 1, the institutions hosting the teams that participated in tasks 8a and 8b are distributed around the world highlighting the international interest in the tasks. Compared to previous versions of the challenge, we observe a shift towards the most complex question answering task b, where the number of participating teams and systems is increasing during the last years as shown in Fig. 2. This year, 7 teams participated in the eighth edition of task a, submitting predictions from 16 di erent systems in total. Here, we provide a brief overview of those systems for which a description was available, stressing their key characteristics. A summing-up of the participating systems and corresponding approaches is presented in Table 3.

This year, the LASIGE team from the University of Lisboa, in its \X-BERT BioASQ" system propose a novel approach for biomedical semantic indexing combining a solution based on Extreme Multi-Label Classi cation (XMLC) with a Named-Entity-Recognition (NER) tool. In particular, their system is based on X-BERT [ 8 ], an approach to scale BERT [ 14 ] to XMLC, combined with the use of the MER [ 12 ] tool to recognize MeSH terms in the abstracts of the articles. The system is structured into three steps. The rst step is the semantic indexing of the labels into clusters using ELMo [ 39 ]; then a second step matches the indices using a Transformer architecture; and nally, the third step focuses on ranking the labels retrieved from the previous indices.

Other teams, improved upon existing systems already participating in previous versions of the task. Namely, the National Library of Medicine (NLM) team, in its \NLM CNN " system enhance the previous version of their \ceb" systems [ 40 ], based on an end-to-end Deep Learning (DL) architecture with Convolutional Neural Networks (CNN), with SentencePiece tokenization [ 24 ]. The Fudan University team also builds upon their previous \AttentionXML" [55] and \DeepMeSH " [ 38 ] systems as well their new \BERTMeSH " [54] system, which are based on document to vector (d2v) and tf-idf feature embeddings, learning to rank (LTR) and DL-based extreme multi-label text classi cation, Attention Mechanisms and Probabilistic Label Trees (PLT) [ 18 ]. Finally, this years versions of the \Iria" systems [ 43 ] are also based on the same techniques used by the systems in previous versions of the challenge which are summarized in Table 3.

Similarly to the previous versions of the challenge, two systems developed by NLM to facilitate the annotation of articles by indexers in MedLine/PubMed, where available as baselines for the semantic indexing task. MTI [ 31 ] as enhanced in [56] and an extension based on features suggested by the winners of the rst version of the task [ 47 ]. 3.2

Task 8b

This version of Task b was tackled by 94 di erent systems in total, developed by 23 teams. In particular, 8 teams participated in the rst phase, on the retrieval of relevant material required for answering the questions, submitting results from 30 systems. In the second phase, on providing the exact and ideal answers for the questions, participated 18 teams with 72 distinct systems. Three of the teams participated in both phases. An overview of the approaches, technologies and datasets used by the teams is provided in Table 4 and a graphical representation of them as a word cloud, weighted by their frequency in logarithmic scale, is also provided in Fig. 3. Only systems for which a description was available is included in this section. Detailed descriptions for some of the systems are available at the proceedings of the workshop.

The \ITMO " team participated in both phases of the task experimenting in its \pa" systems with di ering solutions across the batches. In general, for document retrieval the systems follow an approach with two stages. First, they identify initial candidate articles based on BM25, and then they re-rank them using variations of BERT [ 14 ], ne-tuned for the binary classi cation task with the BioASQ dataset and pseudo-negative documents. They extract snippets from the top documents and rerank them using biomedical Word2Vec based on cosine similarity with the question. To extract exact answers they use BERT ne-tuned on SQUAD [ 41 ] and BioASQ datasets and employ a post-processing to split the answer for list questions and additional ne-tuning on PubMedQA [ 19 ] for yes/no questions. Finally, for ideal answers they generate some candidates from the snippets and their sentences and rerank them using the model used for phase A. In the last batch, they also experiment with generative summarization, developing a model based on BioMed-RoBERTa [ 17 ] to improve the readability and consistency of the produced ideal answers.

Another team participating in both phases of the task is the \UCSD " team with its \bio-answer nder" system. In particular, for phase A they rely on previously developed Bio-AnswerFinder system [ 34 ], which is also used as a rst step in phase B, for re-ranking the sentences of the snippets provided in the test set. For identifying the exact answers for factoid and list questions they experimented on ne-tuning Electra [ 10 ] and BioBERT [ 25 ] on SQuAD and BioASQ datasets combined. The answer candidates are then scored considering classi cation probability, the top ranking of corresponding snippets and number of occurrences. Finally a normalization and ltering step is performed and, for list questions, and enrichment step based on coordinated phrase detection. For yes/no questions they ne-tune BioBERT on the BioASQ dataset and use majority voting. For summary questions, they employ hierarchical clustering, based on weighted relaxed word mover's distance (wRWMD) similarity [ 34 ] to group the top sentences, and select the sentence ranked highest by Bio-AnswerFinder to be concatenated to form the summary.

The \AUEB " team also participated in both tasks focusing on phase A and brie y experimenting with Phase B. Working on extending their previous topperforming model [ 36 ], they experimented with graph-node embeddings generated from a biomedical entity co-occurrence graph from publications [ 23 ]. Moreover, they experimented with new ways to encode and retrieve relevant snippets, but concluded that conventional BM25 pre-fetching was more e cient. For Phase B, they worked with exact answer extraction. To this end, they experimented with a SciBERT-based model [ 5 ] modelled for cloze-style biomedical machine reading comprehension [ 37 ] (MRC). However, their initial results indicated that the MRC task di ers greatly from the exact answer extraction task and they did not pursue this research direction further.

In phase A, the team from the University of Aveiro participated with its \bioinfo" systems, which consists of a ne-tuned BM25 retrieval model based on ElasticSearch [ 16 ], followed by a neural reranking step. For the latter, they use an interaction-based model inspired on the DeepRank [ 35 ] architecture building upon previous versions of their system [ 1 ]. The focus of the improvements was on the sentence splitting strategy, on extracting of multiple relevance signals, and the independent contribution of each sentence for the nal score. The \Google" team also participated in phase A, with four distinct systems for document retrieval based on di erent approaches. In particular, they used a BM25 retrieval model, a neural retrieval model, initialized with BioBERT and trained on a large set of questions developed through Synthetic Query Generation (QGen), and a hybrid retrieval model 5 based on a linear blend of BM25 and the neural model [ 28 ]. In addition, they also used a reranking model, rescoring the results of the hybrid model with a cross-attention BERT rescorer [ 36 ].

In phase B, this year the \KU-DMIS " team participated on both exact and ideal answers. For exact answers, they build upon their previous BioBERTbased systems [53] and try to adapt the sequential transfer learning of Natural Language Inference (NLI) to biomedical question answering. In particular, they investigate whether learning knowledge of entailment between two sentence pairs can improve exact answer generation, enhancing their BioBERT-based models with alternative ne-tuning con gurations based on the MultiNLI dataset [ 50 ]. For ideal answer generation, they develop a deep neural abstractive summarization model based on BART [ 26 ] and beam search, with particular focus on pre-processing and post-processing steps. In particular, alternative systems were developed either considering the answers predicted by the exact answer prediction system in their input or not. In the post-processing step, the generated candidate ideal answers for each question where scored using the predicted exact answers and some grammar scores provided by the language check tool6. For factoid and list questions in particular, the BERN [ 21 ] tool was also employed to recognize named entities in the candidate ideal answers for the scoring step.

The \NCU-IISR" team also participated in both parts of phase B, constructing two BioBERT-based models for extracting the exact answer and ranking the ideal answers respectively. The rst model is ne-tuned on the BioASQ dataset formulated as a SQuAD-type QA task that extracts the answer span. For the second model, they regard the sentences of the provided snippets as candidate ideal answers and build a ranking model with two parts. First, a BioBERT-based model takes as input the question and one of the snippet sentences and provides their representation. Then, a logistic regressor, trained on predicting the similarity between a question and each snippet sentence, takes this representation and outputs a score, which is used for selecting the nal ideal answer.

The \UoT " team participated with three di erent DL approaches for generating exact answers. In their rst approach, they ne-tune separately two distinct BioBERT-based models extended with an additional neural layer depending on the question type, one for yes/no and one for factoid and list questions together. In their second system, they use a joint-learning setting, where the same BioBERT layer is connected with both the additional layers and jointly trained for all types of questions. Finally, in their third system they propose a multi-task model to learn recognizing biomedical entities and answers to questions simulta

5 https://ai.googleblog.com/2020/05/an-nlu-powered-tool-to-explore-covid-19.html 6 https://pypi.org/project/language-check/

neously, aiming at transferring knowledge from the biomedical entity recognition task to question answering. In particular, they extend their joint BioBERT-based model with simultaneous training on the BC2GM dataset [ 45 ] for recognizing gene and protein entities.

The \BioNLPer " team also participated in the exact answers part of phase B, focusing on factoids. They proposed 5 BioBERT-based systems, using external feature enhancement and auxiliary task methodologies. In particular, in their \factoid qa model" and \Parameters retrained" systems they consider the prediction of answer boundaries (start and end positions) as the main task and the whole answer content prediction as an auxiliary task. In their \Features Fusion" system they leveraged external features including NER and part-of-speach (POS) extracted by NLTK [ 27 ] and ScispaCy [ 33 ] tools as additional textual information and fused them with the pre-trained language model representations, to improve answer boundary prediction. Then, in their \BioFusion" system they combine the two methodologies together. Finally, their \BioLabel" system employed the general and biomedical domain corpus classi cation as the auxiliary task to help answer boundary prediction.

The \LabZhu" systems participated in phase B as well, with focus on exact answers for the factoid and list questions. They treat answer generation as an extractive machine comprehension task and explore several di erent pretrained language models, including BERT, BioBERT, XLNet [51] and SpanBERT [ 20 ]. They also follow a transfer learning approach, training the models on the SQuAD dataset, and then ne-tuning them on the BioASQ datasets. Finally, they also rely on voting to integrate the results of multiple models. The \umass czi" team also focused on the exact answer part of phase B, experimenting with unsupervised representation learning approaches in the context of Biomedical QA. In particular, they considered pretrained representations based on BioBERT, SciBERT, and BioSentVec [ 9 ] and experimented with transferring knowledge from the SQuAD and PubMedQA datasets in to the BioASQ 8b QA task. Finally, they also develop a new pre-training method based on a self-supervised de-noising approach. In this method, they rst generate a QA dataset randomly replacing entities automatically recognized by PubTator [ 48 ] in PubMed abstracts. Then, train their model on extracting the span of the new entities given the original ones as a queries.

The \MQ" team, as in past years, focused on ideal answers, approaching the task as query-based summarisation. In some of their systems the retrain their previous classi cation and regression approaches [ 30 ] in the new training dataset. In addition, they also employ reinforcement learning with Proximal Policy Optimization (PPO) [ 44 ] and two variants to represent the input features, namely Word2Vec-based and BERT-based embeddings. The \DAIICT " team also participated in ideal answer generation, using the standard extractive summarization techniques textrank [ 29 ] and lexrank [ 15 ] as well as sentence selection techniques based on their similarity with the query. They also modi ed these techniques investigating the e ect of query expansion based on UMLS [ 6 ] for sentence selection and summarization.

Finally, the \sbert " team, also focused on ideal answers. They experimented with di erent embedding models and multi-task learning in their systems, using parts from previous \MQU " systems for the pre-processing of data and the prediction step based on classi cation and regression [ 30 ]. In particular, they used a Universal Sentence Embedding Model [ 11 ] (BioBERT-NLI 7) based on a version of BioBERT ne-tuned on the the SNLI [ 7 ] and the MultiNLI datasets as in Sentence-BERT [ 42 ]. The features were fed to either a single logistic regression or classi cation model to derive the ideal answers. Additionally, in a multi-task setting, they trained the model on both the classi cation and regression tasks, selecting for the nal prediction one of them.

In this challenge too, the open source OAQA system proposed by [52] served as baseline for phase B exact answers. The system which achieved among the highest performances in previous versions of the challenge remains a strong baseline for the exact answer generation task. The system is developed based on the UIMA framework. ClearNLP is employed for question and snippet parsing. MetaMap, TmTool [ 49 ], C-Value and LingPipe [ 3 ] are used for concept identication and UMLS Terminology Services (UTS) for concept retrieval. The nal steps include identi cation of concept, document and snippet relevance based on classi er components and scoring and nally ranking techniques. 4 4.1

Results

7 https://huggingface.co/gsarti/biobert-nli used for measuring the classi cation performance of the systems. In particular, the micro F-measure (MiF) and the Lowest Common Ancestor F-measure (LCA-F) were used to identify the winners for each batch [ 22 ]. As suggested by Demsar [ 13 ], the appropriate way to compare multiple classi cation systems over multiple datasets is based on their average rank across all the datasets. In this task, the system with the best performance in a test set gets rank 1.0 for this test set, the second best rank 2.0 and so on. In case two or more systems tie, they all receive the average rank. Then, according to the rules of the challenge, the average rank of each system for a batch is calculated based on the four best ranks of the system in the ve test sets of the batch. The average rank of each system, based on both the at MiF and the hierarchical LCA-F scores, for the three batches of the task are presented in Table 5.

The results in Task 8a show that in all test batches and for both at and hierarchical measures, the best systems outperform the strong baselines. In particular, the \dmiip fdu" systems from the Fudan University team achieve the best performance in all three batches of the task. More detailed results can be found in the online results page8. Comparing these results with the corresponding results from previous versions of the task, suggests that both the MTI baseline and the top performing systems keep improving through the years of the challenge, as shown in Figure 4.

8 http://participants-area.bioasq.org/results/8a/

Phase A: In the rst phase of Task 8b, the systems are ranked according to the Mean Average Precision (MAP) measure for each of the four types of annotations, namely documents, snippets, concepts and RDF triples. This year, the calculation of Average Precision (AP) in MAP for phase A was reconsidered as described in the o cial description of the evaluation measures for Task 8b9. In brief, since BioASQ3, the participant systems are allowed to return up to 10 relevant items (e.g. documents), and the calculation of AP was modi ed to re ect this change. However, the number of golden relevant items in the last years have been observed to be lower than 10 in some cases, resulting to relatively small AP values even for submissions with all the golden elements. For this reason, this year, we modi ed the MAP calculation to consider both the limit of 10 elements and the actual number of golden elements. In Tables 6 and 7 some indicative preliminary results from batch 2 are presented. The full results are available in the online results page of Task 8b, phase A10. The results presented here are preliminary, as the nal results for the task 8b will be available after the manual assessment of the system responses by the BioASQ team of biomedical experts.

Phase B: In the second phase of task 8b, the participating systems were expected to provide both exact and ideal answers. Regarding the ideal answers, the systems will be ranked according to manual scores assigned to them by the BioASQ experts during the assessment of systems responses [ 4 ]. For the

9 http://participants-area.bioasq.org/Tasks/b/eval meas 2020/ 10 http://participants-area.bioasq.org/results/8b/phaseA/

exact answers, which are required for all questions except the summary ones, the measure considered for ranking the participating systems depends on the question type. For the yes/no questions, the systems were ranked according to the macro-averaged F1-measure on prediction of no and yes answer. For factoid questions, the ranking was based on mean reciprocal rank (MRR) and for list questions on mean F1-measure. Some indicative results for exact answers for the third batch of Task 8b are presented in Table 8. The full results of phase B of Task 8b are available online11. These results are preliminary, as the nal results for Task 8b will be available after the manual assessment of the system responses by the BioASQ team of biomedical experts.

Figure 5 presents the performance of the top systems for each question type in exact answers during the eight years of the BioASQ challenge. The diagram reveals that this year the performance of systems in the yes/no questions keeps improving. For instance, in batch 3 presented in Table 8, various systems manage to outperform by far the strong baseline, which is based on a version of the OAQA system that achieved top performance in previous years. Improvements are also observed in the preliminary results for list questions, whereas the top system performance in factoid questions is uctuating in the same range as done last year. In general, Figure 5 suggests that for the latter types of question there is still more room for improvement. 11 http://participants-area.bioasq.org/results/8b/phaseB/ Fig. 5. The o cial evaluation scores of the best performing systems in Task B, Phase B, exact answer generation, across the eight years of the BioASQ challenge. Since BioASQ6 the o cial measure for Yes/No questions is the macro-averaged F1 score (macro F1, but accuracy (Acc) is also presented as the former o cial measure. The results for BioASQ8 are preliminary, as the nal results for Task 8b will be available after the manual assessment of the system responses.

Conclusions

This paper provides an overview of the eighth version of the BioASQ tasks a and b, on biomedical semantic indexing and question answering in English respectively. These tasks, already established through the previous seven years of the challenge, together with the new MESINESP task on semantic indexing of medical content in Spanish, which ran for the rst time, consisted the eighth edition of the BioASQ challenge.

The overall shift of participant systems towards deep neural approaches, already noticed in the previous years, is even more apparent this year. Stateof-the-art methodologies have been successfully adapted to biomedical question answering and novel ideas have been investigated. In particular, most of the systems adopted neural embedding approaches, notably based on BERT and BioBERT models, for both tasks. In the QA task in particular, di erent teams attempted transferring knowledge from general domain QA datasets, notably SQuAD, or from other NLP tasks such as NER and NLI, also experimenting with multi-task learning settings. In addition, recent advancements in NLP, such as XLNet [51], BART [ 26 ] and SpanBERT [ 20 ] have also been tested for the tasks of the challenge.

Overall, as in previous versions of the tasks, the top preforming systems were able to advance over the state of the art, outperforming the strong baselines on the challenging shared tasks o ered by the organizers. Therefore, we consider that the challenge keeps meeting its goal to push the research frontier in biomedical semantic indexing and question answering. The future plans for the challenge include the extension of the benchmark data though a communitydriven acquisition process. 6

Acknowledgments

Google was a proud sponsor of the BioASQ Challenge in 2019. The eighth edition of BioASQ is also sponsored by the Atypon Systems inc. BioASQ is grateful to NLM for providing the baselines for task 8a and to the CMU team for providing the baselines for task 8b. The MESINESP task is sponsored by the Spanish Plan for advancement of Language Technologies (Plan TL) and the Secretar a de Estado para el Avance Digital (SEAD). BioASQ is also grateful to LILACS, SCIELO and Biblioteca virtual en salud and Instituto de salud Carlos III for providing data for the BioASQ MESINESP task. 51. Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237 (2019), http://arxiv.org/abs/1906.08237 52. Yang, Z., Zhou, Y., Eric, N.: Learning to answer biomedical questions: Oaqa at bioasq 4b. ACL 2016 p. 23 (2016) 53. Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained Language Model for Biomedical Question Answering. In: Seventh BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering (2019) 54. You, R., Liu, Y., Mamitsuka, H., Zhu, S.: Bertmesh: Deep contextual representation learning for large-scale high-performance mesh indexing with full text. bioRxiv (2020). https://doi.org/10.1101/2020.07.04.187674, https://www.biorxiv.org/content/early/2020/07/06/2020.07.04.187674 55. You, R., Zhang, Z., Wang, Z., Dai, S., Mamitsuka, H., Zhu, S.: Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classi cation. arXiv preprint arXiv:1811.01727 (2018) 56. Zavorin, I., Mork, J.G., Demner-Fushman, D.: Using learning-to-rank to enhance nlm medical text indexer results. ACL 2016 p. 8 (2016)

1. Almeida , T. , Matos , S. : Calling attention to passages for biomedical question answering . In: European Conference on Information Retrieval . pp. 69 { 77 . Springer ( 2020 )

2. Anastasios , N. , Anastasia , K. , Konstantinos , B. , Martin , K. , Carlos , R.P. , Marta , V. , Georgios , P. : Overview of bioasq 2020 : The eighth bioasq challenge on largescale biomedical semantic indexing and question answering . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020 ), Thessaloniki, Greece, September 22 { 25 , 2020 , Proceedings. vol. 12260 . Springer ( 2020 )

3. Baldwin , B. , Carpenter , B. : Lingpipe. Available from World Wide Web: http://alias-i. com/lingpipe ( 2003 )

4. Balikas , G. , Partalas , I. , Kosmopoulos , A. , Petridis , S. , Malakasiotis , P. , Pavlopoulos , I. , Androutsopoulos , I. , Baskiotis , N. , Gaussier , E. , Artieres , T. , Gallinari , P. : Evaluation framework speci cations . Project deliverable D4 .1, UPMC ( 05 / 2013 2013)

5. Beltagy , I. , Lo , K. , Cohan , A. : Scibert: A pretrained language model for scienti c text . arXiv preprint arXiv: 1903 . 10676 ( 2019 )

6. Bodenreider , O. : The uni ed medical language system (umls): integrating biomedical terminology . Nucleic acids research 32(suppl 1) , D267{D270 ( 2004 )

7. Bowman , S.R. , Angeli , G. , Potts , C. , Manning , C.D.: A large annotated corpus for learning natural language inference . arXiv preprint arXiv:1508.05326 ( 2015 )

8 . Chang , W.C. , Yu , H.F. , Zhong , K. , Yang , Y. , Dhillon , I.: X-bert: extreme multilabel text classi cation with using bidirectional encoder representations from transformers . arXiv preprint arXiv: 1905 . 02331 ( 2019 )

9. Chen , Q. , Peng , Y. , Lu , Z. : Biosentvec: creating sentence embeddings for biomedical texts . In: 2019 IEEE International Conference on Healthcare Informatics (ICHI) . pp. 1 { 5 . IEEE ( 2019 )

10. Clark , K. , Luong , M.T. , Le , Q.V. , Manning , C.D.: Electra: Pre-training text encoders as discriminators rather than generators . arXiv preprint arXiv: 2003 . 10555 ( 2020 )

11. Conneau , A. , Kiela , D. , Schwenk , H. , Barrault , L. , Bordes , A. : Supervised learning of universal sentence representations from natural language inference data . arXiv preprint arXiv:1705.02364 ( 2017 )

12. Couto , F.M. , Lamurias , A. : MER: a shell script and annotation server for minimal named entity recognition and linking . Journal of Cheminformatics 10 ( 1 ), 58 (dec 2018 ). https://doi.org/10.1186/s13321-018-0312-9

13. Demsar , J.: Statistical comparisons of classi ers over multiple data sets . Journal of Machine Learning Research 7 , 1 { 30 ( 2006 )

14. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1 ( Mlm ), 4171 {4186 (oct 2018 ), http://arxiv.org/abs/ 1810 .04805

15. Erkan , G. , Radev , D.R. : Lexrank: Graph-based lexical centrality as salience in text summarization . Journal of arti cial intelligence research 22 , 457 { 479 ( 2004 )

16. Gormley , C. , Tong , Z. : Elasticsearch: The de nitive guide: A distributed real-time search and analytics engine . \ O'Reilly Media , Inc." ( 2015 )

17. Gururangan , S. , Marasovic , A. , Swayamdipta , S. , Lo , K. , Beltagy , I. , Downey , D. , Smith , N.A. : Don't stop pretraining: Adapt language models to domains and tasks . arXiv preprint arXiv: 2004 . 10964 ( 2020 )

18. Jain , H. , Prabhu , Y. , Varma , M. : Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications . In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16 . pp. 935 { 944 . ACM Press, New York, New York, USA ( 2016 ). https://doi.org/10.1145/2939672.2939756

19. Jin , Q. , Dhingra , B. , Liu , Z. , Cohen , W.W. , Lu , X. : Pubmedqa: a dataset for biomedical research question answering . arXiv preprint arXiv: 1909 . 06146 ( 2019 )

20. Joshi , M. , Chen , D. , Liu, Y. , Weld , D.S. , Zettlemoyer , L. , Levy , O. : Spanbert: Improving pre-training by representing and predicting spans . Transactions of the Association for Computational Linguistics 8 , 64 { 77 ( 2020 )

21. Kim , D. , Lee , J. , So , C.H. , Jeon , H. , Jeong , M. , Choi , Y. , Yoon , W. , Sung , M. , Kang , J.: A neural named entity recognition and multi-type normalization tool for biomedical text mining . IEEE Access 7 , 73729 { 73740 ( 2019 )

22. Kosmopoulos , A. , Partalas , I. , Gaussier , E. , Paliouras , G. , Androutsopoulos , I. : Evaluation measures for hierarchical classi cation: a uni ed view and novel approaches . Data Mining and Knowledge Discovery 29 ( 3 ), 820 { 865 ( 2015 )

23. Kotitsas , S. , Pappas , D. , Androutsopoulos , I. , McDonald , R. , Apidianaki , M. : Embedding biomedical ontologies by jointly encoding network structure and textual node descriptors . arXiv preprint arXiv: 1906 . 05939 ( 2019 )

24. Kudo , T. , Richardson , J.: SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing . In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations . pp. 66 { 71 . Association for Computational Linguistics, Stroudsburg, PA, USA ( 2018 ). https://doi.org/10.18653/v1/ D18 -2012

25. Lee , J. , Yoon , W. , Kim , S. , Kim , D. , Kim , S. , So , C.H. , Kang , J.: Biobert: pretrained biomedical language representation model for biomedical text mining . arXiv preprint arXiv: 1901 . 08746 ( 2019 )

26. Lewis , M. , Liu , Y. , Goyal , N. , Ghazvininejad , M. , Mohamed , A. , Levy , O. , Stoyanov , V. , Zettlemoyer , L. : Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension . arXiv preprint arXiv: 1910 . 13461 ( 2019 )

27. Loper , E. , Bird , S. : Nltk: the natural language toolkit . arXiv preprint cs/0205028 ( 2002 )

28. Ma , J., Korotkov , I. , Yang , Y. , Hall , K. , McDonald , R. : Zero-shot neural retrieval via domain-targeted synthetic query generation . arXiv preprint arXiv: 2004 . 14503 ( 2020 )

29. Mihalcea , R. , Tarau , P. : Textrank: Bringing order into text . In: Proceedings of the 2004 conference on empirical methods in natural language processing . pp. 404 { 411 ( 2004 )

30. Molla , D. , Jones , C. : Classi cation betters regression in query-based multidocument summarisation techniques for question answering . In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases . pp. 624 { 635 . Springer ( 2019 )

31. Mork , J.G. , Demner-Fushman , D. , Schmidt , S.C. , Aronson , A.R. : Recent enhancements to the nlm medical text indexer . In: Proceedings of Question Answering Lab at CLEF ( 2014 )

32. Nentidis , A. , Bougiatiotis , K. , Krithara , A. , Paliouras , G. : Results of the seventh edition of the bioasq challenge . In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases . pp. 553 { 568 . Springer ( 2019 ). https://doi.org/10.1007/978-3- 030 -43887-6 51

33. Neumann , M. , King , D. , Beltagy , I. , Ammar , W. : Scispacy: Fast and robust models for biomedical natural language processing . arXiv preprint arXiv: 1902 . 07669 ( 2019 )

34. Ozyurt , I.B. , Bandrowski , A. , Grethe , J.S. : Bio-answer nder: a system to nd answers to questions from biomedical texts . Database 2020 ( 2020 )

35. Pang , L. , Lan , Y. , Guo , J. , Xu , J. , Xu , J., Cheng, X.: Deeprank: A new deep architecture for relevance ranking in information retrieval . In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management . pp. 257 { 266 ( 2017 )

36. Pappas , D. , McDonald , R. , Brokos , G.I. , Androutsopoulos , I. : AUEB at BioASQ 7: Document and Snippet Retrieval . In: Seventh BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering ( 2019 )

37. Pappas , D. , Stavropoulos , P. , Androutsopoulos , I. , McDonald , R. : Biomrc: A dataset for biomedical machine reading comprehension . In: Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing . pp. 140 { 149 ( 2020 )

38. Peng , S. , You , R. , Wang , H. , Zhai , C. , Mamitsuka , H. , Zhu , S. : Deepmesh: deep semantic representation for improving large-scale mesh indexing . Bioinformatics 32 ( 12 ), i70 { i79 ( 2016 )

39. Peters , M.E. , Neumann , M. , Iyyer , M. , Gardner , M. , Clark , C. , Lee , K. , Zettlemoyer , L. : Deep contextualized word representations . Proceedings of the Conference on Empirical Methods in Natural Language Processing pp. 31 { 40 (feb 2018 ), http://arxiv.org/abs/ 1802 .05365

40. Rae , A. , Mork , J. , Demner-Fushman , D. : Convolutional Neural Network for Automatic MeSH Indexing . In: Seventh BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering ( 2019 )

41. Rajpurkar , P. , Zhang, J., Lopyrev , K. , Liang , P. : Squad: 100 ,000+ questions for machine comprehension of text . arXiv preprint arXiv:1606.05250 ( 2016 )

42. Reimers , N. , Gurevych , I. : Sentence-bert: Sentence embeddings using siamese bertnetworks . arXiv preprint arXiv: 1908 . 10084 ( 2019 )

43. Ribadas , F.J. , De Campos , L.M. , Darriba , V.M. , Romero , A.E. : CoLe and UTAI at BioASQ 2015: Experiments with similarity based descriptor assignment . CEUR Workshop Proceedings 1391 ( 2015 )

44. Schulman , J. , Wolski , F. , Dhariwal , P. , Radford , A. , Klimov , O. : Proximal policy optimization algorithms . arXiv preprint arXiv:1707.06347 ( 2017 )

45. Smith , L. , Tanabe , L.K. , nee

Ando

, R.J., Kuo , C.J. , Chung , I.F. , Hsu , C.N. , Lin , Y.S. , Klinger , R. , Friedrich , C.M. , Ganchev , K. , et al.: Overview of biocreative ii gene mention recognition . Genome biology 9 ( S2 ), S2 ( 2008 )

46. Tsatsaronis , G. , Balikas , G. , Malakasiotis , P. , Partalas , I. , Zschunke , M. , Alvers , M.R. , Weissenborn , D. , Krithara , A. , Petridis , S. , Polychronopoulos , D. , Almirantis , Y. , Pavlopoulos , J. , Baskiotis , N. , Gallinari , P. , Artieres , T. , Ngonga , A. , Heino , N. , Gaussier , E. , Barrio-Alvers , L. , Schroeder , M. , Androutsopoulos , I. , Paliouras , G. : An overview of the bioasq large-scale biomedical semantic indexing and question answering competition . BMC Bioinformatics 16 , 138 ( 2015 ). https://doi.org/10.1186/s12859-015-0564-6

47. Tsoumakas , G. , Laliotis , M. , Markontanatos , N. , Vlahavas , I. : Large-Scale Semantic Indexing of Biomedical Publications . In: 1st BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering ( 2013 )

48. Wei , C.H. , Kao , H.Y. , Lu , Z. : Pubtator: a web-based text mining tool for assisting biocuration . Nucleic acids research 41 ( W1 ), W518{W522 ( 2013 )

49. Wei , C.H. , Leaman , R. , Lu , Z. : Beyond accuracy: creating interoperable and scalable text-mining web services . Bioinformatics (Oxford, England) 32 ( 12 ), 1907 { 10 ( 2016 ). https://doi.org/10.1093/bioinformatics/btv760

50. Williams , A. , Nangia , N. , Bowman , S.R.: A broad-coverage challenge corpus for sentence understanding through inference . arXiv preprint arXiv:1704.05426 ( 2017 )