-

Results of the BioASQ tasks of the Question Answering Lab at CLEF 2015

Georgios Balikas

Aris Kosmopoulos

Anastasia Krithara

Georgios Paliouras

Ioannis Kakadiaris

2 0 Laboratoire d' Informatique de Grenoble , France 1 NCSR \Demokritos" , Greece 2 University of Houston , USA

The goal of the BioASQ challenge is to push research towards highly precise biomedical information access systems. We aim to promote systems and approaches that are able to deal with the whole diversity of the Web, especially for, but not restricted to, the context of biomedicine. The third challenge consisted of two tasks: semantic indexing and question answering. 59 systems by 18 di erent teams participated in the semantic indexing task (Task 3a). The question answering task was further subdivided into two phases. 24 systems from 9 di erent teams participates in the annotation phase (Task 3b-phase A), while 26 systems of 10 di erent teams participated in the answer generation phase (Task 3b-phase B). Overall, the best systems were able to outperform the strong baselines provided by the organizers. In this paper, we present the data used during the challenge as well as the technologies which were used by the participants.

The aim of this paper is to present an overview of the BioASQ challenge in CLEF 2015. The overview provides information about: 1. the two BioASQ tasks of the Question Answering Lab at CLEF 2015, 2. the data provided during the BioASQ tasks, 3. the systems that participated in the challenge, according to the system descriptions that we have received; detailed descriptions of some of the systems are given in the lab proceedings which we cite, 4. evaluation results about the performance of the participating systems and compare them to dedicated baseline systems.

Overview of the Tasks

The challenge comprised two tasks: (1) a large-scale semantic indexing task (Task 3a) and (2) a question answering task (Task 3b). Information about the challenge and the nature of the data it provides is available at [ 21, 2 ]. Large-scale semantic indexing. In Task 3a the goal is to classify documents from the MEDLINE4 digital library unto concepts of the MeSH5 hierarchy. Here, new MEDLINE articles that are not yet annotated are collected on a weekly basis. These articles are used as test sets for the evaluation of the participating systems. As soon as the annotations are available from the MEDLINE curators, the performance of each system is assessed using standard information retrieval measures as well as hierarchical ones. The winners of each batch are decided based on their performance in the Micro F-measure (MiF) from the family of at measures [ 22 ], and the Lowest Common Ancestor F-measure (LCA-F) from the family of hierarchical measures [ 11 ]. For completeness several other at and hierarchical measures are reported [ 3 ].

In order to provide an on-line and large-scale scenario, the task was divided into three independent batches. In each batch 5 test sets of biomedical articles were released following a pre-announced schedule. The test sets were released on a weekly basis (on Monday 17.00 CET) and the participants were asked to provide their system's answers within 21 hours. Figure 1 gives an overview of the time plan of Task 3a.

February09 February16 February23

arch02 M arch09 M arch16 M arch23 M arch30 M pril 06 A pril 13 A pril 20 A pril 27 A ay04 M ay11 M

Biomedical semantic QA. The goal of Task 3b was to assess the performance of participating systems in di erent stages of the question answering process, ranging from the retrieval of relevant concepts and articles, to the generation of natural-language answers. Task 3b comprised two phases: In phase A, BioASQ released questions in English from benchmark datasets created by a group of biomedical experts. There were four types of question: \yes/no" questions, \factoid" questions,\list" questions and \summary" questions [ 3 ]. Participants were asked to respond with relevant concepts (from speci c terminologies and ontologies), relevant articles (PubMed and PubMedCentral6 articles), relevant snippets extracted from the relevant articles and relevant RDF triples (from speci c ontologies). In phase B, the released questions were accompanied by the correct answers for a subset of the required elements of phase A; namely documents and

4 http://www.ncbi.nlm.nih.gov/pubmed/ 5 http://www.ncbi.nlm.nih.gov/mesh/ 6 http://www.ncbi.nlm.nih.gov/pmc/

snippets.7 The participants had to answer with exact answers as well as with paragraph-sized summaries in natural language (dubbed ideal answers).

The task was split into ve independent batches (see Fig. 2). For each phase, the participants had 24 hours to submit their answers. We used well-known measures such as mean precision, mean recall, mean F-measure, mean average precision (MAP) and geometric MAP (GMAP) to evaluate the performance of the participants in Phase A. The winners were selected based on MAP. The evaluation in phase B was carried out manually by biomedical experts on the ideal answers provided by the systems. For the sake of completeness, ROUGE [ 12 ] was also reported. The systems that participated in the semantic indexing task of the BioASQ challenge adopted a variety of approaches based mostly on at classi cation. In the rest of section we describe the participating systems and stress their key characteristics.

The NCBI system [ 14 ], called MeSH Now, was contributed as a baseline system for the semantic indexing task of 2015. This allowed other participants to use its predictions, in order to improve their own results. The system is very similar to that developed by NCBI for the BioASQ2 challenge, based on the generic learning-to-rank approach presented in [ 9 ]. The main improvements were the addition of new training data from the third iteration of the challenge and the submission of two separate runs each week, one favoring high F1 and one favoring high recall. Improvements were also done on the scalability of the system which now runs in parallel using a computer cluster. 7 In the rst two editions of the BioASQ challenge, the datasets released for Phase B contained relevant articles, snippets, concepts and RDF triples for each question.

The AUTH-Atypon system [ 16 ] also adopted a at classi cation approach. The approach is based on binary linear SVM models for each class. A MetaLabeler [ 20 ] is used to predict the number of classes that an instance should be assigned to. An ensemble of such classi ers, trained on variable training data sizes and di erent time periods, is then used in order to deal with the problem of Concept Drift.

A domain-independent k-nearest-neighbor approach is adopted by the IIIT team [ 1 ]. Initially the system uses k-NN in order to nd the most relevant MeSH headings. Then a series of procedures are used, based on POS-tagging, IDF computation and SVM-rank in order to assign some extra classes to each test instance and improve the recall of the initial k-NN results. In the nal step, tree-based classi ers (one versus all) are used (FastXML), which actually take in to account the hierarchical relations between the MeSH terms.

Another k-nearest-neighbor approach is that of USI [ 8 ], which does not take into account the hierarchy. The authors claim that the method is generic since it does not take into account the domain or use any NLP, although they believe that an NLP module would boost their performance. Given an instance the system nds the k nearest instances in the training corpus and then uses the labels of these instances for annotating it by computing semantic similarities. During the challenge they experimented with various parameters of their system, such as the value of k and they also took into account the predictions of the baselines in order to improve their results.

The CoLe and UTAI [ 18 ] teams introduce a new approach, compared to their approach during the previous challenges. This year they use only conventional information retrieval tools, such as Lucene, combined with k-NN methods. The authors also experimented with several approaches of index term extraction ranging from simple to more complex ones requiring the use of NLP.

The ESIS* systems used the Lucene index in order to nd useful features for each of the MeSH classes separately. In this direction, they selected words that co-occur often with a particular class, as well as the most common terms excluding stop words. The decision function follows an k-nearest-neighbor approach, where for each test instance and given the feature extraction process they nd in the Lucene index the most common training examples that decide the class of the test instance. Intuitively, the probability of a class increases if a term that is strongly associated with it is present and decreases if a frequent term is absent.

The Fudan system [ 17 ] uses a learning to rank (LTR) method for predicting MeSH headings. The MeSHLabeler algorithm consists of two components. The rst component, called MeSHRanker, returns an ordered list of MeSH headings for each test instance. The ranking is determined by a combination of (a) binary classi ers, one for each MeSH heading, (b) the most similar citations to the test instance, (c) pattern matching between the MeSH headings and the title of the abstract and (d) the prediction of the MTI system. The second component, called MeSHNumber, predicts the actual number of MeSH heading that must be assigned to each test instance.

Table 1 describes the principal technologies that were employed by the participating systems and whether a hierarchical or a at approach has been adopted. Baselines. Five systems have served as baseline systems for BioASQ task 3a. The rst one, dubbed BioASQ Baseline, follows a simplistic unsupervised approach to the problem and is thus easy to beat. The rest of the systems are implementations of state-of-the-art methods: the Medical Text Indexer (MTI) and the MTI First Line Index [ 10 ] were developed and are maintained by the National Library of Medicine (NLM). 8 They serve as classi cation systems for articles of MEDLINE and are actively used by the MEDLINE curators in order to assist them in the annotation process. Furthermore, MeSH Now BF and MeSH Now HR were developed by NCBI and were among the best-performing systems in the second edition of the BioASQ challenge [ 14 ]. Consequently, we expected these baselines to be hard to beat. 3.2

Task 3b As mentioned above, the second task of the challenge is further divided into two phases. In the rst phase, where the goal is to annotate questions with relevant concepts, documents, snippets and RDF triples, 9 teams with 24 systems participated. In the second phase, where team are requested to submit exact and paragraph-sized answers for the questions, 10 teams with 26 di erent systems participated.

The OAQA system described in [ 23 ] focuses on learning to answer factoids and list questions. The participants trained three supervised models, using factoid and list questions of the previous editions of the task. The rst is an answer type prediction model, the second assigns a score to each predicted answer while the third is a collective re-ranking model. Although the system also participated in phase A of Task 3b its performance was much better in the factoid and list questions of phase B.

In contrast, the USTB system [25] participated only in phase A of the challenge. This approach initially uses a sequential dependence model for document retrieval. It then uses Word Embeddings (speci cally the Word2Vec tool) to rank 8 http://ii.nlm.nih.gov/MTI/index.shtml the results and improve the document retrieval of the previous step. In the nal step, biomedical concepts and corresponding RDF triples are extracted, using concept recognition tools, such as MetaMap and Banner.

Another system that focused on phase A is by the IIIT team and is described in [24]. The authors relied on the PubMed search engine to retrieve relevant documents. They then applied their own snippet extraction methods, which is based on the similarity of the top 10 sentences of the retrieved documents and the query.

The HPI system [ 15 ] participated in both phases of Task 3b. The system relies on in-memory based database technology, in order to map the given questions to concepts. The Stanford CoreNLP package is used for question tokenization and the BioASQ services are used for relevant document retrieval. The selection of snippets from the retrieved documents is performed using string similarity between terms of the question and words of the documents. Exact and ideal answers are both extracted using the gold-standard snippets that were provided to the participants.

The Fudan system [ 17 ] also participated in the second task of challenge. For phase A a language model is used in order to retrieve relevant documents. For snippet extraction, the retrieved documents are searched for query keywords by giving extra credit to terms that appear close to the query keywords. Regarding exact and ideal answers, the system is split into three main components: question analysis, candidate answer generation and candidate answer ranking.

In the system of ILSP and AUEB [ 13 ] a di erent approach for question answering is presented based on multi-document summarization from relevant documents. The system rst uses an SVR in order to assign scores to each sentence of the relevant documents. The most relevant sentences are then combined to form an answer. In order to avoid redundancy, two main approaches are examined, the use of an ILP model and the use of a more greedy strategy. Several versions of the system were examined, which di er on the features and training data that was used.

The YodaQA system, described in [ 4 ], is a pipeline question answering system that was altered in order to make it compatible with the BioASQ task. The system rst extracts natural language features from the questions and then searches its knowledge base for existing answers. It then either directly provides these passages as answers or performs passage analysis in order to produce answers from the extracted texts. Each answer is evaluated using a logistic regression classi er and those with the highest scores are provided as a nal answer. The initial system was designed to answer only factoid questions, so modi cations were necessary in order to be able to answer list questions.

The nal system is the SNUMedinfo described in [ 5 ]. Regarding Phase A, the system participated only in the document retrieval task. The approach was based on the Indri search engine [ 19 ] and the semantic concept-enriched model presented in [ 6 ]. In phase B, the system participated only in the ideal answer generation subtask, where it ranked each passage from the provided list, based on the unique keywords it contained. A set of m (parameter of the system) YodaQA [ 4 ] SNUMedinfo [ 5 ] 4 4.1

Results

passages were selected, in rank order, by selecting only passages that contain a minimum proportion of new tokens compared the already selected ones.

Table 2 describes the principal technologies that were employed by the participating systems and in which phase (A and/or B) have participated. Baselines. The BioASQ baseline of Task 3b phase B is a system similar to [ 13 ]. It applies a multi-document summarization method using Integer Linear Programming and Support Vector Regression.

During the evaluation phase of the Task 3a, the participants submitted their results on a weekly basis to the online evaluation platform of the challenge.9 The evaluation period was divided into three batches containing 5 test sets each. 18 teams participated in the task with a total of 59 systems. Two training datasets were provided: the rst contains 11,804,715 articles that cover 27,097 MeSH labels; the second is a subset containing 4,607,922 articles and covers 26,866 MeSH labels. The latter dataset focuses on the journals that appear also in the test sets. The uncompressed size of those training sets in text format is 19Gb and 7.4Gb respectively. Table 3 shows the number of articles in each test set of each batch of the challenge.

Table 4 presents the correspondence of the system names in the BioASQ Participants Area Leaderboard for Task 3a and the system description submitted in the track's working notes. Systems that participated in less than 4 test sets in each batch are not reported in the results.10

According to [ 7 ] the appropriate way to compare multiple classi cation systems over multiple datasets is based on their average rank across all the datasets.

9 http://participants-area.bioasq.org/

10 According to the rules of BioASQ, each system had to participate in at least 4 test sets of a batch in order to be able to win the batch.

Batch

1 Subtotal 2 3 Subtotal Subtotal

Total

Systems MeSH Now HR, MeSH Now BF auth* qaiiit system * Abstract framework, USI 20 neighbours, USI baseline, USI 10 neighbours iria-*

MeSHLabeler-* On each dataset the system with the best performance gets rank 1.0, the second best rank 2.0 and so on. In case two or more systems tie, they all receive the average rank. Table 5 presents the average rank (according to MiF and LCA-F) of each system over all the test sets for the corresponding batches. Note, that the average ranks are calculated for the 4 best results of each system in the batch according to the rules of the challenge11. The best ranked system is highlighted with bold typeface. As it can be noticed, on all three batches and for both at and hierarchical measures, the Fudan system [ 17 ] clearly outperforms other approaches. The AUTH-Atypon system [ 16 ] managed to score second in two out of three batches, while the MeSH-UK0 scored second in one of the batches. 11 http://participants-area.bioasq.org/general_information/Task3a/ MiF and LCA-F. A dash (-) is used whenever the system participated in less than 4 times in the batch. Systems that didn't participate in the challenge regularly, i.e. they didn't submit results for at least four test sets in at least one of the three batches, were Batch 1

Batch 2

Batch 3 LCA-F

LCA-F excluded from the table.

System 3b the systems were allowed to submit up to 10 responses per question to any of the corresponding type of annotation; that is documents, concepts, snippets and RDF triples. For each of the categories we rank the systems according to the MiF

Mean Average Precision (MAP) measure [ 3 ]. The nal ranking for each batch is calculated as the average of the individual rankings in the di erent categories. Tables 7 and 8 present the scores of the participating systems for document and snippet retrieval in the rst batch of Phase A .12 Note that systems are allowed to participate in any or all four parts of the task e.g., SNUMedinfo* retrieved only 12 In contrast to the

rst two editions of the challenge, the biomedical experts of BioASQ were not asked to produce golden concepts and triples prior to the challenge. The ground truth for concepts and snippets will be constructed by the experts on the basis of the material provided by the systems. documents. It is worth noting, that document retrieval for the given questions was the most popular aspect of the task; far fewer systems returned document snippets, concepts and RDF triples. The detailed results for Task 3b phase A can be found in http://participants-area.bioasq.org/results/3b/phaseA/. Phase B. In phase B of Task 3b, the systems were asked to generate exact and ideal answers. The systems will be ranked according to the manual evaluation of ideal answers by the BioASQ experts [ 3 ]. For reasons of completeness we report also the results of the systems for the exact answers. In contrast to the previous editions of the BioASQ challenge, the test les of Phase B included only relevant documents and snippets for each question instead of relevant documents, snippets, concepts and RDF triples. As a result, the participating systems had less information available in order to construct the exact and the ideal answers.

Table 9 shows the results for the exact answers in the rst batch of task 3b. For systems that didn't provide exact answers for a particular kind of question we use the dash symbol \-". The results of the other batches are available at http://participants-area.bioasq.org/results/3b/phaseB/. They are not reproduced here in the interest of space. From those results we can see that some of the systems are achieving a very high (> 80% accuracy) performance in the yes/no questions. The performance in factoid and list questions is not as good, indicating that there is room for improvement. On the other hand, the performance on ideal answers has improved compared to the previous years [ 2 ], which in combination with the increase of participation leads us to believe that a signi cant amount of e ort was invested by the participants and that the task is gaining attention. It is to be noted that those conclusions are based only on the automated evaluation measures; the manual assessment was still in progress at the time of writing this document.

Conclusions

The third edition of the BioASQ challenge has led to a number of interesting results by the participating systems. Despite them being quite advanced systems, the baselines that we provided have been beaten by the best systems. Both tasks have attracted an increasing number of participants and the number of submissions to the workshop has also increase. Therefore, we believe that the third edition of the challenge has been another contribution towards better biomedical information systems.This encourages us to continue the e ort and establish BioASQ as a reference point for research in the area. In future editions of the challenge, we aim to provide even more benchmark data derived from a community-driven acquisition process.

Acknowledgments

The third edition of BioASQ is supported by a conference grant from the NIH/NLM (number 1R13LM012214-01) and sponsored by the companies Viseo and Atypon. ing Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 24. Harish Yenala, Avinash Kamineni, Manish Shrivastava, and Manoj Chinnakotla.

BioASQ 3b Challange 2015: Bio-Medical Question Answering System. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 25. Zhi-Juan Zhang, Tian-Tian Liu, Bo-Wen Zhang, Yan Li, Chun-Hua Zhao, ShaoHui Feng, Xu-Cheng Yin, and Fang Zhou. A generic retrieval system for biomedical literatures: USTB at BioASQ2015 Question Answering Task. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015.

Kamineni

Avinash , Fatma Nausheen,

Das

Arpita ,

Shrivastava

Manish , and

Chinnakotla

Manoj . Extreme Classi cation of PubMed Articles usingMeSH Labels . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , Toulouse, France, 2015 .

George

Balikas , Ioannis Partalas, Axel-Cyrille Ngonga

Ngomo

, Anastasia Krithara, Eric Gaussier, and

George

Paliouras . Results of the bioasq track of the question answering lab at clef 2014 . Results of the BioASQ Track of the Question Answering Lab at CLEF, 2014 : 1181 { 93 , 2014 .

Georgios

Balikas , Ioannis Partalas, Aris Kosmopoulos, Sergios Petridis, Prodromos Malakasiotis, Ioannis Pavlopoulos, Ion Androutsopoulos, Nicolas Baskiotis, Eric Gaussier, Thierry Artieres, and

Patrick

Gallinari . Evaluation Framework Speci cations . Project deliverable D4.1 , 05 / 2013 2013.

Petr

Baudis and

Jan

Sedivy . Biomedical Question Answering using the YodaQA System: Prototype Notes . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , Toulouse, France, 2015 .

Sungbin

Choi . SNUMedinfo at CLEF QA track BioASQ 2015 . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF ), Toulouse, France, 2015 .

Sungbin

Choi , Jinwook Choi, Sooyoung Yoo, Heechun Kim, and

Youngho

Lee . Semantic concept-enriched dependence model for medical information retrieval . Journal of biomedical informatics , 47 : 18 { 27 , 2014 .

Janez

Demsar . Statistical Comparisons of Classi ers over Multiple Data Sets . Journal of Machine Learning Research , 7 :1{ 30 , 2006 .

Nicolas

Fiorini , Sylvie Ranwez, Sebastien

Harispe1

Jacky

Montmain , and

Vincent

Ranwez . USI at BioASQ 2015 : a semantic similarity-based approach for semantic indexing . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , Toulouse, France, 2015 .

Minlie

Huang ,

Aurlie

Nvol , and

Zhiyong

Lu . Recommending mesh terms for annotating biomedical articles . JAMIA , 18 ( 5 ): 660 { 667 , 2011 .

10. Susan C. Schmidt Alan R. Aronson James G. Mork , Dina Demner-Fushman. Recent enhancements to the nlm medical text indexer . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , volume 1180 , She ed, UK, 2014 .

11. Aris

Kosmopoulos

, Ioannis Partalas, Eric Gaussier, Georgios Paliouras, and

Ion

Androutsopoulos . Evaluation Measures for Hierarchical Classi cation: a uni ed view and novel approaches . CoRR, abs/1306.6802 , 2013 .

12. Chin-Yew Lin . ROUGE: A package for automatic evaluation of summaries . In Proceedings of the ACL workshop `Text Summarization Branches Out' , pages 74 { 81 , Barcelona , Spain, 2004 .

13. Prodromos

Malakasiotis

, Emmanouil Archontakis, Ion Androutsopoulos, Dimitrios Galanis, and

Harris

Papageorgiou . Biomedical question-focused multi-document summarization: ILSP and AUEB at BioASQ3 . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , Toulouse, France, 2015 .

14.

Yuqing

Mao and

Zhiyong

Lu . NCBI at the 2015 BioASQ challenge task: Baseline results from MeSH Now . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , Toulouse, France, 2015 .

15.

Mariana

Neves . HPI question answering system in the BioASQ 2015 challenge . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , Toulouse, France, 2015 .

16. Yannis

Papanikolaou

, Grigorios Tsoumakas, Manos Laliotis, Nikos Markantonatos, and

Ioannis

Vlahavas . AUTH-Atypon at BioASQ 3: Large-Scale Semantic Indexing in Biomedicine . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , Toulouse, France, 2015 .

17. Shengwen

Peng

, Ronghui You, Zhikai Xie, Yanchun Zhang, and

Shanfeng

Zhu . The Fudan participation in the 2015 BioASQ Challenge: Large-scale Biomedical Semantic Indexing and Question Answering . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , Toulouse, France, 2015 .

18. Francisco J. Ribadas, Luis M. de Campos , V ctor M. Darriba1 , and Alfonso E. Romero. CoLe and UTAI at BioASQ 2015: experiments with similarity based descriptor assignment . In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF) , Toulouse, France, 2015 .

19. Trevor

Strohman

, Donald Metzler, Howard Turtle, and

W Bruce

Croft . Indri: A language model-based search engine for complex queries . In Proceedings of the International Conference on Intelligent Analysis , volume 2 , pages 2{6 , 2005 .

20. Lei

Tang

, Suju Rajan, and Vijay

Narayanan . Large scale multi-label classi cation via metalabeler . In Proceedings of the 18th international conference on World wide web, WWW '09 , pages 211 { 220 , New York, NY, USA, 2009 . ACM.

21. George

Tsatsaronis

, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis,

Dimitris

Polychronopoulos , et al. An overview of the bioasq largescale biomedical semantic indexing and question answering competition . BMC bioinformatics , 16 ( 1 ): 138 , 2015 .

22. Grigorios

Tsoumakas

, Ioannis Katakis, and

Ioannis

Vlahavas . Mining Multi-label Data . In Oded Maimon and Lior Rokach , editors, Data Mining and Knowledge Discovery Handbook , pages 667 { 685 . Springer

, 2010 .

23. Zi

Yang

, Niloy Gupta, Xiangyu Sun, Di Xu, Chi Zhang, and

Eric

Nyberg . Learning to Answer Biomedical Factoid and List Questions OAQA at BioASQ 3B . In Work-