Introduction

Results of the BioASQ Track of the Question Answering Lab at CLEF 2014

George Balikas

Ioannis Partalas

Axel-Cyrille Ngonga Ngomo

Anastasia Krithara

Eric Gaussier

George Paliouras

1181 1193

The goal of this task is to push the research frontier towards hybrid information systems. We aim to promote systems and approaches that are able to deal with the whole diversity of the Web, especially for, but not restricted to, the context of bio-medicine. This goal is pursued by the organization of challenges. The second challenge consisted of two tasks: semantic indexing and question answering. 61 systems participated by 18 di erent participating teams for the semantic indexing task, of which between 25 and 45 participated in each batch. The semantic indexing task was tackled by 22 systems, which were developed by 8 di erent organizations. Between 15 and 19 of these systems addressed each batch. The question answering task was tackled by 18 di erent systems, developed by 7 di erent organizations. Between 9 and 15 of these systems submitted results in each batch. Overall, the best systems were able to outperform the strong baselines provided by the organizers.

Introduction

The aim of this paper is twofold. First, we aim to give an overview of the data issued during the BioASQ track of the Question Answering Lab at CLEF 2014. In addition, we aim to present the systems that participated in the challenge and for which we received system descriptions. In particular, we aim to evaluate their performance w.r.t. to dedicated baseline systems. To achieve these goals, we begin by giving a brief overview of the tasks included in the track, including the timing of the di erent tasks and the challenge data. Thereafter, we give an overview of the systems which participated in the challenge and provided us with an overview of the technologies they relied upon. Detailed descriptions of some of the systems are given in lab proceedings. The evaluation of the systems, which was carried out by using state-of-the-art measures or manual assessment, is the last focal point of this paper. The conclusion sums up the results of the track.

Overview of the Tasks

The challenge comprised two tasks: (1) a large-scale semantic indexing task (Task 2a) and (2) a question answering task (Task 2b). Large-scale semantic indexing. In Task 2a the goal is to classify documents from the PubMed1 digital library unto concepts of the MeSH2 hierarchy. Here, new PubMed articles that are not yet annotated are collected on a weekly basis. These articles are used as test sets for the evaluation of the participating systems. As soon as the annotations are available from the PubMed curators, the performance of each system is calculated by using standard information retrieval measures as well as hierarchical ones. The winners of each batch were decided based on their performance in the Micro F-measure (MiF) from the family of at measures [ 23 ], and the Lowest Common Ancestor F-measure (LCA-F) from the family of hierarchical measures [ 9 ]. For completeness several other at and hierarchical measures were reported [ 3 ]. In order to provide an on-line and large-scale scenario, the task was divided into three independent batches. In each batch 5 test sets of biomedical articles were released consecutively. Each of these test sets were released in a weekly basis and the participants had 21 hours to provide their answers. Figure 1 gives an overview of the time plan of Task 2a. February4 arch11 M pril 15 A ay20 M

Biomedical semantic QA. The goal of task 2b was to provide a large-scale question answering challenge where the systems should be able to cope with all the stages of a question answering task, including the retrieval of relevant concepts and articles, as well as the provision of natural-language answers. Task 2b comprised two phases: In phase A, BioASQ released questions in English from benchmark datasets created by a group of biomedical experts. There were four types of questions: \yes/no" questions, \factoid" questions,\list" questions and \summary" questions [ 3 ]. Participants had to respond with relevant concepts (from speci c terminologies and ontologies), relevant articles (PubMed and PubMedCentral3 articles), relevant snippets extracted from the relevant articles and relevant RDF triples (from speci c ontologies). In phase B, the released questions contained the correct answers for the required elements (concepts, articles, snippets and RDF triples) of the rst phase. The participants had to answer with exact answers as well as with paragraph-sized summaries in natural language (dubbed ideal answers). 1 http://www.ncbi.nlm.nih.gov/pubmed/ 2 http://www.ncbi.nlm.nih.gov/mesh/ 3 http://www.ncbi.nlm.nih.gov/pmc/ arch3 M arch4 M arch19 M arch20

M Phase A Phase B pril 2 A pril 3 A

The task was split into ve independent batches. The two phases for each batch were run with a time gap of 24 hours. For each phase, the participants had 24 hours to submit their answers. We used well-known measures such as mean precision, mean recall, mean F-measure, mean average precision (MAP) and geometric MAP (GMAP) to evaluate the performance of the participants in Phase A. The winners were selected based on MAP. The evaluation in phase B was carried out manually by biomedical experts on the ideal answers provided by the systems. For the sake of completeness, ROUGE [ 11 ] is also reported.

Overview of Participants

The participating systems in the semantic indexing task of the BioASQ challenge adopted a variety of approaches including hierarchical and at algorithms as well as search-based approaches that relied on information retrieval techniques. In the rest of section we describe the proposed systems and stress their key characteristics.

The new NCBI system [ 26 ] for Task 2a is an extension of the work presented in 2013 and relies on the generic learning-to-rank approach presented in [ 7 ]. This novel approach, dubbed LAMBDA-MART, di ers from the previous approach in the following aspects: First, the set of features has been extended to include binary classi er results. In addition, the set of documents used as neighbor documents was reduced to documents indexed after 2009. Moreover, the score function for the selection of the number of features was changed from a linear to a logarithmic approach. Overall, the novel approach achieves an Fmeasure between 0 (RDF triples) and 0.38 (concepts).

In [ 18 ] at classi cation processes were employed for the semantic indexing task. In particular, the authors trained binary SVM classi ers for each label that was present in the data. In order to reduce the complexity they trained the SVMs in fractions of the data. They trained two systems on di erent corpus: Asclepios on 950 thousand documents and Hippocrates on 1.5 million. Those systems output a ranked lists with labels and a meta-model, namely MetaLabeler [ 22 ], is used to decide the number of labels that will be submitted for each document. The remaining three systems of the team employ ensemble learning methods. The approach that worked best was a combination of Hippocrates with a model of simple binary SVMs, which were trained by changing the weights parameter for positive instances [ 10 ]. During the training of a classi er with very few positive instances they can chose to penalize a false negative (a positive instance being misclassi ed) more than a false positive (a negative instance being mis-classi ed). The proposed approaches, although they are relatively simple, require a lot of processing power and memory. For that reason they used a machine with 40 processors and 1TB RAM.

Ribadas et al. [ 20 ] employ hierarchical models based on a top-down hierarchical classi cation scheme [ 21 ] and a Bayesian network which models the hierarchical relations among the labels as well as the training data. The team participated in the rst edition of the BioASQ challenge using the same technologies [ 19 ]. In the current competition they focused on the pre-processing of the textual data while keeping the same classi cation models. More speci cally, the authors employ techniques for identifying abbreviations in the text and expanding it afterwards in order to enrich the document. Also, a part of speech tagger is used in order to tokenize the text and identify noun, verbs, adjectives and unknown elements (not identi ed). Finally, a lemmatization step extracts the canonical forms of those words. Additionally, the authors extract word bigrams and keep only those that are identi ed as multiword terms. The rational is that multiword terms in a domain with complex terminology, like biomedicine, provide higher discriminant power.

In [ 5 ] the authors use a standard at classi cation scheme, where a SVM is trained for each class label in MeSH. Di erent training set methodologies are used resulting in di erent trained classi ers. Due to computational issues only 50,000 documents were used for training. The selection of the best classi cation scheme is optimized on the precision at top k labels on a validation set.

In [ 13 ] the authors used the learning to rank (LTR) method for predicting MeSH headings. However, in addition to the information from similar citations, they also used the prediction scores from individual MeSH classi ers to improve the prediction accuracy. In particular, they trained a binary classi er (logistic regression) for each label (MeSH heading). For a target citation, using the trained classi ers, they calculated the annotation probability (score) of every MeSH heading. Then, using NCBI efetch4,they retrieved similar citations for the neighbor scores. Finally, these two scores, together with the default results of NLM o cial solution MTI, were considered as features in the LTR framework. The LambdaMART [ 4 ] was used as the ranking method in the learning to rank framework.

In [ 1 ], they proposed a system which uses Latent Semantic Analysis to identify semantically similar documents in MEDLINE and then constructs a list of MeSH headers from candidates selected from the documents most similar to a new abstract. 4 http://www.ncbi.nlm.nih.gov/books/NBK25499/

Table 1 resumes the principal technologies that were employed by the participating systems and whether a hierarchical or a at approach has been followed. Baselines. During the rst challenge two systems were served as baseline systems. The rst one, dubbed BioASQ Baseline, follows an unsupervised approach to tackle the problem and so it is expected that the systems developed by the participants will outperform it. The second baseline is a state-of-theart method called Medical Text Indexer [ 8 ] which is developed by the National Library of Medicine5 and serves as a classi cation system for articles of MEDLINE. MTI is used by curators in order to assist them in the annotation process. The new annotator is an extension of the system presented in [ 16 ] with the approaches of the last year's winner [ 24 ]. Consequently, we expected the baseline to di cult to beat. 3.2

Task 2b

As mentioned above, the second task of the challenge is split into two phases. In the rst phase, where the goal is to annotate questions with relevant concepts, documents, snippets and RDF triples 8 teams with 22 systems participated. In the second phase, where team are requested to submit exact and paragraph-sized answers for the questions, 7 teams with 18 di erent systems participated.

The system presented in [ 17 ] relies on the Hana Database for text processing. It uses the Stanford CoreNLP package for tokenizing the questions. Each of the token is then sent to the BioPortal and to the Hana database for concept retrieval. The concepts retrieved from the two stores are nally merged to a single list that is used to retrieve relevant text passages from the documents at hand. To this end, four di erent types of queries are sent to the BioASQ services. Overall, the approach achieves between 0.18 and 0.23 F-measure.

The approach proposed by NCBI [ 26 ] for Task 2b can be used in combination with the approach by the same group for Task 2a. In phase A, NCBI's framework used the cosine similarity between question and sentence to compute their 5 http://ii.nlm.nih.gov/MTI/index.shtml similarity. The best scoring sentence from an abstract was chosen as relevant snippet for an answer. Concept recognition was achieved by a customized dictionary lookup algorithm in combination with MetaMap. For phase B, tailored approaches were used depending on the question types. For example, a manual set of rules was crafted to determine the answers to factoid and list questions based on the benchmark data for 2013. The system achieved an F-measure of up to betwen 0.2% (RDf triples) and 38.48% (concepts). It performed very well on Yes/No questions (up to 100% accuracy). Factoid and list questions led to an MRR of up to 20.57%.

In [ 5 ] the authors participated only in the document retrieval of phase A and in the generation of ideal answers in phase B. The Indri search engine is used to index the PubMed articles and di erent models are used to retrieve documents like pseudo-relevance feedback, sequential dependence model and semantic concept-enriched dependence model where the recognised UMLS concepts in the query are used as additional dependence features for ranking documents. For the generation of ideal answers the authors retrieve sentences from documents and identify the common keywords. Then the sentences are ranked according to the number of times these keywords appear in each of them and nally the top ranked m are used to form the ideal answer.

The authors of [ 12 ] propose a method for the retrieval of relevant documents and snippets of task 2b. They develop a gure-inspired text retrieval method as a way of retrieving documents and text passages from biomedical publications. The method is based on the insight that for biomedical publications, the gures play an important role to the point that the captions can be used to provide abstract like summaries. The proposed approach uses an Information Retrieval perspective on the problem. In principle, the followed steps are: (i) the question in enriched by query expansion with information from UMLS, Wikipedia, and Figures, (ii) a ranking of full documents and snippets is retrieved from a corpus of PubMed Central Articles which is the set of full-text available articles, (iii) features are extracted for each document and snippet that provide proof of its relevance for the question and (iv) the documents/snippets are re-ranked with a learning-to-rank approach.

In the context of phase B of task 2b in [ 18 ], the authors attempted to replicate the work that already exists in literature and was presented in the BioASQ 2013 workshop [ 25 ]. They provided exact answers only for the factoid questions. Their system tries to extract the lexical answer type by manipulating the words of the question. Then, the relevant snippets of the question which are provided as inputs for this tasks are processed with the 2013 release of MetaMap [ 2 ] in order to extract candidate answers.

Baselines. Two baselines were used in phase A. The systems return the list of the top-50 and the top-100 entities respectively that may be retrieved using the keywords of the input question as a query to the BioASQ services. As a result, two lists for each of the main entities (concepts, documents, snippets, triples) are produced, of a maximum length of 50 and 100 items respectively.

For the creation of a baseline approach in Task 2B Phase B, three approaches were created that address respectively the answering of factoid and lists questions, summary questions, and yes/no questions [ 25 ]. The three approaches were combined into one system, and they constitute the BioASQ baseline for this phase of Task 2B. The baseline approach for the list/factoid questions utilizes and ensembles a set of scoring schemes that attempt to prioritize the concepts that answer the question by assuming that the type of the answer aligns with the lexical answer type (type coercion). The baseline approach for the summary questions introduces a multi-document summarization method using Integer Linear Programming and Support Vector Regression. 4 4.1

Results

During the evaluation phase of the Task 2a, the participants submitted their results on a weekly basis to the online evaluation platform of the challenge6. The evaluation period was divided into three batches containing 5 test sets each. 18 teams were participated in the task with a total of 61 systems. 12,628,968 articles with 26,831 labels (20.31GB) were provided as training data to the participants. Table 2 shows the number of articles in each test set of each batch of the challenge.

Labels per article 13.20 13.13 13.32 13.02 13.07 13.15 13.05 12.28 12.90 13.23 13.58 13.01 12.71 13.37 13.32 13.90 12.70 13.20 13.12 [ 18 ] [ 20 ] [ 5 ] [ 13 ] [ 26 ] Baselines

Systems Asclepius, Hippocrates, Sisyphus cole hce1, cole hce2, cole hce ne, utai rebayct, utai rebayct 2 SNUMedInfo* Antinomyra-* L2R*

MTIFL, MTI-Default, bioasq baseline

Table 3 presents the correspondence of the systems for which a description was available and the submitted systems in Task 2a. The systems MTIFL, MTIDefault and BioASQ Baseline were the baseline systems used throughout the challenge. MTIFL and MTI-Default refer to the NLM Medical Text Indexer system [ 16 ]. Systems that participated in less than 4 test sets in each batch are not reported in the results7.

According to [ 6 ] the appropriate way to compare multiple classi cation systems over multiple datasets is based on their average rank across all the datasets. On each dataset the system with the best performance gets rank 1.0, the second best rank 2.0 and so on. In case that two or more systems tie, they all receive the average rank. Tables 4 presents the average rank (according to MiF and LCA-F) of each system over all the test sets for the corresponding batches. Note, that the average ranks are calculated for the 4 best results of each system in the batch according to the rules of the challenge8. The best ranked system is highlighted with bold typeface.

First, we can observe that several systems outperforms the strong MTI baseline in terms of MiF and LCA measures exhibiting state-of-the-art performances. During the rst batch the at classi cation approach (Asclepius system) used in [ 18 ]. In the other two batches the learning-to-rank systems proposed by NCBI (L2R systems) and the Fudan University (Antinomyra systems) ranked as the best performed ones occupying the rst two places in both measures.

According to the available descriptions the only systems that made of use of the MeSH hierarchy were the ones introduced by [ 19 ]. The top-down hierarchical systems, cole hce1, cole hce2 and cole hce ne achieved mediocre results. while the utai rebayct systems had poor performances. For the systems based on a Bayesian network this behavior was expected as they cannot scale well to large problems. 4.2

Task 2b

Phase A. Table 5 presents the statistics of the training and test data provided to the participants. The evaluation included ve test batches. For the phase A of Task 2b the systems were allowed to submit responses to any of the corresponding 7 According to the rules of BioASQ, each system had to participate in at least 4 test sets of a batch in order to be eligible for the prizes. 8 http://bioasq.lip6.fr/general information/Task1a/

MiF types of annotations, that is documents, concepts, snippets and RDF triples. For each of the categories we rank the systems according to the Mean Average Precision (MAP) measure [ 3 ]. The nal ranking for each batch is calculated as the average of the individual rankings in the di erent categories. The detailed results for Task 2b phase A can be found in http://bioasq.lip6.fr/results/

Focusing on the speci c categories, (e.g., concepts or documents) for the Wishart system we observe that it achieves a balanced behavior with respect to the baselines (Table 7 and Table 6). This is evident from the value of Fmeasure which is much higher that the values of the two baselines. This can be explained on the fact that the Wishart-S1 system responded with short lists while the baselines return always long lists (50 and 100 items respectively). Similar observations hold also for the other four batches, the results of which are available online. ideal answers. The systems were ranked according to the manual evaluation of ideal answers by the BioASQ experts [ 3 ]. For reasons of completeness we report also the results of the systems for the exact answers. The participation to the second BioASQ challenge signalizes an uptake of the signi cance of biomedical question answering in the research community. We monitored an increased participation of both Tasks 2a and 2b. The baseline that we used this year in Task 2a incorporated techniques from last year's winning system. Although we had more data and thus more possible sources of errors (but also more training data), the best system in the rst challenge clearly outperformed the baseline. This suggest an improvement of large-scale classi cation systems over the last year. The results achieved in Task 2b also suggest that the state of the art was pushed a step further. Consequently, we regard the outcome of the challenge as a success towards pushing the research on bio-medical information systems a step further. In future editions of the challenge, we aim to provide even more benchmark data derived from a community-driven acquisition process.

Joel

Robert Adams and

Steven

Bedrick . Automatic classi cation of pubmed abstracts with latent semantic indexing: Working notes . In Proceedings of Question Answering Lab at CLEF , 2014 .

2. Alan

Aronson and Franois-Michel Lang . An overview of MetaMap: historical perspective and recent advances . Journal of the American Medical Informatics Association , 17 : 229 { 236 , 2010 .

Georgios

Balikas , Ioannis Partalas, Aris Kosmopoulos, Sergios Petridis, Prodromos Malakasiotis, Ioannis Pavlopoulos, Ion Androutsopoulos, Nicolas Baskiotis, Eric Gaussier, Thierry Artieres, and

Patrick

Gallinari . Evaluation Framework Speci cations . Project deliverable D4.1 , 05 / 2013 2013.

4. Christopher

J.C.

Burges . From ranknet to lambdarank to lambdamart: An overview . Technical Report MSR-TR-2010-82 , June 2010 .

Sungbin

Choi and

Jinwook

Choi . Classi cation and retrieval of biomedical literatures: Snumedinfo at clef qa track bioasq 2014 . In Proceedings of Question Answering Lab at CLEF , 2014 .

Janez

Demsar . Statistical Comparisons of Classi ers over Multiple Data Sets . Journal of Machine Learning Research , 7 :1{ 30 , 2006 .

Minlie

Huang ,

Aurlie

Nvol , and

Zhiyong

Lu . Recommending mesh terms for annotating biomedical articles . JAMIA , 18 ( 5 ): 660 { 667 , 2011 .

8. Susan

Schmidt Alan R. Aronson James G. Mork , Dina Demner-Fushman. Recent enhancements to the nlm medical text indexer . In Proceedings of Question Answering Lab at CLEF , 2014 .

Aris

Kosmopoulos , Ioannis Partalas, Eric Gaussier, Georgios Paliouras, and

Ion

Androutsopoulos . Evaluation Measures for Hierarchical Classi cation: a uni ed view and novel approaches . CoRR, abs/1306.6802 , 2013 .

10. David

Lewis et al. Rcv1: A new benchmark collection for text categorization research . The Journal of Machine Learning Research , 5 : 361 { 397 , 2004 .

11. Chin-Yew Lin . ROUGE: A package for automatic evaluation of summaries . In Proceedings of the ACL workshop `Text Summarization Branches Out' , pages 74 { 81 , Barcelona , Spain, 2004 .

12.

Jessa

Lingeman and

Laura

Dietz . UMass at BioASQ 2014: Figure-inspired text retrieval . In 2nd BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering , 2014 .

13. Ke

Liu

, Junqiu Wu, Shengwen Peng, Chengxiang Zhai, and

Shanfeng

Zhu . The fudan-uiuc participation in the bioasq challenge task 2a: The antinomyra system . In Proceedings of Question Answering Lab at CLEF , 2014 .

14.

Yifeng

Liu. BioASQ System Descriptions (Wishart team) . Technical report , 2013 .

15.

Yuqing

Mao and

Zhiyong

Lu . NCBI at the 2013 BioASQ challenge task: Learning to rank for automatic MeSH Indexing . Technical report , 2013 .

16. James

Mork

, Antonio Jimeno-Yepes, and

Alan

Aronson . The NLM Medical Text Indexer System for Indexing Biomedical Literature , 2013 .

17.

Mariana

Neves . Hpi in-memory-based database system in task 2b of bioasq . In Proceedings of Question Answering Lab at CLEF , 2014 .

18. Yannis

Papanikolaou

, Dimitrios Dimitriadis, Grigorios Tsoumakas, Manos Laliotis, Nikos Markantonatos, and

Ioannis

Vlahavas . Ensemble Approaches for Large-Scale Multi-Label Classi cation and Question Answering in Biomedicine . In 2nd BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering , 2014 .

19. Francisco Ribadas, Luis de Campos, Victor Darriba, and

Alfonso

Romero . Two hierarchical text categorization approaches for BioASQ semantic indexing challenge . In 1st BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering , 2013 .

20. Francisco J. Ribadas-Pena , Luis M. de Campos Ibanez , Victor Manuel DarribaBilbao, and Alfonso E. Romero . Cole and utai participation at the 2014 bioasq semantic indexing challenge . In Proceedings of Question Answering Lab at CLEF , 2014 .

21. Jr. Carlos N. Silla and Alex A. Freitas . A survey of hierarchical classi cation across di erent application domains . Data Mining Knowledge Discovery , 22 : 31 { 72 , 2011 .

22. Lei

Tang

, Suju Rajan, and Vijay

Narayanan . Large scale multi-label classi cation via metalabeler . In Proceedings of the 18th international conference on World wide web, WWW '09 , pages 211 { 220 , New York, NY, USA, 2009 . ACM.

23. Grigorios

Tsoumakas

, Ioannis Katakis, and

Ioannis

Vlahavas . Mining Multi-label Data . In Oded Maimon and Lior Rokach , editors, Data Mining and Knowledge Discovery Handbook , pages 667 { 685 . Springer

, 2010 .

24. Grigorios

Tsoumakas

, Manos Laliotis, Nikos Markontanatos, and

Ioannis

Vlahavas . Large-Scale Semantic Indexing of Biomedical Publications . In 1st BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering , 2013 .

25. Dirk

Weissenborn

, George Tsatsaronis, and

Michael

Schroeder . Answering Factoid Questions in the Biomedical Domain . In 1st BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering , 2013 .

26. Zhiyong Lu Yuqing Mao, Chih-Hsuan Wei . Ncbi at the 2014 bioasq challenge task: large-scale biomedical semantic indexing and question answering . In Proceedings of Question Answering Lab at CLEF , 2014 .

27. Donhqing

Zhu

Dingcheng

Li ,

Ben

Carterette , and

Hongfang

Liu . An Incemental Approach for MEDLINE MeSH Indexing . In 1st BioASQ Workshop: A challenge on large-scale biomedical semantic indexing and question answering , 2013 .