=Paper=
{{Paper
|id=Vol-1391/inv-pap7-CR
|storemode=property
|title=Results of the BioASQ Tasks of the Question Answering Lab at CLEF 2015
|pdfUrl=https://ceur-ws.org/Vol-1391/inv-pap7-CR.pdf
|volume=Vol-1391
|dblpUrl=https://dblp.org/rec/conf/clef/BalikasKKPK15
}}
==Results of the BioASQ Tasks of the Question Answering Lab at CLEF 2015==
Results of the BioASQ tasks of the Question Answering Lab at CLEF 2015 Georgios Balikas2 , Aris Kosmopoulos1 , Anastasia Krithara1 , Georgios Paliouras1? , and Ioannis Kakadiaris3 1 NCSR “Demokritos”, Greece 2 Laboratoire d’ Informatique de Grenoble, France 3 University of Houston, USA Abstract. The goal of the BioASQ challenge is to push research towards highly precise biomedical information access systems. We aim to promote systems and approaches that are able to deal with the whole diversity of the Web, especially for, but not restricted to, the context of bio- medicine. The third challenge consisted of two tasks: semantic indexing and question answering. 59 systems by 18 different teams participated in the semantic indexing task (Task 3a). The question answering task was further subdivided into two phases. 24 systems from 9 different teams participates in the annotation phase (Task 3b-phase A), while 26 systems of 10 different teams participated in the answer generation phase (Task 3b-phase B). Overall, the best systems were able to outperform the strong baselines provided by the organizers. In this paper, we present the data used during the challenge as well as the technologies which were used by the participants. 1 Introduction The aim of this paper is to present an overview of the BioASQ challenge in CLEF 2015. The overview provides information about: 1. the two BioASQ tasks of the Question Answering Lab at CLEF 2015, 2. the data provided during the BioASQ tasks, 3. the systems that participated in the challenge, according to the system de- scriptions that we have received; detailed descriptions of some of the systems are given in the lab proceedings which we cite, 4. evaluation results about the performance of the participating systems and compare them to dedicated baseline systems. 2 Overview of the Tasks The challenge comprised two tasks: (1) a large-scale semantic indexing task (Task 3a) and (2) a question answering task (Task 3b). Information about the challenge and the nature of the data it provides is available at [21, 2]. ? contact email: paliourg@iit.demokritos.gr Large-scale semantic indexing. In Task 3a the goal is to classify documents from the MEDLINE4 digital library unto concepts of the MeSH5 hierarchy. Here, new MEDLINE articles that are not yet annotated are collected on a weekly basis. These articles are used as test sets for the evaluation of the participating systems. As soon as the annotations are available from the MEDLINE curators, the performance of each system is assessed using standard information retrieval measures as well as hierarchical ones. The winners of each batch are decided based on their performance in the Micro F-measure (MiF) from the family of flat measures [22], and the Lowest Common Ancestor F-measure (LCA-F) from the family of hierarchical measures [11]. For completeness several other flat and hierarchical measures are reported [3]. In order to provide an on-line and large-scale scenario, the task was divided into three independent batches. In each batch 5 test sets of biomedical articles were released following a pre-announced schedule. The test sets were released on a weekly basis (on Monday 17.00 CET) and the participants were asked to provide their system’s answers within 21 hours. Figure 1 gives an overview of the time plan of Task 3a. 1st batch 2nd batch 3rd batch End of Task 3a 04 11 02 09 16 23 06 13 20 27 02 09 16 23 30 ay ay y y y y il il il il ch ch ch ch ch ar ar ar ar pr pr pr pr M M ar ar ar ar ar ru ru ru ru A A A A M M M M M b b b b Fe Fe Fe Fe Fig. 1. The time plan of Task 3a. Biomedical semantic QA. The goal of Task 3b was to assess the performance of participating systems in different stages of the question answering process, ranging from the retrieval of relevant concepts and articles, to the generation of natural-language answers. Task 3b comprised two phases: In phase A, BioASQ released questions in English from benchmark datasets created by a group of biomedical experts. There were four types of question: “yes/no” questions, “fac- toid” questions,“list” questions and “summary” questions [3]. Participants were asked to respond with relevant concepts (from specific terminologies and ontolo- gies), relevant articles (PubMed and PubMedCentral6 articles), relevant snippets extracted from the relevant articles and relevant RDF triples (from specific on- tologies). In phase B, the released questions were accompanied by the correct answers for a subset of the required elements of phase A; namely documents and 4 http://www.ncbi.nlm.nih.gov/pubmed/ 5 http://www.ncbi.nlm.nih.gov/mesh/ 6 http://www.ncbi.nlm.nih.gov/pmc/ snippets.7 The participants had to answer with exact answers as well as with paragraph-sized summaries in natural language (dubbed ideal answers). The task was split into five independent batches (see Fig. 2). For each phase, the participants had 24 hours to submit their answers. We used well-known measures such as mean precision, mean recall, mean F-measure, mean average precision (MAP) and geometric MAP (GMAP) to evaluate the performance of the participants in Phase A. The winners were selected based on MAP. The evaluation in phase B was carried out manually by biomedical experts on the ideal answers provided by the systems. For the sake of completeness, ROUGE [12] was also reported. 1st batch 2nd batch 3rd batch 4th batch 5th batch 13 14 08 09 29 30 11 12 25 26 ch ay ay il il il il ch ch ch pr pr pr pr ar M M ar ar ar A A A A M M M M Phase A Phase B Fig. 2. The time plan of Task 3b. The two phases for each batch ran in consecutive days. 3 Technology Overview of the Participating Systems 3.1 Task 3a The systems that participated in the semantic indexing task of the BioASQ challenge adopted a variety of approaches based mostly on flat classification. In the rest of section we describe the participating systems and stress their key characteristics. The NCBI system [14], called MeSH Now, was contributed as a baseline system for the semantic indexing task of 2015. This allowed other participants to use its predictions, in order to improve their own results. The system is very similar to that developed by NCBI for the BioASQ2 challenge, based on the generic learning-to-rank approach presented in [9]. The main improvements were the addition of new training data from the third iteration of the challenge and the submission of two separate runs each week, one favoring high F1 and one favoring high recall. Improvements were also done on the scalability of the system which now runs in parallel using a computer cluster. 7 In the first two editions of the BioASQ challenge, the datasets released for Phase B contained relevant articles, snippets, concepts and RDF triples for each question. The AUTH-Atypon system [16] also adopted a flat classification approach. The approach is based on binary linear SVM models for each class. A Meta- Labeler [20] is used to predict the number of classes that an instance should be assigned to. An ensemble of such classifiers, trained on variable training data sizes and different time periods, is then used in order to deal with the problem of Concept Drift. A domain-independent k-nearest-neighbor approach is adopted by the IIIT team [1]. Initially the system uses k-NN in order to find the most relevant MeSH headings. Then a series of procedures are used, based on POS-tagging, IDF computation and SVM-rank in order to assign some extra classes to each test instance and improve the recall of the initial k-NN results. In the final step, tree-based classifiers (one versus all) are used (FastXML), which actually take in to account the hierarchical relations between the MeSH terms. Another k-nearest-neighbor approach is that of USI [8], which does not take into account the hierarchy. The authors claim that the method is generic since it does not take into account the domain or use any NLP, although they believe that an NLP module would boost their performance. Given an instance the system finds the k nearest instances in the training corpus and then uses the labels of these instances for annotating it by computing semantic similarities. During the challenge they experimented with various parameters of their system, such as the value of k and they also took into account the predictions of the baselines in order to improve their results. The CoLe and UTAI [18] teams introduce a new approach, compared to their approach during the previous challenges. This year they use only conven- tional information retrieval tools, such as Lucene, combined with k-NN methods. The authors also experimented with several approaches of index term extraction ranging from simple to more complex ones requiring the use of NLP. The ESIS* systems used the Lucene index in order to find useful features for each of the MeSH classes separately. In this direction, they selected words that co-occur often with a particular class, as well as the most common terms exclud- ing stop words. The decision function follows an k-nearest-neighbor approach, where for each test instance and given the feature extraction process they find in the Lucene index the most common training examples that decide the class of the test instance. Intuitively, the probability of a class increases if a term that is strongly associated with it is present and decreases if a frequent term is absent. The Fudan system [17] uses a learning to rank (LTR) method for predicting MeSH headings. The MeSHLabeler algorithm consists of two components. The first component, called MeSHRanker, returns an ordered list of MeSH headings for each test instance. The ranking is determined by a combination of (a) binary classifiers, one for each MeSH heading, (b) the most similar citations to the test instance, (c) pattern matching between the MeSH headings and the title of the abstract and (d) the prediction of the MTI system. The second component, called MeSHNumber, predicts the actual number of MeSH heading that must be assigned to each test instance. Table 1 describes the principal technologies that were employed by the partic- ipating systems and whether a hierarchical or a flat approach has been adopted. Table 1. Technologies used by participants in Task 3a. Team Approach Technologies NCBI [14] flat k-NN, learning-to-rank AUTH-Atypon[16] flat SVMs, MetaLabeler [20], Ensembles IIIT [1] hierarchical k-NN, POS-tagging, SVM-rank, FastXML USI [8] flat k-NN, semantic similarities, used Baseline CoLe and UTAI [18] flat k-NN, Lucene Fudan[17] flat Logistic regression, learning-to-rank, used Baseline Baselines. Five systems have served as baseline systems for BioASQ task 3a. The first one, dubbed BioASQ Baseline, follows a simplistic unsupervised ap- proach to the problem and is thus easy to beat. The rest of the systems are implementations of state-of-the-art methods: the Medical Text Indexer (MTI) and the MTI First Line Index [10] were developed and are maintained by the National Library of Medicine (NLM). 8 They serve as classification systems for articles of MEDLINE and are actively used by the MEDLINE curators in or- der to assist them in the annotation process. Furthermore, MeSH Now BF and MeSH Now HR were developed by NCBI and were among the best-performing systems in the second edition of the BioASQ challenge [14]. Consequently, we expected these baselines to be hard to beat. 3.2 Task 3b As mentioned above, the second task of the challenge is further divided into two phases. In the first phase, where the goal is to annotate questions with relevant concepts, documents, snippets and RDF triples, 9 teams with 24 systems participated. In the second phase, where team are requested to submit exact and paragraph-sized answers for the questions, 10 teams with 26 different systems participated. The OAQA system described in [23] focuses on learning to answer factoids and list questions. The participants trained three supervised models, using fac- toid and list questions of the previous editions of the task. The first is an answer type prediction model, the second assigns a score to each predicted answer while the third is a collective re-ranking model. Although the system also participated in phase A of Task 3b its performance was much better in the factoid and list questions of phase B. In contrast, the USTB system [25] participated only in phase A of the chal- lenge. This approach initially uses a sequential dependence model for document retrieval. It then uses Word Embeddings (specifically the Word2Vec tool) to rank 8 http://ii.nlm.nih.gov/MTI/index.shtml the results and improve the document retrieval of the previous step. In the final step, biomedical concepts and corresponding RDF triples are extracted, using concept recognition tools, such as MetaMap and Banner. Another system that focused on phase A is by the IIIT team and is described in [24]. The authors relied on the PubMed search engine to retrieve relevant documents. They then applied their own snippet extraction methods, which is based on the similarity of the top 10 sentences of the retrieved documents and the query. The HPI system [15] participated in both phases of Task 3b. The system re- lies on in-memory based database technology, in order to map the given questions to concepts. The Stanford CoreNLP package is used for question tokenization and the BioASQ services are used for relevant document retrieval. The selec- tion of snippets from the retrieved documents is performed using string similarity between terms of the question and words of the documents. Exact and ideal an- swers are both extracted using the gold-standard snippets that were provided to the participants. The Fudan system [17] also participated in the second task of challenge. For phase A a language model is used in order to retrieve relevant documents. For snippet extraction, the retrieved documents are searched for query keywords by giving extra credit to terms that appear close to the query keywords. Regarding exact and ideal answers, the system is split into three main components: question analysis, candidate answer generation and candidate answer ranking. In the system of ILSP and AUEB [13] a different approach for question answering is presented based on multi-document summarization from relevant documents. The system first uses an SVR in order to assign scores to each sen- tence of the relevant documents. The most relevant sentences are then combined to form an answer. In order to avoid redundancy, two main approaches are ex- amined, the use of an ILP model and the use of a more greedy strategy. Several versions of the system were examined, which differ on the features and training data that was used. The YodaQA system, described in [4], is a pipeline question answering system that was altered in order to make it compatible with the BioASQ task. The sys- tem first extracts natural language features from the questions and then searches its knowledge base for existing answers. It then either directly provides these passages as answers or performs passage analysis in order to produce answers from the extracted texts. Each answer is evaluated using a logistic regression classifier and those with the highest scores are provided as a final answer. The initial system was designed to answer only factoid questions, so modifications were necessary in order to be able to answer list questions. The final system is the SNUMedinfo described in [5]. Regarding Phase A, the system participated only in the document retrieval task. The approach was based on the Indri search engine [19] and the semantic concept-enriched model presented in [6]. In phase B, the system participated only in the ideal answer generation subtask, where it ranked each passage from the provided list, based on the unique keywords it contained. A set of m (parameter of the system) passages were selected, in rank order, by selecting only passages that contain a minimum proportion of new tokens compared the already selected ones. Table 2 describes the principal technologies that were employed by the par- ticipating systems and in which phase (A and/or B) have participated. Table 2. Technologies used by participants in Task 3b. Reference Phase Technologies OAQA [23] A,B supervised learning, collective re-ranking model USTB [25] A Word Embeddings, MetaMap, Banner IIIT [24] A PubMed search engine, sentence similarity HPI [15] A, B Stanford CoreNLP, string similarity Fudan [17] A, B language model, word similarity, ranking ILSP-AUEB[13] A, B multi-document summarization, ILP model, greedy strategy YodaQA [4] A, B natural language features, logistic regression SNUMedinfo [5] A, B Indri search engine, semantic concept-enriched model Baselines. The BioASQ baseline of Task 3b phase B is a system similar to [13]. It applies a multi-document summarization method using Integer Linear Programming and Support Vector Regression. 4 Results 4.1 Task 3a During the evaluation phase of the Task 3a, the participants submitted their results on a weekly basis to the online evaluation platform of the challenge.9 The evaluation period was divided into three batches containing 5 test sets each. 18 teams participated in the task with a total of 59 systems. Two training datasets were provided: the first contains 11,804,715 articles that cover 27,097 MeSH labels; the second is a subset containing 4,607,922 articles and covers 26,866 MeSH labels. The latter dataset focuses on the journals that appear also in the test sets. The uncompressed size of those training sets in text format is 19Gb and 7.4Gb respectively. Table 3 shows the number of articles in each test set of each batch of the challenge. Table 4 presents the correspondence of the system names in the BioASQ Participants Area Leaderboard for Task 3a and the system description submitted in the track’s working notes. Systems that participated in less than 4 test sets in each batch are not reported in the results.10 According to [7] the appropriate way to compare multiple classification sys- tems over multiple datasets is based on their average rank across all the datasets. 9 http://participants-area.bioasq.org/ 10 According to the rules of BioASQ, each system had to participate in at least 4 test sets of a batch in order to be able to win the batch. Table 3. Statistics on the test datasets of Task 3a. Batch Articles Annotated Articles Labels per article 1 21014 14,145 13.03 4,435 3,338 13.27 3,638 2,906 13.29 2,153 1,625 13.27 5,725 4,223 13.10 Subtotal 36,965 26,237 13.12 2 3,617 2,634 12.60 4,725 3,020 12.97 4,861 3,342 13.41 2,902 2,254 12.89 4,059 2,911 12.67 Subtotal 20,164 14,161 12.93 3 3,902 2,937 13.40 4,027 2,822 13.49 3,162 2,116 13.29 3,621 2,299 13.56 3,842 2,362 12.82 Subtotal 18,554 12,536 13.32 Total 72,430 52,934 13.11 Table 4. Correspondence between the public names of the participating teams on the BioASQ Participants Area leaderboard and their submissions on the lab working notes. Reference Systems [14] MeSH Now HR, MeSH Now BF [16] auth* [1] qaiiit system * [8] Abstract framework, USI 20 neighbours, USI baseline, USI 10 neighbours [18] iria-* [17] MeSHLabeler-* On each dataset the system with the best performance gets rank 1.0, the second best rank 2.0 and so on. In case two or more systems tie, they all receive the average rank. Table 5 presents the average rank (according to MiF and LCA-F) of each system over all the test sets for the corresponding batches. Note, that the average ranks are calculated for the 4 best results of each system in the batch according to the rules of the challenge11 . The best ranked system is highlighted with bold typeface. As it can be noticed, on all three batches and for both flat and hierarchical measures, the Fudan system [17] clearly outperforms other ap- proaches. The AUTH-Atypon system [16] managed to score second in two out of three batches, while the MeSH-UK0 scored second in one of the batches. 11 http://participants-area.bioasq.org/general_information/Task3a/ Table 5. Average ranks for each system across the batches of task 3a for the measures MiF and LCA-F. A dash (-) is used whenever the system participated in less than 4 times in the batch. Systems that didn’t participate in the challenge regularly, i.e. they didn’t submit results for at least four test sets in at least one of the three batches, were excluded from the table. System Batch 1 Batch 2 Batch 3 MiF LCA-F MiF LCA-F MiF LCA-F auth1 7.5 7.0 10.5 8.5 10.0 8.0 qaiiit system 1 - - 25.0 25.0 - - TextCategorisation5 8.5 9.5 - - - - MeSH-UK2 - - - - 9.0 7.75 Dexstr system - - 24.25 23.5 - - USI 20 neighbours 15.25 13.0 15.25 14.75 16.75 16.75 iria-1 16.5 16.0 21.25 20.75 17.25 17.25 pseudo n-grams - - - - 27.25 27.0 iria-4 - - - - 24.25 24.5 auth2 7.5 8.75 7.0 9.0 7.0 7.5 test unibitri - - - - 20.0 20.5 auth3 4.25 3.75 5.25 6.5 5.0 4.75 it is a test submit 22.5 22.0 26.0 25.5 28.0 28.0 MeSHLabeler-3 2.25 3.25 1.0 1.0 2.25 2.5 MeSH-UK0 - - 8.0 7.25 7.75 10.75 MeSHLabeler-1 2.5 1.75 2.5 3.0 2.0 2.0 fork-fork 17.75 18.0 16.75 17.75 - - TextCategorisation3 8.75 11.0 - - - - MeSH-UK4 - - 6.25 7.75 10.5 11.5 TextCategorisation1 11.25 12.75 - - - - testLee15 - - - - 27.0 26.5 spoon-spoon 16.25 16.5 14.5 16.0 - - IMI-KOI - - - - 30.0 30.5 auth4 10.0 10.5 8.0 9.75 4.5 4.5 MeSHLabeler-4 1.0 2.25 2.75 2.5 2.5 3.5 MeSH-UK3 - - 7.75 11.0 10.25 12.0 BioASQ Baseline 24.25 24.25 27.75 27.5 29.25 29.0 MeSHLabeler-2 4.5 3.75 1.75 1.75 3.0 2.0 Default MTI 12.0 9.5 14.0 13.0 15.75 13.75 MeSHLabeler 3.5 2.25 2.5 2.25 3.75 3.25 Abstract framework 17.5 18.75 17.25 18.0 19.5 20.25 iria-2 - - 21.0 20.0 21.75 22.0 MeSH Now BF 8.25 7.75 11.25 7.75 13.0 9.75 MeSH Now HR 23.75 23.75 20.75 20.75 31.25 31.75 USI 10 neighbours 18.5 17.75 18.25 17.5 20.5 19.75 IMI-KOI R - - - - 31.25 30.75 iria-3 - - 20.0 19.75 22.0 22.25 iria-mix - - - - 14.0 13.0 MTI First Line Index 16.0 12.75 16.0 15.25 18.5 18.0 USI baseline 6.25 6.0 11.5 9.25 14.5 13.5 TextCategorisation4 8.5 9.5 - - - - IIIT system 2 - - 18.75 19.0 - - MeSH-UK1 - - 4.5 4.5 9.25 9.75 TextCategorisation2 10.0 11.25 - - - - 4.2 Task 3b Phase A. Table 6 presents the statistics of the test data that were provided to the participants. The evaluation included five test batches. For phase A of Task 3b the systems were allowed to submit up to 10 responses per question to any of the corresponding type of annotation; that is documents, concepts, snippets and RDF triples. For each of the categories we rank the systems according to the Table 6. Statistics on the test datasets of Task 3b. The numbers concerning the documents and snippets refer to averages. Batch Size # of documents # of snippets Yes/No List Factoid Summary 1 100 11.27 13.33 33 22 26 19 2 100 10.96 12.95 16 28 32 24 3 100 9.3 10.98 28 17 26 29 4 97 9.37 11.97 29 23 25 20 5 100 5.84 8.53 28 24 22 26 total 497 9.35 11.55 134 114 131 118 Table 7. Results for batch 1 for documents in phase A of Task3b. System Mean Mean Mean MAP GMAP Precision Recall F-measure SNUMedinfo1 0.2430 0.3055 0.2220 0.1733 0.0117 SNUMedinfo2 0.2440 0.3076 0.2231 0.1731 0.0115 SNUMedinfo4 0.2420 0.3062 0.2220 0.1724 0.0117 fdu3 0.2320 0.3275 0.2232 0.1719 0.0071 fdu2 0.2290 0.3242 0.2201 0.1703 0.0066 SNUMedinfo3 0.2340 0.2900 0.2117 0.1695 0.0076 fdu4 0.2320 0.3290 0.2242 0.1695 0.0078 ustb prir3 0.2430 0.3092 0.2245 0.1687 0.0120 testtext 0.2410 0.3042 0.2226 0.1681 0.0124 ustb prir4 0.2430 0.3088 0.2241 0.1666 0.0105 ustb prir1 0.2370 0.3045 0.2194 0.1663 0.0105 fdu 0.2200 0.3045 0.2091 0.1590 0.0067 SNUMedinfo5 0.2240 0.2854 0.2050 0.1569 0.0070 qaiiit system 1 0.1957 0.1757 0.1559 0.1099 0.0006 fa1 0.1385 0.0888 0.0935 0.0489 0.0001 ilsp.aueb.1 0.1264 0.1103 0.0922 0.0485 0.0001 HPI-S2 0.1027 0.1250 0.0841 0.0464 0.0005 fdu5 0.0370 0.0314 0.0276 0.0138 0.0000 Table 8. Results for batch 1 for snippets in phase A of Task3b. System Mean Mean Mean MAP GMAP Precision Recall F-measure ustb prir3 0.0845 0.0967 0.0785 0.0570 0.0004 ustb prir1 0.0829 0.0970 0.0774 0.0546 0.0003 qaiiit system 1 0.0616 0.0697 0.0511 0.0545 0.0002 testtext 0.0887 0.0948 0.0797 0.0529 0.0004 ustb prir4 0.0772 0.0882 0.0706 0.0513 0.0003 HPI-S2 0.0545 0.0686 0.0501 0.0347 0.0002 Mean Average Precision (MAP) measure [3]. The final ranking for each batch is calculated as the average of the individual rankings in the different categories. Tables 7 and 8 present the scores of the participating systems for document and snippet retrieval in the first batch of Phase A .12 Note that systems are allowed to participate in any or all four parts of the task e.g., SNUMedinfo* retrieved only 12 In contrast to the first two editions of the challenge, the biomedical experts of BioASQ were not asked to produce golden concepts and triples prior to the chal- lenge. The ground truth for concepts and snippets will be constructed by the experts on the basis of the material provided by the systems. documents. It is worth noting, that document retrieval for the given questions was the most popular aspect of the task; far fewer systems returned document snippets, concepts and RDF triples. The detailed results for Task 3b phase A can be found in http://participants-area.bioasq.org/results/3b/phaseA/. Phase B. In phase B of Task 3b, the systems were asked to generate exact and ideal answers. The systems will be ranked according to the manual evaluation of ideal answers by the BioASQ experts [3]. For reasons of completeness we report also the results of the systems for the exact answers. In contrast to the previous editions of the BioASQ challenge, the test files of Phase B included only relevant documents and snippets for each question instead of relevant documents, snippets, concepts and RDF triples. As a result, the participating systems had less information available in order to construct the exact and the ideal answers. Table 9 shows the results for the exact answers in the first batch of task 3b. For systems that didn’t provide exact answers for a particular kind of question we use the dash symbol “-”. The results of the other batches are available at http://participants-area.bioasq.org/results/3b/phaseB/. They are not reproduced here in the interest of space. From those results we can see that some of the systems are achieving a very high (> 80% accuracy) performance in the yes/no questions. The performance in factoid and list questions is not as good, indicating that there is room for improvement. On the other hand, the performance on ideal answers has improved compared to the previous years [2], which in combination with the increase of participation leads us to believe that a significant amount of effort was invested by the participants and that the task is gaining attention. It is to be noted that those conclusions are based only on the automated evaluation measures; the manual assessment was still in progress at the time of writing this document. Table 9. Results for batch 1 for exact and ideal answers in phase B of Task3b. System Yes/no Factoid List Ideal Answers Acc. Strict Acc. Lenient Acc. MRR Prec. Recall F-meas Rouge2 Rouge-SU4 fa1 .8485 .0769 .0769 .0769 - - - - - fdu .8485 .0769 .1538 .1038 .0477 .2362 .0756 .2634 .2648 fdu2 .8485 .0769 .1538 .1038 .0477 .2362 .0756 .2669 .2781 fdu3 .8485 .0769 .1538 .1038 .0477 .2362 .0756 .2760 .2973 fdu4 .8485 .1154 .1923 .1423 .0379 .2957 .0650 .2760 .2973 main system .8485 .1154 .3077 .1936 .1311 .1802 .1362 .2934 .3066 HPI-S2 .6667 - - - .0292 .0603 .0364 .1884 .2008 BioASQ Baseline 2 .5455 - - - - - - .3604 .3787 BioASQ Baseline .4545 - - - - - - .4033 .4217 SNUMedinfo1 - - - - - - - .2929 .3069 SNUMedinfo2 - - - - - - - .2940 .3071 SNUMedinfo3 - - - - - - - .2894 .3034 SNUMedinfo4 - - - - - - - .2567 .2703 SNUMedinfo5 - - - - - - - .2650 .2784 ilsp.aueb.1 - - - - - - - .3829 .4052 ilsp.aueb.2 - - - - - - - .4050 .4213 5 Conclusions The third edition of the BioASQ challenge has led to a number of interesting results by the participating systems. Despite them being quite advanced sys- tems, the baselines that we provided have been beaten by the best systems. Both tasks have attracted an increasing number of participants and the num- ber of submissions to the workshop has also increase. Therefore, we believe that the third edition of the challenge has been another contribution towards better biomedical information systems.This encourages us to continue the effort and es- tablish BioASQ as a reference point for research in the area. In future editions of the challenge, we aim to provide even more benchmark data derived from a community-driven acquisition process. Acknowledgments The third edition of BioASQ is supported by a conference grant from the NIH/NLM (number 1R13LM012214-01) and sponsored by the companies Viseo and Atypon. References 1. Kamineni Avinash, Fatma Nausheen, Das Arpita, Shrivastava Manish, and Chin- nakotla Manoj. Extreme Classification of PubMed Articles usingMeSH Labels. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 2. George Balikas, Ioannis Partalas, Axel-Cyrille Ngonga Ngomo, Anastasia Krithara, Eric Gaussier, and George Paliouras. Results of the bioasq track of the question answering lab at clef 2014. Results of the BioASQ Track of the Question Answering Lab at CLEF, 2014:1181–93, 2014. 3. Georgios Balikas, Ioannis Partalas, Aris Kosmopoulos, Sergios Petridis, Prodro- mos Malakasiotis, Ioannis Pavlopoulos, Ion Androutsopoulos, Nicolas Baskiotis, Eric Gaussier, Thierry Artieres, and Patrick Gallinari. Evaluation Framework Specifications. Project deliverable D4.1, 05/2013 2013. 4. Petr Baudis and Jan Sedivy. Biomedical Question Answering using the YodaQA System: Prototype Notes. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 5. Sungbin Choi. SNUMedinfo at CLEF QA track BioASQ 2015. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 6. Sungbin Choi, Jinwook Choi, Sooyoung Yoo, Heechun Kim, and Youngho Lee. Semantic concept-enriched dependence model for medical information retrieval. Journal of biomedical informatics, 47:18–27, 2014. 7. Janez Demsar. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7:1–30, 2006. 8. Nicolas Fiorini, Sylvie Ranwez, Sébastien Harispe1, Jacky Montmain, and Vincent Ranwez. USI at BioASQ 2015: a semantic similarity-based approach for semantic indexing. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 9. Minlie Huang, Aurlie Nvol, and Zhiyong Lu. Recommending mesh terms for an- notating biomedical articles. JAMIA, 18(5):660–667, 2011. 10. Susan C. Schmidt Alan R. Aronson James G. Mork, Dina Demner-Fushman. Re- cent enhancements to the nlm medical text indexer. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), volume 1180, Sheffied, UK, 2014. 11. Aris Kosmopoulos, Ioannis Partalas, Eric Gaussier, Georgios Paliouras, and Ion Androutsopoulos. Evaluation Measures for Hierarchical Classification: a unified view and novel approaches. CoRR, abs/1306.6802, 2013. 12. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the ACL workshop ‘Text Summarization Branches Out’, pages 74– 81, Barcelona, Spain, 2004. 13. Prodromos Malakasiotis, Emmanouil Archontakis, Ion Androutsopoulos, Dimitrios Galanis, and Harris Papageorgiou. Biomedical question-focused multi-document summarization: ILSP and AUEB at BioASQ3. In Working Notes for the Confer- ence and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 14. Yuqing Mao and Zhiyong Lu. NCBI at the 2015 BioASQ challenge task: Baseline results from MeSH Now. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 15. Mariana Neves. HPI question answering system in the BioASQ 2015 challenge. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 16. Yannis Papanikolaou, Grigorios Tsoumakas, Manos Laliotis, Nikos Markantonatos, and Ioannis Vlahavas. AUTH-Atypon at BioASQ 3: Large-Scale Semantic Indexing in Biomedicine. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 17. Shengwen Peng, Ronghui You, Zhikai Xie, Yanchun Zhang, and Shanfeng Zhu. The Fudan participation in the 2015 BioASQ Challenge: Large-scale Biomedical Semantic Indexing and Question Answering. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 18. Francisco J. Ribadas, Luis M. de Campos, Vı́ctor M. Darriba1, and Alfonso E. Romero. CoLe and UTAI at BioASQ 2015: experiments with similarity based descriptor assignment. In Working Notes for the Conference and Labs of the Eval- uation Forum (CLEF), Toulouse, France, 2015. 19. Trevor Strohman, Donald Metzler, Howard Turtle, and W Bruce Croft. Indri: A language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligent Analysis, volume 2, pages 2–6, 2005. 20. Lei Tang, Suju Rajan, and Vijay K. Narayanan. Large scale multi-label classifica- tion via metalabeler. In Proceedings of the 18th international conference on World wide web, WWW ’09, pages 211–220, New York, NY, USA, 2009. ACM. 21. George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Ser- gios Petridis, Dimitris Polychronopoulos, et al. An overview of the bioasq large- scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138, 2015. 22. Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining Multi-label Data. In Oded Maimon and Lior Rokach, editors, Data Mining and Knowledge Discovery Handbook, pages 667–685. Springer US, 2010. 23. Zi Yang, Niloy Gupta, Xiangyu Sun, Di Xu, Chi Zhang, and Eric Nyberg. Learning to Answer Biomedical Factoid and List Questions OAQA at BioASQ 3B. In Work- ing Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 24. Harish Yenala, Avinash Kamineni, Manish Shrivastava, and Manoj Chinnakotla. BioASQ 3b Challange 2015: Bio-Medical Question Answering System. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015. 25. Zhi-Juan Zhang, Tian-Tian Liu, Bo-Wen Zhang, Yan Li, Chun-Hua Zhao, Shao- Hui Feng, Xu-Cheng Yin, and Fang Zhou. A generic retrieval system for biomedical literatures: USTB at BioASQ2015 Question Answering Task. In Working Notes for the Conference and Labs of the Evaluation Forum (CLEF), Toulouse, France, 2015.