Overview of CLEF QA Entrance Exams Task 2015 Álvaro Rodrigo1, Anselmo Peñas1, Yusuke Miyao2, Eduard Hovy3 and Noriko Kando2 1 NLP&IR group, UNED, Spain (anselmo,alvarory@lsi.uned.es) 2 National Institute of Informatics, Japan {yusuke,kando}@nii.ac.jp 3 Carnegie Mellon University, USA (hovy@cmu.edu) Abstract. This paper describes the Entrance Exams task at the CLEF QA Track 2015. Following the last two editions, the data set has been extracted from actu- al university entrance examinations including a variety of topics and question types. Systems receive a set of Multiple-Choice Reading Comprehension tests where the task is to select the correct answer among a finite set of candidates, according to the given text. Questions are designed originally for testing human examinees, rather than evaluating computer systems. Therefore, the data set challenges human ability to show their understanding of texts. Thus, questions and answers are lexically distant from their supporting excerpts in text, requir- ing not only a high degree of textual inference, but also the development of strategies for selecting the correct answer. 1 INTRODUCTION Following the 2013 and 2014 editions, the Entrance Exams task at CLEF QA Track 2015 is focused on solving Reading Comprehension (RC) tests of English examinations. Reading Comprehension tests are routinely used to assess the degree to which people comprehend what they read, so we work with the hypothesis that it is reasonable to use these tests to assess the degree to which a machine “comprehends” what it is reading. Despite the difficulty of the challenge, we believe we are building a real benchmark that will serve to measure real progress in the field during the next years. With this goal in mind, CLEF and NTCIR started collaboration in 2013 around the idea of testing systems against University Entrance Exams, the same exams humans have to pass to enter University. The data set was prepared and distributed by NTCIR, while other organiza- tion efforts, including announcements, collecting and evaluating sub- missions, etc. were managed by UNED. The success of this coordina- tion also owes to the standard data format and evaluation methodology followed in past editions. 2 TASK DESCRIPTION Participant systems are asked to read a given document and answer a set of questions. Questions are given in multiple-choice format, with several options from which a single answer must be selected. Systems have to answer questions by referring to "common sense knowledge" that high school students who aim to enter the university are expected to have. Another important difference is that we do not intend to restrict question types. Any type of reading comprehension questions in real entrance exams will be included in the test data. 3 DATA Japanese University Entrance Exams include questions formulated at various levels of complexity and test a wide range of capabilities. The challenge of "Entrance Exams" aims at evaluating systems under the same conditions that humans are evaluated to enter the University. 3.1 Sources The data set is extracted from standardized English examinations for university admission in Japan. Exams are created by the Japanese Na- tional Center for University Entrance Examinations. Original examina- tions include various styles of questions, such as word filling, grammat- ical error recognition, sentence filling, etc. One of such styles is reading comprehension, where a test pro- vides a text that describes some daily life situation, and questions about the text. As in the previous edition, we reduced the challenge to these Reading Comprehension exercises contained in the English exams. For each examination, one text is given and some (between 4 and 8) questions about the given text are asked. Each question has four choices, with only one correct answer. For this year campaign, we re- used as development data 24 examinations from previous campaigns, with a total number of 115 questions and 460 candidates. Besides, we provided a new test set of 19 documents, 89 questions and 356 candi- date answers to be validated. 3.2 Languages Test data sets, originally in English, were manually translated into German, Russian, French, Spanish and Italian 1 . They are parallel translations of texts, questions and candidate answers. All these collections represent a benchmark for evaluating systems in different languages. In addition to the official data, we collected several unofficial translations for each language. These collections have the same meaning that the original collection, but they use different words, expressions, syntax, semantics and anaphora, which produce collections with a different level of difficulty. The study of results over these variations should offer useful conclusions about systems’ performance and the main issues for current technologies. 4 EVALUATION We obtain the score of each system comparing the answers of systems against the gold standard collection with annotations made by humans. This is an automatic evaluation where we do not need manual assess- ments. Each test receives an evaluation score between 0 and 1 using c@1 [1]. This measure, used in previous CLEF QA Tracks, encourages sys- tems to reduce the number of incorrect answers while maintaining the number of correct ones by leaving some questions unanswered. Sys- tems received evaluation scores from two different perspectives: 1. At the question-answering level: correct answers are counted individually without grouping them 2. At the reading-test level: firstly we obtain scores for each reading test. Then, we consider that a system passes a test if the score is at least 0.5. Finally, we account for the number of passed tests. A system passes the task if it passes more than a half of tests. 1 Development data was translated to the same languages in the previous edition. 5 RESULTS Table 1 enumerates the participating groups and their reference paper in CLEF 2015 Working Notes. Although the number of participant groups was the same than last year, they presented fewer systems (only 18 runs). Only LIMSI-CNRS has participated in the three editions, while two teams, CICNLP and SYNAPSE, took part also in the last edition and only one team (Synapse) has participated in a second lan- guage different than English (French). Table 1. Participants and reference papers Group ID Group Name #runs Reference paper 2 Laurent et al. SYNAPSE Synapse Développement, France 2015 [2] NTUNLG National Taiwan University, Taiwan 3 - Centro de Investigación en 8 CICNLP Computación - Instituto Politécnico Nacional, Mexico CoMIC Universitä Tübingen, Germany 1 Ziai 2015 [4] LIMSI- 4 Gleize et al. 2015 ILES – LIMSI, France CNRS [3] Results are summarized in Tables 2 and 3 for the QA and the Reading perspectives respectively. Table 2 shows that only the two systems from Synapse [2] gave more correct answers than incorrect ones and obtained a c@1 score greater than 0.5. In fact, Synapse obtained also the best results in the previous edition. While French results remain similar, English results raise from a c@1score of 0.45 in the last edition, to a score of 0.58 in this edition. Furthermore, the LIMSI group has improved also its per- formance with respect to the previous edition, while CICNLP obtained similar scores. Overall results were lower in this edition. This may mean that the current collection was more complex, but participants did not reported if they performed better over past collections. This is why we would find interesting the proposition of some baseline systems based on lexi- cal and syntactic similarity able to offer reference scores for collec- tions. Table 2. Overall results for all runs, QA perspective # of questions ANSWERED # of questions RUN NAME C@1 RIGHT WRONG TOTAL Prec. UNANSWERED Synapse-English 0.58 52 37 89 0.58 0 Synapse-French 0.56 50 39 89 0.56 0 LIMSI-2 0.36 32 57 89 0.36 0 LIMSI-1 0.34 30 59 89 0.34 0 LIMSI-3 0.31 28 61 89 0.31 0 LIMSI-4 0.31 28 61 89 0.31 0 cicnlp-8 0.3 27 62 89 0.3 0 cicnlp-2 0.29 26 63 89 0.29 0 NTUNLG-2 0.29 26 63 89 0.29 0 CoMiC-1 0.29 26 63 89 0.29 0 cicnlp-3 0.28 25 64 89 0.28 0 cicnlp-5 0.28 25 64 89 0.28 0 cicnlp-4 0.27 24 65 89 0.27 0 cicnlp-6 0.26 23 66 89 0.26 0 cicnlp-1 0.26 23 66 89 0.26 0 Random 0.25 22 67 89 0.25 0 NTUNLG-3 0.24 21 68 89 0.24 0 NTUNLG-1 0.22 17 57 74 0.23 15 cicnlp-7 0.21 19 70 89 0.21 0 Only one system (NTUNLG-1) decided to leave some questions unanswered, while several systems of two participants did it in the pre- vious edition. In fact, this is why NTUNLG-1 was the only system with different c@1 and precision scores. This is because c@1 gives the same score that accuracy and precision if all the questions are answered. Therefore, there have been fewer systems returning unanswered questions every year. However, the reduction of unanswered questions did not bring a reduction of incorrect answers. It seems that neither the campaign nor the evaluation measure have been able to promote this change in systems. Hence, we must think about new ways of promoting such change. Synapse reported additional experiments where they left unan- swered some questions and they increased precision over answered questions. However, they obtained fewer correct answers and lower c@1 scores. This is because the objective of c@1 is to acknowledge systems able to reduce incorrect answers while keeping the number of correct answers, but systems are not able to do so. On the other hand, Table 3 shows results for the reading perspec- tive. First column corresponds to systems run id, second column Run c@1 Pass T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 Synapse-En 0.58 16/19 0.25 0.5 0.33 0.5 0.5 1 0.8 0.67 0.6 0.8 0.4 0.5 0.57 0.75 0.5 0.5 0.67 0.75 0.67 Synapse-Fr 0.56 16/19 0.25 0.5 0.5 0.25 0.5 0.67 0.6 1 0.6 0.8 0.4 0.75 0.57 0.5 0.5 0.5 0.5 0.5 0.83 Table 3. Overall results for all runs, reading perspective LIMSI-2 0.36 8/19 0 0.25 0.5 0.25 0.5 0.67 0.2 0.33 0.2 0.6 0.6 0.25 0.29 0.25 0.5 0.5 0.5 0 0.33 LIMSI-1 0.34 5/19 0.25 0.5 0.33 0.25 0 0.33 0.4 0 0.4 0.2 0.6 0.25 0.43 0.25 0.5 0.25 0.5 0.75 0.17 LIMSI-3 0.31 6/19 0.5 0.25 0.5 0 0.17 0.33 0.4 0.67 0.4 0.6 0 0.25 0.29 0.5 0.25 0.5 0.17 0.25 0.17 LIMSI-4 0.31 4/19 0.25 0.25 0.17 0.25 0.17 0.33 0.6 0 0.2 0.4 0.6 0.75 0.43 0 0.75 0 0.33 0.25 0.17 Average 0.31 - 0.24 0.29 0.25 0.31 0.38 0.31 0.20 0.37 0.32 0.31 0.30 0.42 0.30 0.26 0.50 0.36 0.26 0.36 0.33 cicnlp-8 0.3 6/19 0 0.25 0.17 0.75 0.33 0.67 0 0.67 0.6 0.2 0.4 0.25 0 0 0.5 0.25 0.17 0.25 0.67 cicnlp-2 0.29 5/19 0.25 0.25 0.17 0.25 0.5 0.33 0 0 0.4 0.2 0.2 0.5 0.43 0 0.5 0.5 0 0.25 0.67 NTUNLG-2 0.29 6/19 0.5 0 0.33 0.25 0.17 0 0 0 0.2 0.2 0.8 0.5 0.29 0.5 0.5 0 0.33 0.5 0.33 CoMiC-1 0.29 5/19 0.25 0.5 0.33 0.25 0.83 0.67 0.2 0.33 0.4 0.2 0 0 0.29 0 0.5 0.25 0.17 0.5 0 Median 0.29 - 0.25 0.25 0.17 0.25 0.47 0.33 0.1 0.33 0.4 0.2 0.3 0.5 0.29 0.25 0.5 0.5 0.17 0.25 0.17 cicnlp-3 0.28 7/19 0 0.5 0.17 0.25 0.5 0 0 0.67 0 0.4 0.2 0.5 0.14 0.25 0.75 0.5 0.17 0.5 0.17 cicnlp-5 0.28 5/19 0.25 0.25 0.5 0.75 0.17 0 0.2 0.33 0.4 0.2 0 0 0.14 0.5 0.25 0.5 0.17 0.25 0.5 cicnlp-4 0.27 5/19 0.25 0.5 0.17 0.25 0.33 0 0 0.33 0.4 0 0 0.5 0.29 0.25 0.75 0.5 0 0.25 0.5 cicnlp-6 0.26 5/19 0.5 0.25 0 0.25 0.5 0 0 0.33 0.6 0.4 0 0.5 0.29 0.25 0.25 0.5 0 0.25 0.17 cicnlp-1 0.26 4/19 0.25 0.25 0.17 0.25 0.5 0.33 0 0.33 0.4 0 0.2 0.5 0.43 0 0.5 0.5 0 0.25 0.17 Random 0.25 - 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 NTUNLG-3 0.24 5/19 0.25 0 0 0.25 0.5 0 0 0 0 0.2 0.6 0.5 0.14 0.25 0.5 0.25 0.33 0.5 0.17 NTUNLG-1 0.22 3/19 0.25 0 0 0.25 0.44 0 0 0 0 0 0.48 0.5 0.2 0.25 0.5 0.25 0.44 0.5 0 cicnlp-7 0.21 3/19 0 0.25 0.17 0.25 0.17 0.33 0.2 1 0 0.2 0 0.5 0.14 0.25 0.5 0.25 0.17 0 0.17 to the overall c@1 score obtained, third column shows the number of tests that the systems had passed if we considered a c@1 threshold of 0.5, and the rest of columns correspond to the c@1 value for each sin- gle test. Under the reading perspective we say that a system passes the global exam if it passes a 50% or more tests. That is, if the system passes at least 10 reading tests. According to this requirement, only the two systems from Synapse passed the 2015 Entrance Exams task. Although results in the QA perspective are worse than in the pre- vious edition, results in the RC perspective are a little bit greater. In fact, the proportion of passed tests this year (84%) is higher than in the previous edition (75%). We have also observed that the raking of system sometimes changes between the QA and the Reading perspective. For instance, system cicnlp-3 ranked fourth in the Reading perspective, while it ranked eleventh in the QA perspective. We observed also similar changes for other systems. We think this is because participants have focused more on the Reading perspective, creating systems with low results in some tests, but good results in other tests. We think Tables 4 and 5, which show the number of systems passing each test and the maximum score per test, offer a similar con- clusion. We see in Tables 4 and 5 that the maximum scores remain sim- ilar or better this year with respect to the previous edition. Moreover, there are more systems passing a test despite the fact that we have few- er systems in this edition. Tables 4 and 5 show also a different degree of difficulty for sys- tems over each test. This difficulty mainly depends on the lexical gap between the text and the candidate answers. Besides, systems find also difficulties depending on different formulations of the same text, as Synapse showed last year [5] and we are studying now with different versions of the same collection. Table 4. Number of runs (out of 18) that passed each test (from test 1 to test 10), and maximum c@1 score achieved per test. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 # Runs pass 3 12 2 5 15 10 4 8 6 6 Max. score 0.50 0.75 0.57 0.75 0.75 0.50 0.67 0.75 0.83 0.50 Table 5. Number of runs (out of 18) that passed each test (from test 11 to test 19), and maxi- mum c@1 score achieved per test. T11 T12 T13 T14 T15 T16 T17 T18 T19 # Runs pass 4 3 9 5 3 6 4 4 5 Max. score 0.50 0.75 0.83 1.00 0.80 1.00 0.60 0.80 0.80 6 SUMMARY OF SYSTEMS In this Section we offer deeper details about systems from groups that sent a reference paper about their participation. The general architecture of participant systems includes the fol- lowing components: (1) a preprocessing step for preparing texts for the next steps; (2) the retrieval of relevant passages in order to reduce the search space; (3) the creation of graph style structures from texts, ques- tions and candidates; (4) the enrichment of structures including, for instance, background knowledge; (5) the comparison of structures as a way of finding the most probable answer; (6) the ranking of candidates with respect to the comparison score and; (7) the selection of the candi- date with the best score. Most of systems follow this architecture, with slight changes at some levels or removing some steps. The general architecture shows that systems relied on ranking methods instead of validation. In fact, only one system (NTUNLG-1) decided to leave unanswered some questions. Systems know that there is a correct answer, and they take the risk of returning always the can- didate more similar to the text, no matter if this similarity is low. As we have pointed out above, this is not the expected behavior in the task, and we must think about new ways of promoting the desired change in such direction. It is still not clear the impact of selecting relevant passages rather than working with the whole document. Systems working with passag- es do it as a way of reducing the amount of work in the following steps. Unfortunately, participant groups did not report if they return some in- correct answers as a consequence of this selection. We think this might be an interesting study for other researchers. Anyway, it is clear that this step must be focused on recall rather than in precision. Some participants prefer to work with a representation of texts and answers similar to graphs instead of raw text. We think these par- ticipants try to exploit the properties of such structures for representing connections between concepts, the inclusion of extra knowledge, etc. Regarding the use of external knowledge, we think it is one of the main issues at this task. Reading comprehension texts contain a lot of implicit information that automatic systems might not be able to ex- tract, as LIMSI reported [3]. However, the best performing systems, from Synapse, did not use any kind of external knowledge. We think current systems are still quite far from a proper way of representing, exploiting and using external knowledge in this task. A more detailed analysis of each system showed that Synapse [2] built Clause Description Structures (CDS) structures, which are similar to graphs, of whole documents. They preferred not to include external knowledge from resources such as DBPedia or Wikipedia because they thought that the given text contained enough information for finding the correct candidate. They also removed candidates which did not match the expected answer type as a way of reducing the search space. Then, they compared CDSs from texts and candidates, measuring prox- imity and the number of common elements. Finally, they chose the candidate with the highest coefficient. On the other hand, LIMSI [3] selected a set of passages in order to reduce the computation time. Next, they represented passages as graphs and enriched those graphs using external sources. They wanted to reduce the gap between the knowledge extracted from texts by hu- mans and computers. Then, they recorded the changes required for passing from passages graphs to candidate graphs. Finally, they ap- plied two classifiers, one for validation and the other one for rejection, using the set of changes as features. The selected candidate was the one with the highest final score according to equation finalScore = valida- tionScore – rejectionScore. CoMIC [4] retrieved also a set of relevant passages. For this pur- pose, they took also in consideration that passages relevant to the first questions usually appears at the beginning of the text, while passages referring to the last questions appears at the bottom of the text. Then, they measured the similarity between relevant passages and candidates without using any intermediate graph structure. They accounted for vector-space model based measures, similarity measures using Word- Net, as well as syntactic and semantic similarity measures. Finally, they applied a Ranking SVM model to obtain the final answer. 7 CONCLUSIONS In the third edition of the task we expected a jump in performance in comparison with previous campaigns. However, we have seen similar results at the Question Answering perspective and slight improvements at the Reading Comprehension perspective. Only systems from Syn- apse could give more correct than incorrect answers. We think the current task is still very hard for current technolo- gies and it is not clear what the bottleneck is. We know that there are several issues such as (1) the semantic gap between texts, questions and answers; (2) external knowledge management; etc. Participants tried different approaches and offered some details about the right way for obtaining progress in this task, but it is not clear what the right direction is. Anyway, the availability of the created resources and methodolo- gy provides a benchmark able to assess real progress in the field along future years. 8 ACKNOWLEDGEMENTS The collaboration has been developed in the framework of Todai Robot Project in Japan, and the CHIST-ERA Readers project in Europe (MINECO PCIN-2013-002-C02-01) and the Voxpopuli project (TIN2013-47090-C3-1-P). The Todai Robot Project is a grand chal- lenge headed by NII, and aims to develop an end-to-end AI system that can solve real entrance examinations of universities in Japan integrating heterogeneous AI technologies, such as natural language processing, situation understanding, math formula processing or vision processing. REFERENCES 1. Anselmo Peñas and Alvaro Rodrigo. A Simple Measure to Assess Non-response. In Proceedings of 49th Annual Meeting of the Association for Computational Linguistics - Human Language Technologies (ACL-HLT 2011), Portland, Oregon, USA, 2011 2. Dominique Laurent, Baptiste Chardon, Sophie Negre, Camille Pradel and Patrick Seguela. Reading Comprehension at Entrance Exams 2015. CLEF 2015 Working Notes, Toulouse, 2015. 3. Martin Gleize and Brigitte Grau. LIMSI-CNRS@CLEF 2015: Tree Edit Beam Search for Multiple Choice Question Answering. CLEF 2015 Working Notes, Toulouse, 2015. 4. Ramon Ziai. CoMiC: Exploring Text Segmentation and Similarity in the English Entrance Exams Task. CLEF 2015 Working Notes, Toulouse, 2015. 5. Dominique Laurent, Baptiste Chardon, Sophie Negre and Patrick Seguela. English run of Synapse Développement at Entrance Exams 2014. CLEF 2014 Working Notes, Sheffield, 2014