1. INTRODUCTION

Overview of QA4MRE at CLEF 2011: Question Answering for Machine Reading Evaluation

Anselmo Peñas

anselmo@lsi.uned.es

Eduard Hovy

hovy@isi.edu 2

Pamela Forner

forner@celct.it 1

Álvaro Rodrigo

Richard Sutcliffe

richard.sutcliffe@ul.ie 4

Corina Forascu

Caroline Sporleder

3 0 Al. I. Cuza University of Iasi , Romania 1 CELCT , Italy 2 Information Sciences Institute of the University of Southern California , USA 3 Saarland University , Germany 4 University of Limerick , Ireland

2009

This paper describes the first steps towards developing a methodology for testing and evaluating the performance of Machine Reading systems through Question Answering and Reading Comprehension Tests. This was the attempt of the QA4MRE challenge which was run as a Lab at CLEF 2011. This year a major innovation was introduced, as the traditional QA task was replaced by a new Machine Reading task whose intention was to ask questions which required a deep knowledge of individual short texts and in which systems were required to choose one answer, by analysing the corresponding test document in conjunction with the background collections provided by the organization. Beside the main task, also one pilot task was offered, namely, Processing Modality and Negation for Machine Reading. This task was aimed at evaluating whether systems were able to understand extra-propositional aspects of meaning like modality and negation. This paper describes the preparation of the data sets, the creation of the background collections to allow systems to acquire the required knowledge, the metric used for the evaluation of the systems' submissions, and the results of this first attempt. Twelve groups participated in the task submitting a total of 62 runs in three languages: English, German and Romanian.

1. INTRODUCTION

Machine Reading (MR) is defined as a task that deals with the automatic understanding of texts. The evaluation of this “automatic understanding” can be approached in two ways: the first one is to define a formal language (target ontology), ask the systems to translate texts into the formal language representation, and then evaluate systems by using structured queries formulated in the formal language. The second approach is agnostic with any particular representation of the text. Systems are inquired about the text with natural language questions. The first option is approached by Information Extraction. The second is related to how Question Answering (QA) is being articulated during the last decade. In this evaluation we follow the second approach but with a significant change with respect to previous QA campaigns. Why?

By 2005 we realized that there was an upper bound of 60% of accuracy in systems performance, despite more than 80% of the questions were answered by at least one participant. We understood that we had a problem of error propagation in the traditional QA pipeline (Question Analysis, Retrieval, Answer Extraction, Answer Selection/Validation). Thus, in 2006 we proposed a pilot task called Answer Validation Exercise (AVE). The aim was to produce a change in QA architectures giving more responsibility to the validation step. In AVE we assumed there was a previous step of hypothesis over-generation and the hard work was in the validation step. This is a kind of classification task that could take advantage of Machine Learning. The same idea is behind the architecture of IBM’s Watson (DeepQA project) that successfully participated at Jeopardy (Ferrucci et al., 2010) .

After the three editions of AVE we tried to transfer our conclusions to the main QA task at CLEF 2009 and 2010. The first step was to introduce the option of leaving questions unanswered. This is related to the development of validation technologies. We needed a measure able to reward systems that reduce the number of questions answered incorrectly without affecting systems accuracy, by leaving unanswered the questions they estimated they couldn’t answer. The measure was an extension of accuracy called c@1 (Peñas and Rodrigo, 2011) , tested during 2009 and 2010 QA campaigns at CLEF, and used also in the current evaluation.

However, this change wasn’t enough. Almost all systems continued using IR engines to retrieve relevant passages and then try to extract the exact answer from that. This is not the change in the architecture we expected, and again, results didn’t go beyond the 60% pipeline upper bound. Finally, we understood that the change in the architecture requires a previous development of answer validation/selection technologies. For this reason, in the current formulation of the task, the step of retrieval is put aside for a while, focusing on the development of technologies able to work with a single document, and answer questions about it.

The idea of hypothesis generation and validation architecture is applicable to the new setting were only one document is considered, but of course the generation of hypotheses would be very limited if one only considers the given document. Systems should consider a large collection related to the given document in the task of hypothesis generation. Then, the validation must be performed according to the given document.

In the new setting, we started again decompounding the problem into generation and validation. Thus, in this first edition, we will test the systems only for the validation step. Together with the questions the organization provides a set of candidate answers. Besides, in this first edition, systems know there is one and only one correct answer among the candidates. This gives the evaluation the format of traditional Multiple Choice Reading Comprehension tests. From this starting point, a natural roadmap could be the following: 1. Focus on validation: Questions have attached a set of candidate answers.

a. Step 1. All questions have one and only one correct candidate answer. b. Step 2. Introduce questions that require inference (e.g. about time and space). c. Step 3. Introduce questions with no correct candidate answer. d. Step 4. Introduce questions that require textual inference after reading a large set of documents related to the test (e.g. expected actions of agents with a particular role, etc.) 2. Introduce hypothesis generation: Organization provides reference collections of documents related to the tests.

a. Step 5. Questions about a single document, but no candidate answers are provided. b. Step 6. Full setting of QA were systems have to generate hypothesis considering the reference collection and provide the answer together with the set of documents that support the answer.

We are just at the beginning of this roadmap, giving space and resources for the evaluation of new QA systems with new architectures. The success of this new initiative is only measurable by the development of these new architectures able to produce a qualitative jump in performance. This vision will guide the concrete definition of the tasks year by year.

2. TASK DESCRIPTION

The QA4MRE 2011 task focuses on the reading of single documents and the identification of the answers to a set of questions. Questions are in the form of multiple choice, each having five options, and only one correct answer. The detection of correct answers might require eventually various kinds of inference and the consideration of previously acquired background knowledge from reference document collections. Although the additional knowledge obtained through the background collection may be used to assist with answering the questions, the principal answer is to be found among the facts contained in the test documents given. Thus, reading comprehension tests do not require only semantic understanding but they assume a cognitive process which involves using implications and presuppositions, retrieving the stored information, performing inferences to make implicit information explicit. Many different forms of knowledge take part in this process: linguistic, procedural, world-and-common-sense knowledge. All these forms coalesce in the memory of the reader and it is sometimes difficult to clearly distinguish and reconstruct them in a system which needs additional knowledge and inference rules in order to understand the text and to give sensitive answers.

2.1 Main Task

By giving only a single document per test, systems are required to understand every statement and to form connections across statement in case the answer is spread over more than one sentence. Systems are requested to (i) understand the test questions, (ii) analyze the relation among entities contained in questions and entities expressed by the candidate answers, (iii) understand the information contained in the documents, (iv) extract useful pieces of knowledge from the background collections, (v) and select the correct answer from the five alternatives proposed.

Tests were divided into: - 3 topics, namely “Aids”, “Climate change” and “Music and Society” - Each topic had 4 reading test - Each reading test consisted of one single document, with 10 questions and a set of five choices per question.

In global, the evaluation had in this campaign - 12 test documents (4 documents for each of the three topics) - 120 questions (10 questions for each document) with - 600 choices/options (5 for each question) Test documents and questions were made available in English, German, Italian, Romanian, and Spanish. These materials were exactly the same in all languages, created using parallel translations.

2.2 Pilot Exercises

Beside the main task, also one pilot task was offered this year at QA4MRE; i.e. Processing Modality and Negation for Machine Reading [11]. It was coordinated by CLiPS, a research center associated with the University of Antwerp, Belgium. The task was aimed at evaluating whether systems are able to understand extra-propositional aspects of meaning like modality and negation. Modality is a grammatical category that allows expressing aspects related to the attitude of the speaker towards his/her statements. Modality understood in a broader sense is also related to the expression of certainty, factuality, and evidentiality. Negation is a grammatical category that allows changing the truth value of a proposition. Modality and negation interact to express extra-propositional aspects of meaning. More information at http://www.cnts.ua.ac.be/BiographTA/qa4mre.html The Pilot task exploited the same topics and background collections of the main exercise. Test documents, instead, were specifically selected in order to ensure the properties required for the questionnaires. The pilot task was offered in English only.

3. THE BACKGROUND COLLECTIONS

One focus of the task is the ability to extract different types of knowledge and to combine them as a way to answer the questions. In order to allow systems to acquire the same background knowledge, ad-hoc collections were created. At an early stage, a background collection related to the renewable energy domain was first released to participants together with some sample data. The background collection for the sample, of about 11,000 documents, was in English only. For the real test, three background collections - one for each of the topics – were released in all the languages involved in the exercise, i.e., English, German, Italian, Spanish and Romanian. Overall, fifteen large repositories as source of “background knowledge” were created to enable inferring information that is implicit in the text. These background collections are comparable (but not identical) topic-related (but not specialized) collections made available to all participants at the beginning of April by signing a license agreement. Thus, systems could “learn” and acquire knowledge in one language or several.

The only way to acquire big comparable corpora in the three domains we were interested, was crawling the web. Crawling refers to the acquisition of material specific to a given subject from the Web. The Web, with its vast volumes of data in almost any domain and language, offers a natural source for naturally occurring texts. To this end, a web crawler was specifically created by CELCT in order to gather domain-specific texts from the Web.

As for the distribution of documents among the collections, the final number of documents fetched for each language collection was different, but this is supposed to reflect the real distribution. Table 1 depicts the sizes of the corpora which were acquired and the number of documents contained in each language background collection for each of the three topics. The corpora obtained from the process of crawling contain a set of documents which are related to the test documents. Unfortunately, the degree of noisy documents introduced is unknown.

As a final step, in order to ensure that each language background collection really contained documents which supported the inferences of the questions, each language organizer was also asked to manually search on the web for the documents, in their own language, which were to be manually added to each language collection. A list of the respective docs that should be looked for was provided by question creators to each language group.

Once all collections were ready in all languages, the zipped files were transferred to CELCT ftp server. All documents inside each collection were then re-numbered giving them a progressive unique identifier.

3.1 Keywords and Crawling

A web crawler is a relatively simple automated program, or script, that methodically scans or "crawls" through Internet pages to create an index of the data it's looking for.

The QA4MRE crawler is a flexible application designed to download a large number of documents from the World Wide Web around a specified list of keywords. It was developed using Google API, downloading documents in a ranked order, and obeying the Robot Exclusion Standard. After downloading, documents are converted in .txt format and each text is named according to the sources from which it has been downloaded, for example: “articles.latimes.com_68”.

Keywords play a central role in the crawling process as they are used in acquiring the seed URLs. Before fixing the final set of keywords all people in charge of the creation of the respective language collection experimented with a preliminary pool of keywords and suggested changes to the others. Then, once the sets of keywords were standardised in English, they were translated into the other languages and loaded into CELCT’s crawler. Keywords mustn’t be too generic, and combination of keywords useful to restrict the domain helped to retrieve relevant documents. Synonyms or words which have very similar meaning – like for example, “climate change” and “climate variability”; “carbon dioxide” and “C02” – were kept as separate queries, as the documents which could be obtained could be different. Also, acronyms were always solved, – like for example Joint United Nations Programme on HIV (UNAIDS) – and were entered in the same query into the crawler.

In addition, as building a comparable corpus requires control over the selection of source texts in the various languages, each language group was asked to prepare a list of (trusted) web sites – indicatively a number of 40 – which were more likely to have plenty of documents related to the topic in their own language. This was required as a way to increase the number of relevant documents avoiding introducing noise (or virus files). The longer the list of domains was, the higher the number of documents which could be downloaded for each single query. Texts were drawn from a variety range of sources e.g.: newspapers, newswire, web, journals, blogs, Wikipedia entries, etc.

All keywords and all domains were entered in one crawling run. This solution allowed the removal of duplicate URLs retrieved making different queries, as the encountered URLs were kept in memory, so that every URL was visited only once. On average, it took 2-3 days to build one background collection for one topic.

Other parameters could also be set, namely the number of documents to be downloaded for each single query. By default it was set to 1000, since, due to Google restrictions, it is the maximum number of documents per query which can be downloaded for a specified source/domain. For the English language, this parameter was set to 500. In an attempt, to reduce the number of indices, and other useless files from the corpus lists, the documents which are too short were automatically discarded, by setting the minimum length of the document to 1000 characters. For the English language it was set to 1500.

4. TEST SET PREPARATION

As we have seen, the task this year was to answer a series of multiple choice tests, each based on a short document.

4.1 Test Documents

In order to allow participants to tune their systems, a set of pilot data was first devised. This consisted of three English documents concerned with the topic of renewable energy taken from Green Blog (http://www.greenblog.org/) together with three sets of questions, one for each document, and a background collection of about 11,000 documents. For each document there were ten multiple choice questions; each question had five candidate answers, one clearly correct answer and four clearly incorrect answers. The task of each system was therefore to choose one answer for each question, by analysing the corresponding test document in conjunction with the background collection.

Following the creation of the pilot data, attention was turned to the materials for the actual evaluation. The languages this year were English, German, Italian, Romanian and Spanish. The intention was to set identical questions for these five languages. This implied that we had access to a suitable parallel collection of documents so that each test document was exactly translated into each language of the task. Unfortunately, even after decades of interest in parallel corpora, very few publicly available high quality collections exist in these five languages. The main possibilities available to us were "Eurobabble" and technical manuals, but each was somewhat unsuitable for the task. Another option was for us to commission special translations of selected documents in, say, English, just for the purposes of QA4MRE.

After some consideration, we took up a suggestion of Igal Gabbay to use documents taken from the Technology, Entertainment, Design (TED) conferences (www.ted.com). Each TED event consists of a series of invited presentations by prestigious speakers, from fields such as politics, entertainment and industry. The speakers are fluent, persuasive, and mostly speak from memory with no repetition or hesitation. Each talk lasts for twenty minutes or less and is aimed at a non-specialised but reasonably educated audience. The organisers provide for each talk a high-quality text transcription. In the case of the talks used, this ranges in length between 1125 and 3580 words. However, they also provide an infrastructure for the transcriptions to be translated by volunteers. These translations are carefully refereed and are generally of very high quality. The number of languages in which a talk is available varies, depending on its popularity, but is typically 20-40. From the perspective of QA4MRE, TED transcriptions have some good points and some bad ones. On the one hand, they are of high typographical and syntactic quality, they discuss clearly-defined topics, they are at a reasonable intellectual level, they are available translated accurately into many languages and they are of course publicly available. On the other hand, they are on the short side, and, length-for-length contain less facts amenable to the generation of questions than might be the case for other kinds of document. They may also contain jokes or digressions, or material which can only be comprehended in the context of film clips, photographs or recordings which are used in the talk but which of course do not appear in the transcription. Finally, the transcriptions can contain phrases such as "laughter", "applause" or "music" from time to time. These, of course, are describing events at the talk itself and are thus not a transcription of anything that was said. Having decided on the source of documents, three topics were then chosen, AIDS, Climate Change, and Music and Society. For each topic, four TED talks were selected, each having transcripts available in English, German, Italian, Romanian and Spanish. Table 2 lists the selected talks. Ten multiple-choice questions were then devised for each talk. As in the pilot materials, a question always had five candidate answers from which to choose, with one clearly correct answer and four clearly incorrect answers.

Once the questions had been composed in the language of the original author, each was then translated into English. The English versions of the questions and candidate answers were carefully checked by a referee to verify that they were clear, that the intended answer was clearly correct, that the intended answer was in the test document, and that the other candidate answers were clearly incorrect. Questions were modified accordingly. The English versions were then used to translate each question into each of the five languages of the task. The same process was used to translate each candidate answer (five per query) into the five languages.

The result of this process was a set of 120 questions in five languages, each with five multiple-choice answers, also in those five languages. The final step was to check that the answer to each question was in fact present in the test document for all the languages of the task. Occasionally, certain parts of the original English text were left out of the translation in a particular target language, or perhaps modified or interpreted in a particular manner which made the question impossible to answer in that language. In such cases, the question had to be withdrawn from all languages and a new one devised to take its place.

In parallel with the above activity, a background collection was created for each of the three topics, as described in Section 3 above. The questions, test documents and background collections were now ready to be used in the QA4MRE task.

4.2 Questions

Unlike previous campaigns, where the aim was mainly to ask factoid questions involving the extraction of simple information (mainly Named Entities) from large collections of long documents, the intention in QA4MRE was to ask more searching questions which required a deep knowledge of individual short texts. Concerning test queries, as is usual practice in the QA campaign, they were artificially constructed from portions of the text to match the criteria we wanted to test in this task.

The QA4MRE questions were also created taking into consideration different levels of difficulty. They may refer to: facts that (as in traditional QA evaluation) are explicitly present in the text facts that are explicitly present but are not explicitly related (for example, they do not appear in the same sentence, although any human would understand they are connected) facts that are not explicitly mentioned in the text, but that are one inferential step away (as in the RTE challenge) facts that are explicitly mentioned in the text but that require some inference to be connected to form the answer Out of the 120 questions given in the test set, 44 of them needed some extra information from the background collection in order to be answered, while for 76 questions the information present in the text document alone was enough to select the correct answer. More in details, as Table 3 shows, 38 questions had the answer contained in the same sentence/paragraph; while for 38 questions the system had to assemble information from different paragraphs in order to answer the question. In addition, questions were also posed so that the answers were not merely a mechanical repetition of the input question, but all kinds of textual inferences could be requested, i.e., lexical (acronym, synonymy, hyperonymy-hyponymy), syntactic (nominalization-verbalization, causative, paraphrase, active-passive), discourse (co-reference, anaphora ellipsis).

Type

CAUSE

COMPOSITE DEGREEOFTRUTH FACTOID

LOCATION FACTOIDNUMBER FACTOIDPERSON

FACTOIDLIST Topic

Climate Change

Climate Change AIDS AIDS Music &

Society AIDS

When was a newspaper

article published on climate change? What other information important for climate change is stored in Antarctic ice? What consequence would the use of renewable energies have in the US?

How are people infected by HIV? What is Bono's attitude

towards the digital age? Why did the Bayreuth Festspielhaus have a large orchestra pit?

Where has carbon from

the Earth's atmosphere gone? What is the worst thing about being poor? at the beginning of 2000 / in the 90s / in 1965 / in the 1950s / in 2075 ozone gasses / a register of ocean currents / the amount of precipitation / crystals / measurements of the Earth's temperature new job opportunities / a higher dependency on fossil fuels / a decrease in toxic dumps / a higher use of clean coal / less responsible use of energy through aerial transmission / through genetic transmission / through direct contact with infected people / through the faecal-oral route / through sexual intercourse sorrow / sadness / enthusiasm / indifference / anger to eat, drink and yell out / to be more intricate / to help Mozart / to suggest an encore / to accommodate low-end instruments it is still in the atmosphere / to form fossil fuels / to be part of the arctic ice cap / to pollute the air / to create acid rain the enjoyment of music / the satisfaction of playing / the feeling of being no-one / the lack of food / the lack of shelter Concerning the types of questions which would be asked, it had originally been proposed that there would be four: FACTOID, CAUSE, HYPOTHETICAL and COMPOSITE. However, following the creation of the pilot materials, six further question types were suggested: DEGREE-OF-TRUTH, METHOD, OPINION, PURPOSE, RESULTS and WHICH-IS-TRUE. Furthermore, FACTOIDs are broken down into LOCATION, NUMBERCALC, PERSON, STATED-LIST, TIME and UNKNOWN-TYPE Examples of the types can be seen in Table 4 with a breakdown by frequency in Table 5. Unlike in previous campaigns, questions were not required to fall into the ten types in a pre-determined distribution. As can be seen in Table 5, about half the questions (64 out of 120) were FACTOID, 17 were CAUSE and 16 were WHICH-IS-TRUE. There were between one and five instances of each of the remaining types. Table 6 shows the proportion of correct answers and of NoA answers given by all systems to each different question type. Degree of truth seem to be the easiest type of question to be answered while composite and hypothetical questions appear to be the most difficult to be approached. However, system seem to be less confident in answering methods and opinion questions.

4.3 Tools and Infrastructure

Also this year, CELCT developed a series of infrastructures to help the management of the QA4MRE exercise. Many processes and requirements were to be dealt with: the need to develop a proper and coherent tool for the management of the data produced during the campaign, to store it and to make it re-usable, as well as to facilitate the analysis and comparison of results the necessity of assisting the different organizing groups in the various tasks of the data set creation and to facilitate the process of collection and translation of questions the possibility for the participants to directly access the data, submit their own runs (this also implied some syntax checks of the format), and later, get the detailed viewing of the results and statistics. A series of automatic web interfaces were specifically designed for each of these purposes, with the aim of facilitating the data processing and, at the same time, showing the users only what they needed for the task they had to accomplish. So, the main characteristics of these interfaces are the flexibility of the system specifically centred on the user’s requirements.

While designing the interfaces for question collection and translation one of the first issues which was to be dealt with, was the fact of having many assessors, a big amount of data, and a long process. So tools must ensure an efficient and consistent management of the data, allowing: o o o

Edition of the data already entered at any time.

Revision of the data by the users themselves.

Consistency propagation ensuring that modifications automatically re-model the output in which they are involved.

Statistics and evaluation measures are calculated and updated in real time.

In particular, ensuring the consistency of data is a key feature in data management. For example, if a typo is corrected in the Translation Interface, the modification is automatically updated also in the GoldStandard files, in the Test Set files, etc.

5. EVALUATION

Participating systems could give one of two possible responses for each question in the test collection: • •

To give one answer selected from the five candidate answers of the question not to answer the question if a system considered that it did not have enough evidences for selecting one of the candidate answers as the correct one. This option is called NoA answer. In order to evaluate the ability of validating its answers, the system could return in this case the candidate answer that it would select in case of having to answer the question.

Taking into consideration these two possible responses, each question receives one (and only one) of the three following assessments: •

correct if the system selected the correct answer among the five candidate ones of the given question • • incorrect if the system selected one of the wrong answers

NoA if the system chose not to answer the question

The evaluation of the output given by participating systems was performed automatically by comparing the answers of systems against the gold standard collection with human-made annotations. No manual assessment was required.

The task developed this year allowed us to evaluate systems from two different perspectives: 1. A question-answering evaluation, as the traditional evaluation performed in past campaigns. In this evaluation, we just accounted answers without grouping them. 2. On the other hand, we can perform a reading-test evaluation, obtaining figures for each particular reading test, and as a part of a topic.

5.1 Evaluation Measure

The purpose of allowing NoA answers is to reduce the amount of incorrect responses, while keeping the number of correct ones, by leaving some questions unanswered. As the main evaluation measure for this year's campaign c@1 was used, which takes into account the option of not answering certain questions. c@1 was firstly introduced in ResPubliQA 2009 [8] and is fully described in (Peñas and Rodrigo, 2011) . The formulation of c@1 is given in (1).

n (nR + n

R ) U n (1) nR: number of questions correctly answered. nU: number of questions unanswered.

n: total number of questions c@1 acknowledges the option of giving NoA answers in the proportion that a system answers questions correctly, which is measured using accuracy. Thus, a higher accuracy over answered questions would give more value to unanswered questions, and therefore, a higher final c@1 value.

As a secondary measure, we also provided scores according to accuracy (2), the traditional measure applied to QA evaluations that does not distinguish between answered and unanswered questions. We used also the candidate answer given to unanswered questions to obtain accuracy values.

accuracy = nR + nUR

(2) n nR: number of questions correctly answered. nUR: number of unanswered questions whose candidate answer was correct.

n: total number of questions where where • •

5.2 Question Answering perspective evaluation

A question-answering evaluation has been performed over the whole test collection. This evaluation measures the overall performance of a system, without analyzing the behaviour over a particular reading test. The information taken into account for each system at this level is: number of questions ANSWERED o number of questions ANSWERED with RIGHT answer o number of questions ANSWERED with WRONG answer number of questions UNANSWERED o number of questions UNANSWERED with RIGHT candidate answer o o number of questions UNANSWERED with WRONG candidate answer number of questions UNANSWERED with EMPTY candidate answer More in detail, the evaluation at this level includes: where: • • • • • •

Overall c@1(over the 120 questions of the test collection) c@1 per topic (over the 40 questions of each topic) Overall accuracy (over the 120 questions of the test collection, considering also the candidate answers given to unanswered questions) Proportion of answers correctly discarded (see (3)) correctly _ discarded =

nUW + nUE nUR + nUW + nUE (3) nUR: number of unanswered questions whose candidate answer was correct nUW: number of unanswered questions whose candidate answer was incorrect nUE: number of unanswered questions whose candidate answer was empty

5.3 Reading perspective evaluation

The objective of the reading-test evaluation is to offer information about the performance of a system “understanding” the meaning of each single document. This understanding is evaluated by means of multiplechoice tests consisting of ten questions per document.

This evaluation is performed taking as reference the c@1 values achieved for each test (one document with ten questions about it). Then, the c@1 values were aggregated at topic and global levels:

The median c@1 has been provided under the consideration that it can be more informative at reading-test level than average values. This is because median is less affected by outliers than average, and therefore, it offers more information about the ability of a system to understand a text. For example, if we have three high c@1 values in a topic, but the last one is very low, the median is not affected by this low result (because it is an isolated result in comparison with the other three), while average accounts for this bad behaviour.

5.4 Random Baselines

In order to offer some baselines for this task, it must be considered that participating systems can decide to answer or not to answer a given question. Then, we firstly propose the use of a random baseline where all the questions are answered. This baseline has five possibilities when trying to answer a question: it can select the correct answer to the question, or it can select one of the four incorrect answers. In this case, the overall result is 0.2 (both for accuracy and for c@1).

6. PARTICIPATION and RESULTS

Out of the 25 groups which had previously registered and signed the license agreement to download the background collections, a total of 12 groups participated in the QA4MRE tasks submitting 62 runs in 3 different languages (German, English, and Romanian). Table 7 shows the runs submitted in each language. No runs were submitted either in Italian, or - quite surprisingly - in Spanish (usually the second most chosen language). All runs were monolingual; no team attempted a cross-language task. This was probably due to the fact that crossing the language boundary is currently not core to the task, even though multiliguality is directly addressed through the provision of collections and tests in five languages.

Participants were allowed to submit a maximum of 10 runs. The first run was to be produced using nothing more than the knowledge provided in the background collections. Additional runs could include other sources of information, e.g. ontologies, rule bases, web, Wikipedia, etc., or other types of inferences. All resources used to acquire the knowledge were to be listed in the submission file.

Beside specifying the resources used, systems were required to list also the document(s) and sentence(s) that helped them (directly or indirectly) to identify the correct answer. Such provenance was not used for formal evaluation, but for informal analysis and discussion.

s e g a u gn )s la on e i c t r s

e u u o q S ( As usual, the vast majority of the runs were in English, as Table 7 shows. The list of participating teams and the reference to their reports are shown in Table 8. Beside Europe, participants came also from USA, China and India. Concerning the overall difficulty of the exercize, Topic 3 was the easiest and Topic 1 was the hardest but the range of difficulty is not huge, as Table 10 demonstrates. So, the three topics look fairly balanced. Also, average performances do not exceed too much the random baseline (0.2). The following three tables (14-15-16) show the best run for each participating group, reporting the mean of the tests for each topic. Except for one case, the overall mean is higher that the baseline.

Table 14 : Results for English

Topic Topic Topic Overall 1 2 3 As for system performances at the question-answering evaluation level we can generally see that only one team (jucs) is above 50%, showing a large room for improvement.

From a reading test perspective, in general no group passed the reading tests, and all system seem to be very close to random guessing. Overall results at reading test level, i.e., median, average, and standard deviation for all runs are given in Appendix 1.

System combination uhei1109dede uhei1102dede loga1101dede loga1102dede uhei1103dede random baseline uhei1106dede uhei1104dede uhei1108dede uhei1105dede uhei1107dede uhei1101dede

System combination uaic1107roro uaic1101roro uaic1109roro uaic1103roro uaic1104roro uaic1106roro random baseline uaic1108roro uaic1105roro uaic1102roro A summary of the applied methods and techniques reported by participants is given in Table 17-18-19 in Appendix 2.

8. RELATED WORK

The current state of development of the NLP technologies offers a good opportunity for proposing an evaluation of MR systems. The opportunity arises from the clear evolution of NLP systems towards a deeper level of text analysis that allows a better understanding of documents. In fact, the interest in MR among different research groups over the world has increased recently as the creation of the MR program at DARPA1 testifies. The large community involved in Machine Reading is searching a way to evaluate their systems. But the problem of how to evaluate these machines is still an open research issue.

Over the last years, the QA Track at CLEF has changed its evaluation methodology in order to promote deeper text understanding. Clearly, the task of retrieving just text excerpts (facts, sentences, paragraphs or documents) is not enough to develop the technology. Besides QA, other evaluation activities were also performed which required deeper analyses of texts, for example Recognizing Textual Entailment (RTE), Answer Validation (AV), and Knowledge Base Population (KBP).

Question Answering: a system receives questions formulated in natural language and returns one or more exact answers to these questions, possibly with the locations from which the answers were drawn as justification. The evaluation of QA systems began at the Text Retrieval Conference (TREC)2, and was continued at the Cross Language Evaluation Forum (CLEF)3 in the EU, and at the NII-NACSIS Test Collection for IR Systems (NTCIR)4 in Japan. Most of the questions used in these evaluations ask about facts (i.e. Who is the president of XYZ?) or dentitions (i.e. What does XYZ mean?). Since systems could search for answers among several documents (using IR engines), it was generally possible to find in some document a “system-friendly” statement that contained exactly the answer information stated in an easily matched form. This made QA both shallow and relatively easy.

Recognizing of Textual Entailment (RTE): a system must decide whether the meaning of a text (the Text T) entails the meaning of another text (the Hypothesis H): whether the meaning of the hypothesis can be inferred from the meaning of the text [4]. RTE systems have been evaluated at the RTE Challenges, whose first competition was proposed in 2005. The RTE Challenges encourage the development of systems that have to treat different semantic phenomena.

Answer Validation Exercise (AVE) [5.6.7]. A combination of QA and RTE evaluations. Answer Validation (AV) is the task of deciding. given a question and an answer from a QA system, whether the answer is correct or not. AVE was a task focused on the evaluation of AV systems and it was defined as a problem of RTE in order to promote a deeper analysis in QA.

Another application of RTE, similar to AVE, in the context of Information Extraction was performed in a pilot task at the RTE-65 with the aim of studying the impact of RTE systems in Knowledge Base Population (KBP)6. The objective of this pilot task is to validate the output of participant systems at the KBP slot illing task that was celebrated at the Text Analysis Conference (TAC)7. Systems participating at the KBP slot filling task must extract from documents some values for a set of attributes of a certain entity. Given the output of participant systems at KBP, the RTE KBP validation pilot consists of deciding whether each of the values detected for an entity is correct according to the supporting document. For taking this decision, participant systems at the RTE KBP validation pilot receive a set of T-H pairs, where the hypothesis is built combining an entity, an attribute and a value.

Other efforts closer to our proposal for evaluating systems understanding took place, as the “ANLP/NAACL 2000 Workshop on Reading comprehension tests as evaluation for computer-based language understanding systems”8. This workshop proposed to evaluate understanding systems by means of Reading Comprehension (RC) tests. The evaluation consisted of a set of texts and a series of questions about each text. Quite interestingly, most of the approaches presented at that workshop showed how to adapt QA systems to such kind of evaluation.

A more complete evaluation methodology of MR systems has been reported in [7], where the authors proposed to use also RC tests. However, the objective of these tests was to extract correct answers from documents, which is similar to QA without an IR engine.

A natural step in this area is an evaluation methodology that requires a deeper level of inference and of analysis of text. 1 http://www.darpa.mil/ipto/programs/mr/mr.asp 2 http://trec.nist.gov/ 3 http://www.clef-campaign.org/ 4 http://research.nii.ac.jp/ntcir/ 5 http://www.nist.gov/tac/2010/RTE/index.html 6 http://nlp.cs.qc.cuny.edu/kbp/2010/ 7 http://www.nist.gov/tac/2010/ 8 http://www.aclweb.org/anthology/W/W00/#0600

9. CONCLUSIONS

This year, the QA @ CLEF task was characterized by a major innovation, namely the transition from the traditional Question Answering (QA) task, proposed in the last eight QA challenges at CLEF, to a new evaluation focus on the reading of a single document. The main reason behind this choice was the feeling that most systems were ready to make a definitive move towards a deeper understanding of the text. Along the years, the QA challenges adopted simple questions which required almost no inferences to find the correct answers. These surface-level evaluations have promoted QA architectures based on Information Retrieval (IR) techniques, in which the final answer(s) is/are obtained after focusing on selected portions of retrieved documents and matching sentence fragments or sentence parse trees. No real understanding of documents was performed, since none was required by the evaluation. Machine Reading (MR), instead, requires the automatic understanding of texts at a deeper level, so this methodology encourages the development of systems able to perform a deep analyses of the text.

One way of evaluating the understanding of a text is to assess the ability to answer a set of questions about it. In particular, reading comprehension tests are designed to measure how well human readers understand what they read. Each text comes with a set of questions about information that is stated or implied in the text.

The objectives of the task are twofold: (i) to propose a task where a deeper level of understanding is required (ii) to extract the knowledge contained in texts as a way to improve the performance of systems where some kinds of reasoning are required. Hence, the development of MR technologies should be fostered and the number of groups interested in the task should increase. This is also an opportunity to create a common framework and community in the field of text understanding.

ACKNOWLEDGMENTS

Special thanks are also due Giovanni Moretti (CELCT. Trento. Italy) for the technical support in the management of all data of the campaign.

This work has been partially supported by the Research Network MA2VICMR (S2009/TIC-1542) and Holopedia project (TIN2010-21128-C02). APPENDIX 1: Overall results at reading test level: Median, Average, and Standard

Deviation for all runs

E M A N _ N U R n a i d e M l l a r e v O e g a r e v A l l a r e v O d r a d n a tS on ll i a t

a erv iev O D Median

Topic 1

Topic 2

Topic 3 Average

Standard Deviation

Standard

Deviation Median

Average

Median

Average

Standard Deviation base1101enen diue1101enen diue1102enen fdcs1102enen fdcs1103enen ifln1101enen ifln1102enen ifln1104enen ifln1105enen iles1101enen iles1102enen iles1103enen iles1104enen iles1105enen iles1106enen iles1107enen iles1108enen iles1109enen iles1110enen jucs1101enen jucs1102enen jucs1103enen jucs1104enen jucs1105enen jucs1106enen jucs1107enen loga1101dede loga1102dede swai1101enen swai1102enen swai1103enen swai1104enen swai1105enen

n ) s ifrtScgggaaeehnpPTO igkhunnC -rsganm iiitttycgaEooeenndRm ()ERN ilrrsssxaooeeenppTm iilrrssscxaoeeeunpNm ifrrrttsssaaaooenhnPm ilssyycaaeeennndpD i(.j.ttssccooenunubb iifrrttttsyccaaaoonnnm iirtsScgaaennpm iilllrtScgaaoeeennbm iffrtsSeeeenddPO iltaoenR rsFaem iilrrttscgaooeeennp rrrvooeeehpTm oenN rtehO

N F S base diue x fdcs x ifln x iles x jucs loga x swai x uaic x uhei x uned x vens x x x base x diue x ifln x fdcs iles jucs loga uaic swai x uhei x uned x vens x x x x x x x x x x x x x x x x x x x x x x x x x x x x fttrerceoaduphm llitcecrcoogaodnukb ilcxLaeBD rssaeuuhT ilyccaEoendp ltygoonO illfrrtsscaaaoooeehnppC irtsLodW rttszzaeeeG iiil-rrttgaaaaooenBVCD -rSyyycoonnnAmm iirtycaonD iiilrtSyycaeeenndpDm iirtycaonD iiiiilrrttSyyxaoPmm ill-frccxLaeeeeeenuRR saeB illfrtcooooendCw iilrtssgooooeennppdKw iilllftttcaoooeeennnCm lrseu lfrrrsvcooeeeeeenRC iiiitttscgaEooeeennndRm rSggaePTO rrsaeP iilrtzaaaooenNNmm N x x x

Tools

factoid extract. Stem System name x

Anselmo

Peñas . Álvaro Rodrigo. Felisa Verdejo. Overview of the Answer Validation Exercise 2007 . In C. Peters.

V. Jijkoun. Th. Mandl. H.

Müller .

D.W.

Oard .

A. Peñas. V.

Petras . and

Santos . (Eds.): Advances in Multilingual and Multimodal Information Retrieval . LNCS 5152. September 2008 .

Anselmo

Peñas . Álvaro Rodrigo. Valentín Sama. Felisa Verdejo . Overview of the Answer Validation Exercise 2006 . In C. Peters.

Clough. F. C. Gey . J. Karlgren . B.

Magnini . D. W.

Oard. M. de Rijke. M. Stempfhuber (Eds.): Evaluation of Multilingual and Multi-modal Information Retrieval. 7th Workshop of the Cross-Language Evaluation Forum . CLEF 2006. Alicante. Spain. September 20-22 . 2006 . Revised Selected Papers.

Álvaro

Rodrigo . Anselmo Peñas. Felisa Verdejo. Overview of the Answer Validation Exercise 2008 . In

C. Peters. Th. Mandl. V.

Petras .

A. Peñas. H.

Müller .

Oard .

Jijkoun . D. Santos (Eds). Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum. CLEF 2008. Aarhus. Denmark. September 17-19 . 2008 . Revised Selected Papers.

Ido

Dagan. Oren Glickman . and

Bernardo

Magnini . The PASCAL Recognising Textual Entailment Challenge. In Lecture Notes in Computer Science . volume 3944 . pages 177 - 190 . Springer. 2005 .

Oren

Etzioni .

Michele

Banko . and Michael J. Cafarella. Machine reading . In Proceedings of the 21st National Conference on Artificial Intelligence . 2006

6. Ellen

Voorhees and Dawn M. Tice . The TREC-8 Question Answering Track Evaluation. In Text Retrieval Conference TREC-8 . pages 83 - 105 . 1999 .

B. Wellner. L. Ferro. W.

Greiff and

Hirschman . Reading Comprehension Tests for Computer-based Understanding Evaluation . Nat. Lang. Eng. 12. 4 . 305 - 334 . 2006

Anselmo

Peñas and

Alvaro

Rodrigo . A Simple Measure to Assess Non-response . In Proceedings of 49th Annual Meeting of the Association for Computational Linguistics - Human Language Technologies (ACL-HLT 2011 ). Portland. Oregon. USA. June 19- 24 . 2011 .

10.

David

Ferrucci . Eric Brown. Jennifer Chu-Carroll. James Fan. David Gondek. Aditya A. Kalyanpur. Adam Lally. J. William Murdock. Eric Nyberg. John Prager. Nico Schlaefer. and Chris Welty . 2010 . Building Watson: An Overview of the DeepQA Project . AI Magazine . 31 ( 3 ).

11.

Roser

Morante and

Walter

Daelemans . Annotating Modality and Negation for a Machine Reading Evaluation . CLEF 2011 Labs and Workshop - Notebook Papers. 19 -22 September, Amsterdam, The Netherlands. Online Proceedings.

12. Juan Martinez-Romo and Lourdes Araujo . Graph-based Word Clustering Applied to Question Answering and Reading Comprehension Tests . CLEF 2011 Labs and Workshop - Notebook Papers. 19 -22 September, Amsterdam, The Netherlands. Online Proceedings.

13. Adrian

Iftene

, Alexandru-Lucian

Gînscă

, Alex Moruz, Diana Trandabăt,

Maria

Husarciuc . Question Answering for Machine Reading Evaluation on Romanian and English. CLEF 2011 Labs and Workshop - Notebook Papers. 19-22 September , Amsterdam, The Netherlands. Online Proceedings.

14.

Gaurav

Arora . Cosine similarity as Machine Reading Technique . Question Answering for Machine Reading Evaluation on Romanian and English. CLEF 2011 Labs and Workshop - Notebook Papers. 19-22 September , Amsterdam, The Netherlands. Online Proceedings.

15.

Suzan

Verberne . Retrieval-based Question Answering for Machine Reading Evaluation . CLEF 2011 Labs and Workshop - Notebook Papers. 19 -22 September, Amsterdam, The Netherlands. Online Proceedings.

16. Partha

Pakray

, Pinaki Bhaskar, Somnath Banerjee, Bidhan Chandra Pal, Alexander Gelbukh and

Sivaji

Bandyopadhyay . JU_CSE_TE: System Description QA4MRE@CLEF 2011 . CLEF 2011 Labs and Workshop - Notebook Papers. 19 -22 September, Amsterdam, The Netherlands. Online Proceedings.

17. Svitlana

Babych

, Alexander Henn, Jan Pawellek, and

Sebastian

Padò . Dependency-Based Answer Validation for German . CLEF 2011 Labs and Workshop - Notebook Papers. 19 -22 September, Amsterdam, The Netherlands. Online Proceedings.

18. Ling

Cao

, Xipeng Qiu and

Xuanjing

Huang . Question Answering for Machine Reading with Lexical Chain . CLEF 2011 Labs and Workshop - Notebook Papers. 19 -22 September, Amsterdam, The Netherlands. Online Proceedings.

19. Ingo

Glockner

, Bjorn Pelzer, and

Tiansi

Dong . The LogAnswer Project at QA4MRE 2011 . CLEF 2011 Labs and Workshop - Notebook Papers. 19 -22 September, Amsterdam, The Netherlands. Online Proceedings . x x x s d se rn re s e i System lyan ttaP cauq s name nA en ly rn re ito od ilca ttae thO seu lly ta p Q au om o n t N a u M A