<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of QA4MRE at CLEF 2011: Question Answering for Machine Reading Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anselmo Peñas</string-name>
          <email>anselmo@lsi.uned.es</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduard Hovy</string-name>
          <email>hovy@isi.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pamela Forner</string-name>
          <email>forner@celct.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Álvaro Rodrigo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Sutcliffe</string-name>
          <email>richard.sutcliffe@ul.ie</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Corina Forascu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caroline Sporleder</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Al. I. Cuza University of Iasi</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CELCT</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Information Sciences Institute of the University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Saarland University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Limerick</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <abstract>
        <p>This paper describes the first steps towards developing a methodology for testing and evaluating the performance of Machine Reading systems through Question Answering and Reading Comprehension Tests. This was the attempt of the QA4MRE challenge which was run as a Lab at CLEF 2011. This year a major innovation was introduced, as the traditional QA task was replaced by a new Machine Reading task whose intention was to ask questions which required a deep knowledge of individual short texts and in which systems were required to choose one answer, by analysing the corresponding test document in conjunction with the background collections provided by the organization. Beside the main task, also one pilot task was offered, namely, Processing Modality and Negation for Machine Reading. This task was aimed at evaluating whether systems were able to understand extra-propositional aspects of meaning like modality and negation. This paper describes the preparation of the data sets, the creation of the background collections to allow systems to acquire the required knowledge, the metric used for the evaluation of the systems' submissions, and the results of this first attempt. Twelve groups participated in the task submitting a total of 62 runs in three languages: English, German and Romanian.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Machine Reading (MR) is defined as a task that deals with the automatic understanding of texts. The evaluation
of this “automatic understanding” can be approached in two ways: the first one is to define a formal language
(target ontology), ask the systems to translate texts into the formal language representation, and then evaluate
systems by using structured queries formulated in the formal language. The second approach is agnostic with any
particular representation of the text. Systems are inquired about the text with natural language questions. The
first option is approached by Information Extraction. The second is related to how Question Answering (QA) is
being articulated during the last decade. In this evaluation we follow the second approach but with a significant
change with respect to previous QA campaigns. Why?</p>
      <p>
        By 2005 we realized that there was an upper bound of 60% of accuracy in systems performance, despite
more than 80% of the questions were answered by at least one participant. We understood that we had a problem
of error propagation in the traditional QA pipeline (Question Analysis, Retrieval, Answer Extraction, Answer
Selection/Validation). Thus, in 2006 we proposed a pilot task called Answer Validation Exercise (AVE). The
aim was to produce a change in QA architectures giving more responsibility to the validation step. In AVE we
assumed there was a previous step of hypothesis over-generation and the hard work was in the validation step.
This is a kind of classification task that could take advantage of Machine Learning. The same idea is behind the
architecture of IBM’s Watson (DeepQA project) that successfully participated at Jeopardy
        <xref ref-type="bibr" rid="ref9">(Ferrucci et al., 2010)</xref>
        .
      </p>
      <p>
        After the three editions of AVE we tried to transfer our conclusions to the main QA task at CLEF 2009
and 2010. The first step was to introduce the option of leaving questions unanswered. This is related to the
development of validation technologies. We needed a measure able to reward systems that reduce the number of
questions answered incorrectly without affecting systems accuracy, by leaving unanswered the questions they
estimated they couldn’t answer. The measure was an extension of accuracy called c@1
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref17 ref18 ref8">(Peñas and Rodrigo,
2011)</xref>
        , tested during 2009 and 2010 QA campaigns at CLEF, and used also in the current evaluation.
      </p>
      <p>However, this change wasn’t enough. Almost all systems continued using IR engines to retrieve
relevant passages and then try to extract the exact answer from that. This is not the change in the architecture we
expected, and again, results didn’t go beyond the 60% pipeline upper bound. Finally, we understood that the
change in the architecture requires a previous development of answer validation/selection technologies. For this
reason, in the current formulation of the task, the step of retrieval is put aside for a while, focusing on the
development of technologies able to work with a single document, and answer questions about it.</p>
      <p>The idea of hypothesis generation and validation architecture is applicable to the new setting were only
one document is considered, but of course the generation of hypotheses would be very limited if one only
considers the given document. Systems should consider a large collection related to the given document in the
task of hypothesis generation. Then, the validation must be performed according to the given document.</p>
      <p>In the new setting, we started again decompounding the problem into generation and validation. Thus,
in this first edition, we will test the systems only for the validation step. Together with the questions the
organization provides a set of candidate answers. Besides, in this first edition, systems know there is one and
only one correct answer among the candidates. This gives the evaluation the format of traditional Multiple
Choice Reading Comprehension tests. From this starting point, a natural roadmap could be the following:
1. Focus on validation: Questions have attached a set of candidate answers.</p>
      <p>a. Step 1. All questions have one and only one correct candidate answer.
b. Step 2. Introduce questions that require inference (e.g. about time and space).
c. Step 3. Introduce questions with no correct candidate answer.
d. Step 4. Introduce questions that require textual inference after reading a large set of
documents related to the test (e.g. expected actions of agents with a particular role, etc.)
2. Introduce hypothesis generation: Organization provides reference collections of documents related
to the tests.</p>
      <p>a. Step 5. Questions about a single document, but no candidate answers are provided.
b. Step 6. Full setting of QA were systems have to generate hypothesis considering the
reference collection and provide the answer together with the set of documents that
support the answer.</p>
      <p>We are just at the beginning of this roadmap, giving space and resources for the evaluation of new QA
systems with new architectures. The success of this new initiative is only measurable by the development of
these new architectures able to produce a qualitative jump in performance. This vision will guide the concrete
definition of the tasks year by year.</p>
    </sec>
    <sec id="sec-2">
      <title>2. TASK DESCRIPTION</title>
      <p>The QA4MRE 2011 task focuses on the reading of single documents and the identification of the
answers to a set of questions. Questions are in the form of multiple choice, each having five options, and only
one correct answer. The detection of correct answers might require eventually various kinds of inference and the
consideration of previously acquired background knowledge from reference document collections. Although the
additional knowledge obtained through the background collection may be used to assist with answering the
questions, the principal answer is to be found among the facts contained in the test documents given. Thus,
reading comprehension tests do not require only semantic understanding but they assume a cognitive process
which involves using implications and presuppositions, retrieving the stored information, performing inferences
to make implicit information explicit. Many different forms of knowledge take part in this process: linguistic,
procedural, world-and-common-sense knowledge. All these forms coalesce in the memory of the reader and it is
sometimes difficult to clearly distinguish and reconstruct them in a system which needs additional knowledge
and inference rules in order to understand the text and to give sensitive answers.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Main Task</title>
      <p>By giving only a single document per test, systems are required to understand every statement and to form
connections across statement in case the answer is spread over more than one sentence. Systems are requested to
(i) understand the test questions, (ii) analyze the relation among entities contained in questions and entities
expressed by the candidate answers, (iii) understand the information contained in the documents, (iv) extract
useful pieces of knowledge from the background collections, (v) and select the correct answer from the five
alternatives proposed.</p>
      <p>Tests were divided into:
- 3 topics, namely “Aids”, “Climate change” and “Music and Society”
- Each topic had 4 reading test
- Each reading test consisted of one single document, with 10 questions and a set of five choices per
question.</p>
      <p>In global, the evaluation had in this campaign
- 12 test documents (4 documents for each of the three topics)
- 120 questions (10 questions for each document) with
- 600 choices/options (5 for each question)
Test documents and questions were made available in English, German, Italian, Romanian, and Spanish. These
materials were exactly the same in all languages, created using parallel translations.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Pilot Exercises</title>
      <p>Beside the main task, also one pilot task was offered this year at QA4MRE; i.e. Processing Modality
and Negation for Machine Reading [11]. It was coordinated by CLiPS, a research center associated with the
University of Antwerp, Belgium. The task was aimed at evaluating whether systems are able to understand
extra-propositional aspects of meaning like modality and negation. Modality is a grammatical category that
allows expressing aspects related to the attitude of the speaker towards his/her statements. Modality
understood in a broader sense is also related to the expression of certainty, factuality, and evidentiality.
Negation is a grammatical category that allows changing the truth value of a proposition. Modality and
negation interact to express extra-propositional aspects of meaning. More information at
http://www.cnts.ua.ac.be/BiographTA/qa4mre.html
The Pilot task exploited the same topics and background collections of the main exercise. Test documents,
instead, were specifically selected in order to ensure the properties required for the questionnaires. The pilot
task was offered in English only.</p>
    </sec>
    <sec id="sec-5">
      <title>3. THE BACKGROUND COLLECTIONS</title>
      <p>One focus of the task is the ability to extract different types of knowledge and to combine them as a way to
answer the questions. In order to allow systems to acquire the same background knowledge, ad-hoc collections
were created. At an early stage, a background collection related to the renewable energy domain was first
released to participants together with some sample data. The background collection for the sample, of about
11,000 documents, was in English only. For the real test, three background collections - one for each of the
topics – were released in all the languages involved in the exercise, i.e., English, German, Italian, Spanish and
Romanian. Overall, fifteen large repositories as source of “background knowledge” were created to enable
inferring information that is implicit in the text. These background collections are comparable (but not identical)
topic-related (but not specialized) collections made available to all participants at the beginning of April by
signing a license agreement. Thus, systems could “learn” and acquire knowledge in one language or several.</p>
      <p>The only way to acquire big comparable corpora in the three domains we were interested, was crawling
the web. Crawling refers to the acquisition of material specific to a given subject from the Web. The Web, with
its vast volumes of data in almost any domain and language, offers a natural source for naturally occurring texts.
To this end, a web crawler was specifically created by CELCT in order to gather domain-specific texts from the
Web.</p>
      <p>As for the distribution of documents among the collections, the final number of documents fetched for
each language collection was different, but this is supposed to reflect the real distribution. Table 1 depicts the
sizes of the corpora which were acquired and the number of documents contained in each language background
collection for each of the three topics.
The corpora obtained from the process of crawling contain a set of documents which are related to the test
documents. Unfortunately, the degree of noisy documents introduced is unknown.</p>
      <p>As a final step, in order to ensure that each language background collection really contained documents
which supported the inferences of the questions, each language organizer was also asked to manually search on
the web for the documents, in their own language, which were to be manually added to each language collection.
A list of the respective docs that should be looked for was provided by question creators to each language group.</p>
      <p>Once all collections were ready in all languages, the zipped files were transferred to CELCT ftp server.
All documents inside each collection were then re-numbered giving them a progressive unique identifier.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1 Keywords and Crawling</title>
      <p>A web crawler is a relatively simple automated program, or script, that methodically scans or "crawls" through
Internet pages to create an index of the data it's looking for.</p>
      <p>The QA4MRE crawler is a flexible application designed to download a large number of documents
from the World Wide Web around a specified list of keywords. It was developed using Google API,
downloading documents in a ranked order, and obeying the Robot Exclusion Standard. After downloading,
documents are converted in .txt format and each text is named according to the sources from which it has been
downloaded, for example: “articles.latimes.com_68”.</p>
      <p>Keywords play a central role in the crawling process as they are used in acquiring the seed URLs.
Before fixing the final set of keywords all people in charge of the creation of the respective language collection
experimented with a preliminary pool of keywords and suggested changes to the others. Then, once the sets of
keywords were standardised in English, they were translated into the other languages and loaded into CELCT’s
crawler. Keywords mustn’t be too generic, and combination of keywords useful to restrict the domain helped to
retrieve relevant documents. Synonyms or words which have very similar meaning – like for example, “climate
change” and “climate variability”; “carbon dioxide” and “C02” – were kept as separate queries, as the
documents which could be obtained could be different. Also, acronyms were always solved, – like for example
Joint United Nations Programme on HIV (UNAIDS) – and were entered in the same query into the crawler.</p>
      <p>In addition, as building a comparable corpus requires control over the selection of source texts in the
various languages, each language group was asked to prepare a list of (trusted) web sites – indicatively a
number of 40 – which were more likely to have plenty of documents related to the topic in their own language.
This was required as a way to increase the number of relevant documents avoiding introducing noise (or virus
files). The longer the list of domains was, the higher the number of documents which could be downloaded for
each single query. Texts were drawn from a variety range of sources e.g.: newspapers, newswire, web, journals,
blogs, Wikipedia entries, etc.</p>
      <p>All keywords and all domains were entered in one crawling run. This solution allowed the removal of
duplicate URLs retrieved making different queries, as the encountered URLs were kept in memory, so that every
URL was visited only once. On average, it took 2-3 days to build one background collection for one topic.</p>
      <p>Other parameters could also be set, namely the number of documents to be downloaded for each single
query. By default it was set to 1000, since, due to Google restrictions, it is the maximum number of documents
per query which can be downloaded for a specified source/domain. For the English language, this parameter was
set to 500. In an attempt, to reduce the number of indices, and other useless files from the corpus lists, the
documents which are too short were automatically discarded, by setting the minimum length of the document to
1000 characters. For the English language it was set to 1500.</p>
    </sec>
    <sec id="sec-7">
      <title>4. TEST SET PREPARATION</title>
      <p>As we have seen, the task this year was to answer a series of multiple choice tests, each based on a short
document.</p>
    </sec>
    <sec id="sec-8">
      <title>4.1 Test Documents</title>
      <p>In order to allow participants to tune their systems, a set of pilot data was first devised. This consisted of three
English documents concerned with the topic of renewable energy taken from Green Blog
(http://www.greenblog.org/) together with three sets of questions, one for each document, and a background collection of about
11,000 documents. For each document there were ten multiple choice questions; each question had five
candidate answers, one clearly correct answer and four clearly incorrect answers. The task of each system was
therefore to choose one answer for each question, by analysing the corresponding test document in conjunction
with the background collection.</p>
      <p>Following the creation of the pilot data, attention was turned to the materials for the actual evaluation.
The languages this year were English, German, Italian, Romanian and Spanish. The intention was to set identical
questions for these five languages. This implied that we had access to a suitable parallel collection of documents
so that each test document was exactly translated into each language of the task. Unfortunately, even after
decades of interest in parallel corpora, very few publicly available high quality collections exist in these five
languages. The main possibilities available to us were "Eurobabble" and technical manuals, but each was
somewhat unsuitable for the task. Another option was for us to commission special translations of selected
documents in, say, English, just for the purposes of QA4MRE.</p>
      <p>After some consideration, we took up a suggestion of Igal Gabbay to use documents taken from the
Technology, Entertainment, Design (TED) conferences (www.ted.com). Each TED event consists of a series of
invited presentations by prestigious speakers, from fields such as politics, entertainment and industry. The
speakers are fluent, persuasive, and mostly speak from memory with no repetition or hesitation. Each talk lasts
for twenty minutes or less and is aimed at a non-specialised but reasonably educated audience. The organisers
provide for each talk a high-quality text transcription. In the case of the talks used, this ranges in length between
1125 and 3580 words. However, they also provide an infrastructure for the transcriptions to be translated by
volunteers. These translations are carefully refereed and are generally of very high quality. The number of
languages in which a talk is available varies, depending on its popularity, but is typically 20-40.
From the perspective of QA4MRE, TED transcriptions have some good points and some bad ones. On the one
hand, they are of high typographical and syntactic quality, they discuss clearly-defined topics, they are at a
reasonable intellectual level, they are available translated accurately into many languages and they are of course
publicly available. On the other hand, they are on the short side, and, length-for-length contain less facts
amenable to the generation of questions than might be the case for other kinds of document. They may also
contain jokes or digressions, or material which can only be comprehended in the context of film clips,
photographs or recordings which are used in the talk but which of course do not appear in the transcription.
Finally, the transcriptions can contain phrases such as "laughter", "applause" or "music" from time to time.
These, of course, are describing events at the talk itself and are thus not a transcription of anything that was said.
Having decided on the source of documents, three topics were then chosen, AIDS, Climate Change, and Music
and Society. For each topic, four TED talks were selected, each having transcripts available in English, German,
Italian, Romanian and Spanish. Table 2 lists the selected talks. Ten multiple-choice questions were then devised
for each talk. As in the pilot materials, a question always had five candidate answers from which to choose, with
one clearly correct answer and four clearly incorrect answers.</p>
      <p>Once the questions had been composed in the language of the original author, each was then translated
into English. The English versions of the questions and candidate answers were carefully checked by a referee to
verify that they were clear, that the intended answer was clearly correct, that the intended answer was in the test
document, and that the other candidate answers were clearly incorrect. Questions were modified accordingly.
The English versions were then used to translate each question into each of the five languages of the task. The
same process was used to translate each candidate answer (five per query) into the five languages.</p>
      <p>The result of this process was a set of 120 questions in five languages, each with five multiple-choice
answers, also in those five languages. The final step was to check that the answer to each question was in fact
present in the test document for all the languages of the task. Occasionally, certain parts of the original English
text were left out of the translation in a particular target language, or perhaps modified or interpreted in a
particular manner which made the question impossible to answer in that language. In such cases, the question
had to be withdrawn from all languages and a new one devised to take its place.</p>
      <p>In parallel with the above activity, a background collection was created for each of the three topics, as
described in Section 3 above. The questions, test documents and background collections were now ready to be
used in the QA4MRE task.</p>
    </sec>
    <sec id="sec-9">
      <title>4.2 Questions</title>
      <p>Unlike previous campaigns, where the aim was mainly to ask factoid questions involving the extraction of
simple information (mainly Named Entities) from large collections of long documents, the intention in
QA4MRE was to ask more searching questions which required a deep knowledge of individual short texts.
Concerning test queries, as is usual practice in the QA campaign, they were artificially constructed from portions
of the text to match the criteria we wanted to test in this task.</p>
      <p>The QA4MRE questions were also created taking into consideration different levels of difficulty. They may refer
to:
facts that (as in traditional QA evaluation) are explicitly present in the text
facts that are explicitly present but are not explicitly related (for example, they do not appear in the
same sentence, although any human would understand they are connected)
facts that are not explicitly mentioned in the text, but that are one inferential step away (as in the RTE
challenge)
facts that are explicitly mentioned in the text but that require some inference to be connected to form the
answer
Out of the 120 questions given in the test set, 44 of them needed some extra information from the background
collection in order to be answered, while for 76 questions the information present in the text document alone was
enough to select the correct answer. More in details, as Table 3 shows, 38 questions had the answer contained in
the same sentence/paragraph; while for 38 questions the system had to assemble information from different
paragraphs in order to answer the question. In addition, questions were also posed so that the answers were not
merely a mechanical repetition of the input question, but all kinds of textual inferences could be requested, i.e.,
lexical (acronym, synonymy, hyperonymy-hyponymy), syntactic (nominalization-verbalization, causative,
paraphrase, active-passive), discourse (co-reference, anaphora ellipsis).</p>
      <sec id="sec-9-1">
        <title>Type</title>
        <p>CAUSE</p>
        <sec id="sec-9-1-1">
          <title>COMPOSITE</title>
        </sec>
        <sec id="sec-9-1-2">
          <title>DEGREEOFTRUTH</title>
        </sec>
        <sec id="sec-9-1-3">
          <title>FACTOID</title>
          <p>LOCATION
FACTOIDNUMBER
FACTOIDPERSON</p>
        </sec>
        <sec id="sec-9-1-4">
          <title>FACTOIDLIST</title>
        </sec>
      </sec>
      <sec id="sec-9-2">
        <title>Topic</title>
        <p>Climate
Change</p>
        <sec id="sec-9-2-1">
          <title>Climate Change</title>
        </sec>
        <sec id="sec-9-2-2">
          <title>AIDS</title>
        </sec>
        <sec id="sec-9-2-3">
          <title>AIDS</title>
        </sec>
        <sec id="sec-9-2-4">
          <title>Music &amp;</title>
          <p>Society
AIDS</p>
        </sec>
        <sec id="sec-9-2-5">
          <title>When was a newspaper</title>
          <p>article published on
climate change?
What other information
important for climate
change is stored in
Antarctic ice?
What consequence would
the use of renewable
energies have in the US?</p>
        </sec>
        <sec id="sec-9-2-6">
          <title>How are people infected by HIV?</title>
        </sec>
        <sec id="sec-9-2-7">
          <title>What is Bono's attitude</title>
          <p>towards the digital age?
Why did the Bayreuth
Festspielhaus have a large
orchestra pit?</p>
        </sec>
        <sec id="sec-9-2-8">
          <title>Where has carbon from</title>
          <p>the Earth's atmosphere
gone?
What is the worst thing
about being poor?
at the beginning of 2000 / in the 90s / in 1965
/ in the 1950s / in 2075
ozone gasses / a register of ocean currents /
the amount of precipitation / crystals /
measurements of the Earth's temperature
new job opportunities / a higher
dependency on fossil fuels / a decrease in
toxic dumps / a higher use of clean coal / less
responsible use of energy
through aerial transmission / through genetic
transmission / through direct contact with
infected people / through the faecal-oral
route / through sexual intercourse
sorrow / sadness / enthusiasm / indifference
/ anger
to eat, drink and yell out / to be more
intricate / to help Mozart / to suggest an
encore / to accommodate low-end
instruments
it is still in the atmosphere / to form fossil
fuels / to be part of the arctic ice cap / to
pollute the air / to create acid rain
the enjoyment of music / the satisfaction of
playing / the feeling of being no-one / the
lack of food / the lack of shelter
Concerning the types of questions which would be asked, it had originally been proposed that there would be
four: FACTOID, CAUSE, HYPOTHETICAL and COMPOSITE. However, following the creation of the pilot
materials, six further question types were suggested: DEGREE-OF-TRUTH, METHOD, OPINION, PURPOSE,
RESULTS and WHICH-IS-TRUE. Furthermore, FACTOIDs are broken down into LOCATION,
NUMBERCALC, PERSON, STATED-LIST, TIME and UNKNOWN-TYPE Examples of the types can be seen in Table 4
with a breakdown by frequency in Table 5. Unlike in previous campaigns, questions were not required to fall
into the ten types in a pre-determined distribution. As can be seen in Table 5, about half the questions (64 out of
120) were FACTOID, 17 were CAUSE and 16 were WHICH-IS-TRUE. There were between one and five
instances of each of the remaining types.
Table 6 shows the proportion of correct answers and of NoA answers given by all systems to each different
question type. Degree of truth seem to be the easiest type of question to be answered while composite and
hypothetical questions appear to be the most difficult to be approached. However, system seem to be less
confident in answering methods and opinion questions.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>4.3 Tools and Infrastructure</title>
      <p>Also this year, CELCT developed a series of infrastructures to help the management of the QA4MRE exercise.
Many processes and requirements were to be dealt with:
the need to develop a proper and coherent tool for the management of the data produced during the
campaign, to store it and to make it re-usable, as well as to facilitate the analysis and comparison of
results
the necessity of assisting the different organizing groups in the various tasks of the data set creation and
to facilitate the process of collection and translation of questions
the possibility for the participants to directly access the data, submit their own runs (this also implied
some syntax checks of the format), and later, get the detailed viewing of the results and statistics.
A series of automatic web interfaces were specifically designed for each of these purposes, with the aim of
facilitating the data processing and, at the same time, showing the users only what they needed for the task they
had to accomplish. So, the main characteristics of these interfaces are the flexibility of the system specifically
centred on the user’s requirements.</p>
      <p>While designing the interfaces for question collection and translation one of the first issues which was to be dealt
with, was the fact of having many assessors, a big amount of data, and a long process. So tools must ensure an
efficient and consistent management of the data, allowing:
o
o
o</p>
      <p>Edition of the data already entered at any time.</p>
      <p>Revision of the data by the users themselves.</p>
      <p>Consistency propagation ensuring that modifications automatically re-model the output in which they are
involved.</p>
      <p>Statistics and evaluation measures are calculated and updated in real time.</p>
      <p>In particular, ensuring the consistency of data is a key feature in data management. For example, if a typo is
corrected in the Translation Interface, the modification is automatically updated also in the GoldStandard files, in
the Test Set files, etc.</p>
    </sec>
    <sec id="sec-11">
      <title>5. EVALUATION</title>
      <p>Participating systems could give one of two possible responses for each question in the test collection:
•
•</p>
      <p>To give one answer selected from the five candidate answers of the question
not to answer the question if a system considered that it did not have enough evidences for selecting one
of the candidate answers as the correct one. This option is called NoA answer. In order to evaluate the
ability of validating its answers, the system could return in this case the candidate answer that it would
select in case of having to answer the question.</p>
      <p>Taking into consideration these two possible responses, each question receives one (and only one) of the
three following assessments:
•</p>
      <p>correct if the system selected the correct answer among the five candidate ones of the given question
•
•
incorrect if the system selected one of the wrong answers</p>
      <p>NoA if the system chose not to answer the question</p>
      <p>The evaluation of the output given by participating systems was performed automatically by comparing the
answers of systems against the gold standard collection with human-made annotations. No manual assessment
was required.</p>
      <p>The task developed this year allowed us to evaluate systems from two different perspectives:
1. A question-answering evaluation, as the traditional evaluation performed in past campaigns. In this
evaluation, we just accounted answers without grouping them.
2. On the other hand, we can perform a reading-test evaluation, obtaining figures for each particular
reading test, and as a part of a topic.</p>
    </sec>
    <sec id="sec-12">
      <title>5.1 Evaluation Measure</title>
      <p>
        The purpose of allowing NoA answers is to reduce the amount of incorrect responses, while keeping the number
of correct ones, by leaving some questions unanswered. As the main evaluation measure for this year's campaign
c@1 was used, which takes into account the option of not answering certain questions. c@1 was firstly
introduced in ResPubliQA 2009 [8] and is fully described in
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref17 ref18 ref8">(Peñas and Rodrigo, 2011)</xref>
        . The formulation of c@1
is given in (1).
      </p>
      <p>n
(nR + n</p>
      <p>R )
U n
(1)
nR: number of questions correctly answered.
nU: number of questions unanswered.</p>
      <p>n: total number of questions
c@1 acknowledges the option of giving NoA answers in the proportion that a system answers questions
correctly, which is measured using accuracy. Thus, a higher accuracy over answered questions would give more
value to unanswered questions, and therefore, a higher final c@1 value.</p>
      <p>As a secondary measure, we also provided scores according to accuracy (2), the traditional measure applied
to QA evaluations that does not distinguish between answered and unanswered questions. We used also the
candidate answer given to unanswered questions to obtain accuracy values.</p>
      <p>accuracy = nR + nUR</p>
      <p>(2)
n
nR: number of questions correctly answered.
nUR: number of unanswered questions whose candidate answer was correct.</p>
      <p>n: total number of questions
where
where
•
•</p>
    </sec>
    <sec id="sec-13">
      <title>5.2 Question Answering perspective evaluation</title>
      <p>A question-answering evaluation has been performed over the whole test collection. This evaluation measures
the overall performance of a system, without analyzing the behaviour over a particular reading test. The
information taken into account for each system at this level is:
number of questions ANSWERED
o number of questions ANSWERED with RIGHT answer
o number of questions ANSWERED with WRONG answer
number of questions UNANSWERED
o number of questions UNANSWERED with RIGHT candidate answer
o
o
number of questions UNANSWERED with WRONG candidate answer
number of questions UNANSWERED with EMPTY
candidate
answer
More in detail, the evaluation at this level includes:
where:
•
•
•
•
•
•</p>
      <p>Overall c@1(over the 120 questions of the test collection)
c@1 per topic (over the 40 questions of each topic)
Overall accuracy (over the 120 questions of the test collection, considering also the candidate answers
given to unanswered questions)
Proportion of answers correctly discarded (see (3))
correctly _ discarded =</p>
      <p>nUW + nUE
nUR + nUW + nUE
(3)
nUR: number of unanswered questions whose candidate answer was correct
nUW: number of unanswered questions whose candidate answer was incorrect
nUE: number of unanswered questions whose candidate answer was empty</p>
    </sec>
    <sec id="sec-14">
      <title>5.3 Reading perspective evaluation</title>
      <p>The objective of the reading-test evaluation is to offer information about the performance of a system
“understanding” the meaning of each single document. This understanding is evaluated by means of
multiplechoice tests consisting of ten questions per document.</p>
      <p>This evaluation is performed taking as reference the c@1 values achieved for each test (one document with
ten questions about it). Then, the c@1 values were aggregated at topic and global levels:</p>
      <p>The median c@1 has been provided under the consideration that it can be more informative at reading-test level
than average values. This is because median is less affected by outliers than average, and therefore, it offers
more information about the ability of a system to understand a text. For example, if we have three high c@1
values in a topic, but the last one is very low, the median is not affected by this low result (because it is an
isolated result in comparison with the other three), while average accounts for this bad behaviour.</p>
    </sec>
    <sec id="sec-15">
      <title>5.4 Random Baselines</title>
      <p>In order to offer some baselines for this task, it must be considered that participating systems can decide to
answer or not to answer a given question. Then, we firstly propose the use of a random baseline where all the
questions are answered. This baseline has five possibilities when trying to answer a question: it can select the
correct answer to the question, or it can select one of the four incorrect answers. In this case, the overall result is
0.2 (both for accuracy and for c@1).</p>
    </sec>
    <sec id="sec-16">
      <title>6. PARTICIPATION and RESULTS</title>
      <p>Out of the 25 groups which had previously registered and signed the license agreement to download the
background collections, a total of 12 groups participated in the QA4MRE tasks submitting 62 runs in 3 different
languages (German, English, and Romanian). Table 7 shows the runs submitted in each language. No runs were
submitted either in Italian, or - quite surprisingly - in Spanish (usually the second most chosen language). All
runs were monolingual; no team attempted a cross-language task. This was probably due to the fact that crossing
the language boundary is currently not core to the task, even though multiliguality is directly addressed through
the provision of collections and tests in five languages.</p>
      <p>Participants were allowed to submit a maximum of 10 runs. The first run was to be produced using nothing
more than the knowledge provided in the background collections. Additional runs could include other sources of
information, e.g. ontologies, rule bases, web, Wikipedia, etc., or other types of inferences. All resources used to
acquire the knowledge were to be listed in the submission file.</p>
      <p>Beside specifying the resources used, systems were required to list also the document(s) and sentence(s) that
helped them (directly or indirectly) to identify the correct answer. Such provenance was not used for formal
evaluation, but for informal analysis and discussion.</p>
      <p>s
e
g
a
u
gn )s
la on
e i
c t
r s</p>
      <p>e
u u
o q
S (
As usual, the vast majority of the runs were in English, as Table 7 shows. The list of participating teams and the
reference to their reports are shown in Table 8. Beside Europe, participants came also from USA, China and
India.
Concerning the overall difficulty of the exercize, Topic 3 was the easiest and Topic 1 was the hardest but the
range of difficulty is not huge, as Table 10 demonstrates. So, the three topics look fairly balanced. Also, average
performances do not exceed too much the random baseline (0.2).
The following three tables (14-15-16) show the best run for each participating group, reporting the mean of the
tests for each topic. Except for one case, the overall mean is higher that the baseline.</p>
      <p>Table 14 : Results for English</p>
      <p>Topic Topic Topic
Overall 1 2 3
As for system performances at the question-answering evaluation level we can generally see that only one team
(jucs) is above 50%, showing a large room for improvement.</p>
      <p>From a reading test perspective, in general no group passed the reading tests, and all system seem to be very
close to random guessing. Overall results at reading test level, i.e., median, average, and standard deviation for
all runs are given in Appendix 1.</p>
      <p>System
combination
uhei1109dede
uhei1102dede
loga1101dede
loga1102dede
uhei1103dede
random baseline
uhei1106dede
uhei1104dede
uhei1108dede
uhei1105dede
uhei1107dede
uhei1101dede</p>
      <p>System
combination
uaic1107roro
uaic1101roro
uaic1109roro
uaic1103roro
uaic1104roro
uaic1106roro
random baseline
uaic1108roro
uaic1105roro
uaic1102roro
A summary of the applied methods and techniques reported by participants is given in Table 17-18-19 in
Appendix 2.</p>
    </sec>
    <sec id="sec-17">
      <title>8. RELATED WORK</title>
      <p>The current state of development of the NLP technologies offers a good opportunity for proposing an evaluation
of MR systems. The opportunity arises from the clear evolution of NLP systems towards a deeper level of text
analysis that allows a better understanding of documents. In fact, the interest in MR among different research
groups over the world has increased recently as the creation of the MR program at DARPA1 testifies. The large
community involved in Machine Reading is searching a way to evaluate their systems. But the problem of how
to evaluate these machines is still an open research issue.</p>
      <p>Over the last years, the QA Track at CLEF has changed its evaluation methodology in order to promote
deeper text understanding. Clearly, the task of retrieving just text excerpts (facts, sentences, paragraphs or
documents) is not enough to develop the technology. Besides QA, other evaluation activities were also
performed which required deeper analyses of texts, for example Recognizing Textual Entailment (RTE), Answer
Validation (AV), and Knowledge Base Population (KBP).</p>
      <p>Question Answering: a system receives questions formulated in natural language and returns one or more exact
answers to these questions, possibly with the locations from which the answers were drawn as justification. The
evaluation of QA systems began at the Text Retrieval Conference (TREC)2, and was continued at the Cross
Language Evaluation Forum (CLEF)3 in the EU, and at the NII-NACSIS Test Collection for IR Systems
(NTCIR)4 in Japan. Most of the questions used in these evaluations ask about facts (i.e. Who is the president of
XYZ?) or dentitions (i.e. What does XYZ mean?). Since systems could search for answers among several
documents (using IR engines), it was generally possible to find in some document a “system-friendly” statement
that contained exactly the answer information stated in an easily matched form. This made QA both shallow and
relatively easy.</p>
      <p>Recognizing of Textual Entailment (RTE): a system must decide whether the meaning of a text (the Text T)
entails the meaning of another text (the Hypothesis H): whether the meaning of the hypothesis can be inferred
from the meaning of the text [4]. RTE systems have been evaluated at the RTE Challenges, whose first
competition was proposed in 2005. The RTE Challenges encourage the development of systems that have to treat
different semantic phenomena.</p>
      <p>Answer Validation Exercise (AVE) [5.6.7]. A combination of QA and RTE evaluations. Answer Validation
(AV) is the task of deciding. given a question and an answer from a QA system, whether the answer is correct or
not. AVE was a task focused on the evaluation of AV systems and it was defined as a problem of RTE in order
to promote a deeper analysis in QA.</p>
      <p>Another application of RTE, similar to AVE, in the context of Information Extraction was performed in a pilot
task at the RTE-65 with the aim of studying the impact of RTE systems in Knowledge Base Population (KBP)6.
The objective of this pilot task is to validate the output of participant systems at the KBP slot illing task that was
celebrated at the Text Analysis Conference (TAC)7. Systems participating at the KBP slot filling task must
extract from documents some values for a set of attributes of a certain entity. Given the output of participant
systems at KBP, the RTE KBP validation pilot consists of deciding whether each of the values detected for an
entity is correct according to the supporting document. For taking this decision, participant systems at the RTE
KBP validation pilot receive a set of T-H pairs, where the hypothesis is built combining an entity, an attribute
and a value.</p>
      <p>Other efforts closer to our proposal for evaluating systems understanding took place, as the “ANLP/NAACL
2000 Workshop on Reading comprehension tests as evaluation for computer-based language understanding
systems”8. This workshop proposed to evaluate understanding systems by means of Reading Comprehension
(RC) tests. The evaluation consisted of a set of texts and a series of questions about each text. Quite
interestingly, most of the approaches presented at that workshop showed how to adapt QA systems to such kind
of evaluation.</p>
      <p>A more complete evaluation methodology of MR systems has been reported in [7], where the authors proposed
to use also RC tests. However, the objective of these tests was to extract correct answers from documents, which
is similar to QA without an IR engine.</p>
      <p>A natural step in this area is an evaluation methodology that requires a deeper level of inference and of
analysis of text.
1 http://www.darpa.mil/ipto/programs/mr/mr.asp
2 http://trec.nist.gov/
3 http://www.clef-campaign.org/
4 http://research.nii.ac.jp/ntcir/
5 http://www.nist.gov/tac/2010/RTE/index.html
6 http://nlp.cs.qc.cuny.edu/kbp/2010/
7 http://www.nist.gov/tac/2010/
8 http://www.aclweb.org/anthology/W/W00/#0600</p>
    </sec>
    <sec id="sec-18">
      <title>9. CONCLUSIONS</title>
      <p>This year, the QA @ CLEF task was characterized by a major innovation, namely the transition from the
traditional Question Answering (QA) task, proposed in the last eight QA challenges at CLEF, to a new
evaluation focus on the reading of a single document. The main reason behind this choice was the feeling that
most systems were ready to make a definitive move towards a deeper understanding of the text. Along the years,
the QA challenges adopted simple questions which required almost no inferences to find the correct answers.
These surface-level evaluations have promoted QA architectures based on Information Retrieval (IR) techniques,
in which the final answer(s) is/are obtained after focusing on selected portions of retrieved documents and
matching sentence fragments or sentence parse trees. No real understanding of documents was performed, since
none was required by the evaluation. Machine Reading (MR), instead, requires the automatic understanding of
texts at a deeper level, so this methodology encourages the development of systems able to perform a deep
analyses of the text.</p>
      <p>One way of evaluating the understanding of a text is to assess the ability to answer a set of questions
about it. In particular, reading comprehension tests are designed to measure how well human readers understand
what they read. Each text comes with a set of questions about information that is stated or implied in the text.</p>
      <p>The objectives of the task are twofold: (i) to propose a task where a deeper level of understanding is
required (ii) to extract the knowledge contained in texts as a way to improve the performance of systems where
some kinds of reasoning are required. Hence, the development of MR technologies should be fostered and the
number of groups interested in the task should increase. This is also an opportunity to create a common
framework and community in the field of text understanding.</p>
    </sec>
    <sec id="sec-19">
      <title>ACKNOWLEDGMENTS</title>
      <p>Special thanks are also due Giovanni Moretti (CELCT. Trento. Italy) for the technical support in the
management of all data of the campaign.</p>
      <p>This work has been partially supported by the Research Network MA2VICMR (S2009/TIC-1542) and
Holopedia project (TIN2010-21128-C02).
APPENDIX 1: Overall results at reading test level: Median, Average, and Standard</p>
    </sec>
    <sec id="sec-20">
      <title>Deviation for all runs</title>
      <p>E
M
A
N
_
N
U
R
n
a
i
d
e
M
l
l
a
r
e
v
O
e
g
a
r
e
v
A
l
l
a
r
e
v
O
d
r
a
d
n
a
tS on
ll i
a t</p>
      <p>a
erv iev
O D Median</p>
      <p>Topic 1</p>
      <p>Topic 2</p>
      <p>Topic 3
Average</p>
      <p>Standard
Deviation</p>
      <p>Standard</p>
      <p>Deviation
Median</p>
      <p>Average</p>
      <p>Median</p>
      <p>Average</p>
      <p>Standard
Deviation
base1101enen
diue1101enen
diue1102enen
fdcs1102enen
fdcs1103enen
ifln1101enen
ifln1102enen
ifln1104enen
ifln1105enen
iles1101enen
iles1102enen
iles1103enen
iles1104enen
iles1105enen
iles1106enen
iles1107enen
iles1108enen
iles1109enen
iles1110enen
jucs1101enen
jucs1102enen
jucs1103enen
jucs1104enen
jucs1105enen
jucs1106enen
jucs1107enen
loga1101dede
loga1102dede
swai1101enen
swai1102enen
swai1103enen
swai1104enen
swai1105enen</p>
      <p>n ) s
ifrtScgggaaeehnpPTO igkhunnC -rsganm iiitttycgaEooeenndRm ()ERN ilrrsssxaooeeenppTm iilrrssscxaoeeeunpNm ifrrrttsssaaaooenhnPm ilssyycaaeeennndpD i(.j.ttssccooenunubb iifrrttttsyccaaaoonnnm iirtsScgaaennpm iilllrtScgaaoeeennbm iffrtsSeeeenddPO iltaoenR rsFaem iilrrttscgaooeeennp rrrvooeeehpTm oenN rtehO</p>
      <p>N F S
base
diue x
fdcs x
ifln x
iles x
jucs
loga x
swai x
uaic x
uhei x
uned x
vens x
x
x
base x
diue x
ifln x
fdcs
iles
jucs
loga
uaic
swai x
uhei x
uned x
vens
x
x
x x
x x x
x
x x
x
x x x x x
x x
x x x x
x x x
x
x x
fttrerceoaduphm llitcecrcoogaodnukb ilcxLaeBD rssaeuuhT ilyccaEoendp ltygoonO illfrrtsscaaaoooeehnppC irtsLodW rttszzaeeeG iiil-rrttgaaaaooenBVCD -rSyyycoonnnAmm iirtycaonD iiilrtSyycaeeenndpDm iirtycaonD iiiiilrrttSyyxaoPmm ill-frccxLaeeeeeenuRR saeB illfrtcooooendCw iilrtssgooooeennppdKw iilllftttcaoooeeennnCm lrseu lfrrrsvcooeeeeeenRC iiiitttscgaEooeeennndRm rSggaePTO rrsaeP iilrtzaaaooenNNmm
N
x x
x</p>
      <sec id="sec-20-1">
        <title>Tools</title>
        <p>factoid
extract.
Stem
System
name
x</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Anselmo</given-names>
            <surname>Peñas</surname>
          </string-name>
          . Álvaro Rodrigo.
          <article-title>Felisa Verdejo. Overview of the Answer Validation Exercise 2007</article-title>
          . In C. Peters.
          <string-name>
            <given-names>V. Jijkoun. Th. Mandl. H.</given-names>
            <surname>Müller</surname>
          </string-name>
          .
          <string-name>
            <given-names>D.W.</given-names>
            <surname>Oard</surname>
          </string-name>
          .
          <string-name>
            <given-names>A. Peñas. V.</given-names>
            <surname>Petras</surname>
          </string-name>
          . and
          <string-name>
            <given-names>D.</given-names>
            <surname>Santos</surname>
          </string-name>
          . (Eds.):
          <article-title>Advances in Multilingual and Multimodal Information Retrieval</article-title>
          .
          <source>LNCS 5152. September</source>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Anselmo</given-names>
            <surname>Peñas</surname>
          </string-name>
          .
          <source>Álvaro Rodrigo. Valentín Sama. Felisa Verdejo</source>
          .
          <article-title>Overview of the Answer Validation Exercise 2006</article-title>
          . In C. Peters.
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough. F. C.</surname>
          </string-name>
          <article-title>Gey</article-title>
          . J.
          <string-name>
            <surname>Karlgren</surname>
            .
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Magnini</surname>
            .
            <given-names>D. W.</given-names>
          </string-name>
          <string-name>
            <surname>Oard. M. de Rijke. M. Stempfhuber</surname>
          </string-name>
          (Eds.):
          <article-title>Evaluation of Multilingual and Multi-modal Information Retrieval. 7th Workshop of the Cross-Language Evaluation Forum</article-title>
          .
          <source>CLEF 2006. Alicante. Spain. September 20-22</source>
          .
          <year>2006</year>
          . Revised Selected Papers.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Álvaro</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          . Anselmo Peñas.
          <article-title>Felisa Verdejo. Overview of the Answer Validation Exercise 2008</article-title>
          . In
          <string-name>
            <given-names>C. Peters. Th. Mandl. V.</given-names>
            <surname>Petras</surname>
          </string-name>
          .
          <string-name>
            <given-names>A. Peñas. H.</given-names>
            <surname>Müller</surname>
          </string-name>
          .
          <string-name>
            <given-names>D.</given-names>
            <surname>Oard</surname>
          </string-name>
          .
          <string-name>
            <given-names>V.</given-names>
            <surname>Jijkoun</surname>
          </string-name>
          . D. Santos (Eds).
          <source>Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum. CLEF 2008. Aarhus. Denmark. September 17-19</source>
          .
          <year>2008</year>
          . Revised Selected Papers.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Ido</given-names>
            <surname>Dagan. Oren Glickman</surname>
          </string-name>
          . and
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          .
          <source>The PASCAL Recognising Textual Entailment Challenge. In Lecture Notes in Computer Science</source>
          . volume
          <volume>3944</volume>
          . pages
          <fpage>177</fpage>
          -
          <lpage>190</lpage>
          . Springer.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <string-name>
            <given-names>Michele</given-names>
            <surname>Banko</surname>
          </string-name>
          . and
          <string-name>
            <surname>Michael J. Cafarella.</surname>
          </string-name>
          <article-title>Machine reading</article-title>
          .
          <source>In Proceedings of the 21st National Conference on Artificial Intelligence</source>
          .
          <year>2006</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ellen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          and
          <string-name>
            <surname>Dawn M. Tice</surname>
          </string-name>
          .
          <source>The TREC-8 Question Answering Track Evaluation. In Text Retrieval Conference TREC-8</source>
          . pages
          <fpage>83</fpage>
          -
          <lpage>105</lpage>
          .
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B. Wellner. L. Ferro. W.</given-names>
            <surname>Greiff</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Hirschman</surname>
          </string-name>
          .
          <article-title>Reading Comprehension Tests for Computer-based Understanding Evaluation</article-title>
          .
          <source>Nat. Lang. Eng. 12. 4</source>
          .
          <fpage>305</fpage>
          -
          <lpage>334</lpage>
          .
          <year>2006</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Anselmo</given-names>
            <surname>Peñas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alvaro</given-names>
            <surname>Rodrigo</surname>
          </string-name>
          .
          <article-title>A Simple Measure to Assess Non-response</article-title>
          .
          <source>In Proceedings of 49th Annual</source>
          <article-title>Meeting of the Association for Computational Linguistics - Human Language Technologies (ACL-HLT</article-title>
          <year>2011</year>
          ). Portland. Oregon. USA. June 19-
          <fpage>24</fpage>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          10.
          <string-name>
            <given-names>David</given-names>
            <surname>Ferrucci</surname>
          </string-name>
          .
          <source>Eric Brown. Jennifer Chu-Carroll. James Fan. David Gondek. Aditya A. Kalyanpur. Adam Lally. J. William Murdock. Eric Nyberg. John Prager. Nico Schlaefer. and Chris Welty</source>
          .
          <year>2010</year>
          .
          <article-title>Building Watson: An Overview of the DeepQA Project</article-title>
          .
          <source>AI Magazine</source>
          .
          <volume>31</volume>
          (
          <issue>3</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Roser</given-names>
            <surname>Morante</surname>
          </string-name>
          and
          <string-name>
            <given-names>Walter</given-names>
            <surname>Daelemans</surname>
          </string-name>
          .
          <article-title>Annotating Modality and Negation for a Machine Reading Evaluation</article-title>
          .
          <article-title>CLEF 2011 Labs</article-title>
          and Workshop - Notebook Papers.
          <fpage>19</fpage>
          -22 September, Amsterdam, The Netherlands.
          <source>Online Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          12.
          <string-name>
            <surname>Juan</surname>
            Martinez-Romo and
            <given-names>Lourdes</given-names>
          </string-name>
          <string-name>
            <surname>Araujo</surname>
          </string-name>
          .
          <article-title>Graph-based Word Clustering Applied to Question Answering and Reading Comprehension Tests</article-title>
          .
          <article-title>CLEF 2011 Labs</article-title>
          and Workshop - Notebook Papers.
          <fpage>19</fpage>
          -22 September, Amsterdam, The Netherlands.
          <source>Online Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          13.
          <string-name>
            <surname>Adrian</surname>
            <given-names>Iftene</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alexandru-Lucian</surname>
            <given-names>Gînscă</given-names>
          </string-name>
          , Alex Moruz, Diana Trandabăt,
          <string-name>
            <given-names>Maria</given-names>
            <surname>Husarciuc</surname>
          </string-name>
          .
          <source>Question Answering for Machine Reading Evaluation on Romanian and English. CLEF 2011 Labs and Workshop - Notebook Papers. 19-22 September</source>
          , Amsterdam, The Netherlands.
          <source>Online Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Gaurav</given-names>
            <surname>Arora</surname>
          </string-name>
          .
          <article-title>Cosine similarity as Machine Reading Technique</article-title>
          .
          <source>Question Answering for Machine Reading Evaluation on Romanian and English. CLEF 2011 Labs and Workshop - Notebook Papers. 19-22 September</source>
          , Amsterdam, The Netherlands.
          <source>Online Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Suzan</given-names>
            <surname>Verberne</surname>
          </string-name>
          .
          <article-title>Retrieval-based Question Answering for Machine Reading Evaluation</article-title>
          .
          <article-title>CLEF 2011 Labs</article-title>
          and Workshop - Notebook Papers.
          <fpage>19</fpage>
          -22 September, Amsterdam, The Netherlands.
          <source>Online Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          16.
          <string-name>
            <surname>Partha</surname>
            <given-names>Pakray</given-names>
          </string-name>
          , Pinaki Bhaskar, Somnath Banerjee, Bidhan Chandra Pal, Alexander Gelbukh and
          <string-name>
            <given-names>Sivaji</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          .
          <article-title>JU_CSE_TE: System Description QA4MRE@CLEF 2011</article-title>
          .
          <article-title>CLEF 2011 Labs</article-title>
          and Workshop - Notebook Papers.
          <fpage>19</fpage>
          -22 September, Amsterdam, The Netherlands.
          <source>Online Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          17.
          <string-name>
            <surname>Svitlana</surname>
            <given-names>Babych</given-names>
          </string-name>
          , Alexander Henn, Jan Pawellek, and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Padò</surname>
          </string-name>
          .
          <article-title>Dependency-Based Answer Validation for German</article-title>
          .
          <article-title>CLEF 2011 Labs</article-title>
          and Workshop - Notebook Papers.
          <fpage>19</fpage>
          -22 September, Amsterdam, The Netherlands.
          <source>Online Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          18.
          <string-name>
            <surname>Ling</surname>
            <given-names>Cao</given-names>
          </string-name>
          , Xipeng Qiu and
          <string-name>
            <given-names>Xuanjing</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Question Answering for Machine Reading with Lexical Chain</article-title>
          .
          <article-title>CLEF 2011 Labs</article-title>
          and Workshop - Notebook Papers.
          <fpage>19</fpage>
          -22 September, Amsterdam, The Netherlands.
          <source>Online Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ingo</surname>
            <given-names>Glockner</given-names>
          </string-name>
          , Bjorn Pelzer, and
          <string-name>
            <given-names>Tiansi</given-names>
            <surname>Dong</surname>
          </string-name>
          .
          <source>The LogAnswer Project at QA4MRE</source>
          <year>2011</year>
          .
          <article-title>CLEF 2011 Labs</article-title>
          and Workshop - Notebook Papers.
          <fpage>19</fpage>
          -22 September, Amsterdam, The Netherlands.
          <source>Online Proceedings</source>
          .
          <article-title>x x x s d se rn re s e i System lyan ttaP cauq s name nA en ly rn re ito od ilca ttae thO seu lly ta p Q au om o n t N a u M A</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>