<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>OVERVIEW OF THE CLEF 2008 MULTILINGUAL QUESTION ANSWERING TRACK</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pamela Forner</string-name>
          <email>forner@celct.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anselmo Peñas</string-name>
          <email>anselmo@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eneko Agirre</string-name>
          <email>e.agirre@ehu.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iñaki Alegria</string-name>
          <email>i.alegria@ehu.es</email>
          <xref ref-type="aff" rid="aff10">10</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Corina Forăscu</string-name>
          <xref ref-type="aff" rid="aff9">9</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Moreau</string-name>
          <email>moreau@elda.org</email>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petya Osenova</string-name>
          <email>petya@bultreebank.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prokopis Prokopidis</string-name>
          <email>prokopis@ilsp.gr</email>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paulo Ro- cha</string-name>
          <email>Paulo.Rocha@di.uminho.pt</email>
          <xref ref-type="aff" rid="aff8">8</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Sacaleanu</string-name>
          <email>bogdan@dfki.de</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Sutcliffe</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Tjong Kim Sang</string-name>
          <email>e.f.tjong.kim.sang@rug.nl</email>
          <xref ref-type="aff" rid="aff11">11</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BTB</institution>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CELCT</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Computer Science Department, University of Basque Country</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>DFKI</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>DLTG, University of Limerick</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Departamento de Lenguajes y Sistemas Informáticos, UNED</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>ELDA/ELRA</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff7">
          <label>7</label>
          <institution>ILSP Greece, Athena Research Center</institution>
        </aff>
        <aff id="aff8">
          <label>8</label>
          <institution>Linguateca</institution>
          ,
          <addr-line>DEI UC</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff9">
          <label>9</label>
          <institution>UAIC and RACAI</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff10">
          <label>10</label>
          <institution>University of Basque Country</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff11">
          <label>11</label>
          <institution>University of Groningen</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2003</year>
      </pub-date>
      <fpage>39</fpage>
      <lpage>77</lpage>
      <abstract>
        <p>The QA campaign at CLEF [1], was manly the same as that proposed last year. The results and the analyses reported by last year's participants suggested that the changes introduced in the previous campaign had led to a drop in systems' performance. So for this year's competition it has been decided to practically replicate last year's exercise. Following last year's experience some QA pairs were grouped in clusters. Every cluster was characterized by a topic (not given to participants). The questions from a cluster contained co-references between one of them and the others. Moreover, as last year, the systems were given the possibility to search for answers in Wikipedia1 as document corpus beside the usual newswire collection. In addition to the main task, three additional exercises were offered, namely the Answer Validation Exercise (AVE), the Question Answering on Speech Transcriptions (QAST), which continued last year's successful pilot, and Word Sense Disambiguation for Question Answering (QA-WSD). As general remark, it must be said that the task still proved to be very challenging for participating systems. In comparison with last year's results the Best Overall Accuracy dropped significantly from 41,75% to 19% in the multi-lingual subtasks,</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>while instead it increased a little in the monolingual sub-tasks, going from 54% to
63,5%.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>QA@CLEF 2008 was carried out according to the spirit of the campaign,
consolidated in previous years. Beside the classical main task, three additional
exercises were proposed:
• the main task: several monolingual and cross-language sub-tasks, were
offered: Bulgarian, English, French, German, Italian, Portuguese, Romanian,
Greek, Basque and Spanish were proposed as both query and target languages.
• the Answer Validation Exercise (AVE) [2]: in its third round was aimed at
evaluating answer validation systems based on textual entailment recognition. In
this task, systems were required to emulate human assessment of QA responses
and decide whether an Answer to a Question is correct or not according to a
given Text. Results were evaluated against the QA human assessments.
• the Question Answering on Speech Transcripts (QAST) [3,14]: which
continued last year’s successful pilot task, aimed at providing a framework in which
QA systems could be evaluated when the answers to factual and definition
questions must be extracted from spontaneous speech transcriptions.
• the Word Sense Disambiguation for Question Answering (QA- WSD) [4], a
pilot task which provided the questions and collections with already
disambiguated Word Senses in order to study their contribution to QA performance.</p>
      <p>As far as the main task is concerned, following last year experience, the
exercise consisted of topic-related questions, i.e. clusters of questions which were
related to the same topic and contained co-references between one question and the
others. The requirement for questions related to a topic necessarily implies that the
questions refer to common concepts and entities within the domain in question.
This is accomplished either by co-reference or by anaphoric reference to the topic,
implicit or explicitly expressed in the first question or in its answer.</p>
      <p>Moreover, besides the usual news collections provided by ELRA/ELDA,
articles from Wikipedia were considered as an answer source. Some questions could
have answers only in one collection, i.e. either only in the news corpus or in
Wikipedia.</p>
      <p>As a general remark, this year we had the same number of participants as in
2007 campaign, but the number of submissions went up. Due to the complexity of
the innovation introduced in 2007 - the introduction of topics and anaphora, list
questions, Wikipedia corpus - the questions tended to get a lot more difficult and
the performance of systems dropped dramatically, so, people were disinclined to
continue the following year (i.e. 2008), inverting the positive trend in participation
registered in the previous campaigns.</p>
      <p>As reflected in the results, the task proved to be even more difficult than
expected. Results improved in the monolingual subtasks but are still very low in the
cross-lingual subtasks.</p>
      <p>This paper describes the preparation process and presents the results of the QA
track at CLEF 2008. In section 2, the tasks of the track are described in detail. The
results are reported in section 3. In section 4, some final analysis about this
campaign is given.</p>
    </sec>
    <sec id="sec-3">
      <title>2 Task Description</title>
      <p>As far as the main task is concerned, the consolidated procedure was followed,
capitalizing on the experience of the task proposed in 2007.</p>
      <p>The exercise consisted of topic-related questions, i.e. clusters of questions
which were related to the same topic and contained co-references between one
question and the others. Neither the question types (F, D, L) nor the topics were
given to the participants.</p>
      <p>The systems were fed with a set of 200 questions -which could concern facts or
events (F-actoid questions), definitions of people, things or organisations
(Definition questions), or lists of people, objects or data (L-ist questions)- and were
asked to return up to three exact answers per question, where exact meant that
neither more nor less than the information required was given.</p>
      <p>The answer needed to be supported by the docid of the document in which the
exact answer was found, and by portion(s) of text, which provided enough context
to support the correctness of the exact answer. Supporting texts could be taken
from different sections of the relevant documents, and could sum up to a
maximum of 700 bytes. There were no particular restrictions on the length of an
answer-string, but unnecessary pieces of information were penalized, since the
answer was marked as ineXact. As in previous years, the exact answer could be
exactly copied and pasted from the document, even if it was grammatically
incorrect (e.g.: inflectional case did not match the one required by the question).
Anyway, systems were also allowed to use natural language generation in order to
correct morpho-syntactical inconsistencies (e.g., in German, changing dem
Presidenten into der President if the question implies that the answer is in
nominative case), and to introduce grammatical and lexical changes (e.g., QUESTION:
What nationality is X? TEXT: X is from the Netherlands EXACT ANSWER:
Dutch).</p>
      <p>The subtasks were both:
•
monolingual, where the language of the question (Source language) and
the language of the news collection (Target language) were the same;
•
cross-lingual, where the questions were formulated in a language
different from that of the news collection.</p>
      <p>Two new languages have been added, i.e. Basque and Greek both as source
and target languages. In total eleven source languages were considered,
namely, Basque, Bulgarian, Dutch, English, French, German, Greek, Italian,
Portuguese, Romanian and Spanish. All these languages were also considered
as target languages.
10 Monolingual -i.e. Bulgarian (BG), German (DE), Greek (EL), Spanish
(ES), Basque (EU), French (FR), Italian (IT), Dutch (NL), Portuguese
(PT) and Romanian (RO);
33 Cross-lingual (as customary in recent campaigns, in order to prepare
the cross-language subtasks, for which at least one participant had
registered, some target language question sets were translated into the
combined source languages).</p>
      <p>Anyway, as Table 2 shows, not all the proposed tasks were then carried out by
the participants.</p>
      <p>As long-established, the monolingual English (EN) task was not available as it
seems to have been already thoroughly investigated in TREC campaigns. English
was still both source and target language in the cross-language tasks.</p>
      <sec id="sec-3-1">
        <title>2.1 Questions Grouped by Topic</title>
        <p>The procedure followed to prepare the test set was the same as that used in the
2007 campaign. First of all, each organizing group, responsible for a target
language, freely chose a number of topics. For each topic, one to four questions were
generated. Topics could be not only named entities or events, but also other
categories such as objects, natural phenomena, etc. (e.g. George W. Bush; Olympic
Games; notebooks; hurricanes; etc.). The set of ordered questions were related to
the topic as follows:
• the topic was named either in the first question or in the first answer
• the following questions could contain co-references to the topic expressed in
the first question/answer pair.</p>
        <p>Topics were not given in the test set, but could be inferred from the first
question/answer pair. For example, if the topic was George W. Bush, the cluster of
questions related to it could have been:
Q1: Who is George W. Bush?; Q2: When was he born?; Q3: Who is his wife?</p>
        <p>The requirement for questions related to a same topic necessarily implies that
the questions refer to common concepts and entities within the domain. The most
common form is pronominal anaphoric reference to the topic declared in the first
question, e.g.:
Q4: What is a polygraph?; Q5: When was it invented?</p>
        <p>However, other forms of co-reference occurred in the questions. Here is an
example:
Q6: Who wrote the song "Dancing Queen"?; Q7: How many people were in the
group?</p>
        <p>Here the group refers to an entity expressed not in the question but only in the
answer. However the QA system does not know this and has to infer it, a task
which can be very complex, especially if the topic is not provided in the test set.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2 Document collections</title>
        <p>Beside the data collections composed of news articles provided by
ELRA/ELDA (see Table 3), also Wikipedia was considered.</p>
        <p>The Wikipedia pages in the target languages, as found in the version of
November 2006, could be used. Romanian had Wikipedia2 as the only document
collection, because there was no newswire Romanian corpus. The “snapshots” of
Wikipedia were made available for download both in XML and HTML versions.
The answers to the questions had to be taken from actual entries or articles of
Wikipedia pages. Other types of data such as images, discussions, categories,
templates, revision histories, as well as any files with user information and
metainformation pages, had to be excluded.</p>
        <p>One of the major reasons for using Wikipedia was to make a first step towards
web formatted corpora where to search for answers. In fact, as nowadays so large
information sources are available on the web, this may be considered a desirable
next level in the evolution of QA systems. An important advantage of Wikipedia
is that it is freely available for all languages so far considered. Anyway the
variation in size of Wikipedia, depending on the language, is still problematic.</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3 Types of Questions</title>
        <p>As far as the question types are concerned, as in previous campaigns, the three
following categories were considered:
1. Factoid questions, fact-based questions, asking for the name of a person, a
location, the extent of something, the day on which something happened, etc. We
consider the following 8 answer types for factoids:
–</p>
        <p>PERSON, e.g.: Q8: Who was called the “Iron-Chancellor”? A8: Otto von
Bismarck.
2 http://static.wikipedia.org/downloads/November_2006/ro/
–
–</p>
        <p>TIME, e.g.: Q9: What year was Martin Luther King murdered? A9: 1968.
LOCATION, e.g.: Q10: Which town was Wolfgang Amadeus Mozart born
in? A10: Salzburg.</p>
        <p>ORGANIZATION, e.g.: Q11: What party does Tony Blair belong to?:
A11: Labour Party.</p>
        <p>MEASURE, e.g.: Q12: How high is Kanchenjunga? A12: 8598m.</p>
        <p>COUNT, e.g.: Q13: How many people died during the Terror of PoPot?
A13: 1 million.</p>
        <p>OBJECT, e.g.: Q14: What does magma consist of? A14: Molten rock.
OTHER, i.e. everything that does not fit into the other categories above,
e.g.: Q15: Which treaty was signed in 1979? A15: Israel-Egyptian peace
treaty.</p>
        <sec id="sec-3-3-1">
          <title>TARGET LANG. COLLECTION PERIOD</title>
          <p>[BG] Bulgarian Sega 2002</p>
          <p>Standart 2002</p>
          <p>Novinar 2002
[DE] German Frankfurter Rundschau 1994</p>
          <p>Der Spiegel 1994/1995
German SDA 1994</p>
          <p>German SDA 1995
[EL] Greek The Southeast European Times 2002
[EN] English Los Angeles Times 1994</p>
          <p>Glasgow Herald 1995
[ES] Spanish EFE 1994</p>
          <p>EFE 1995
[EU] Basque Egunkaria 2001/2003
[FR] French Le Monde 1994</p>
          <p>Le Monde 1995
French SDA 1994</p>
          <p>French SDA 1995
[IT] Italian La Stampa 1994</p>
          <p>Italian SDA 1994</p>
          <p>Italian SDA 1995
[NL] Dutch NRC Handelsblad 1994/1995</p>
          <p>Algemeen Dagblad 1994/1995
[PT] Portuguese Público 1994</p>
          <p>Público 1995
Folha de São Paulo 1994
Folha de São Paulo 1995</p>
          <p>SIZE
120 MB (33,356 docs)
93 MB (35,839 docs)
320 MB (139,715 docs)
63 MB (13,979 docs)
144 MB (71,677 docs)
141 MB (69,438 docs)
425 MB (113,005 docs)
154 MB (56,472 docs)
509 MB (215,738 docs)
577 MB (238,307 docs)
2. Definition questions, questions such as “What/Who is X?”, and are divided into
the following subtypes:</p>
          <p>PERSON, i.e., questions asking for the role/job/important information
about someone, e.g.: Q16: Who is Robert Altmann? A16: Film maker
ORGANIZATION, i.e., questions asking for the mission/full
name/important information about an organization, e.g.: Q17: What is the
Knesset? A17: Parliament of Israel.</p>
          <p>OBJECT, i.e., questions asking for the description/function of objects, e.g.:
Q18: What is Atlantis? A18: Space Shuttle.</p>
          <p>OTHER, i.e., question asking for the description of natural phenomena,
technologies, legal procedures etc., e.g.: Q19: What is Eurovision? A19:
Song contest.
3. closed list questions: i.e., questions that require one answer containing a
determined number of items, e.g.: Q20: Name all the airports in London,
England. A20: Gatwick, Stansted, Heathrow, Luton and City.</p>
          <p>As only one answer was allowed, all the items had to be present in sequence in
the document and copied, one next to the other, in the answer slot.</p>
          <p>Besides, all types of questions could contain a temporal restriction, i.e. a
temporal specification that provided important information for the retrieval of the
correct answer, for example:</p>
          <p>Q21: Who was the Chancellor of Germany from 1974 to 1982?
A21: Helmut Schmidt.</p>
          <p>Q22: Which book was published by George Orwell in 1945?
A22: Animal Farm.</p>
          <p>Q23: Which organization did Shimon Perez chair after Isaac Rabin’s
death?</p>
          <p>A23: Labour Party Central Committee.</p>
          <p>Some questions could have no answer in the document collection, and in that
case the exact answer was "NIL" and the answer and support docid fields were left
empty. A question was assumed to have no right answer when neither human
assessors nor participating systems could find one.</p>
          <p>The distribution of the questions among these categories is described in Table 4.
Each question set was then translated into English, which worked as
interlanguage during the translation of the datasets into the other tongues for the
activated cross-lingual subtasks.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>2.4 Formats</title>
        <p>As the format is concerned, also this year both input and output files were
formatted as an XML file. For example, the first four questions in the EN-FR test set, i.e.
English questions that hit a French document collection - were represented as
follows:
&lt;input&gt;
&lt;q target_lang="FR" source_lang="EN" q_id="0001"</p>
        <p>q_group_id="1600"&gt;Which is the largest bird in Africa?&lt;/q&gt;
&lt;q target_lang="FR" source_lang="EN" q_id="0002"</p>
        <p>q_group_id="1600"&gt;How many species of ostriches are there?&lt;/q&gt;
&lt;q target_lang="FR" source_lang="EN" q_id="0003"
q_group_id="1601"&gt;Who served as a UNICEF goodwill ambassador
between 1988 and 1992?&lt;/q&gt;
&lt;q target_lang="FR" source_lang="EN" q_id="0004"</p>
        <p>q_group_id="1601"&gt;What languages did she speak?&lt;/q&gt;
...
&lt;/input&gt;
An example of system output which answered the above questions was the
following:
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;!DOCTYPE output SYSTEM "QA-CLEF-OUTPUT.dtd"&gt;
&lt;output&gt;</p>
      </sec>
      <sec id="sec-3-5">
        <title>2.5 Evaluation</title>
        <p>As far the evaluation process is concerned, no changes were made with respect to
the previous campaigns. Human judges assessed the exact answer (i.e. the shortest
string of words which is supposed to provide the exact amount of information to
answer the question) as:
•
•
•
•</p>
        <p>R (Right) if correct;
W (Wrong) if incorrect;
X (ineXact) if contained less or more information than that required by
the query;
U (Unsupported) if either the docid was missing or wrong, or the
supporting snippet did not contain the exact answer.</p>
        <p>Most assessor-groups managed to guarantee a second judgement of all the runs.</p>
        <p>As regards the evaluation measures, the main one was accuracy, defined as the
average of SCORE(q) over all 200 questions q, where SCORE(q) is 1 in the first
answer to q in the submission file is assessed as R, and 0 otherwise.
In addition most assessor groups computed the following measures:
• Confident Weighted Score (CWS). Answers are in a decreasing order of
confidence and CWS rewards systems that give correct answers at the top
of the ranking [16]
• the Mean Reciprocal Rank (MRR) over N assessed answers per question
(to consider the three answers). That is, the mean of the reciprocal of the
rank of the first correct label over all questions. If the first correct label is
ranked as the 3rd label, then the reciprocal rank (RR) is 1/3. If none of
the first N responses contains a correct label, RR is 0. RR is 1 if the
highest ranked label matches the correct label.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 Results</title>
      <p>As far as accuracy is concerned, scores were generally far lower than usual, as
Figure 1 shows. Although comparison between different languages and years is
not possible, in Figure 1 we can observe some trends which characterized this
year’s competition: best accuracy in the monolingual task increased with respect
to last year, going up again to the values recorded in 2006. But systems - even
those that participated in all previous campaigns - did not achieve a brilliant
overall performance. Apparently systems could not manage suitably the new
challenges, although they improved their performances when tackling issues already
treated in previous campaigns.</p>
      <p>More in detail, best accuracy in the monolingual task scored 63,5 almost ten
points up with respect to last year, meanwhile the overall performance of the
systems was quite low, as average accuracy was 23,63, practically the same as last
year. On the contrary, the performances in the cross-language tasks recorded a
drastic drop: best accuracy reached only 19% compared to 41,75% in the previous
year, which means more than 20 points lower, meanwhile average accuracy was
more or less the same as in 2007 - 13,24 compared to 10,9.
M
o
n
o</p>
      <p>B
i
l
i
n
g
u
a
l</p>
      <p>M
o
n
o</p>
      <p>B
i
l
i
n
g
u
a
l</p>
      <p>M
o
n
o</p>
      <p>B
i
l
i
n
g
u
a
l</p>
      <p>M
o
n
o</p>
      <p>B
i
l
i
n
g
u
a
l</p>
      <p>M
o
n
o</p>
      <p>B
i
l
i
g
u
a
l</p>
      <p>M
o
n
o</p>
      <p>B
i
l
i
n
g
u
a
l
CLEF 03</p>
      <p>CLEF 04</p>
      <p>CLEF 05</p>
      <p>CLEF 06</p>
      <p>CLEF 07</p>
      <p>CLEF 08</p>
      <p>On the contrary, Best accuracy over the bilingual tasks, decreased considerably.
This is also true for average performances. This year a small increase was
recorded in the bilingual tasks but it seems that the high level of difficulty of the
question sets particularly impacted the bilingual tasks and the task proved to be
still difficult also for veterans.</p>
      <sec id="sec-4-1">
        <title>3.1 Participation</title>
        <p>America Europe Asia Australia TOTAL
3 5 0 0 8
1 17 0 0 18
1 22 1 0 24
4 24 2 0 30
3 16 1 1 21
1 20 0 0 21</p>
        <p>The number of participants has remained almost the same as in 2007 (see Table
5). As noticed, this is probably the consequence of the new challenges introduced
last year in the exercise.</p>
        <p>Also the geographical distribution remained almost unchanged, even though
there was no participation from Australia and Asia. No runs were submitted
neither for Italian or Greek tasks.</p>
        <p>Anyway, the number of submitted runs, increased from a total of 37 registered last
year to 51 (see Table 6). The breakdown of participants and runs, according to
language, is shown in Table 4 (Section 2.3). As in previous campaigns, more
participants chose the monolingual tasks, which once again demonstrated to be more
approachable.</p>
        <p>In the following subsections a more detailed analysis of the results in each
language follows, giving specific information on the performances of the
participating systems in the single sub-tasks and on the different types of questions,
providing the relevant statistics and comments.
3.2 Basque as target</p>
        <p>In the first year working with Basque as target only a research groups
submitted runs for evaluation in the track having Basque as target language, the Ixa
group from the University of the Basque Country. They sent four runs: one
monolingual, one English-Basque and two Spanish-Basque.</p>
        <p>The Basque question set consisted of 145 factoid questions, 39 definition
questions and 16 list questions. 39 questions contained a temporal restriction, and 10
had no answer in the Gold Standard. 40 answers were retrieved from Wikipedia,
the remains from the news collections. Half of the questions were linked to a
topic, so the second (and sometimes the 3rd) question was more difficult to answer.</p>
        <p>The news were from the Egunkaria newspaper during 2000, 2001 and 2002
years and the information from Wikipedia was the exportation corresponding to
the 2006 year.</p>
        <p>Table 7 shows the evaluation results for the four submitted runs (one
monolingual and three cross-lingual). The table shows the number of Right, Wrong,
ineXact and Unsupported answers, as well as the percentage of correctly answered
Factoids, Temporally restricted questions, Definition and List questions.</p>
        <p>The monolingual run (ixag081eueu.xml) achieved accuracy of 13%, lower than
the most systems for other target languages during the evaluation of 2007 but
better than some of them. It is necessary to underline that Basque is a highly flexional
language, doing matching of term and entities more complex, and that ir is the first
participation. The system achieved better accuracy in factoids questions (15.9%).
No correct answers was retrieved for list questions. It is necessary to remark that
57 answers were NIL (only four of them were corrects), perhaps participants can
improve this aspect.</p>
        <p>Looking to the cross-lingual runs the loss of accuracy respect to the
monolingual system is a bit more than 50% for the two best runs. This percentage is quite
similar with runs for other target languages in 2007. The overall accuracy is the
same for both (English and Spanish to Basque) but only they agree in five correct
answers (each system gives other six correct answers). The second system for
Spanish-Basque get poorer results and only is slightly better in inexact answers.
These runs get also a lot of NIL answers.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.3 Bulgarian as Target</title>
        <p>This year, contrary to our optimistic expectations, only one run by one group
(BTB) was performed for Bulgarian. As the table above shows, the result is far
from satisfying. Again, the definitions were detected better in comparison to other
question types. Also, the difference between the detection of factoids and of
temporally restricted questions is negligible. The results from the previous years
decreased in both directions – as participating groups and as system performance.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.4 Dutch as Target</title>
        <p>The questions for the Dutch subtask of CLEF-QA 2008 were written by four
native speakers. They selected random articles from either Wikipedia or the news
collection and composed questions based on the topics of the articles.</p>
        <p>The quartet produced a total of 222 question-answer pairs from which they
selected a set of 200 that satisfied the type distribution requirements of the task
organizers. An overview of the question types and answer types can be found in
Table 9.</p>
        <p>This year, only one team took part in the question answering task with Dutch as
target language: the University of Groningen. The team submitted two
monolingual runs and two cross-lingual runs (English to Dutch). All runs were assessed
twice by a single assessor. This resulted in a total of eight conflicts (1%). These
were corrected. The results of the assessment can be found in Table 10.</p>
        <p>The two cross-lingual runs gron081ennl andron082ennl produced exactly the
same answers.</p>
        <p>The best monolingual run (gron082nlnl) achieved exactly the same score as the
best run of 2007 (25.5%). The same is true for the best monolingual run (13.5%).
The fact that the two scores are in the same range as last year is no big surprise
since the task has not changed considerably this year and all scores have been
achieved by the same system.</p>
        <p>
          Like in 2007, the system performed better for definition questions than for
other question types. The definition questions could be divided in two subtypes: those
that asked for a definition (26) and those that contained a definition and asked for
the name of the defined object (
          <xref ref-type="bibr" rid="ref12">12</xref>
          ). The monolingual runs performed similarly for
both subtypes but the cross-lingual runs did not contain a correct answer to any
question of the second subtype.
        </p>
        <p>None of the runs obtained any points for the list questions. The answers
contained some parts that were correct but none of them were completely correct. We
were unable to award points for partially correct answers in the current assessment
scheme.</p>
        <p>All the runs were produced by the same system and the differences between the
runs are small. The cross-lingual runs contained seven correct answers that were
not present in any of the monolingual runs (for questions 20, 25, 120, 131, 142,
150 and 200). Eight questions were only answered correctly in a single
monolingual run (1, 28, 54, 72, 83, 143, 193 and 199). Thirty-five questions were
answered correctly in two runs, three in three runs and seventeen in all four runs. 137
questions failed to receive any correct answer.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.5 English as Target</title>
        <p>Run</p>
        <p>Creation of Questions. The task this year was exactly the same as in 2007 and
moreover the three collections were the same: Glasgow Herald, LA Times and
Wikipedia. However, given the considerable interest in the Wikipedia which has
been shown by Question Answering groups generally, it was decided to increase
the number of questions drawn from it to 75% overall, with just 25% coming from
the two newspaper collections. This means that 40 of the 160 Factoids came from
the newspapers, together with seven of the 30 Definitions and two of the ten Lists.
These questions were divided equally between the Glasgow Herald and LA Times.
All the remainder we drawn from the Wikipedia.</p>
        <p>Considerable care was taken in the selection of the questions. The distribution
by answer type was controlled exactly as in previous years. As requested by the
organisers there were exactly twenty each of Factoid target type PERSON, TIME,
LOCATION, MEASURE, COUNT, ORGANIZATION, OBJECT and OTHER.
Similarly for Definitions there were eight PERSON, seven ORGANIZATION,
seven OBJECT and eight OTHER. For Lists there were four OTHER, two each of
PERSON and ORGANIZATION, and one each of LOCATION and OBJECT.</p>
        <p>In addition to the above distribution, we also controlled the distribution of
topics for the question groups, something which was made practicable by the use of
the Wikipedia. Questions were drawn from a number of predefined subject fields:
countries towns, roads and bridges, shops, politicians and politics, sports and
sports people, foods and vegetables, cars, classical music including instruments,
popular music, literature poetry and drama, philosophy, films, architecture,
languages, science, consumer goods, and finally organisations. Questions were
distributed among these topics. The maximum in any topic was twenty (sports) and the
minimum was two (shops). For the majority there were between four and six
question groups. For each such topic, one or more questions were set depending on
what information the texts contained. As a change from last year, the organisers
asked us to include 100 singleton topics. This effectively meant that half the
questions in the overall set of 200 were simple "one-off" queries as were set in CLEF
prior to 2007 and for the earlier TREC campaigns.</p>
        <p>Questions were entered via a web interface developed by the organisers last
year. However, this year they improved it considerably, for example allowing
modifications to be made to existing entries. This was a great help and a
commendable effort on their part.</p>
        <p>Summary Statistics. Five cross-lingual runs with English as target were
submitted this year, as compared with eight in 2007 and thirteen in 2006. Four groups
participated in three languages, Dutch, German and Romanian. Each group
worked with only one source language, and only DCUN submitted two runs. The
rest submitted only one run.</p>
        <p>Assessment Procedure. Last year we used the excellent Web-based
assessment system developed originally for the QiQA task by University of Amsterdam.
However, we were asked not to use this in 2008 because it only allows one answer
per question per system to be assessed and it was required to assess multiple
answers per question per system. For this reason we used a Web-based tool
developed by UNED in Madrid.</p>
        <p>
          All answers were double-judged. Where the assessors differed, the case was
reviewed and a decision taken. There were 63 judgement differences in total. Three
of the runs contained multiple answers to individual questions in certain cases, and
these were all assessed, as per the requirement of the organisers. If we assume that
the number of judgements was in fact 200 questions * five runs, i.e. 1,000, we can
compute a lower bound for the agreement level. This gives a figure of
(
          <xref ref-type="bibr" rid="ref1">1,00063</xref>
          )/1,000, i.e. 93.7%. The equivalent figure for 2007 (called Agreement Level 2 in
the Working Notes for last year) was 97.6%. Given that we have computed a
lower bound this year (and not therefore the exact figure) this seems acceptable.
        </p>
        <p>Results Analysis. Of the five runs with English as target, wlvs081roen was the
best with an accuracy of 19.00% overall. They also did very will on the
definitions, scoring 66.67%. The only source language for which there was more than
one run was German, for which there were three submissions from two groups.
dfki081 scored the best with 14.00% and this was followed by dcun081deen with
8.00% and dcun082deen with 0.50%. dfki also did very well on definitions with
an accuracy of 60.00. Interestingly, none of the systems answered any of the list
questions correctly. Only dcun082deen answered one list question inexactly.</p>
        <p>If we compare the results this year with those of last year when the task was
very similar, performance has improved here. The best score in 2007 was
wolv071roen with 14.00% (the best score) which has now improved to 19.00%.
Similarly, dfki071deen scored 7.00% in 2007 but increased this to 14.00% this
year in dfki081deen. An attempt was made to set easier questions this year, which
might have affected performance. In addition, many more questions came from
the Wikipedia in 2008 with only a minority being drawn from the newspaper
corpora.</p>
      </sec>
      <sec id="sec-4-5">
        <title>3.6 QA-WSD subtask</title>
        <p>The QA-WSD task brings semantic and retrieval evaluation together. The
participants were offered the same queries and document collections as for the main
QA exercise, but with the addition of word sense tags as provided by two
automatic word sense disambiguation (WSD) systems. Contrary to the main QA task,
Wikipedia articles are not included, and thus systems need to reply to the questions
that have an answer in the news document collection. The goal of the task is to test
whether WSD can be used beneficially for Question Answering, and is closely
related to the Robust-WSD subtask of the ad-hoc track in CLEF 2008.
The exercise scenario is event-targeted QA on a news document collection. In the
QA-WSD track only English monolingual and Spanish to English bilingual tasks
are offered, i.e. English is the only target language, and queries are available on
both English and Spanish. The queries were the same as for the main QA exercise,
and the participation followed the same process, except for the use of the
senseannotated data.</p>
        <p>The goal of this task is to evaluate whether word sense information can help in
certain queries. For this reason, participants were required to send two runs for
each of the monolingual/bilingual tasks where they participate: one which does
not use sense annotations and another one which does use sense annotations.
Whenever possible, the only difference between the two runs should be solely the
use or not of the sense information. Participants which send a single run would be
discarded from the evaluation.</p>
        <p>The WSD data is based on WordNet version 1.6 and was supplemented with
freely available data from the English and Spanish WordNets in order to test different
expansion strategies. Two leading WSD experts run their systems [17][18], and
R
#
8
7
R
#
10
8</p>
        <p>W
#
41
42
W
#
188
189</p>
        <p>X
#
0
0
X
#
0
0</p>
        <p>U
#
0
0
U
#
2
3
provided those WSD results for the participants to use.</p>
        <p>The task website [4] provides additional information on data formats and
resources.</p>
        <p>Results</p>
        <p>From the 200 questions provided to participants, only 49 queries had a correct
answer in the news collection. The table below provides the results for the
participant on those 49 questions.</p>
        <p>The first run does not use WSD, while the second uses the sense tags returned
by the NUS WSD system. The WSD tags where used in the passage retrieval
module. The use of WSD does not provide any improvement, and causes one
more error. For the sake of completeness we also include below the results on all
200 queries. Surprisingly the participant managed to find two (one in the WSD
run) correct answer for the Wikipedia questions in the news collection.</p>
      </sec>
      <sec id="sec-4-6">
        <title>3.7 French as Target</title>
        <p>This year only one group took part in the evaluation tasks using French as a
target language: the French group Synapse Développement. Last year’s second
NIL
0
0
0
participant, the Language Computer Corporation (LCC, USA) didn’t send any
submission this time.</p>
        <p>Synapse submitted three runs in total:
• one monolingual run: French to French (FR-to-FR),
• two bilingual runs: English-to-French (EN-to-FR) and
Portuguese-to</p>
        <p>French (PT-to-FR).</p>
        <p>In the following, these will be referred to as:
• syn08frfr (for FR-to-FR),
• syn08enfr (for EN-to-FR),
• syn08ptfr (for PT-to-FR).</p>
        <p>As last year, three types of questions were proposed: factual, definition and
closed list questions. Participants could return one exact answer per question and
up to two runs. Some questions (10%) had no answer in the document collection,
and in this case the exact answer is "NIL".</p>
        <p>The French test set consists of 200 questions:
• 135 Factual (F),
• 30 Definition (D),
• 35 closed List questions (L).</p>
        <p>Among these 200 questions, 66 were temporally restricted questions (T) and 12
were NIL questions (i.e. a “NIL” answer was expected, meaning that there is no
valid answer for this question in the document collection).</p>
        <p>R
#
36</p>
        <p>Table 14 shows the final results of the assessment of the 3 runs submitted by
Synapse. For each run, the following statistics are provided:
• The number of correct (R), wrong (W), inexact (X) and unsupported
answers (U),
• The accuracy calculated within each of the categories of questions:</p>
        <p>F, D, T and L questions,
• The number of NIL answers and the proportion of correct ones (i.e.</p>
        <p>corresponding to a NIL questions),
• The Confidence Weighted Score (CWS) measure.</p>
        <p>• The accuracy calculated over all answers.</p>
        <p>Figure 2 shows the best scores for systems using French as target in the last
five CLEF QA campaigns.
For the monolingual task, the Synapse system returned 113 correct answers
(accuracy of 56.5%), slightly more than last year (accuracy of 54.0%). The
bilingual runs performance is quite low, with an accuracy of 18.0% for EN-to-FR and
16.5% for PT-to-FR. It cannot be fairly compared to the results of CLEF2007,
because Synapse didn’t submit bilingual runs last year. Last year, LCC obtained an
accuracy of 41.7% for EN-to-FR, but did not submit anything this year.</p>
        <p>It appears that the level of performance strongly depends on the type of
questions. The monolingual run scores very high on the definition questions (86.7%).
The lowest performance is obtained with closed list questions (37.1%).</p>
        <p>It is even more obvious when looking at the bilingual runs. If the systems
performed pretty well on the definition questions (50.0% and 43.3% for EN-to-FR
and PT-to-FR respectively), they could not cope with the closed list questions. The
PT-to-FR system could only give one close list correct answer. The EN-to-FR
system could not even answer to any of these questions. The bilingual runs did not
reach high accuracy with factoid and temporally restricted questions (50.0% and
43.3% for EN-to-FR and PT-to-FR respectively). This year, the complexity of the
task, in particular regarding closed list questions, seems to have been hard to cope
with for the bilingual systems.</p>
        <p>The complexity of the task is also reflected by the number of NIL answers. The
monolingual system returned 20 NIL answers (to be compared with the 12
expected). The bilingual systems returned 60 (EN-to-FR) and 67 (EN-to-FR) NIL
answers, i.e. at least 5 times more as expected.</p>
        <p>It is also interesting to look at the results when categorizing questions by the
size of the topic they belong to. This year, topics could contain from 1 single
question to 4 questions. The CLEF 2008 set consists of:
• 52 single question topics,
• 33 topics with 2 questions (66 questions in total),
• 18 topics with 3 questions (54 questions in total),
• 7 topics with 4 questions (28 questions in total).</p>
        <p>Table 15, Table 16 and Table 17 give the results of each run according to the
size of the topics.
1
2
3
4
1
2
3</p>
        <p>The monolingual system (Table 15) is not sensitive to the size of the topic
question set. On the opposite, the performances of the bilingual systems (Table 16
and Table 17) decrease by a half, when comparing the 1- and 2-question sets to the
3- and 4-question sets. A possible explanation is that the bilingual systems
perform poorly with questions containing anaphoric references (which are more
likely to occur in the 3- and 4-question sets).</p>
        <p>In conclusion, there was unfortunately only one participant this year. In
particular; it would have been interesting to see how the LCC group, which submitted a
bilingual run last year, would have performed this year.</p>
        <p>This decrease in participation can be explained by the discouragement of some
participants. Some have complained that the task is each year harder (e.g. this
year, there were more closed list questions and anaphoric references than last
year) that can result in a decrease in the systems performances.</p>
        <p>This year, the number and complexity of closed list questions was clearly
higher than the previous year. In the same way, there were more temporally
restricted questions, more topics (comprising from 2 to 4 questions) and more
anaphoric references. It seems that this higher level of difficulty particularly impacted
the bilingual tasks. In spite of this, the monolingual Synapse system performed
slightly better than last year.</p>
      </sec>
      <sec id="sec-4-7">
        <title>3.8 German as Target</title>
        <p>Three research groups submitted runs for evaluation in the track having
German as target language: The German Research Center for Artificial Intelligence
(DFKI), the Fern Universität Hagen (FUHA) and the Universität Koblenz-Landau
(LOGA). All groups provided system runs for the monolingual scenario, DFKI
and FUHA submitted runs for the cross-language English-German scenario and
FUHA had also runs for the Spanish-German scenario.
2008
2007
2006
2005</p>
        <p>2004
Compared to the previous editions of the evaluation forum, this year an increase in
the accuracy of the best performing system and of an aggregated virtual system for
monolingual and a decrease in the accuracy of the best performing system and of
an aggregated virtual system for cross-language tasks was registered.
3
1
0
2
2
0
0
4
1
0
1
1
0
1
35
14
5
9
63</p>
        <p>Total
9
8
13
13
0
14</p>
        <sec id="sec-4-7-1">
          <title>WIKI</title>
          <p>The number of topics covered by the test set questions was of 120 distributed as it
follows: 74 topics consisting of 1 question, 24 topics of 2 related questions, 10
topics of 3 related questions, and 12 topics of 4 related questions. The distribution
of the topics over the document collections (CLEF vs. Wikipedia) is presented in
Table 18.
#</p>
          <p>X
#
According to Table 19 the most frequent topic types were OTHER (32), OBJECT
(29) and ORGANIZATION (24), with first two types more present for the
Wikipedia collection of documents (WIKI).</p>
          <p>As regards the source of the answers, 97 questions from 57 topics asked for
information out of the CLEF document collection and the rest of 103 from 63
topics for information from Wikipedia. Table 20 shows a breakdown of the test set
questions by the expected answer type (EAType) for each collection of data.</p>
        </sec>
      </sec>
      <sec id="sec-4-8">
        <title>3.9 Portuguese as Target</title>
        <p>The Portuguese track had six different participants: beside the veteran groups
of Priberam, Linguateca, Universidade de Évora, INESC and FEUP, we had a new
participants this year, Universidade Aberta. No bilingual task occurred this year.</p>
        <p>In this fourth year of Portuguese participation, Priberam repeated the top place
of its previous years, with University of Évora behind. Again we added the
classification the classification X-, meaning incomplete, keeping the classification X+
for answers with extra text or other kinds of inexactness. In Table 22 we present
the overall results (all tables in these notes refer exclusively to the first answer by
each system).
#</p>
        <p>To provide a more direct comparison with pre-2006 results, in Table 23 we
present the results both for first question of each topic (which we believe is more
readily comparable to such results) and for the linked questions.</p>
        <p>On the whole, compared to last year, Priberam and Senso (UE) improved their
results, which were already the best. INESC system and Esfinge (Linguateca) also
showed some improvement, at a lower level Raposa (FEUP) showed similar
results. The system of Universidade Aberta appeared with good results compared to
some veteran systems. We leave it to the participants to comment on whether it
might have been caused by harder questions or changes (or lack thereof) in the
systems.
obj
org
oth
per</p>
        <p>Unlike last year , the results over linked questions are significatively different
(and below) from those over not-linked. Question 180 was wrongly redacted,
referring to Aida’s opera Verdi instead of the other way around, which also affected
two linked questions. Therefore, we accepted both NIL answers to those
questions, as well as correct ones.</p>
        <p>Table 24 shows the results for each answer type of definition questions, while
Table 25 shows the results for each answer type of factoid questions (including list
6
5
6
2
3
6
2
2
9
6
5
1
1
1
5
5
5
5
6
11
7
8
3
3
11
8
8
22
questions). As it can be seen, four out of six systems perform clearly better when
it comes to definitions than to factoids. Particularly Senso has a high accuracy
regarding definitions.
17
8
8
4
3
9
9
9
21
31
35%
20%
20%
14%
12%
24%
13%
13%
55%
82%
obj
1</p>
        <p>2
1
1
5
2
2
1
1
7
7</p>
        <p>We included in both Table 24 and Table 25 a virtual run, called combination, in
which one question is considered correct if at least one participating system found
a valid answer. The objective of this combination run is to show the potential
achievement when combining the capacities of all the participants. The
combination run can be considered, somehow, state-of-the-art in monolingual Portuguese
question answering. All definition questions were answered by at least one
system.
Run name
diue081
esfi081
esfi082
feup081
feup081
idsa081
ines081
ines082
prib081
The system with best results, Priberam, answered correctly 64.8% the questions
with at least one correct answer. In all, 130 questions were answered by more than
one system.</p>
        <p>In Table 26, we present some values concerning answer and snippet size.</p>
        <p>Temporally restricted questions: Table 27 presents the results of the 17
temporally restricted questions. As in previous years, the effectiveness of the systems
to answer those questions is visibly lower than for non-TRQ questions.</p>
        <sec id="sec-4-8-1">
          <title>Run name</title>
          <p>List questions: ten questions were defined as list questions all closed list
factoids with two to five each3. The results haven’t improved with UE getting two
correct answers. Priberam three and all other system zero. There were however seven
cases of incomplete answers (i.e.. answering some elements of the list only)
although only two of them with than one element of the answer.</p>
          <p>3 There were some open list questions as well, but they were
classified and evaluated as ordinary factoids.</p>
          <p>Answer source: Table 28 presents the distribution of questions by source
during their selection. The distribution of sources used by the different runs and their
correctness.</p>
        </sec>
      </sec>
      <sec id="sec-4-9">
        <title>3.10 Romanian as Target</title>
        <p>In the third year of Romanian participation in QA@CLEF, and the second one
with Romanian addressed as a target language, the question generation was based
on the collection of Wikipedia Romanian pages frozen in November 20064- the
same corpus as in the previous edition5.</p>
        <p>Creation of Questions. The questions were generated starting from the corpus
and based on the Guidelines for Question Generation6, the Guidelines for
Participants7 and the final decisions taken after email discussions between the organizers.
The 200 questions are distributed according to Table 29, where for each type of
question and expected answer we indicate also the temporally restricted questions
out of the total number of questions. Without counting the NIL questions, 100% of
the questions has the answer in Wikipedia collection.
NIL</p>
        <p>
          PER
SON
20
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
8
3
        </p>
        <p>
          TIM
E
23
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
        </p>
        <p>LOC.</p>
        <p>ORG.</p>
        <p>MEAS COU</p>
        <p>
          URE NT
26 (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) 20 (
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
        </p>
        <p>
          17 (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
1
1 (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
6 (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
1
22
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
        </p>
        <p>
          OBJE
CT
18 (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
6
2 (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>
          OTH
ER
16
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
3
        </p>
        <p>
          TOTAL
162 (44)
28 (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
10 (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
8
4 http://static.wikipedia.org/downloads/November_2006/ro/
5 At http://static.wikipedia.org/downloads/ the frozen versions of
Wikipedia exist for April 2007 and June 2008, for all languages
involved in QA@CLEF.
        </p>
        <p>6http://celct.isti.cnr.it/ClefQA/QA@CLEF08_Question_Generation_Gui
delines.pdf</p>
        <p>7http://nlp.uned.es/clef-qa/QA@CLEF08_Guidelines-forParticipants.pdf
As the Guidelines for Question Generation did not change since the previous
edition, there were no major difficulties in creating the Romanian gold standard for
the 2008 QA@CLEF. The working version of the GS was uploaded on the
question generation interface developed at CELCT (Italy), by filling all the required
fields.</p>
        <p>For the topic-related questions (clusters of up to four questions, related to one
same topic) we kept about the same number as in the previous edition: in 2007 we
had 122 topics and now there are 119 topics. The percentage of topic-linked
questions is illustrated in Table 30, showing that 127 questions were grouped under 46
topics, hence 63.5% out of the total 200 questions were linked in topics with more
than one question.
4 Qs
3 Qs
2 Qs
1 Q
TOTAL
5
5
5
13
28
1
1
3
6
11
1
4
19
24
1
1
1
2
17
20
5
3</p>
        <p>Total
topics</p>
        <p>Total
questions
In fact the questions contain not 127, but only 51 anaphoric elements of various
types, so that 25.5% of the questions are linked through coreferential relations.
The personal, possessive or demonstrative pronouns were used in most of the
cases to create anaphoric relations. The antecedents are mainly the focus of the
previous question, or the previous answer. Few such questions require inference in
order to be correctly answered. For example in order to correctly answer the
FTime question When was the first Esperanto dictionary for Romanian published?
and then the L-Other Name all the grammatical cases of this artificial language.,
one needs to correctly link the anaphor “artificial language” to its antecedent
which is “Esperanto” and not “Romanian” (also a language but not artificial); this
is possible by establishing, based on a text snippet, that Esperanto is an artificial
language.</p>
        <p>The 8 NIL questions, even though they seem somehow unnatural, were created
by including questions about facts impossible from a human perception; for
example the question In which year did Paul Kline publish his work about the
natural phenomena called hail? has no answer in any of the articles about the
psychologist. Another type of NIL questions are those based on inference – the question
How many bicameral Parliaments are there in Cuba? is a NIL question because
in all wiki articles one can find that Cuba has a unicameral parliament. Another
type of NIL questions (with answer in English, but not in Romanian) we have
created cannot be good items neither in a cross-lingual evaluation where the
answers are to be find in any language, nor in an evaluation based on an open text
collection such as the web. The question What is a micron? has no answer in the
Romanian wiki articles from 2006, but it can have an answer in other Romanian
webpages, and, moreover, in the English wiki articles it has more than a correct
answer depending on the domain where the term is used (in the metric system or
in vacuum engineering).</p>
        <p>For the LIST type we created only questions whose answers are to be found in
one same text section. The 2007 evaluation for Romanian showed that “open list”
questions (with answers in various sections of an article or even in various
articles) are difficult to handle, therefore we made the LIST questions easier.
Systems’ analysis and evaluation. Like in the 2007 edition, this year two
Romanian groups took part in the monolingual task with Romanian as a target language:
the Faculty of Computer Science from the Al. I. Cuza University of Iasi (UAIC),
and the Research Institute for Artificial Intelligence from the Romanian Academy
(ICIA), Bucharest. Each group submitted two runs, the four systems having an
average of 2.4 answers per question for ICIA, and 1.92 for UAIC. The 2008 general
results are presented in Tables 31 below.</p>
        <p>The statistics includes a system, named combined, obtained through the
combination of the 4 participating RO-RO systems. Because at the evaluation time we
observed that there are correct answers not only in the first position, but also on the
second or the third, the combined system considers that an answer is R if there
exists at least one R answer among all the answers returned by the four systems. If
there is no R answer, the same strategy is applied to X, U and finally W answers.
This “ideal” system permits to calculate the percentage of the questions (and their
type), answered by at least one of the four systems in any of the maximum 3
answers returned for a question.</p>
        <p>All three systems crashed on the LIST questions. The best results were obtained
by ICIA for DEFINITION questions, whereas UAIC performed best with the
FACTOID questions. The combined system suggests that a joint system,
developed by both groups, would improve substantially the general results for
Romanian.</p>
        <p>Using in a first stage the web interface for assessing the QA runs, developed at
UNED in Spain, the assessment took into consideration one question with all its
answers at the time, assuring that the same evaluation criteria are applied to all
answers. The judgment of the answers was based on the same Guidelines as in 2007,
therefore we kept the same criteria as in 2007, in order to assure consistency
inside the Romanian language, which gives also the possibility to evaluate the
systems in their evolution from one year to another. For example, one could easily
see that the UAIC systems had most of the answers for the DEFINITION
questions evaluated as ineXact, because the answers were judged as being “longer than
the minimum amount of information required” and hence “unnecessary pieces of
information were penalized”. Since all the 2007 and 2008 answers were evaluated
this way, we considered it is more important to have uniformly applied rules
inside one language than to change the evaluation in order to be consistent across
languages. On the other hand the ICIA answers judged as ineXact are due to
answers that are too long, snippets shortened as such as they do not contain the
answer, or because the answer and the snippet has no connections.</p>
        <p>U F</p>
        <p>T</p>
        <p>D</p>
        <p>L
Tables 31. Results in the monolingual task, Romanian as target languageM
C
W R
S R
Run
icia08
1roro
icia08
2roro
uaic08
1roro
uaic08
2roro
Run
Combined
icia081roro
icia082roro
uaic081roro
uaic082roro</p>
        <p>R
# [162] [47] [28]
4.938 18.51
6.173 18.51
24.69 25.5
1 32
26.54 27.6
3 60
7.143
10 0</p>
        <p>X</p>
        <p>The evaluation was made more difficult because two of the submitted runs
contain the answers in a totally arbitrary order, with topic-related questions having
their answers in various parts of the submitted file. If in the first stage the UNED
interface was of a great help, after the xml file was generated with all the
evaluations, the corrections needed a thorough manual inspection. Anyway it was nice to
find out that the answer to the question Which terrorist organization does Osama
bin Laden belong to? is Pentagon.</p>
        <p>W
0.000
0.000
15 2
7.143
39.286
21 0</p>
      </sec>
      <sec id="sec-4-10">
        <title>3.11 Spanish as Target</title>
        <p>The participation at the Spanish as Target subtask has decreased from 5 groups in
2007 to 4 groups this year. 6 runs were monolingual and 3 runs were crosslingual.
Table 32 shows the summary of systems results with the number of Right (R),
Wrong (W), Inexact (X) and Unsupported (U) answers. The table shows also the
accuracy (in percentage) of factoids (F), factoids with temporal restriction (T),
definitions (D) and list questions (L). Best values are marked in bold face.
prib081eses 86 105 5 4 41,13 41,67
inao082eses 44 152 3 1 19,35 8,33
inao081eses 42 156 1 1 15,32 8,33
qaua082eses 39 156 4 1 22,58 13,89
mira081eses 32 156 3 9 12,90 2,78
mira082eses 29 159 3 9 11,29 2,78
qaua081enes 25 173 - 2 11,29 16,67
qaua082enes 18 176 3 3 9,68 8,33
mira081fres 10 185 2 3 5,65
75
80
95
30
75
70
20
15
15
20
5
5
5
3
4
3
6
3
3
6
8
3</p>
        <p>Run
0,4483 42,5
0,2342 22
0,2375 21
0,2217 19,5
0,1766 16
0,1591 14,50
0,1450 12,50
0,1108 9
0,0533 5</p>
        <p>Table 33 shows that the first question of the topic group is answered much
more easily than the rest of the questions which need to solve some references to
previous questions and answers.</p>
        <p>Regarding NIL questions, Table 34 shows the harmonic mean (F) of precision
and recall for self-contained questions, linked questions and all questions, taking
into account only the first answer. In most of the systems, NIL is not given as
second or third candidate answer.</p>
        <p>The correlation coefficient r between the self-score and the correctness of the
answers (shown in Table 34) has been similar to the obtained last year, being not
good enough yet, and explaining the low results in CWS and K1 [6] measures.
prib081eses
inao082eses
inao081eses
qaua082eses
mira081eses
mira082eses
qaua081enes
qaua082enes
mira081fres
0,26
0,14
0,19
0,27
0,27
0,29
0,26
0,20
0,15
%Answer
Extraction
90,53
80,56
80,00
80,00
73,58
67,74</p>
        <p>With respect to the source of the answers, Table 36 shows that in this second
year of using Wikipedia, this collection is now the main source of correct answers
for most of the systems (with the exception of U. of Alicante).</p>
        <p>% Of Correct
% Of correct answers Answers found % Of Correct
anfound in EFE in Wikipedia swers found NIL
36,97
24,14</p>
        <p>25
48,53
23,26
21,62
52,27
48,57
33,33
60,50
68,97</p>
        <p>70
42,65
69,77
70,27
29,55
34,29
41,67
2,52
6,90</p>
        <p>5
8,82
6,98
8,11
18,18
17,14
25</p>
        <p>Run
prib081eses
inao082eses
inao081eses
qaua082eses
mira081eses
mira082eses
qaua081enes
qaua082enes
mira081fres</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4 Conclusions</title>
      <p>This year we proposed the same evaluation setting as in 2007 campaign. In
fact, last year the task was changed considerably and this affected the general level
of results and also the level of participation in the QA task. This year participation
increased slightly but the task proved to be still very difficult. Wikipedia increased
its presence as a source of questions and answers. Following last year’s
conclusions Wikipedia seemed to be a good source for finding answers to simple factoid
questions.</p>
      <p>Moreover, the overall decrease in accuracy was probably due to linked
questions. This fact confirms that topic resolution is a weak point for QA systems.</p>
      <p>Only 5 out of 11 target languages had more than one different participating
group. Thus from the evaluation methodology perspective, a comparison between
systems working under similar circumstances cannot be accomplished and this
impedes one of the major goals of campaigns such the QA@CLEF, i.e. the
systems comparison which could determine an improvement in approaching QA
problematic issues.</p>
      <p>In six years of QA experimentation, a lot of resources and know-how have
been accumulated, nevertheless systems do not show a brilliant overall
performance, even those that have participated to most QA campaigns, and still seem
not to manage suitably the different challenges proposed.</p>
      <p>In conclusion, it is clear that a redefinition of the task should be thought in the
next campaign. This new definition of the task should permit the evaluation and
comparison of systems even working in different languages. The new setting
should also take as reference a real user scenario, perhaps in a new document
collection.</p>
      <p>Acknowledgements. A special thank to Danilo Giampiccolo (CELCT, Trento,
Italy), who has given his precious advise and valuable support at many levels for
the preparation and realization of the QA track at CLEF 2008.</p>
      <p>Jesús Herrera has been partially supported by the Spanish Ministry of
Education and Science (TIN2006-14433-C02-01 project).</p>
      <p>Anselmo Peñas has been partially supported by the Spanish Ministry of Science
and Technology within the Text-Mess-INES project (TIN2006-15265-C06-02).</p>
      <p>Paulo Rocha was supported by the Linguateca project, jointly funded by the
Portuguese Government and the European Union (FEDER and FSE), under
contract ref. POSC/339/1.3/C/NAC</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. QA@CLEF Website: http://clef-qa.itc.it/</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. AVE Website: http://nlp.uned.es/QA/ave/.</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>3. QAST Website: http://www.lsi.upc.edu/~qast/</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>QA-WSD Website</surname>
          </string-name>
          : http://ixa2.si.ehu.es/qawsd/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. QA@
          <article-title>CLEF 2007 Organizing Committee</article-title>
          .
          <source>Guidelines</source>
          <year>2007</year>
          . http://clef-qa.itc.it/2007/download/QA@
          <article-title>CLEF07_Guidelines-for-Participants</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hartrumpf</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glöckner</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leveling</surname>
          </string-name>
          , J.: University of Hagen at QA@
          <article-title>CLEF 2007: Coreference Resolution for Questions and Answer Merging</article-title>
          . This volume.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peñas</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Question Answering Pilot Task at CLEF 2004</article-title>
          . In: Peters,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Gareth J.F.</given-names>
            ,
            <surname>Kluck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <surname>B</surname>
          </string-name>
          . (eds.):
          <article-title>Multilingual Information Access for Text</article-title>
          ,
          <source>Speech and Images. Lecture Notes in Computer Science</source>
          , Vol.
          <volume>3491</volume>
          . Springer-Verlag, Berlin Hidelberg New York (
          <year>2005</year>
          )
          <fpage>581</fpage>
          -
          <lpage>590</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ion</surname>
          </string-name>
          , R.:
          <article-title>Word Sense Disambiguation Methods Applied to English and Romanian</article-title>
          .
          <source>PhD thesis</source>
          , Romanian Academy, Bucharest (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ion</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mititelu</surname>
          </string-name>
          , V.B.:
          <article-title>Constrained Lexical Attraction Models</article-title>
          .
          <source>In: Nineteenth International Florida Artificial Intelligence Research Society Conference</source>
          , pp.
          <fpage>297</fpage>
          -
          <lpage>302</lpage>
          . AAAI Press, Menlo Park, California, USA (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Landis</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          and Koch,
          <string-name>
            <surname>G. G.</surname>
          </string-name>
          :
          <article-title>The measurements of observer agreement for categorical data</article-title>
          .
          <source>Biometrics</source>
          ,
          <volume>33</volume>
          (
          <year>1997</year>
          )
          <fpage>159</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Laurent</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Séguéla</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nêgre</surname>
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Cross Lingual Question Answering using QRISTAL for CLEF 2007</article-title>
          . This volume.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giampiccolo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ayache</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jijkoun</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osenova</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peñas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rocha</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sacaleanu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sutcliffe</surname>
          </string-name>
          , R.:
          <article-title>Overview of the CLEF 2006 Multilingual Question Answering Track</article-title>
          .
          <source>In: Evaluation of Multilingual and Multi-modal Information Retrieval. Lecture Notes in Computer Science</source>
          , Vol.
          <volume>4730</volume>
          . Springer-Verlag, Berlin Heidelberg New York (
          <year>2007</year>
          )
          <fpage>223</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Peñas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodrigo</surname>
          </string-name>
          , Á.,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Overview of the Answer Validation Exercise 2007</article-title>
          . This volume.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Turmo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Comas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ayache</surname>
            ,
            <given-names>C</given-names>
          </string-name>
          , Mostefa,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Rosset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Lamel</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Overview of QAST</article-title>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Vallin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giampiccolo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aunimo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ayache</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osenova</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peñas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Rijke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sacaleanu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sutcliffe</surname>
          </string-name>
          , R. .:
          <article-title>Overview of the CLEF 2005 Multilingual Question Answering Track</article-title>
          .
          <source>In: Accessing Multilingual Information Repositories. Lecture Notes in Computer Science</source>
          , Vol.
          <volume>4022</volume>
          . Springer-Verlag, Berlin Heidelberg New York (
          <year>2006</year>
          )
          <fpage>307</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          , E.:
          <article-title>Overview of the TREC 2002 Question Answering Track</article-title>
          .
          <source>In NIST Special Publication 500-251: The Eleventh Text REtrieval Conference (TREC</source>
          <year>2002</year>
          ).
          <article-title>National Institute of Standards and Technology</article-title>
          , USA (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Agirre</surname>
          </string-name>
          , Eneko &amp; Lopez de Lacalle,
          <string-name>
            <surname>Oier</surname>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>UBC-ALM: Combining k-NN with SVD for WSD</article-title>
          .
          <source>Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval</source>
          <year>2007</year>
          ), pp.
          <fpage>341</fpage>
          -
          <lpage>345</lpage>
          . Prague, Czech Republic.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Chan</surname>
            ,
            <given-names>Yee</given-names>
          </string-name>
          <string-name>
            <surname>Seng</surname>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>Hwee</given-names>
          </string-name>
          <string-name>
            <surname>Tou</surname>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <surname>Zhi</surname>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>NUS-PT: Exploiting Parallel Texts for Word Sense Disambiguation in the English All-Words Tasks</article-title>
          .
          <source>Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval</source>
          <year>2007</year>
          ), pp.
          <fpage>253</fpage>
          -
          <lpage>256</lpage>
          . Prague, Czech Republic.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>