=Paper=
{{Paper
|id=Vol-1176/CLEF2010wn-MLQA10-PenasEt2010
|storemode=property
|title=Overview of ResPubliQA 2010: Question Answering Evaluation over European Legislation
|pdfUrl=https://ceur-ws.org/Vol-1176/CLEF2010wn-MLQA10-PenasEt2010.pdf
|volume=Vol-1176
|dblpUrl=https://dblp.org/rec/conf/clef/PenasFRSFM10
}}
==Overview of ResPubliQA 2010: Question Answering Evaluation over European Legislation==
<pdf width="1500px">https://ceur-ws.org/Vol-1176/CLEF2010wn-MLQA10-PenasEt2010.pdf</pdf>
<pre>
            Overview of ResPubliQA 2010: Question Answering Evaluation over
                              European Legislation

            Anselmo Peñas1, Pamela Forner2, Álvaro Rodrigo3, Richard Sutcliffe4, Corina Forăscu5, Cristina Mota6
                                     1
                                          NLP&IR group, UNED, Spain (anselmo@lsi.uned.es)
                                                  2
                                                    CELCT, Italy (forner@celct.it)
                                       3
                                          NLP&IR group, UNED, Spain (alvarory@lsi.uned.es)
                                     4
                                       University of Limerick, Ireland (richard.sutcliffe@ul.ie)
                                        5
                                          UAIC and RACAI, Romania (corinfor@info.uaic.ro)
                                             6
                                               SINTEF ICT, Norway (cmota@ist.utl.pt)


Abstract. This paper describes the second round of ResPubliQA, a Question Answering (QA) evaluation task
over European legislation, a LAB of CLEF 2010. Two tasks have been proposed this year: Paragraph Selection
(PS) and Answer Selection (AS). The PS task consisted of extracting a relevant paragraph of text that satisfies
completely the information need expressed by a natural language question. In the AS task, the exercise was to
demarcate the shorter string of text corresponding to the exact answer supported by the entire paragraph.
The general aims of this exercise are (i) to move towards a domain of potential users; (ii) to propose a setting
which allows the direct comparison of performance across languages; (iii) to allow QA technologies to be
evaluated against IR approaches; (iv) to promote validation technologies to reduce the amount of incorrect
answers by leaving some questions unanswered. These goals are achieved through the use of parallel aligned
document collections (JRC-Acquis and EUROPARL) and the possibility to return two different types of
answers, either passages or exact strings. The paper describes the task in more detail, presenting the different
types of questions, the methodology for the creation of the test sets and the evaluation measure, and analyzing
the results obtained by systems and the more successful approaches. Thirteen groups participated in both PS and
AS tasks submitting 49 runs in total.


1. INTRODUCTION
The ResPubliQA 2010 exercise is aimed at retrieving answers to a set of 200 questions over EUROPARL and
ACQUIS collections. Questions were offered in 8 different languages: Basque (EU), English (EN), French (FR),
German (DE), Italian (IT), Portuguese (PT), Romanian (RO) and Spanish (ES). All Monolingual and Cross-
language subtasks combinations of questions between the last 7 languages above were activated, including
monolingual English (EN). Basque (EU), instead, was included exclusively as a source language, as there is no
Basque translation of the document collection, which means that no monolingual EU-EU sub-task could be
enacted.

The design of the ResPubliQA 2010 evaluation campaign was to a large extent the repetition of the previous
year’s exercise [1] with the addition of a number of refinements. Thus, the main goals of the lab this year are
basically the same: Moving towards a domain of potential users; Moving to an evaluation setting able to
compare systems working in different languages; Comparing current QA technologies with pure Information
Retrieval (IR) approaches; Allowing more types of questions; Introducing in QA systems the Answer Validation
technologies developed in the past campaigns [2,3,5].

As a difference with the previous campaign, this year participants had the opportunity to return both paragraph
and exact answers as system output. Another novelty this year is the addition of a portion of the EUROPARL
collection1 in the languages involved in the task. The subject of EUROPARL’s parliamentary domains is
different in style and content from ACQUIS while being fully compatible with it. This has given participants the
opportunity to adapt their systems in a way which widens their coverage in compatible domains; and for the
organizers it has represented the opportunity to widen the scope of the questions (through the introduction of
new types of question, as for example opinion).


1
    http://www.europarl.europa.eu/
The paper is organized as follows: Section 2 illustrates the document collection; Section 3 gives an overview of
the different types of question developed; Section 4 addresses the various steps to create the ResPubliQA data
set; Section 5 provides an explanation of the evaluation measure and of how systems have been evaluated;
Section 6 gives some details about participation in this year evaluation campaign; Section 7 presents and
discusses the results achieved by participating systems and across the different languages; Section 8 shows the
approaches used by participating systems; and Sections 9 draws some conclusions.


2. DOCUMENT COLLECTION
Two sets of documents have been included in ResPubliQA 2010 collection: a subset of the JRC-ACQUIS
Multilingual Parallel Corpus2 and a small portion of the EUROPARL collection. Both are multilingual parallel
collections. JRC-ACQUIS3 is a freely available parallel corpus containing the total body of European Union
(EU) documents, mostly of legal nature. It comprises contents, principles and political objectives of the EU
treaties; the EU legislation; declarations and resolutions; international agreements; and acts and common
objectives. Texts cover various subject domains, including economy, health, information technology, law,
agriculture, food, politics and more. This collection of legislative documents currently includes selected texts
written between 1950 and 2006 with parallel translations in 22 languages. The corpus is encoded in XML,
according to the TEI guidelines.

The subset used in ResPubliQA consists of 10,700 parallel and aligned documents per language (Bulgarian,
English, French, German, Italian, Portuguese, Romanian and Spanish). The documents are grouped by language,
and inside each language directory, documents are grouped by year. All documents have a numerical identifier
called the CELEX code, which helps to find the same text in the various languages. Each document contains a
header (giving for instance the download URL and the EUROVOC codes) and a text (which consists of a title
and a series of paragraphs).

EUROPARL4 is a collection of the Proceedings of the European Parliament dating back to 1996. European
legislation is a topic of great relevance to a large number of potential users from citizens to lawyers, government
agencies politicians and many others. EUROPARL comprises texts in each of the 11 official languages of the
European Union (Danish, German, Greek, English, Spanish, Finnish, French, Italian, Dutch, Portuguese and
Swedish). With the enlargement of the European Union to 25 member countries in May 2004, the European
Union has begun to translate texts into even more languages. However, translations into Bulgarian and
Romanian start from January 2009 and for this reason we only compiled documents from the European
Parliament site (http://www.europarl.europa.eu/) starting from that date. In this way, we ensured a parallel
collection for 9 languages (Bulgarian, Dutch, English, French, German, Italian, Portuguese, Romanian and
Spanish).


3. TYPES OF QUESTIONS
Beside the question types used last year (Factoid, Definition, Procedure) two additional question categories were
added in the 2010 campaign: Opinion and a miscellanea called Other. Moreover, Reason and Purpose categories
were merged into a single one as the distinction between them was a little blurred in the past edition. The
following are examples of these types of questions:
Factoid. Factoid questions are fact-based questions, asking for the name of a person, a location, the extent of
something, the day on which something happened, etc. For example:
Q: What percentage of people in Italy relies on television for information?
P: In Italy, 80% of the people get their daily information from television. If that television is not broadcasting all
voices, then people do not get the chance to make their own decisions. That is fundamental to democracy.
A: 80%

Q: What is the maximum efficiency index for a ten-place dishwasher?


2 Please note that it cannot be guaranteed that a document available on-line exactly reproduces an officially adopted text.
  Only European Union legislation published in paper editions of the Official Journal of the European Union is deemed
  authentic.
3
  http://wt.jrc.it/lt/Acquis/
4
  http://www.europarl.europa.eu/
P: (a) Dishwashers with 10 or more place settings shall have an energy efficiency index lower than 0,58 as
defined in Annex IV to Commission Directive 97/17/EC of 16 April 1997 implementing Council Directive
92/75/EEC with regard to energy labelling of household dishwashers(1), using the same test method EN 50242
and programme cycle as chosen for Directive 97/17/EC.
A: 0,58

Definition. Definition questions are questions such as "What/Who is X?", i.e. questions asking for the
role/job/important information about someone, or questions asking for the mission/full name/important
information about an organization. For example:
Q: What is avian influenza?
P: (1) Avian influenza is an infectious viral disease in poultry and birds, causing mortality and disturbances
which can quickly take epizootic proportions liable to present a serious threat to animal health and to reduce
sharply the profitability of poultry farming. Under certain circumstances the disease may also pose a risk to
human health. There is a risk that the disease agent might be spread to other holdings, to wild birds and from one
Member State to other Member States and third countries through the international trade in live birds or their
products.
A: an infectious viral disease in poultry and birds, causing mortality and disturbances which can quickly take
epizootic proportions liable to present a serious threat to animal health and to reduce sharply the profitability of
poultry farming.

Q: What does MFF signify in a financial context?
P: 1. Recalls that its political priorities and its assessment of the budgetary framework for the year 2010 were
set out in its resolution of 10 March 2009, where Parliament was highly critical of the tight margins available in
most of the headings of the Multiannual Financial Framework (MFF);
A: Multiannual Financial Framework

Reason_Purpose. Reason_Purpose questions ask for the reasons/goals for something happening. For example:
Q: Why was Perwiz Kambakhsh sentenced to death?
P: I. whereas the 23 year-old Afghan journalist Perwiz Kambakhsh was sentenced to death for circulating an
article about women's rights under Islam, and whereas, after strong international protests, that sentence was
commuted to 20 years" imprisonment,
A: for circulating an article about women's rights under Islam

Q: What were the objectives of the 2001 Doha Round?
P: A. whereas the Doha Round was launched in 2001 with the objectives of creating new trading opportunities,
strengthening multilateral trade rules, addressing current imbalances in the trading system and putting trade at
the service of sustainable development, with an emphasis on the economic integration of developing countries,
especially the least developed countries (LDCs), arising from the conviction that a multilateral system, based on
more just and equitable rules, can contribute to fair and free trade at the service of the development of all
continents,
A: creating new trading opportunities, strengthening multilateral trade rules, addressing current imbalances in
the trading system and putting trade at the service of sustainable development, with an emphasis on the
economic integration of developing countries, especially the least developed countries (LDCs), arising from the
conviction that a multilateral system, based on more just and equitable rules, can contribute to fair and free trade
at the service of the development of all continents,

Procedure. Procedure questions ask for a set of actions which is the official or accepted way of doing
something. For example:
Q: How do you calculate the monthly gross occupancy rate of bed places?
P: The gross occupancy rate of bed places in one month is obtained by dividing total overnight stays by the
product of the bed places and the number of days in the corresponding month (sometimes termed bed-nights) for
the same group of establishments, multiplying the quotient by 100 to express the result as a percentage.
A: by dividing total overnight stays by the product of the bed places and the number of days in the
corresponding month (sometimes termed bed-nights) for the same group of establishments, multiplying the
quotient by 100 to express the result as a percentage
Q: How do you make a blank test?
P: 7.1. A blank test shall be made regularly using an ashless filter paper (5.8) moistened with a mixture of 90 ml
(4.1) sodium citrate solution, 1 ml saturated solution of calcium chloride (4.2), 0,5 ml of liquid rennet (4.5), and
washed with 3 x 15 ml of distilled water before mineralisation by the Kjeldahl method as described at IDF
standard 20A 1986.
A: using an ashless filter paper (5.8) moistened with a mixture of 90 ml (4.1) sodium citrate solution, 1 ml
saturated solution of calcium chloride (4.2), 0,5 ml of liquid rennet (4.5), and washed with 3 x 15 ml of distilled
water before mineralisation by the Kjeldahl method as described at IDF standard 20A 1986

Opinion. Opinion questions ask for the opinions/feelings/ideas about people, topics, events. For example:
Q: What did the Council think about the terrorist attacks on London?
P: (10) On 13 July 2005, the Council reaffirmed in its declaration condemning the terrorist attacks on London
the need to adopt common measures on the retention of telecommunications data as soon as possible.
A: condemning the terrorist attacks on London

Q: What is the Socialist Group position with respect to the case of Manuel Rosales?
P: − Madam President, concerning the next vote, on ‘Venezuela: the case of Manuel Rosales’, the Socialist
Group, of course, has withdrawn its signature from the compromise resolution. We have not taken part in the
debate and we will not take part in the vote.
A: has withdrawn its signature from the compromise resolution. We have not taken part in the debate and we
will not take part in the vote.

Other. It is used for any reasonable questions which do not fall into the other categories. For example:
Q: What is the e-Content program about?
P: A multiannual programme "European digital content for the global networks" (hereinafter referred to as
"eContent") is hereby adopted.
A: European digital content for the global networks

Q: By whom was the Treaty of Lisbon rejected?
P: The Treaty of Lisbon, which is 96 per cent identical to the draft Constitutional Treaty, was rejected in the
referendum in Ireland. Prior to that, the draft Constitutional Treaty was rejected in referendums in France and the
Netherlands.
A: was rejected in the referendum in Ireland

Q: Which ideals are central to the EU?
P: (1) Security incidents resulting from terrorism are among the greatest threats to the ideals of democracy,
freedom and peace, which are the very essence of the European Union.
A: democracy, freedom and peace


4. TEST SET PREPARATION
Three hundred questions were initially formulated, manually verified against the document collection, translated
into English and collected in a common XML format using a web interface specifically designed for this
purpose. To avoid a bias towards a language, the 300 questions were developed by 4 different annotators
originally in 4 different languages (75 each). All questions had at least one answer in the target corpus of that
language. Then, a second translation from English back into all the nine languages of the track was performed.
Translators checked whether a question initially created for a particular language had an answer or not in all
other languages.

Beside the paragraph containing the answer, annotators were also required to demarcate the shorter string of text
that responses to a question in all different languages. Pinpointing the precise extent of an answer is a more
difficult problem than finding a paragraph that contains an answer. The purposes of demarcating exact responses
are (i) to show to the evaluators what the question creators considered to be the exact answer, and (ii) to create a
GoldStandard which has been used to automatically compare the responses retrieved by the systems against
those manually collected by the annotator. Nevertheless, the exact answer returned by a system was judged by
human assessors besides the automatic evaluation.
The final pool of 200 questions was selected out of the 300 produced, attempting to balance the question set
according to the different question types (factoid, definition, reason/purpose, procedure, opinion and others). The
distribution of the different questions types in the collection is shown in Table 1. 130 questions had an answer in
JRC-ACQUIS and 70 in EUROPARL. All the questions were formulated in such a way that they have an answer
in all the collections, that is, there were no NIL questions.

                                     Table 1: Distribution of question types
                                     Question type            Total number
                                                               of questions
                                     DEFINITION                     32
                                     FACTOID                        35
                                     REASON/PURPOSE                 33
                                     PROCEDURE                      33
                                     OPINION                        33
                                     OTHER                          34
                                     Total                         200

All language dependent tasks (question creation, translation and assessments of runs) have been performed by
native speakers in a distributed setting. For this reason, a complete set of guidelines for each of these tasks have
been shared among annotators and central coordination has been maintained in order to ensure consistency.


5. EVALUATION METHODOLOGY
Systems were allowed to participate in one or both tasks (PS and/or AS) which operated simultaneously on the
same input questions. A maximum of two runs in total per participant could be submitted, i.e. two PS runs, two
AS runs or one PS plus one AS run. Participants were allowed to submit just one response per question.

As in the previous campaign, systems had two options as output for each question:

1.   To give an answer (which could be one full paragraph for the PS task; or the shortest possible string of text
     which contains an exact answer to the question, for the AS task)
2.   To choose not to answer the particular question (if the system considers that it is not able to find a correct
     answer). This option is called NoA answer.


5.1 Evaluation Measure
NoA answers should be used when a system is not confident about the correctness of its answer to a particular
question. The goal is to reduce the amount of incorrect responses, keeping the number of correct ones, by leaving
some questions unanswered. Systems should ensure that only the portion of wrong answers is reduced,
maintaining as high as possible the number of correct answers. Otherwise, the reduction in the number of correct
answers is punished by the evaluation measure for both the answered and unanswered questions. We used c@1
as a measure to make account of this behaviour.

c@1, which was introduced in ResPubliQA 2009, was used also this year as the main evaluation measure for
both PS and AS tasks. The formulation of c@1 is given in:

                                                     1         n
                                           c @1 =      (nR + nU R )
                                                     n          n
where
         nR: number of questions correctly answered.
         nU: number of questions unanswered.
         n: total number of questions (200 in this edition)

Regarding the evaluation of exact answers, we also provide a measure of Answer Extraction performance, that
is, the proportion of exact answers correctly extracted from correctly selected paragraphs.
Optionally, a system can also give the discarded candidate answer when responding NoA. These candidate
answers were also assessed by evaluators in order to give a feedback to the participants about the validation
performance of their systems, even though they are not considered in the main evaluation measure.


5.2 Assessment for Paragraph Selection
Each returned paragraph had a binary assessment: Right (R) or Wrong (W). Questions which were left
unanswered were automatically filtered and marked as U (Unanswered). However, the discarded candidate
answers given to these questions were also evaluated. Human assessors didn’t know that these answers belong to
unanswered questions.

The evaluators were guided by the initial “gold” paragraph, which contained the answers. This “gold” paragraph
was only a hint, since there could be other responsive paragraphs in the same or different documents.


5.3 Assessment for Answer Selection
In order to judge the exact answer strings (AS task), assessors had to take into account also the paragraph as it
provided the context and a justification to the exactness of the answer. Each paragraph/answer couple was
manually judged and assessed considering one of the following judgments:

     -   R (Right): the answer-string consists of an exact and correct answer, supported by the returned
         paragraph;

     -   X (ineXact): the answer-string contains either part of a correct answer present in the returned paragraph
         or it contains all the correct answer plus unnecessary additional text; this option allowed the judge to
         indicate the fact that the answer was only partially correct (for example, because of missing
         information, or because the answer was more general/specific than required by the question, etc.)

     -   M (Missed): the answer-string does not contain a correct answer even in part but the returned paragraph
         in fact does contain a correct answer. In other words, the answer was there but the system missed it
         completely (i.e. the system did not extract it correctly);

     -   W (Wrong): the answer-string does not contain a correct answer and moreover the returned paragraph
         does not contain it either; or it contains an unsupported answer


5.4 Automatic Assessments
As human assessment is a time consuming and resource expensive task, this year it was decided to make some
experiment with automatic evaluation in order to reduce the amount of work for human evaluators. The
evaluation was performed in two steps:

1.   Each run for both the PS and AS tasks were firstly automatically compared against the Gold Standard
     manually produced.
2.   Non-matching paragraphs/ answers were judged by human assessors

The automatic script filtered out the answers that exactly match with the GoldStandard, assigning them correct
values (R), leaving to human assessors only the evaluation of non-matching paragraphs/answers. The parameters
which allow determining the correctness of a response are based on the exact match of Document identifier,
Paragraph identifier, and the text retrieved by the system with respect to those in the GoldStandard.

Almost 31% of the answers (91% of them for Paragraph Selection and 9% for Answer selection) did match the
GoldStandard and so it was possible to automatically mark them as correct.

The rest of the paragraphs and answers returned by systems were manually evaluated by native speaker assessors
who considered if the system output was responsive or not. Answers were evaluated anonymously and
simultaneously for the same question to ensure that the same criteria are being applied to all systems.


5.5 Tools and Infrastructure
This year, CELCT has developed a series of infrastructures to help the management of the ResPubliQA exercise.
We had to deal with many processes and requirements:

o      First of all, the need to develop a proper and coherent tool for the management of the data produced
       during the campaign, to store it and to make it re-usable, as well as to facilitate the analysis and
       comparison of results.
o      Secondly, the necessity of assisting the different organizing groups in the various tasks of the data set
       creation and to facilitate the process of collection and translation of questions and their assessment.
o      Finally, the possibility for the participants to directly access the data, submit their own runs (this also
       implied some syntax checks of the format), and later, get the detailed viewing of the results and statistics.

A series of automatic web interfaces were specifically designed for each of these purposes, with the aim of
facilitating the data processing and, at the same time, showing the users only what is important for the task they
had to accomplish. So, the main characteristics of these interfaces are the flexibility of the system specifically
centred on the user’s requirements.

While designing the interfaces for question collection and translation one of the first issues which was to be dealt
with, was the fact of having many assessors, a big amount of data, and a long process. So tools must ensure an
efficient and consistent management of the data, allowing:

1.     Edition of the data already entered at any time.
2.     Revision of the data by the users themselves.
3.     Consistency propagation ensuring that modifications automatically re-model the output in which they are
       involved.
4.     Statistics and evaluation measures are calculated and updated in real time.

In particular, ensuring the consistency of data is a key feature in data management. For example, if a typo is
corrected in the Translation Interface, the modification is automatically updated also in the GoldStandard files, in
the Test Set files, etc.


6. PARTICIPANTS
Out of the 24 groups who had previously registered showing interest in the task, a total of 13 groups participated
in the ResPubliQA 2010 tasks in 8 different languages (German, English, Spanish, Basque, French, Italian,
Portuguese and Romanian), as shown in Table 2. The list of participating systems, teams and the reference to
their reports are shown in Table 2.

Table 2: Systems and teams with the reference to their reports
System                                      Team                                                 Reference
bpac     SZTAKI, HUNGARY                                                                         Nemeskey
         Dhirubhai Ambani Institute of Information and Communication
dict     Technology, INDIA                                                                     Sabnani et al
elix     University of Basque Country, SPAIN                                                    Agirre et al
icia     RACAI, ROMANIA                                                                           Ion et al
iles     LIMSI-CNRS, FRANCE                                                                    Tannier et al
ju_c     Jadavpur University, INDIA                                                             Pakray et al
loga     University Koblenz, GERMANY                                                        Glöckner and Pelzer
nlel     U. politecnica Valencia, SPAIN                                                         Correa et al
prib     Priberam, PORTUGAL                                                                           -
uaic     Al.I.Cuza\" University of Iasi, Faculty of Computer Science, ROMANIA                   Iftene et al
uc3m     Universidad Carlos III de Madrid, SPAIN                                             Vicente-Díez et al
uiir     University of Indonesia, INDONESIA                                                      Toba et al
uned     UNED, SPAIN                                                                           Rodrigo et al

A total of 49 runs were officially submitted considering both the PS and AS tasks. Specifically, 42 submissions
in the PS task and only 7 in the AS task. It is quite encouraging that compared to last year, both the number of
participating teams and the number of submissions have increased. Table 3 shows the runs submitted in each
language as well as the distribution among PS and AS runs.

Table 3: Tasks and corresponding numbers of submitted runs. In brackets, the number of PS and AS runs
                                                                 Target languages (corpus and answer)
                                                     DE        EN         ES        FR         IT       PT        RO        Total
          Source languages (questions)


                                           DE      4 (4,0)                                                                  4 (4,0)
                                           EN                19 (16,3)                                           2 (2,0)   21 (18,3)
                                           ES                            7 (6,1)                                            7 (6,1)
                                           EU                 2 (2,0)                                                       2 (2,0)
                                           FR                                      7 (5,2)                                  7 (5,2)
                                           IT                                                3 (2,1)                        3 (2,1)
                                           PT                                                          1 (1,0)              1 (1,0)
                                           RO                                                                    4 (4,0)    4 (4,0)
                                           Total   4 (4,0)   21 (18,3)   7 (6,1)   7 (5,2)   3 (2,1)   1 (1,0)   6 (6,0)   49 (42,7)

As usual, the most popular language was English (with 21 submitted runs), with Spanish and French as second
(with 7 submissions each). Almost all runs were monolingual; only two participating teams attempted a cross-
language task (EU-EN and EN-RO) that produced 4 runs.


7. RESULTS

7.1 Overall Results for Paragraph Selection
The use of the same set of questions in all the languages allows, as in last year, a general comparison among
different languages. Table 4 shows the c@1 value for all systems. Systems were able to find answers for more
than 70% of questions in all languages (combination row in Table 4) except Portuguese where there was only
one participant.

Considering all languages, 99% of questions received at least one correct answer by at least one system. This is
an indication that the task is feasible for current systems. It also suggests that multi-stream systems might obtain
good results. One way of obtaining this improvement could be the inclusion of the validation step to choose
among the different systems (streams).

Some IR based baselines were developed last year in order to compare the performance of pure IR approaches
against more sophisticated QA technologies. These baselines were produced using the Okapi-BM25 ranking
function [5] and are described in more detail in [4]. In this edition, the UNED group sent two similar baselines
for English and Spanish and these are described in [16]. Therefore, these runs can be used for comparing QA
technologies with pure IR systems in this edition.

                                         Table 4: c@1 in participating systems in the PS task according to the language
                                                       System        DE EN ES            FR     IT PT RO
                                                    Combination      0.75 0.94 0.82 0.74 0.73 0.56 0.70
                                                        uiir101            0.73
                                                       dict102             0.68
                                                       bpac102             0.68
                                                       loga102       0.62
                                                       loga101       0.59
                                                       prib101                                       0.56
                                                       nlel101       0.49 0.65 0.56 0.55 0.63
                                                       bpac101             0.65
                                                       elix101             0.65
                                                  IR baseline (uned)       0.65 0.54
                             uned102                      0.54
                            uc3m102                       0.52
                            uc3m101                       0.51
                              dict101             0.64
                              uiir102             0.64
                             uned101              0.63
                              elix102             0.62
                              nlel102        0.59 0.62 0.20       0.55   0.53
                             ju_c101              0.50
                              iles102             0.48            0.36
                             uaic102              0.46            0.24                0.55
                             uaic101              0.43            0.30                0.52
                              icia102                                                 0.49
                              icia101                                                 0.47
                           elix102euen             0.36
                           elix101euen             0.33
                           icia101enro                                                0.29
                           icia102enro                                                0.29

Although we cannot compare these results directly with those of last year, there seems to be a certain
improvement in performance. Whereas the best result this year is a little higher than last year’s one (from c@1 of
0.73 in English compared to 0.68) there has been a considerable improvement in the average results, with an
increase from 0.39 to 0.54 in c@1 in the monolingual PS task.

EUROPARL turned out to be an easier collection than ACQUIS: 84% of all the answers by all systems over
EUROPARL were correct whereas only 20% were over ACQUIS.

Table 4 shows the proportion of correct answers given by all systems to each different question type.
Surprisingly, Definition questions turned out to be more difficult and Reason/Purpose slightly easier than the rest
of the question types. These results contradict the performance obtained in past campaigns of QA@CLEF, where
a very good performance was usually obtained in Definition questions. However, in ResPubliQA, Definition
questions tend to be considerably more complex than those which appeared in earlier campaigns based on
newspaper articles.

                        Table 4: Correct answers according to different question type
                              Question type                  % of correct
                                                                answers
                              DEFINITION                        28.64%
                              FACTOID                            46.53%
                              REASON_PURPOSE                     53.18%
                              PROCEDURE                          41.62%
                              OPINION                            42.80%
                              OTHER                             44.00%


Finally, considering the UNED baselines runs we can see that once again they performed extremely well. For
English only three of the seventeen runs where better than the baselines. For Spanish, only one of the five runs
was better.


7.2 Results per Language in the Paragraph Selection task
Tables 5-12 show the individual results by target language of each participant system at the PS task. Moreover, a
combination of systems in each language is also given in these Tables. This combination represents the number
of questions correctly answered by at least one system. All the results are ranked by c@1 values. These tables
contain the following columns:
    •   c@1: official measure at ResPubliQA 2010.
    •   #R: number of questions correctly answered.
    •   #W: number of questions wrongly answered.
    •   #NoA: number of questions left unanswered.
    •   #NoA R: number of questions unanswered where the candidate answer was Right. In this case, the
        system took the bad decision of leaving the question unanswered.
    •   #NoA W: number of questions unanswered where the candidate answer was Wrong. In this case, the
        system took a good decision leaving the question unanswered.
    •   #NoA empty: number of questions unanswered where no candidate answer was given. Since all the
        questions had an answer, these cases were considered as if the candidate answer were wrong for
        accuracy calculation purposes.

Overall general statistics, together with test set questions and adjudicated runs are available at the RespubliQA
website http://celct.isti.cnr.it/ResPubliQA/ under Past Campaigns.
The best results for German were obtained by the systems that include a validation step. These systems showed a
very good performance validating answers (more than 75% of the rejected answers were actually incorrect). This
means that these systems are able to improve their performance in the future by trying to answer the questions
they left unanswered.

                                Table 5: Results for German in the PS task
                          System     c@1 #R #W #NoA #NoA #NoA                     #NoA
                                                              R      W            empty
                      combination    0.75 150 50        0     0       0             0
                      loga102PSdede 0.62 105 59        36     2      29             5
                      loga101PSdede 0.59 101 65        34     2      27             5
                      nlel101PSdede 0.49 90 93         17     2      15             0
                      nlel102PSdede 0.44 88 112         0     0       0             0

The combination of English systems shows that 94% of questions were correctly answered by at least one
system, which means that the task is feasible for current technologies. There are still some systems that perform
worse than the IR baseline. As already discussed in the last edition, participant should care more about the
correct tuning of the IR engine.

Most of the systems that left some questions unanswered didn’t provide the candidate answer, so the organizers
couldn’t provide feedback about the actual state of validation technologies in English. There is some evidence
that more efforts should be applied to the validation step in English for improving overall results as has been
shown in German.

                                  Table 6: Results for English in the PS task
                         System         c@1 #R #W #NoA #NoA #NoA                    #NoA
                                                                    R      W        empty
                       combination       0.94 188 12        0       0       0         0
                      uiir101PSenen      0.73 143 54        3       0       3         0
                     bpac102PSenen       0.68 136 64        0       0       0         0
                      dict102PSenen      0.68 117 52       31      17      14         0
                     bpac101PSenen       0.65 129 71        0       0       0         0
                      elix101PSenen      0.65 130 70        0       0       0         0
                      nlel101PSenen      0.65 128 68        4       2       2         0
                    IR baseline (uned) 0.65 129 71          0       0       0         0
                      dict101PSenen      0.64 127 73        0       0       0         0
                      uiir102PSenen      0.64 127 73        0       0       0         0
                     uned101PSenen       0.63 117 66       17      13       4         0
                      nlel102PSenen      0.62 122 76        2       0       2         0
                      elix102PSenen      0.62 123 77        0       0       0         0
                     ju_c101PSenen       0.50 73 52        75       0       0        75
                     iles102PSenen       0.48   89 95       16      0       0         16
                     uaic102PSenen       0.46   85 98       17      0       0         17
                     uaic101PSenen       0.43   78 99       23      0       0         23
                     elix102PSeuen       0.36   72 128       0      0       0         0
                     elix101PSeuen       0.33   66 134       0      0       0         0

With respect to Spanish, 80% of questions were correctly answered by at least one system. However, this
combination is almost a 50% higher than the best system. Only one system (nlel101PSeses) performed better
than the IR baseline. This system was able to reduce the number of incorrect answers while maintaining the same
number of correct answers as the baseline. This is what allowed it to obtain a better performance according to
c@1.

                                  Table 7: Results for Spanish in the PS task
                         System         c@1 #R #W #NoA #NoA #NoA                   #NoA
                                                                   R      W        empty
                       combination      0.82 165 35        0        0      0         0
                      nlel101PSeses     0.56 108 86        6        1      5         0
                    IR baseline (uned) 0.54 108 92         0        0      0         0
                    uned101PSeses       0.54 92 73        35       22     13         0
                     uc3m102PSeses 0.52 104 96             0        0      0         0
                     uc3m101PSeses 0.51 101 99             0        0      0         0
                      nlel102PSeses     0.20 39 161        0        0      0         0

Similar results were obtained for French and Romanian as target, where the difference between the combination
row and the best system is relevant. Again, the system nlel shows that accurate validation technologies have been
developed.

                                 Table 8: Results for French in the PS task
                          System    c@1 #R #W #NoA #NoA #NoA                     #NoA
                                                               R      W          empty
                      combination 0.74 148 52           0      0       0           0
                      nlel101PSfrfr 0.55 105 86         9      2       7           0
                      nlel102PSfrfr 0.55 109 88         3      0       3           0
                      iles102PSfrfr 0.36 62 105 33             0       0          33
                      uaic101PSfrfr 0.30 54 124 22             0       0          22
                      uaic102PSfrfr 0.24 47 153         0      0       0           0

                               Table 9: Results for Romanian in the PS task
                          System      c@1 #R #W #NoA #NoA #NoA              #NoA
                                                               R      W     empty
                     combination      0.70 140 60       0      0      0       0
                     UAIC102PSroro 0.55 95 74          31      0      0       31
                     UAIC101PSroro 0.52 102 93          5      0      0       5
                     icia102PSroro    0.49 63 29 108           0      0      108
                     icia101PSroro    0.47 93 107       0      0      0       0
                     icia102PSenro    0.29 56 137       7      0      0       7
                     icia101PSenro    0.29 58 139       3      0      0       3

                                Table 10: Results for Italian in the PS task
                         System      c@1 #R #W #NoA #NoA #NoA                #NoA
                                                               R      W      empty
                       combination 0.73 146 54         0       0      0        0
                       nlel101PSitit 0.63 124 72       4       2      2        0
                       nlel102PSitit 0.53 105 94       1       0      1        0
                                Table 11: Results for Portuguese in the PS task
                           System     c@1 #R #W #NoA #NoA #NoA                  #NoA
                                                                R       W       empty
                        prib101PSptpt 0.56 111 88        1      0       0         1


7.3 Results in the Answer Selection Task
Tables 12-14 show the results by language of participant systems at the AS task. The results for all the languages
are given in Table 12. These tables contain similar information to the PS tables plus the following additional
information:

    -   #M: number of questions where the paragraph contained a correct answer, but the answer string given
        was wrong
    -   #X: number of questions where the answer string given was judged as inexact.

All runs were monolingual. Three groups (iles, ju_c and nlel ) submitted seven runs for the Answer Selection
(AS) task. Each of these groups submitted one EN run. In addition, iles submitted one FR run and nlel submitted
one ES, one FR and one IT run. Thus there were three attempts at EN and two at FR, allowing some comparison.
For ES and IT there was only one run each.


                                   Table 12: General Results in the AS task
          System            c@1    #R #W #M #X #NoA #NoA #NoA #NoA #NoA                               #NoA
                                                                   R      W  M  X                     empty
   combination              0.30   60 140 0 0             0        0       0 0  0                       0
   ju_c101ASenen            0.26   31    12   10 8       115       0      40 24 0                      75
   iles101ASenen            0.09   17 124 6 44            9        0       0 0  0                       9
   iles101ASfrfr            0.08   14 128 7 36           15        0       0 0  0                      15
   nlel101ASenen            0.07   10    97   20 6        67       0       0 0  0                      67
   nlel101ASeses            0.06   12 138 21 1           28        0       0 0  0                      28
   nlel101ASitit            0.03    6   139 18 7         30        0       0 0  0                      30
   nlel101ASfrfr            0.02    4   132 13 11        40        0       0 0  0                      40

Considering EN first, the best system by c@1 was ju_c with a score of 0.26. Interestingly, while iles scored only
0.09, it had a high number of X answers (44). Thus, iles was identifying the vicinity of answers better than ju_c
but was not demarcating them exactly right. Of course, the demarcation in cases of question types like reason is
not beyond debate. Finally, the third system nlel scored 0.07.

                                 Table 13: Results for English in the AS task
            System           c@1 #R #W #M #X #NoA #NoA #NoA #NoA #NoA                              #NoA
                                                                   R      W   M  X                 empty
           combination       0.24 49 151 0           0     0       0       0   0 0                   0
        ju_c101ASenen        0.26 31 12 10 8             115       0      40  24 0                  75
         iles101ASenen       0.09 17 124 6 44              9       0      0   0  0                   9
        nlel101ASenen        0.07 10 97 20 6              67       0      0   0  0                  67

Now, turning to FR, iles scored 0.08 and nlel scored 0.02. Notice once again the large number of inexact answers
for iles.

                                 Table 14: Results for French in the AS task
             System          c@1  #R #W #M #X #NoA #NoA #NoA #NoA #NoA                            #NoA
                                                                R      W     M X                  empty
        combination          0.8 17 183 0 0              0       0      0    0 0                    0
        iles101ASfrfr        0.08 14 128 7 36           15       0      0    0 0                   15
        nlel101ASfrfr        0.02 4 132 13 11           40       0      0    0 0                   40
Notice that these figures are all very low compared to traditional factoid QA where figures of 0.8 can be
obtained. We can attribute this to the inclusion of difficult question types which go beyond the factoid concept
with its dependence on the Named Entity concept. Recall that the breakdown of questions was 40 factoids and 32
definitions, with 32 each of opinion, procedure, reason-purpose and "other" questions. Thus 64% of questions
fell into the latter four "difficult" types.

Another consideration is the effect of allowing systems to mark questions as unanswered even though they had
in fact answered them. Only in the case of EN and the ju_c run was there any loss of score incurred by not
answering. For ju_c, 24 unanswered questions had a missed answer, i.e. ju_c identified the correct paragraph
containing the exact answer, but was not able to demarcate it. For all the other EN runs (and indeed all the other
runs), unanswered questions had an empty answer, so nothing can be said about how close these
other systems were to getting the right answer in the case of unanswered questions.

8. SYSTEM DESCRIPTIONS
A summary of the techniques reported by participants is shown in Table 15. Most of the systems that analyze the
questions use manually built patterns. Regarding the IR model, BM25 has been applied by almost half of the
participants that reported the retrieval model used. The other reported models were the standard ones supplied by
Lucene.

                              Table 15: Methods used by participating systems
                                                          Retrieval           Linguistic Unit which
                          Question Analyses
                                                            Model                  is indexed
               done Patterns
               automatically


      Syste
                No Question

                 Manually
                 Analyses


                 acquired


                                                                                                            N-grams
                                                                                           Lemmas


                                                                                                                      Chunks/
                 patterns


                                                                                                                      phrases
      m


                                                                                   Words
                                            Other


                                                                                                    Stems
      name


                                    Lemmatization,
                                    POS tagging and
                                    *very* minimal
      bpac                           pattern usage  Okapi BM25                     x                x
      dict            X                                                                             x
                                     lemmatization,
                                      part of speech
      elix                               tagging       BM25                                         x
                                                       Lucene Boolean Search
      icia      X                                      Engine                              x
      iles            X                                                                    x

                                                       standard lucene model;
      loga            X                                word senses are indexed             x
                                                       Distance Density N-
      nlel            X                                gram Model , BM25                            x       x
      ju_c                    X                        Apache Lucene               x       x        x       x
      prib            X                                                            x       x        x                   x

      uaic            X                                LUCENE                      x
      uc3m            X                                Passage IR                          x

      uiir            X                                                                    x

      uned            X                                BM25                                         x
A summary of Answer Extraction techniques in the AS task is given in Table 16. The most common processing
was the use of named entities, numeric and temporal expressions, while some systems relied on syntactic
processing by means of chunking, dependency parsing or syntactic transformations.

                         Table 16: Methods used by systems for extracting answers
                                 Answer Extraction – Further processing


                                                                                                                     Numerical expressions
                                                                                          Temporal expressions


                                                                                                                                               Dependency analysis


                                                                                                                                                                                                   Logic representation

                                                                                                                                                                                                                               Theorem prover
                                                                                                                                                 transformations
                                                              Named Entity
                                                               Recognition
                                     Chunking


                                                                                                                                                    Syntactic
                                                                n-grams


                                                                                                                                                                                                                                                Other
                                                                                                                                                                                                                                                            None
                         System
                          name


                       bpac                                                                                                                                                                                                                                 x
                       dict          x                         x       x                   x                         x
                       elix                                                                                                                                                                                                                     x
                       icia          x                                                                                                                                                                                                          x
                       iles                                            x                                             x                          x                       x
                       loga                                                                x                         x                                                                             x                          x
                       nlel                                                                                                                                                                                                                                 x
                       ju_c          x
                       prib                                            x                   x                         x
                       uaic                                            x                   x                         x                                                  x
                       uc3m                                                                x
                       uiir                                                                                                                                             x
                       uned                                            x                   x                         x

The validation of answers was applied by 9 of the 13 participants. According to Table 17, which shows the
different validation techniques applied by participants, the most common processing was to measure the lexical
overlapping between questions and candidate answers (it was performed by 5 participants). On the other hand,
more complex techniques such as syntactic similarity or theorem proving were applied by very few participants.

These observations are different from the ones obtained last year, where participants applied more techniques
and performed more complex analysis like semantic similarity or the combination of different classifiers. That is,
participants at ResPubliQA 2010 relied on naive techniques for performing validation. However, the experience
during the last years shows that the validation step can improve results if it is performed carefully. Otherwise,
the effect will be the opposite, the harming results.

                      Table 17: Techniques used for the Answer Validation component
                                                                                                                                             Lexical similarity (term
                                                No answer validation


                                                                                                                                                                                                                  Theorem prooving or
                                                                                                     Redundancies in the


                                                                                                                                                                            Syntactic similarity
                                                                       Machine Learning


                           System
                                                                                                                                             overlapping)


                            name
                                                                                                     collection


                                                                                                                                                                                                                  similar


                                                                                                                                                                                                                                                    Other


                          bpac                  x
                          dict                                                                                                                      x
                          elix                  x
                          icia                  x
                          iles                                                                                   x                                                          x
                          loga                                         x                                                                            x                                                                     x
                          nlel                                                      x
                          ju_c                                 x                    x
                          prib         x
                          uaic                                 x      x
                          uc3m                                 x
                          uiir                         x
                          uned                                                      x


9. CONCLUSIONS
An important result demonstrated in 2009 was that a good IR system can be better than a QA system if the IR
parameters are carefully tuned to the requirements of the domain. A relevant portion of participants already
moved towards better IR models, although there are still many systems that don’t outperform IR baselines.

While the Paragraph Selection task is just paragraph retrieval, the main difference from pure IR systems is to
add the decision of leaving the question unanswered, that is, the validation step. Best performing systems in
German, Spanish and French have accurate validation steps.

The PS task allows the inclusion of more complex questions, as well as their evaluation in a simple and natural
way. However, when the aim is to extract an exact answer (as in the AS task), it turns out to be very difficult for
systems to perform well, except were answers are named entities. This is because NE is a well-studied and
largely solved problem. On the other hand, “exact” answer demarcation for more complex queries against
documents such as those used in ResPubliQA needs further study by both system designers and evaluation task
organizers.


ACKNOWLEDGMENTS
Our thanks are due to all the annotators who took care of the translations of the questions in the different
languages and of the evaluation of the respective submitted runs:
    - Anna Kampchen, Julia Kramme (University of Hagen, Hagen, Germany), and Stefan Rigo (FBK,
        Trento, Italy) for the German language
    - Iñaki Alegria (University of Basque Country, Spain) for the Basque language
    - Patricia Hernandez Llodra (UNED, Madrid, Spain) for the French language
    - Rosário Silva (Linguateca/FCCN, Portugal) for the Portuguese language
Special thanks are also due Giovanni Moretti (CELCT, Trento, Italy) for the technical support in the
management of all data of the campaign.
This work has been partially supported by the Regional Government of Madrid under the Research Network
MA2VICMR (S-2009/TIC-1542), the Education Council of the Regional Government of Madrid and the
European Social Fund.


REFERENCES
1. Anselmo Peñas, Pamela Forner, Richard Sutcliffe, Álvaro Rodrigo, Corina Forascu, Iñaki Alegria, Danilo
Giampiccolo, Nicolas Moreau, Petya Osenova. Overview of ResPubliQA 2009: Question Answering Evaluation
over European Legislation. In C. Peters, G. di Nunzio, M. Kurimo, Th. Mandl, D. Mostefa, A. Peñas, G. Roda
(Eds.), Multilingual Information Access Evaluation Vol. I Text Retrieval Experiments, Workshop of the Cross-
Language Evaluation Forum, CLEF 2009, Corfu, Greece, 30 September - 2 October. Revised Selected Papers.
(to be published)
2. Anselmo Peñas, Álvaro Rodrigo, Felisa Verdejo. Overview of the Answer Validation Exercise 2007. In C.
Peters, V. Jijkoun, Th. Mandl, H. Müller, D.W. Oard, A. Peñas, V. Petras, and D. Santos, (Eds.): Advances in
Multilingual and Multimodal Information Retrieval, LNCS 5152, September 2008.
3. Anselmo Peñas, Álvaro Rodrigo, Valentín Sama, Felisa Verdejo. Overview of the Answer Validation Exercise
2006. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, M. de Rijke, M. Stempfhuber
(Eds.): Evaluation of Multilingual and Multi-modal Information Retrieval, 7th Workshop of the Cross-Language
Evaluation Forum, CLEF 2006, Alicante, Spain, September 20-22, 2006, Revised Selected Papers.
4. Joaquín Pérez, Guillermo Garrido, Álvaro Rodrigo, Lourdes Araujo and Anselmo Peñas. Information
Retrieval Baselines for the ResPubliQA Task. CLEF 2009, LNCS 6241.
5. Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo. Overview of the Answer Validation Exercise 2008. In C.
Peters, Th. Mandl, V. Petras, A. Peñas, H. Müller, D. Oard, V. Jijkoun, D. Santos (Eds), Evaluating Systems for
Multilingual and Multimodal Information Access, 9th Workshop of the Cross-Language Evaluation Forum,
CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers.
6. Stephen E. Robertson and Steve Walker. Some simple effective approximations to the 2-Poisson model for
probabilistic weighted retrieval. In SIGIR '94: Proceedings of the 17th annual international ACM SIGIR
conference on Research and development in information retrieval (1994), pp. 232-241.
7. Xavier Tannier and Véronique Moriceau. FIDJI @ ResPubliQA 2010. Notebook Paper for the CLEF 2010
LABs Workshop, 22-23 September, Padua, Italy
8. Hitesh Sabnani and Prasenjit Majumder. Question Answering System: Retrieving relevant passages. Notebook
Paper for the CLEF 2010 LABs Workshop, 22-23 September, Padua, Italy
9. Partha Pakray, Pinaki Bhaskar, Santanu Pal, Dipankar Das, Sivaji Bandyopadhyay and Alexander Gelbukh.
JU_CSE_TE: System Description QA@CLEF 2010. Notebook Paper for the CLEF 2010 LABs Workshop, 22-
23 September, Padua, Italy
10. Eneko Agirre, Olatz Ansa, Xabier Arregi, Maddalen Lopez de Lacalle, Arantxa Otegi and Xabier Saralegi.
Document Expansion for Cross-Lingual Passage Retrieval. Notebook Paper for the CLEF 2010 LABs
Workshop, 22-23 September, Padua, Italy
11. Hapnes Toba and Mirna Adriani. Contextual Approach for Paragraph Selection in Question Answering Task.
Notebook Paper for the CLEF 2010 LABs Workshop, 22-23 September, Padua, Italy
12. Adrian Iftene, Diana TRANDABAT, Alex Moruz and Maria HUSARCIUC. Question Answering on
Romanian, English and French Languages. Notebook Paper for the CLEF 2010 LABs Workshop, 22-23
September, Padua, Italy
13. David M. Nemeskey. SZTAKI @ ResPubliQA 2010. Notebook Paper for the CLEF 2010 LABs Workshop,
22-23 September, Padua, Italy
14. María Teresa Vicente-Díez, Julián Moreno-Schneider and Paloma Martínez. Temporal information needs in
ResPubliQA: an attempt to improve accuracy. The UC3M Participation at CLEF 2010. Notebook Paper for the
CLEF 2010 LABs Workshop, 22-23 September, Padua, Italy
15. Ingo Glöckner and Björn Pelzer. The LogAnswer Project at ResPubliQA 2010. Notebook Paper for the
CLEF 2010 LABs Workshop, 22-23 September, Padua, Italy
16. Álvaro Rodrigo, Joaquin Perez-Iglesias, Anselmo Peñas, Guillermo Garrido and Lourdes Araujo. A Question
Answering System based on Information Retrieval and Validation. Notebook Paper for the CLEF 2010 LABs
Workshop, 22-23 September, Padua, Italy
17. Radu Ion, Alexandru Ceausu, Dan Ştefănescu, Dan Tufis, Elena Irimia and Verginica Barbu Mititelu.
Monolingual and Multilingual Question Answering on European Legislation. Notebook Paper for the CLEF
2010 LABs Workshop, 22-23 September, Padua, Italy
18. Santiago Correa, Davide Buscaldi and Paolo Rosso. NLEL at RespubliQA 2010. Notebook Paper for the
CLEF 2010 LABs Workshop, 22-23 September, Padua, Italy

</pre>