=Paper= {{Paper |id=Vol-1172/CLEF2006wn-QACLEF-JijkounEt2006 |storemode=property |title=Overview of WiQA 2006 |pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-QACLEF-JijkounEt2006.pdf |volume=Vol-1172 |dblpUrl=https://dblp.org/rec/conf/clef/JijkounR06a }} ==Overview of WiQA 2006== https://ceur-ws.org/Vol-1172/CLEF2006wn-QACLEF-JijkounEt2006.pdf
                         Overview of WiQA 2006
                              Valentin Jijkoun    Maarten de Rijke
                                 ISLA, University of Amsterdam
                                 jijkoun,mdr@science.uva.nl


                                             Abstract
     We describe WiQA 2006, a pilot task aimed at studying question answering using
     Wikipedia. Going beyond traditional factoid questions, the task considered at WiQA
     2006 was to return—given an source page from Wikipedia—to identify snippets from
     other Wikipedia pages, possibly in languages different from the language of the source
     page, that add new and important information to the source page, and that do so
     without repetition.
         A total of 7 teams took part, submitting 20 runs. Our main findings are two-
     fold: (i) while challenging, the tasks considered at WiQA are do-able as participants
     achieved impressive scores as measured in terms of yield, mean reciprocal rank, and
     precision, (ii) on the bilingual task, substantially higher scores were achieved than on
     the monolingual tasks.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Managment]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation

Keywords
Question answering, Questions beyond factoids, Wikipedia


1    Introduction
CLEF 2006 featured a pilot on Question Anwering Using Wikipedia, or WiQA [3], for short. The
idea to organize a pilot track on QA using Wikipedia builds on several motivations: First, tradi-
tionally, people turn to reference works to get answers to their questions. Wikipedia has become
one of the largest reference works ever, making it a natural target for question answering systems.
Moreover, Wikipedia is a rich mixture of text, link structure, navigational aids, categories,. . . ,
making it extremely appealing for text mining and link analysis work. And finally, Wikipedia
is simply a great resource. It is something we want to work with, and contribute to, both by
facilitating access to it, and, as the distinction between readers and authors has become blurred,
by creating tools to support the authoring process.
    In this overview we first provide a description of the tasks considered and of the evaluation and
assessment procedures (Section 2). After that we describe the runs submitted by the participants
(Section 3 and detail the results (Section 4). We end with some preliminary conclusions (Section 5).
2     The Question Answering Task
The WiQA 2006 task deals with access to Wikipedia’s content, where access is considered both
from a point of view and from an author point of view.

2.1    Tasks
As our user model we take the following scenario: a reader or author of a given Wikipedia article
(the source page) is interested in collecting information about the topic of the page that is not yet
included in the text, but is relevant and important for the topic, so that it can be used to update
the content of the source article. Although the source page is in a specific language (the source
language), the reader or author would also be interested in finding information in other languages
(the target languages) that he explicitely specifies.
    With this user scenario, the task of an automatic system is to locate information snippets in
Wikipedia which are:
    • outside the given source page,
    • in one of the specified target languages,
    • substantially new w.r.t. the information contained in the source page, and important for
      the topic of the source page, in other words, worth including in the content of (the future
      editions of) the page.
Participants of the WiQA 2006 pilot could take part in two flavors of the task: a monolingual
one (where the snippets to be returned are in the language of the source page) and a multilingual
(where the snippets to be returned can be in any of the languages of the Wikipedia corpus used
at WiQA).

2.2    Document Collections
The corpus used at WiQA 2006 consists of XML-ified dumps of Wikipedia in three language:
Dutch, English, and Spanish. The dumps are based on the XML version of the Wikipedia col-
lections [1] that include the annotation of the structure of the articles, links between articles,
categories, cross-lingual links, etc. For the WiQA 2006 pilot the collections were enriched with
annotations of sentences and classification of pages into named entity classes (person, location,
organization).

2.3    Topics
For each of the three WiQA 2006 languages (Dutch, English, and Spanish) a set of 50 topics
correctly tagged as PERSON, LOCATION or ORGANIZATION in the XML data collections was
released, together with other topics, announced as optional. These optional topics either did not
fall into these three categories, or were not tagged correctly in the XML collections. The optional
topics could be ignored by systems without penalty. In fact, the submitted runs provided responses
for optional topics as well as for the main topics.
    When selecting Wikipedia articles as topics, we included articles explicitely marked as stubs,
as well as other short and long articles.
    In order to create the topics for the English-Dutch bilingual task, 30 topics were selected
from the English monolingual topic set and 30 topics from the Dutch monolingual topic set. The
bilingual topics were selected so that the corresponding articles are present in Wikipedias for both
languages.
    In addition to the test topics, a set of 80 (English language) development topics was released.
2.4    Evaluation
Given a source page, automatic systems return a list of short snippets, defined as sequences of at
most two sentences from a Wikipedia page. The ranked list of snippets for the topic were manually
assessed using the following binary criteria, largely inspired by the TREC 2003 Novelty task [2]:
   • support: the snippet does indeed come from the specified target Wikipedia article.
   • importance: the information of the snippet is relevant to the topic of the source Wikipedia
     article, is in one of the target languages as specified in the topic, and is already present on
     the page (directly or indirectly) or is interesting and important enough to be included in an
     updated version of the page.
   • novelty: the information content of the snippet is not subsumed by the information on the
     source page
   • non-repetition: the information content of the snippet is not subsumed by the target snippets
     higher in the ranking for the given topic
Note that we distinguish between novelty (subsumption by the source page) and non-repetition
(subsumption by the higher ranked snippets) in order for the results of the assessment to be
re-usable for automatic system evaluation in future: novelty only takes the source page and the
snippet into account, while non-repetition is defined on a ranked list of snippets.
    One of the purposes of the WiQA pilot task was to experiment with different measures for
evaluating the performance of systems. WiQA 2006 used the following simple principal measure
for accessing the performance of the systems:
   • yield : the average (per topic) number of supported, novel, non-repetitive, important target
     snippets.
We also considered other simple measures:
   • mean reciprocal rank of the first supported, important, novel, non-repeated snippet, and
   • overall precision: the percentage of supported, novel, non-repetitive, important snippets
     among all submitted snippets.

2.5    Assessment
To establish the ground truth, an assessment environment was developed by the track organizers.
Assessors were given the following guidelines. For each system and each source article P the
ordered list of the returned snippets was be manually assessed with respect to importance, novelty
and non-repetition following the procedure below:
  1. Each snippet was marked as supported or not. To reduce the workload on the assessors,
     this aspect was checked automatically. Hence, unsupported snippets were excluded from the
     subsequent assessment.
  2. Each snippet was marked as important or not, with respect to the topic of the source article.
     A snippet is important if it contains information that a user of Wikipedia would like to
     see in P or an author would consider worth to be present in P. Snippets were assessed for
     importance independently of each other and regardless of whether the important information
     was already present in P (in particular, presence of some information in P does not necessarily
     imply its importance).
  3. Each important snippet was marked as novel or not. It was to be considered novel if the
     important information in the snippet is substantially new with respect to the content of P.
  4. Each important and novel snippet was marked as repeated or non-repeated, with respect to
     the important snippets higher in the ranked list of snippets.
Figure 1: Assessment interface; first three snippets of a system’s response for topic wiqa06-en-39.


Following this procedure, snippets were assessed along four axes (support, importance, novelty,
non-repetition). Assessors were not required to judge novelty and non-repetition of snippets that
are considered not important for the topic of the source article. The reason for this was to avoid
spending much time on assessing irrelevant information. Assessors provided assessments for the
top 20 snippets for each result list returned. Figure 1 contains a screen shot of the assessment
interface.
    A total number of 14203 snippets had to be assessed; the number unique snippets assessed is
4959. Of these, 3396 were assessed by at least two assessors.
    The results of the assessments for all submitted runs (anonymized) will be made available to
all the participants for further analysis and experiments.

2.6    Submission
For each task (three monolingual and one bilingual), participating teams were allowed to submit
up to three runs. For each topic of a run, the top 20 submitted snippets were manually assessed
as described above.


3     Submitted Runs
Table 1 lists the runs submitted to WiQA 2006: 19 for the monolingual task (Dutch: 3, English:
12, Spanish: 4) and 1 for the bilingual task (English-Dutch). In Table 2 we present the aggregate
results of the assessment of the runs submitted to WiQA 2006. Columns 3–7 show the following
aggregate numbers: total number of snippets (with at most 20 snippets considered per response for
                      Table 1: Summary of runs submitted to WiQA 2006
 Group                           Run name          Description
 English monolingual task
 LexiClone Inc.                  lexiclone         Lexical Cloning method
 Universidad Politécnica        rfia-bow-en       simple “bag of words” submission
 de València
 University of Alicante          UA-DLSI-1         Near phrase
                                 UA-DLSI-2         Near phrase temporal
 University of Essex/Limerick dltg061              Limit of ten snippets per topic
                                 dltg062           Limit of twenty snippets per topic
 University of Amsterdam         uams-linkret-en   Cross-links and IR for snippet ranking
                                 uams-link-en      Only cross-links for snippet ranking
                                 uams-ret-en       Only IR for snippet ranking
 University of Wolverhampton WLV-one-old           No coreference, link analysis
                                 WLV-two           Coreference
                                 WLV-one           no coreference, version 2
 Spanish monolingual task
 Universidad Politécnica        rfia-bow-es       simple “bag of words” submission
 de València
 University of Alicante          UA-DLSI-es        Near phrase
 Daedalus consortium             mira-IS-CN-N      InLink sentence retrieval, rank by novelty
                                 mira-IP-CN-CN     InLink passage retrieval,
                                                   combine cosine and novelty in ranking,
                                                   no threshold
 Dutch monolingual task
 University of Amsterdam         uams-linkret-nl   Cross-links and IR for snippet ranking
                                 uams-link-nl      Only cross-links for snippet ranking
                                 uams-ret-nl       Only IR for snippet ranking
 English-Dutch bilingual task
 University of Amsterdam         uams-linkret-ennl Cross-links and IR for snippet ranking


a topic); total number of supported snippets; total number of important supported snippets; total
number of novel and important supported snippets; and the total number of novel and important
supported with repetition.
    The results indicate that the task of detecting important snippets is a hard one: for most
submissions, only 50–60% of the found snippets are judged as important. The performance of the
systems for detecting novel snippets has a substantially higher range: between 50% and 80% of
the found important snippets are judged as novel with respect to the topic article.


4    Results
Table 3 shows the evaluation results for the submitted runs: total yield (for a run, the total number
of “perfect” snippets, i.e., supported, important, novel and not repeated), the average yield per
topic (only topics with at least one response are considered), the mean reciprocal rank of the first
“perfect” snippet and the precision of the systems’ responses.
    Clearly, most systems cope well with the pilot task: up to one third of the found snippets are
assessed as “perfect” for the English and Spanish monolingual tasks, and up to one half for the
Dutch monolingual and the English-Dutch bilingual task. Quite expectedly, the relative ranking
of the submitted runs is different for different evaluation measures: as in many complex tasks,
the best yield (a recall-oriented measure) does not necessarily lead to the best precision and vice
versa.
                    Table 2: Results of the assessment of the submitted runs.
          Run name            Number of topics Aggregate numbers of snippets
                              with response         (with at most 20 snippets con-
                                                    sidered per topic).
                                                                                 supp
                                                                                 imp
                                                                          supp novel
                                                                   supp imp      not-
                                                    total supp imp        novel rep
          English monolingual task: 65 topics
          lexiclone                   38              684    676     179      98   79
          rfia-bow-en                 65              607    607     255    187   173
          UA-DLSI-1                   64              572    571     277    204   191
          UA-DLSI-2                   60              489    488     239    173   161
          dltg061                     65              435    435     226    165   161
          dltg062                     65              682    682     310    223   194
          uams-linkret-en             65              570    570     331    202   191
          uams-link-en                65              615    614     353    232   220
          uams-ret-en                 65              580    580     325    203   193
          WLV-one-old                 61              473    473     263    219   142
          WLV-two                     61              526    526     327    280   135
          WLV-one                     61              473    472     267    221   135
          Spanish monolingual task: 67 topics
          rfia-bow-es                 62              497    497     198    142   113
          UA-DLSI-es                  63              501    501     184    149   111
          mira-IS-CN-N                67              251    251     127      79   69
          mira-IP-CN-CN               67              431    431     155      95   71
          Dutch monolingual task: 60 topics
          uams-linkret-nl             60              425    425     301    228   210
          uams-link-nl                60              455    455     305    236   228
          uams-ret-nl                 60              450    450     271    206   192
          English-Dutch bilingual task: 60 topics
          uams-linkret-ennl           60              564    551     456    342   302


    An interesting aspect of the results is that the performance of the systems differs substantially
for the four tasks. This can be due to the fact that the submissions for tasks were assessed by
different assessors (native speakers of the corresponding languages), as well as due to the differences
in the sizes and structures of the Wikipedias in these languages. It is worth pointing out that the
highest scores were achieved on the English-Dutch bilingual task; this may suggest that different
language versions of Wikipedia do indeed present different material on a given topic.
    Finally, a more detailed analysis of this issue, as well as the analysis of the inter-annotator
agreement will be presented by the time of the CLEF workshop.


5    Conclusion
We have described the first installment of the WiQA—Question Answering Using Wikipedia—
task. Set up as an attempt to take question answering beyond the traditional factoid format and
to one of the most interesting knowledge sources currently available, WiQA had 8 participants
who submitted a total of 20 runs for 4 tasks. The results of the pilot are very encouraging. While
challenging, the task turned out to be do-able, and in cases several participants managed to achieve
impressive yield, MRR, and precision scores. Surprisingly, the highest scores were achieved on the
bilingual task.
Table 3: Evaluation results for the submitted runs (calculated for top 10 snippets per topic);
highest scores per task are given in boldface.
     Run name             Number of topics Total yield Average yield MRR Precision
                           with response                   per topic
     English monolingual task: 65 topics
     lexiclone                   38             58           1.53        0.31     0.21
     rfia-bow-en                 65            173           2.66        0.48     0.29
     UA-DLSI-1                   64            191           2.98        0.53     0.33
     UA-DLSI-2                   60            158           2.63        0.52     0.32
     dltg061                     65            160           2.46        0.54     0.37
     dltg062                     65            152           2.34        0.50     0.33
     uams-linkret-en             65            188           2.89        0.52     0.33
     uams-link-en                65            220           3.38        0.58     0.36
     uams-ret-en                 65            191           2.94        0.52     0.33
     WLV-one-old                 61            142           2.33        0.58     0.30
     WLV-two                     61            135           2.21       0.59      0.26
     WLV-one                     61            135           2.21        0.58     0.29
     Spanish monolingual task: 67 topics
     rfia-bow-es                 62            113           1.82       0.37      0.23
     UA-DLSI-es                  63            111           1.76        0.36     0.22
     mira-IS-CN-N                67             69           1.03        0.30     0.27
     mira-IP-CN-CN               67             71           1.06        0.29     0.16
     Dutch monolingual task: 60 topics
     uams-linkret-nl             60            210           3.50       0.53      0.49
     uams-link-nl                60            228           3.80       0.53      0.50
     uams-ret-nl                 60            192           3.20        0.45     0.42
     English-Dutch bilingual task: 60 topics
     uams-linkret-ennl           60            302           5.03       0.52      0.54


   As to the future of WiQA, as pointed out before we aim to take a close look at our assesssments,
perhaps add new assessments, and analyse inter-assessor agreement along various dimensions. The
WiQA 2006 pilot has shown that it is possible to set up tractable yet challenging information access
tasks involving the multilingual Wikipedia corpus—but this was only a first step. In the future
we would like to consider additional information access scenarios, all centered around Wikipedia.


6    Acknowledgments
We are very grateful to the following people and organizations for helping us with the assessments:
José Luis Martı́nez Fernández and César de Pablo from the Daedalus consortium; Silke Scheible
and Bonnie Webber at the University of Edinburgh; Udo Kruschwitz and Richard Sutcliffe at the
University of Essex; and Bouke Huurnink and Maarten de Rijke at the University of Amsterdam.
   Valentin Jijkoun was supported by the Netherlands Organisation for Scientific Research (NWO)
under project numbers 220-80-001, 600.-065.-120 and 612.000.106. Maarten de Rijke was sup-
ported by NWO under project numbers 017.001.190, 220-80-001, 264-70-050, 354-20-005, 600.-
065.-120, 612-13-001, 612.000.106, 612.066.302, 612.069.006, 640.001.501, 640.002.501, and and by
the E.U. IST programme of the 6th FP for RTD under project MultiMATCH contract IST-033104.


References
[1] Ludovic Denoyer and Patrick Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 2006.
[2] Ian Soboroff and Donna Harman. Overview of the TREC 2003 Novelty track. In Proceedings
    of the Twelfth Text REtrieval Conference (TREC 2003), pages 38–53. NIST, 2003.
[3] WiQA, 2006. Question Answering Using Wikipedia URL: http://ilps.science.uva.nl/
    WiQA/.