=Paper=
{{Paper
|id=Vol-1175/CLEF2009wn-QACLEF-TurmoEt2009
|storemode=property
|title=Overview of QAST 2009
|pdfUrl=https://ceur-ws.org/Vol-1175/CLEF2009wn-QACLEF-TurmoEt2009.pdf
|volume=Vol-1175
|dblpUrl=https://dblp.org/rec/conf/clef/TurmoCRGMMRB09a
}}
==Overview of QAST 2009==
<pdf width="1500px">https://ceur-ws.org/Vol-1175/CLEF2009wn-QACLEF-TurmoEt2009.pdf</pdf>
<pre>
                        Overview of QAST 2009
 J. Turmo1 , P.R. Comas1 , S. Rosset2 , O. Galibert2 , N. Moreau3 , D. Mostefa3 , P. Rosso4 and D. Buscaldi4
                             1
                               TALP Research Centre (UPC). Barcelona. Spain
                                        {turmo,pcomas}@lsi.upc.edu
                                             2
                                               LIMSI. Paris. France
                                 {rosset,olivier.galibert}@limsi.fr
                                         3
                                           ELDA/ELRA. Paris. France
                                         {moreau,mostefa}@elda.org
                            4
                              NLE Lab. - ELiRF Research Group (UPV). Spain
                                    {prosso,dbuscaldi}@dsic.upv.es


                                            Abstract


        This paper describes the experience of QAST 2009, the third time a pilot track
     of CLEF has been held aiming to evaluate the task of Question Answering in Speech
     Transcripts. Four sites submitted results for at least one of the three scenarios (Eu-
     ropean Parliament debates in English and Spanish and broadcast news in French). In
     order to assess the impact of potential errors of automatic speech recognition, for each
     task manual transcripts and three different ASR outputs were provided. In addition
     an original method of question creation was tried in order to get spontaneous oral
     questions resulting in two sets of questions (spoken and written). Each participant
     who had chosen a task, was asked to submit a run for each condition. The QAST
     2009 evaluation framework is described, along with descriptions of the three scenarios
     and their associated data, the system submissions for this pilot track and the official
     evaluation results.


Categories and Subject Descriptors

H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software


General Terms

Experimentation, Performance, Measurement


Keywords

Question Answering, Spontaneous Speech Transcripts
1     Introduction

Question Answering (QA) technology aims at providing answers to natural language questions.
Current QA technology is focused mainly on the mining of written text sources for extracting the
answer to written questions from both open-domain and restricted-domain document collections
[7, 3]. However, most human interaction occurs through speech, e.g. meetings, seminars, lectures,
telephone conversations. All these scenarios provide large amounts of information that could be
mined by QA systems. As a consequence, the exploitation of speech sources brings QA a step
closer to many real world applications in which spontaneous oral questions or written questions
can be involved. The QAST 2009 track aims at investigating the problem of answer spontaneous
oral questions and written questions using audio documents.

Current text-based QA systems tend to use technologies that require text written in accordance
with standard norms for written grammar. The syntax of speech is quite different than that of
written language, with more local but less constrained relations between phrases, and punctuation,
which gives boundary cues in written language, is typically absent. Speech also contains disfluen-
cies, repetitions, restarts and corrections. Moreover, any practical application of search in speech
requires the transcriptions to be produced automatically, and the Automatic Speech Recognizers
(ASR)introduce a number of errors. Therefore current techniques for text-based QA need sub-
stantial adaptation in order to access the information contained in audio documents, and probably
to analyse oral questions. Preliminary research on QA in speech transcriptions was addressed in
QAST 2007 and QAST 2008, pilot evaluation tracks at CLEF in which systems attempted to
provide answers to written factual and definitional written questions by mining speech transcripts
of different scenarios [5, 6].

This paper provides an overview of the third QAST pilot evaluation. Section 2 describes the
principles of this evaluation track. Sections 3 and 4 present the evaluation framework and the
systems that participated, respectively. Section 5 reports and discusses the achieved results,
followed by some conclusions in Section 6.


2     The QAST 2009 task

The aim of this third year of QAST is to provide a framework in which QA systems can be
evaluated in a real scenario, where the answers of both spontaneous oral questions and written
questions have to be extracted from speech transcriptions, these transcriptions being manually
and automatically generated. There are five main objectives to this evaluation:


    • Motivating and driving the design of novel and robust QA architectures for speech tran-
      scripts;
    • Measuring the loss due to the inaccuracies in state-of-the-art ASR technology;
    • Measuring this loss at different ASR performance levels given by the ASR word error rate;

    • Measuring the loss when dealing with spontaneous oral questions
    • Motivating the development of monolingual QA systems for languages other than English.


In the 2009 evaluation, as in the 2008 evaluation, an answer is structured as a simple [answer
string, document id] pair where the answer string contains nothing more than the full and exact
Spontaneous oral question: When did the bombing of Fallujah eee took take place?
Written question: When did the bombing of Fallujah take place?

Manual transcript: (%hesitation) a week ago President the American (%hesitation)
occupation forces (%hesitation) m() m() m() marched into Fallujah and they (%hesitation)
bombarded (%hesitation) m() murdered and have been persecuting everyone in the city .
Answer: a week ago
Extracted portion of an automatic transcript (CTM file format):
(...)
20041115 1705 1735 EN SAT 1 1081.588 0.050 a 0.9595
20041115 1705 1735 EN SAT 1 1081.638 0.190 week 0.9744
20041115 1705 1735 EN SAT 1 1081.828 0.350 ago 0.9743
20041115 1705 1735 EN SAT 1 1082.338 0.630 President 0.9576
20041115 1705 1735 EN SAT 1 1083.648 0.310 the 0.9732
20041115 1705 1735 EN SAT 1 1084.008 0.710 American 0.9739
20041115 1705 1735 EN SAT 1 1085.078 0.450 occupation 0.9739
20041115 1705 1735 EN SAT 1 1085.528 0.640 forces 0.9741
20041115 1705 1735 EN SAT 1 1086.858 1.730 and 0.9742
20041115 1705 1735 EN SAT 1 1089.098 0.170 we 0.6274
20041115 1705 1735 EN SAT 1 1089.308 0.480 must 0.9571
20041115 1705 1735 EN SAT 1 1089.948 0.300 into 0.9284
20041115 1705 1735 EN SAT 1 1090.368 0.130 for 0.3609
20041115 1705 1735 EN SAT 1 1090.498 0.130 the 0.3609
20041115 1705 1735 EN SAT 1 1090.698 0.240 Chair 0.2233
20041115 1705 1735 EN SAT 1 1091.678 0.600 and 0.9755
20041115 1705 1735 EN SAT 1 1092.798 0.400 they 0.9686
20041115 1705 1735 EN SAT 1 1093.598 0.530 bombarded 0.8314
(...)
Answer: 1019.228 1019.858

Figure 1: Example query and response from manual (top) and automatic (bottom) transcripts.


answer, and the document id is the unique identifier of the document supporting the answer. For
the tasks on automatic speech transcripts, the answer string consisted of the <start-time> and the
<end-time> giving the position of the answer in the signal.

Figure 1 illustrates this point. Given the manually transcribed spontaneous oral question When did
the bombing of Fallujah eee took take place? corresponding to the written question When did the
bombing of Fallujah take place?, the figure compares the expected answer in a manual transcript
(the text a week ago) and in an automatic transcript (the time segment 1081.588 1082.178). Note
that Fallujah was wrongly recongnized as for the Chair by the ASR. A system can provide up to
5 ranked answers per question.

A total of six tasks were defined for this third edition of QAST covering three scenarios: English
questions related to European Parliament sessions in English (T1a and T1b), Spanish questions
related to European Parliament sessions in Spanish (t2a and T2b) and French questions related
to French Broadcast News (t3a and T3b). The complete set of tasks is:


   • T1a: QA of English written questions in the manual and automatic transcriptions of Euro-
     pean Parliament Plenary sessions in English (EPPS English corpus).
    • T1b: QA of manual transcriptions of English spontaneous oral questions in the manual and
      automatic transcriptions of European Parliament Plenary sessions in English (EPPS English
      corpus).

    • T2a: QA of Spanish written questions in the manual and automatic transcriptions of Euro-
      pean Parliament Plenary sessions in Spanish (EPPS Spanish corpus).
    • T2b: QA of manual transcriptions of Spanish spontaneous oral questions in the manual
      and automatic transcriptions of European Parliament Plenary sessions in Spanish (EPPS
      Spanish corpus).
    • T3a: QA of French written questions in manual and automatic transcriptions of broadcast
      news for French (ESTER corpus)
    • T3b: QA of manual transcriptions of French spontaneous oral questions in manual and
      automatic transcriptions of broadcast news for French (ESTER corpus)


3     Evaluation protocol

3.1    Data collections

The QAST 2009 data is derived from three different resources, each one corresponding to a different
language (English, Spanish and French):


    • English parliament (EPPS EN): The TC-STAR05 EPPS English corpus [4] contains
      3 hours of recordings in English corresponding to 6 sessions of the European Parliament.
      The data was used to evaluated speech recognizers in the TC-STAR project. There are 3
      different automatic speech recognition outputs with different word error rates (10.6%, 14%
      and 24.1%) . The manual transcriptions were done by ELDA.
    • Spanish parliament (EPPS ES): The TC-STAR05 EPPS Spanish corpus [4] is com-
      prised of three hours of recordings in Spanish corresponding to 6 sessions of the European
      Parliament. The data was used to evaluate Spanish ASR systems developed in the TC-STAR
      project. There are 3 different automatic speech recognition outputs with different word error
      rates (11.5%, 12.7% and 13.7%). The manual transcriptions were done by ELDA.
    • French broadcast news (French BN): The test portion of the ESTER corpus [2] contains
      10 hours of broadcast news recordings in French, comprising 18 shows from different sources
      (France Inter, Radio France International, Radio Classique, France Culture, Radio Television
      du Maroc). There are 3 different automatic speech recognition outputs with different error
      rates (11.0%, 23.9% and 35.4%). The manual transcriptions were produced by ELDA.


These three collections are the same than the ones used last year for the QAST 2008 evaluation
campaign.

European Parliament and Broadcast News data are usually referred to as prepared speech. Al-
though they typically have few interruptions and turn-taking problems when compared to actual
spontaneous speech, many of the characteristics of spoken language are still present (hesitations,
breath noises, speech errors, false starts, mispronunciations and corrections).
3.2    Questions and answer types

For each of the three languages, two sets of manually transcribed spontaneous oral questions and
their respective written questions have been created and provided to the participants, the first for
development purposes and the second for the evaluation:


   • Development sets (released on the 25th of March 2009):
        – EPPS EN: 50 transcribed questions and their respective written questions.
        – EPPS ES: 50 transcribed questions and their respective written questions.
        – French BN: 50 transcribed questions and their respective written questions.
   • Evaluation sets (released on the 1st of June 2009):
        – EPPS EN: 100 transcribed questions and their respective written questions.
        – EPPS ES: 100 transcribed questions and their respective written questions.
        – French BN: 100 transcribed questions and their respective written questions.


For each language, both the development and evaluation sets were created from the whole doc-
ument collection (i.e. the 6 European Parliament sessions for English and Spanish, and the 18
Broadcast News shows for French). In other words, there was no collection split between a devel-
opment data set and an evaluation data set as was done last year.

As for last year, two types of questions were considered: factual questions and definitional ones.
The expected answer to a factual question is a named entity. There were 6 types of factual question
this year, each corresponding to a particular category of named entities:


   • Person: names of humans, real and fictional, fictional or real non-human individuals.
     Ex: Mirjam Killer, John, Jesus, etc.
   • Organisation: names of business, multinational organizations, political parties, religious
     groups, etc.
     Ex: CIA, IBM, but also named entities like Washington when they display the characteristics
     of an organisation.
   • Location: geographical, political or astronomical entities.
     Ex: California, South of California, Earth, etc.
   • Time: a date or a specific moment in time, absolute and relative time expressions.
     Ex: March 28th, last week, at four oclock in the morning, etc.
   • Measure: measures of length, width or weight, etc. Generally, a quantity and a unit of
     measurement.
     Ex: five kilometers, 20 hertz, etc. But also ages, period of time, etc.


This is less than the 10 categories used for the 2007 and 2008 evaluations. Some categories have
not been considered this year because no occurence were found in the collected set of sponteaneous
questions (Color, Shape, Language, System, Material).

The definition questions are questions such as What is the CDU? and the answer can be anything.
In this example, the answer would be political group. This year, the definition questions are
subdivided into three types:
   • Person: question about someone.
     Q: Who is George Bush?
     R: The President of the United States of America.

   • Organisation: question about an organisation.
     Q: What is Cortes?
     R: Parliament of Spain.
   • Other: questions about technology, natural phenomena, etc.
     Q: What is the name of the system created by AT&T?
     R: The How can I help you system.


For each language a number of ’NIL’ questions (i.e., questions having no answer in the document
collection) have been selected. The distribution of the different types of questions across the three
collections is shown in Table 1.
                                Type           Factual   Definition   NIL
                                T1 (English)    75%        25%        18%
                                T2 (Spanish)    55%        45%        23%
                                T3 (French)     68%        32%        21%


Table 1: Distribution of question types per task: T1 (EPPS EN), T2 (EPPS ES), T3 (French
BN).

The question sets are formatted as plain text files, with one question per line (see the QAST 2008
Guidelines1 ). The procedure to generate the questions is described in the following section.


3.2.1   Question generation


A novel feature in QAST 2009 was the introduction of spontaneous oral questions. The main issue
in the generation of this kind of questions was how to obtain spontaneity. The solution adopted
was to set up the following procedure for question generation:


  1. Passage generation: a set of passages was randomly extracted from the document collection.
     A single passage was composed by the complete sentences included in a text window of 720
     characters.
  2. Question generation: human question generators were randomly assigned a number of pas-
     sages (varying from 2 to 4). They had to read each passage and then to formulate one or
     more questions based on the passage they just read about information not present in it.
  3. Question transcription: precise transcriptions of the oral spontaneous questions were made,
     including hesitations, etc.
     Ex: (%hesitation) What (%hesitation) house is the pres() the president elect being elected
     to?
  4. Question filtering: some questions were filtered out from the set of generated questions
     because their answer types were not allowed or because they did not have answer in the
     document collection. The resulting questions were usable questions.
  1 http://www.lsi.upc.edu/˜qast: News
  5. Written question generation: the usable questions were re-written by removing speech dis-
     fluencies, correcting the syntax and simplifying the sentence when necessary.
     Ex: What house does the president run?

  6. Question selection: the final set of development questions and test questions were selected
     by ELDA from the usable questions.


The allowed question types were the following:


   • definition: person, organisation, object and other

   • factoid : person, location, organisation, time (includes date), measure and language


However, the types “language” for factual questions and “object” for definition questions did not
occur among the generated questions.

A preliminary evaluation of the generated questions was carried out in order to determine how
many usable questions could be produced by a human reader. The results of this evaluation show
that the percentage of usable questions produced by the questions generator was between 47% and
58% of the total questions produced, depending on the speakers knowledge of the task guidelines.
These figures show that the produced questions were more than the number of questions actually
presented to participants in QAST 2009. Most unusable questions were due to the fact that
human question generators forgot the guidelines many times while asking their questions. Table
3.2.1 shows the number of questions recorded, the resulting usable questions and the average of
the length in words per question for each language.

                     #speaker     #questions recorded   #usable questions   avg. #words
         English        12               1096                 616                9.1
         French          7                485                 335                7.7
         Spanish        11                403                 313                7.1


                   Table 2: Details of the questions generated for each language.


3.3    Human judgment

As in 2008, the answer files submitted by participants have been manually judged by native
speaking assessors, who considered the correctness and exactness of the returned answers. They
also checked that the document labeled with the returned document ID supports the given answer.
One assessor evaluated the results, and another assessor manually checked each judgment of the
first one. Any doubts about an answer was solved through various discussions. The assessors used
the QASTLE2 evaluation tool developed in Perl (at ELDA) to evaluate the systems’ results. A
simple window-based interface permits easy, simultaneous access to the question, the answer and
the document associated with the answer.

After each judgment the submission files were modified by the interface, adding a new element in
the first column: the answer’s evaluation (or judgment). The four possible judgments (also used
at TREC[7]) correspond to a number ranging between 0 and 3:
  2 http://www.elda.org/qastle/
    • 0 correct: the answer-string consists of the relevant information (exact answer), and the
      answer is supported by the returned document.
    • 1 incorrect: the answer-string does not contain a correct answer.
    • 2 inexact: the answer-string contains a correct answer and the docid supports it, but the
      string has bits of the answer missing or contains additional texts (longer than it should be).
    • 3 unsupported: the answer-string contains a correct answer, but is not supported by the
      docid.


3.4     Measures

The two following metrics (also used in CLEF) were used in the QAST evaluation:


    1. Mean Reciprocal Rank (MRR): This measures how well the right answer is ranked in the
       list of 5 possible answers.
    2. Accuracy: The fraction of correct answers ranked in the first position in the list of 5 possible
       answers.


4     Submitted runs

A total of four groups from four different countries submitted results for one or more of the
proposed QAST 2009 tasks. Due to various reasons (technical, financial, etc.), eight other groups
registered but were not be able to submit any results.

The four participating groups were:


    • INAOE, Instituto Nacional de Astrofı́sica, Optica y Electrı́ca, Mexico;
    • LIMSI, Laboratoire d’Informatique et de Mécanique des Sciences de l’Ingénieur, France;
    • TOK, Tokyo Institute of Technology, Japan;
    • UPC, Universitat Politècnica de Catalunya, Spain.


All groups participated to task T1 (EPPS EN), UPC and LIMSI participated to task T2 (EPPS
ES) and only LIMSI dealt with task T3 (French BN). Each participant could submit up to 48
submissions (2 runs per task and transcription). In order to allow comparisons on the performance
of the systems when using different WER levels in the transcriptions, it was mandatory for each
task to submit results for all the data: the manual transcriptions and the three ASR outputs
(automatic transcriptions).

Table 3 shows the number of submitted runs per participant and task. The number of submissions
ranged from 8 to 32. The characteristics of the systems used in the submissions are summarized
in Table 4. A total of 86 submissions were evaluated with the distribution across tasks shown in
the bottom row of the table.
                                Participant       T1a     T1b     T2a     T2b     T3a    T3b
                                INAOE              8        8      -        -      -      -
                                LIMSI              5        5      5       5       5      5
                                TOK                4        4      -        -      -      -
                                UPC                8        8      8       8       -      -
                                Total             25       25     13       13      5      5


                              Table 3: Submitted runs per participant and task.

 System       Enrichment         Question           Doc./Passage            Factual Answer           Def. Answer    NERC
                                 classification     Retrieval               Extraction               Extraction
 inaoe1       words                                                         passage
              and NEs            hand-crafted       Indri                   selection                -              regular
 inaoe2       same plus          rules                                      based on NEs                            expressions
              phonetics                                                     of the question type
    limsi1    words, lemmas,                        passage                 ranking based on
              morphologic                           ranking                 distance and             specific       hand-crafted
              derivations,       hand-crafted       based on                redundancy               index          rules with
    limsi2    synonymic          rules              search                  ranking based on         for known      statistical POS
              relations and                         descriptors             bayesian                 acronyms
              extended NEs                                                  modelling
    tok1      words and                             sentence                ranking based on
              word classes                          ranking                 analogy between
              derived from       -                  based on                input question           -              -
              training data -                       statistical             and question in
              question-answer                       models                  the training data
              pairs
    upc1      words, NEs                            passage ranking     ranking based on                            hand-crafted
              lemmas and                            through iterative   keyword distance                            rules,
              POS                perceptrons        query relaxation    and density                  -              gazeetters
    upc2      same plus                             addition of approximated                                        and perceptrons
              phonetics                             phonetic matching


                Table 4: Characteristics of the systems that participated in QAST 2009.


5       Results

The results for the three tasks in manual transcribed data are presented in Tables 5 to 7, according
to the question types (factual, definitional and all questions).

           System   Questions                   Factual                       Definitional                    All
                                     #Correct     MRR        Acc        #Correct MRR          Acc        MRR      Acc
         INAOE1     Written            44          0.38     26.7%         10       0.31      28.0%       0.36     27%
                    Spoken             28          0.27     21.3%          7       0.26      24.0%       0.27     22%
         INAOE2     Written            42          0.38     28.0%          9       0.30      28.0%       0.36     28%
                    Spoken             38          0.35     25.3%          9       0.30      28.0%       0.34     26%
         LIMSI1     Written            42          0.39     29.3%         11       0.28      20.0%       0.36     27%
                    Spoken             39          0.36     25.3%         10       0.24       16%        0.33     23%
         LIMSI2     Written            32          0.31     22.7%         13       0.36      24.0%       0.32     23%
                    Spoken             30          0.26     18.7%         11       0.30      20.0%       0.27     19%
         TOK1       Written            11          0.10     6.7%           3       0.03       0.0%       0.08      5%
                    Spoken             11          0.08     4.0%           3       0.03       0.0%       0.06      3%
         UPC1       Written            32          0.27     18.7%          8       0.29      28.0%       0.28     21%
                    Spoken             19          0.15     9.3%           2       0.05       4.0%       0.12      8%
         UPC2       Written            35          0.31     22.7%          8       0.29      28.0%       0.31     24%
                    Spoken             18          0.15     9.3%           2       0.05       4.0%       0.12      8%


Table 5: Results for task T1, English EPPS, manual transcripts (75 factual questions and 25
definitional ones).
      System    Questions              Factual                 Definitional                  All
                            #Correct     MRR      Acc    #Correct   MRR        Acc    MRR       Acc
      LIMSI1    Written       32          0.56   45.5%     29       0.36      28.6%   0.45     36.0%
                Spoken        32          0.56   45.5%     30       0.37      28.6%   0.45     36.0%
      LIMSI2    Written       26          0.41   29.5%     23       0.28      19.6%   0.34     24.0%
                Spoken        26          0.41   29.5%     23       0.28      19.6%   0.34     24.0%
      UPC1      Written       16          0.24   15.9%     10       0.16      14.3%   0.20     15.0%
                Spoken        20          0.34   27.3%      9       0.13      10.7%   0.22     18.0%
      UPC2      Written       20          0.29   18.2%     10       0.14      10.7%   0.20     14.0%
                Spoken        20          0.33   27.3%      9       0.13       8.9%   0.22     17.0%


Table 6: Results for task T2, Spanish EPPS, manual transcripts (44 factual questions and 56
definitional ones).

      System    Questions              Factual                 Definitional                  All
                            #Correct     MRR      Acc    #Correct   MRR        Acc    MRR       Acc
      LIMSI1    Written       38          0.35   23.5%     22       0.47      37.5%   0.39     28.0%
                Spoken        39          0.36   23.5%     20       0.46      37.5%   0.39     28.0%
      LIMSI2    Written       38          0.34   22.1%     22       0.47      37.5%   0.38     27.0%
                Spoken        39          0.36   23.5%     20       0.46      37.5%   0.39     28.0%


Table 7: Results for task T3, French Broadcast News, manual transcripts (68 factual questions
and 32 definitional ones).


The results for the three tasks in automatically transcribed data are presented in Tables 8 to 10,
according to the question types (factual, definitional and all questions).

7 systems participated in the T1 (English) task on manual transcripts and 6 on automatic tran-
scripts.

On manual transcripts, the accuracy ranged from 28% to 5% (for written questions) and from
26% to 3% (for spoken questions).

For five of the systems, we observe a relatively small difference between written and spoken
questions (from 2% to 5% loss going from written questions to spoken questions). The other two
systems encountered a significant loss (13% and 16% of difference between written and spoken
questions).

There were three approaches for QA on automatic speech transcripts used by the systems. The
LIMSI and UPC on all ASRs and INAOE on ASR A and ASR B took the ASR output at the only
available information. INAOE on ASR C used information extracted from all the ASR outputs,
keeping ASR C as primary. This approach could represent an application where multiple ASR
outputs from different systems are available. Combining outputs from varied systems is a standard
method in speech recognition to obtain a better word error rate [1], it is interesting to see if the same
kind of method can be used at a more semantic level. The TOK system on the other hand used
sentence segmentation information from the manual transcripts and applied it to the automatic
transcripts. While such a segmentation information is not available in the transcriptions given,
ASR systems do generate an acoustically motivated segmentation as a step of their processing. The
TOK approach could then be considered as using an optimistic approximation of this automatically
generated segmentation information. In any case, comparing systems and estimating the impact
of WER can only be done on ”pure” systems (LIMSI and UPC on all ASRs and INAOE on ASR A
and ASR B).

On the ASR transcripts for the pure systems, the accuracy ranged for the best ASR (10.6%
  ASR      System   Questions              Factual                 Definitional                  All
                                #Correct     MRR      Acc    #Correct   MRR        Acc    MRR       Acc
          INAOE1    Written       35          0.32   24.0%     6        0.21      20.0%   0.30     23.0%
                    Spoken        34          0.33   25.3%     6        0.21      20.0%   0.30     24.0%
          INAOE2    Written       35          0.32   22.7%     7        0.22      20.0%   0.29     22.0%
                    Spoken        34          0.32   24.0%     7        0.22      20.0%   0.29     23.0%
 ASR A    LIMSI1    Written       32          0.34   28.0%     10       0.25      20.0%   0.31     26.0%
                    Spoken        30          0.31   25.3%     11       0.29      24.0%   0.30     25.0%
 10.6%    TOK1      Written       13          0.08    4.0%     3        0.04      0.0%    0.07      3.0%
                    Spoken        12          0.07    2.7%     4        0.08      4.0%    0.07      3.0%
          UPC1      Written       29          0.27   18.7%     7        0.26      24.0%   0.27     20.0%
                    Spoken        11          0.08    5.3%     2        0.06      4.0%    0.08      5.0%
          UPC2      Written       30          0.26   18.7%     6        0.24      24.0%   0.26     20.0%
                    Spoken        12          0.09    5.3%     1        0.04      4.0%    0.08      5.0%
          INAOE1    Written       23          0.22   16.0%     6        0.21      20.0%   0.22     17.0%
                    Spoken        23          0.21   13.3%     7        0.25      24.0%   0.22     16.0%
          INAOE2    Written       24          0.22   16.0%     6        0.21      20.0%   0.22     17.0%
                    Spoken        24          0.21   13.3%     7        0.25      24.0%   0.22     16.0%
 ASR B    LIMSI1    Written       24          0.27   22.7%     8        0.20      16.0%   0.25     21.0%
                    Spoken        24          0.26   21.3%     9        0.24      20.0%   0.25     21.0%
 14.0%    TOK1      Written       9           0.06    4.0%     3        0.03      0.0%    0.06      3.0%
                    Spoken        10          0.06    2.7%     3        0.06      4.0%    0.06      3.0%
          UPC1      Written       26          0.24   17.3%     7        0.26      24.0%   0.24     19.0%
                    Spoken        11          0.08    4.0%     2        0.06      4.0%    0.08      4.0%
          UPC2      Written       29          0.26   20.0%     7        0.25      24.0%   0.26     21.0%
                    Spoken        12          0.08    4.0%     2        0.05      4.0%    0.07      4.0%
          INAOE1    Written       29          0.31   26.7%     5        0.20      20.0%   0.28     25.0%
                    Spoken        28          0.30   26.7%     5        0.20      20.0%   0.28     25.0%
          INAOE2    Written       29          0.30   25.3%     6        0.21      20.0%   0.28     24.0%
                    Spoken        28          0.29   24.0%     6        0.21      20.0%   0.27     23.0%
 ASR C    LIMSI1    Written       23          0.26   24.0%     8        0.19      12.0%   0.24     21.0%
                    Spoken        24          0.24   21.3%     9        0.23      16.0%   0.24     20.0%
 24.1%    TOK1      Written       17          0.12    5.3%     5        0.08      4.0%    0.11      5.0%
                    Spoken        19          0.11    4.0%     5        0.12      8.0%    0.11      5.0%
          UPC1      Written       22          0.21   16.0%     6        0.24      24.0%   0.22     18.0%
                    Spoken        10          0.08    5.3%     1        0.04      4.0%    0.07      5.0%
          UPC2      Written       26          0.24   17.3%     6        0.24      24.0%   0.24     19.0%
                    Spoken        11          0.08    4.0%     1        0.04      4.0%    0.07      4.0%


Table 8: Results for task T1, English EPPS, automatic transcripts (75 factual questions and 25
definitional ones).


of WER) from 26% (written questions) to 5% (spoken questions). Accuracy goes down with
increased word error rate giving a roughly 5% loss for ASR B and ASR C compared to ASR A. It
is interesting to note that the differences between ASR B (WER 14%) and ASR C (WER 24%)
are negligible. The INAOE multi-ASR approach paid off by giving an overall result better than
what was obtained by the same system on the best ASR only.

We notice that the impact of written vs spoken questions is similar than for manual transcriptions,
with two systems taking an heavy loss and the others not showing a significant difference.

Four systems (2 from LIMSI and 2 from UPC) participated in the T2 (Spanish) task on manual
transcripts and 3 systems (1 from LIMSI and 2 from UPC) on automatic transcripts.

On manual transcripts, the accuracy ranged from 36% (written questions and spoken questions) to
14% (written questions) and 17% (spoken questions). The differences between written questions
and spoken questions is very low (from 0% to 3%). The same kind of behaviour is observed on
the automatic transcripts tasks, with a loss due to the speech recognition errors and no significant
difference between written and spoken questions.
    ASR   System   Questions               Factual                 Definitional                  All
                                #Correct     MRR      Acc    #Correct   MRR        Acc    MRR       Acc
          LIMSI1   Written        20          0.37   31.8%     22       0.29      23.2%   0.32     27.0%
 ASR A             Spoken         20          0.37   31.8%     21       0.27      21.4%   0.31     26.0%
 11.5%    UPC1     Written         8          0.15   13.6%      2       0.01       0.0%   0.07      6.0%
                   Spoken          6          0.14   13.6%      2       0.01       0.0%   0.07      6.0%
          UPC2     Written        12          0.20   18.2%      3       0.02       0.0%   0.10      8.0%
                   Spoken         12          0.24   22.7%      3       0.03       1.8%   0.12     11.0%
          LIMSI1   Written        18          0.32   27.3%     19       0.26      23.2%   0.29     25.0%
 ASR B             Spoken         18          0.32   27.3%     19       0.26      23.2%   0.29     25.0%
 12.7%    UPC1     Written        12          0.18   13.6%      2       0.04       3.6%   0.10      8.0%
                   Spoken         12          0.20   15.9%      1       0.02       1.8%   0.10      8.0%
          UPC2     Written        13          0.20   15.9%      3       0.02       0.0%   0.10      7.0%
                   Spoken         12          0.20   15.9%      1       0.01       0.0%   0.09      7.0%
          LIMSI1   Written        18          0.33   29.5%     19       0.24      17.9%   0.28     23.0%
 ASR C             Spoken         18          0.33   29.5%     19       0.25      19.6%   0.28     24.0%
 13.7%    UPC1     Written        12          0.22   20.5%      4       0.05       3.6%   0.13     11.0%
                   Spoken          8          0.13   11.4%      2       0.03       1.8%   0.07      6.0%
          UPC2     Written        11          0.20   18.2%      4       0.03       1.8%   0.11      9.0%
                   Spoken         10          0.21   20.5%      3       0.02       0.0%   0.10      9.0%


Table 9: Results for task T2, Spanish EPPS, automatic transcripts (44 factual questions and 56
definitional ones).

    ASR   System    Questions              Factual                 Definitional                  All
                                #Correct     MRR      Acc    #Correct   MRR        Acc    MRR       Acc
 ASR A    LIMSI1    Written       33          0.33   25.0%     19       0.47      37.5%   0.37     29.0%
 11.0%              Spoken        32          0.33   25.0%     18       0.45      37.5%   0.37     29.0%
 ASR B    LIMSI1    Written       25          0.29   25.0%     15       0.38      31.3%   0.32     27.0%
 23.9%              Spoken        25          0.27   22.1%     13       0.35      31.3%   0.30     25.0%
 ASR C    LIMSI1    Written       25          0.26   20.6%     13       0.33      28.1%   0.28     23.0%
 35.4%              Spoken        24          0.25   19.1%     11       0.31      28.1%   0.27     22.0%


Table 10: Results for task T3, French Broadcast News, manual transcripts (68 factual questions
and 32 definitional ones).


Only 2 systems (both from LIMSI) participated in the T3 (French) task on manual transcripts
and one (from LIMSI) on automatic transcripts.

On manual transcripts, the accuracy ranged from from 28% (both written and spoken questions) to
27% (written questions). There is no significant differences between spoken and written questions
(0% to 1% loss). The results for automatic transcriptions show very little loss compared to the
manual transcriptions except for the worst ASR.

The overall absolute results were worse this year compared to last year which points to a globally
harder task. The question development method produces requests which qualitatively seem to be
more different to what is found in the documents compared to questions built after reading the
documents. In our opinion that method, while giving an harder problem, puts us closer to a real,
usable application.


6     Conclusions

In this paper, the QAST 2009 evaluation has been described. Four groups participated in this
track with a total of 86 submitted runs across 3 main tasks that included dealing with different
languages (English, Spanish and French), different word error rates for automatic transcriptions
(from 10.5% to 35.4%) and different question types (written and spoken questions). An original
question creation method has been tried succesfully to generate spontaneous spoken questions.
Qualitatively, the questions were harder and more different to the formulations found in the
documents compared to those produced by the traditional method of consulting the documents
first. The method used this year gives an harder problem but we think that it is a more realistic
one, putting us closer to a real, usable application.


Acknowledgments

This work has been jointly funded by the Spanish Ministry of Science (TEXTMESS project -
TIN2006-15265-C06) and OSEO under the Quaero program. We thank to Lori Lamel, Erik Bilin-
ski, Manuel González and Pere Vilarrubia their help to the organisation and data generation.


References
[1] J. Fiscus. A post-processing system to yield reduced word error rates: Recogniser output voting
    error reduction (rover). In Proceedings 1997 IEEE Workshop on Automatic Speech Recognition
    and Understanding, pages 347–352, Santa Barbara, CA, 1997.
[2] S. Galliano, E. Geoffrois, G. Gravier, J.F. Bonastre, D. Mostefa, and K. Choukri. Corpus
    description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast
    News. In Proceedings of LREC’06, pages 315–320, Genoa, 2006.
[3] C. Peters, P. Clough, F.C. Gey, J. Karlgren, B. M;agnini, D.W. Oard, M. de Rijke, and
    M. Stempfhuber, editors. Evaluation of Multilingual and Multi-modal Information Retrieval.
    Springer-Verlag., 2006.
[4] TC-Star. http://www.tc-star.org, 2004-2008.
[5] J. Turmo, P.R. Comas, C. Ayache, D. Mostefa, S. Rosset, and L. Lamel. Overview of qast 2007.
    In C. Peters, V. Jijkoun, Th. Mandl, H. Müller, D.W. Oard, A. Pes, V. Petras, and D. Santos,
    editors, 8th workshop of the Cross Language Evaluation Forum (CLEF 2007). Revised Selected
    Papers., pages 249–256. LNCS, 2008.
[6] J. Turmo, P.R. Comas, S. Rosset, L. Lamel, N. Moreau, and D. Mostefa. Overview of qast
    2008. In 9th workshop of the Cross Language Evaluation Forum (CLEF 2008). Revised Selected
    Papers. (to appear). LNCS, 2009.
[7] E.M. Voorhees and L.L. Buckland, editors. The Fifteenth Text Retrieval Conference Proceed-
    ings (TREC 2006), 2006.

</pre>