=Paper=
{{Paper
|id=Vol-1174/CLEF2008wn-QACLEF-TurmoEt2008
|storemode=property
|title=Overview of QAST 2008
|pdfUrl=https://ceur-ws.org/Vol-1174/CLEF2008wn-QACLEF-TurmoEt2008.pdf
|volume=Vol-1174
|dblpUrl=https://dblp.org/rec/conf/clef/TurmoCRLMM08a
}}
==Overview of QAST 2008==
<pdf width="1500px">https://ceur-ws.org/Vol-1174/CLEF2008wn-QACLEF-TurmoEt2008.pdf</pdf>
<pre>
                         Overview of QAST 2008
 Jordi Turmo1 , Pere Comas1 , Sophie Rosset2 , Lori Lamel2 , Nicolas Moreau3 and Djamel Mostefa3
                       1
                         TALP Research Centre (UPC). Barcelona. Spain
                                 {turmo,pcomas}@lsi.upc.edu
                                      2
                                        LIMSI. Paris. France
                                    {rosset,lamel}@limsi.fr
                                  3
                                    ELDA/ELRA. Paris. France
                                  {moreau,mostefa}@elda.org


                                             Abstract


        This paper describes the experience of QAST 2008, the second time a pilot track
     of CLEF has been held aiming to evaluate the task of Question Answering in Speech
     Transcripts. Five sites submitted results for at least one of the five scenarios (lectures
     in English, meetings in English, broadcast news in French and European Parliament
     debates in English and Spanish). In order to assess the impact of potential errors of
     automatic speech recognition, for each task contrastive conditions are with manual
     and automatically produced transcripts. The QAST 2008 evaluation framework is
     described, along with descriptions of the five scenarios and their associated data, the
     system submissions for this pilot track and the official evaluation results.


Categories and Subject Descriptors

H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software


General Terms

Experimentation, Performance, Measurement


Keywords

Question Answering, Spontaneous Speech Transcripts


1    Introduction

Question Answering (QA) technology aims at providing relevant answers to natural language
questions. Most Question Answering research has focused on mining document collections con-
taining written texts to answer written questions [3, 6]. Documents can be either open domain
(newspapers, newswire, Wikipedia...) or restricted domain (biomedical papers...) but share, in
general, a decent writing quality, at least grammar-wise. In addition to written sources, a lot
(and growing amount) of potentially interesting information appears in spoken documents, such
as broadcast news, speeches, seminars, meetings or telephone conversations. The QAST track
aims at investigating the problem of question answering in such audio documents.

Current text-based QA systems tend to use technologies that require texts to have been written
in accordance with standard norms for written grammar. The syntax of speech is quite different
than that of written language, with more local but less constrained relations between phrases,
and punctuation, which gives boundary cues in written language, is typically absent. Speech also
contains disfluencies, repetitions, restarts and corrections. Moreover, any practical application
of search in speech requires the transcriptions to be produced automatically, and the Automatic
Speech Recognizers (ASR) introduce a number of errors. Therefore current techniques for text-
based QA need substantial adaptation in order to access the information contained in audio
documents. Preliminary research on QA in speech transcriptions was addressed in QAST 2007, a
pilot evaluation track at CLEF 2007 in which systems attempted to provide answers to written
factual questions by mining speech transcripts of seminars and meetings [5].

This paper provides an overview of the second QAST pilot evaluation. Section 2 describes the
principles of this evaluation track. Sections 3 present the evaluation framework and section 4 the
systems that participated. Section 5 reports and discusses the achieved results, followed by some
conclusions in Section 6.


2     The QAST 2008 task

The objective of this pilot track is to develop a framework in which QA systems can be evaluated
when the answers have to be found in speech transcripts, these transcripts being either produced
manually or automatically. There are five main objectives to this evaluation:


    • Motivating and driving the design of novel and robust QA architectures for speech tran-
      scripts;
    • Measuring the loss due to the inaccuracies in state-of-the-art ASR technology;
    • Measuring this loss at different ASR performance levels given by the ASR word error rate;
    • Comparing the performance of QA systems on different kinds of speech data (prepared
      speech such as broadcast news (BN) or parliamentary hearings vs. spontaneous in meeting
      for instance);
    • Motivating the development of monolingual QA systems for languages other than English.


In the 2008 evaluation, as in the 2007 pilot evaluation, an answer is structured as a simple [answer
string, document id] pair where the answer string contains nothing more than the full and exact
answer, and the document id is the unique identifier of the document supporting the answer. In
2008, for the tasks on automatic speech transcripts, the answer string consisted of the <start-
time> and the <end-time> giving the position of the answer in the signal. Figure 1 illustrates
this point comparing the expected answer to the question What is the Vlaams Blok? in a manual
transcript (the text criminal organisation) and in an automatic transcript (the time segment
1019.228 1019.858). A system can provide up to 5 ranked answers per question.
Question: What is the Vlaams Blok?

Manual transcript: the Belgian Supreme Court has upheld a previous ruling that declares
the Vlaams Blok a criminal organization and effectively bans it .
Answer: criminal organisation
Extracted portion of an automatic transcript (CTM file format):
(...)
20041115 1705 1735 EN SAT 1 1018.408 0.440 Vlaams 0.9779
20041115 1705 1735 EN SAT 1 1018.848 0.300 Blok 0.8305
20041115 1705 1735 EN SAT 1 1019.168 0.060 a 0.4176
20041115 1705 1735 EN SAT 1 1019.228 0.470 criminal 0.9131
20041115 1705 1735 EN SAT 1 1019.858 0.840 organisation 0.5847
20041115 1705 1735 EN SAT 1 1020.938 0.100 and 0.9747
(...)
Answer: 1019.228 1019.858

Figure 1: Example query What is the Vlaams Blok? and response from manual (top) and auto-
matic (bottom) transcripts.


A total of ten tasks were defined for this second edition of QAST covering five main task scenarios
and three languages: lectures in English about speech and language processing (T1), meetings in
English about design of television remote controls (T2), French broadcast news (T3) and European
Parliament debates in English (T4) and Spanish (T5). The complete set of tasks are:


    • T1a: QA in manual transcriptions of lectures in English.
    • T1b: QA in automatic transcriptions of lectures in English.
    • T2a: QA in manual transcriptions of meetings in English.
    • T2b: QA in automatic transcriptions of meetings in English.
    • T3a: QA in manual transcriptions of broadcast news for French.
    • T3b: QA in automatic transcriptions of broadcast news for French.
    • T4a: QA in manual transcriptions of European Parliament Plenary sessions in English.
    • T4b: QA in automatic transcriptions of European Parliament Plenary sessions in English.
    • T5a: QA in manual transcriptions of European Parliament Plenary sessions in Spanish.
    • T5b: QA in automatic transcriptions of European Parliament Plenary sessions in Spanish.


3     Evaluation protocol

3.1    Data collections

The data for this second edition of QAST is derived from five different resources, covering sponta-
neous speech, semi-spontaneous speech and prepared speech: The first two are the same as were
used in QAST 2007 [5].
   • The CHIL corpus1 (as used for QAST 2007): The corpus contains about 25 hours of
     speech, mostly spoken by non native speakers of English, with an estimated ASR Word
     Error Rate (WER) of 20%.
   • The AMI corpus2 (as used for QAST 2007): This corpus contains about 100 hours of
     speech, with an ASR WER of about 38%.
   • French broadcast news: The test portion of the ESTER corpus [1] contains 10 hours
     of broadcast news in French, recorded from different sources (France Inter, Radio France
     International, Radio Classique, France Culture, Radio Television du Maroc). There are 3
     different automatic speech recognition outputs with different error rates (WER = 11.0%,
     23.9% and 35.4%). The manual transcriptions were produced by ELDA.
   • Spanish parliament: The TC-STAR05 EPPS Spanish corpus [4] is comprised of three
     hours of recordings from the European Parliament in Spanish. The data was used to evaluate
     recognition systems developed in the TC-STAR project. There are 3 different automatic
     speech recognition outputs with different word error rates (11.5%, 12.7% and 13.7%). The
     manual transcriptions were done by ELDA.
   • English parliament: The TC-STAR05 EPPS English corpus [4] contains 3 hours of
     recordings from the European Parliament in English. The data was used to evaluated speech
     recognizers in the TC-STAR project. There are 3 different automatic speech recognition
     outputs with different word error rates (10.6%, 14% and 24.1%) . The manual transcriptions
     were done by ELDA.


The spoken data cover a broader range of types, both in terms of content and in speaking style.
The Broadcast News and European Parliament date are less spontaneous than the lecture and
meeting speech as they are typically prepared in advance and are closer in structure to written
texts. While meetings and lectures are representative of spontaneous speech, Broadcast News and
European Parliament sessions are usually referred to as prepared speech. Although they typically
have few interruptions and turn-taking problems when compared to meeting data, many of the
characteristics of spoken language are still present (hesitations, breath noises, speech errors, false
starts, mispronunciations and corrections). One of the reasons for including the additional types
of data was to be closer to the textual data used to assess written QA, and to benefit from the
availability of multiple speech recognizers that have been developed for these languages and tasks
in the context of European or national projects [2, 1, 4].


3.1.1    Questions and answer types


For each of the five scenarios, two sets of questions have been provided to the participants, the
first for development purposes and the second for the evaluation.


   • Development set (11 March 2008) :
         – Lectures: 10 seminars and 50 questions.
         – Meetings: 50 meetings and 50 questions.
         – French broadcast news: 6 shows and 50 questions.
         – English EPPS: 2 sessions and 50 questions.
         – Spanish EPPS: 2 sessions and 50 questions.
  1 http://chil.server.de
  2 http://www.amiproject.org
   • Evaluation set (15 June 2008):
        – Lectures: 15 seminars and 100 questions.
        – Meetings: 120 meetings and 100 questions.
        – French broadcast news: 12 shows and 100 questions.
        – English EPPS: 4 sessions and 100 questions.
        – Spanish EPPS: 4 sessions and 100 questions.


Two types of questions were considered this year: factual questions and definitional ones. For each
corpus (CHIL, AMI, ESTER, EPPS EN, EPPS ES) roughly 70% of the questions are factual, 20%
are definitional, and 10% are NIL (i.e., questions having no answer in the document collection).

The question sets are formatted as plain text files, with one question per line (see the QAST 2008
Guidelines3 ). The factual questions similar to those used in the 2007 evaluation. The expected
answer to these questions is a Named Entity (person, location, organization, language, system,
method, measure, time, color, shape and material). The definition questions are questions such
as What is the Vlaams Blok? and the answer can be anything. In this example, the answer would
be a criminal organization. The definition questions are subdivided into the following types:


   • Person: question about someone
     Q: Who is George Bush?
     R: The President of the United States of America.
   • Organisation: question about an organisation
     Q: What is Cortes?
     R: Parliament of Spain.
   • Object: question about any kind of objects
     Q: What is F-15?
     R: combat aircraft.
   • Other: questions about technology, natural phenomena, etc.
     Q: What is the name of the system created by AT&T?
     R: The How can I help you system.


3.2    Human judgment

As in QAST 2007, the answer files submitted by participants have been manually judged by native
speaking assessors, who considered the correctness and exactness of the returned answers. They
also checked that the document labeled with the returned docid supports the given answer. One
assessor evaluated the results, and another assessor manually checked each judgment of the first
one. Any doubts about an answer was solved through various discussions. The assessors used
the QASTLE4 evaluation tool developed in Perl (at ELDA) to evaluate the responses. A simple
window-based interface permits easy, simultaneous access to the question, the answer and the
document associated with the answer.

For T1b, T2b, T3b, T4b and T5b (QA on automatic transcripts) the manual transcriptions were
aligned to the automatic ASR outputs to find associate times with the answers in the automatic
  3 http://www.lsi.upc.edu/˜qast: News
  4 http://www.elda.org/qastle/
transcripts. The alignments between the automatic and the manual transcription were done using
time information. Unfortunately, for some documents time information were not available and
only word alignments were used.

After each judgment the submission files were modified, adding a new element in the first column:
the answer’s evaluation (or judgment). The four possible judgments (also used at TREC[6])
correspond to a number ranging between 0 and 3:


    • 0 correct: the answer-string consists of the relevant information (exact answer), and the
      answer is supported by the returned document.
    • 1 incorrect: the answer-string does not contain a correct answer.
    • 2 inexact: the answer-string contains a correct answer and the docid supports it, but the
      string has bits of the answer missing or contains additional texts (longer than it should be).
    • 3 unsupported: the answer-string contains a correct answer, but is not supported by the
      docid.


3.3     Measures

The two following metrics (also used in CLEF) were used in the QAST evaluation:


    1. Mean Reciprocal Rank (MRR): This measures how well the right answer is ranked in the
       list of 5 possible answers..
    2. Accuracy: The fraction of correct answers ranked in the first position in the list of 5 possible
       answers.


4     Submitted runs

A total of five groups from four different countries submitted results for one or more of the proposed
QAST 2008 tasks. Due to various reasons (technical, financial, etc.), three other groups registered
but were not be able to submit any results.

The five participating groups were:


    • CUT, Chemnitz University of Technology, Germany;
    • INAOE, Instituto Nacional de Astrofica, Optica y Electrica, Mexico;
    • LIMSI, Laboratoire d’Informatique et de Mécanique des Sciences de l’Ingénieur, France;
    • UA, Universidad de Alicante, Spain;
    • UPC, Universitat Politècnica de Catalunya, Spain.


All groups participated to task T4 (English EPPS). Only LIMSI participated to task T3 (French
broadcast news). Table 1 shows the number of submitted runs per participant and task. Each
participant could submit up to 32 submissions (2 runs per task and transcription). The number
of submissions ranged from 2 to 20. The characteristics of the systems used in the submissions
are summarized in Table 2. A total of 49 submissions were evaluated with the distribution across
tasks shown in the bottom row of Table 2.
               Participant    T1a   T1b        T2A   T2b      T3a   T3b   T4a    T4b     T5a   T5b
               CUT             2     -          -     -        -     -     2       -      -     -
               INAOE           -     -          -     -        -     -     1      2       -     -
               LIMSI           1     1          1     1        2     3     1      3       2     3
               UA              -     -          -     -        -     -     1      3       -     -
               UPC             1     2          1     2        -     -     1      6       1     6
               Total           4     3          2     3        2     3     6      14      3     9


Table 1: Submitted runs per participant and task. T1 (English lectures), T2 (English meetings),
T3 (French BN), T4 (English EPPS), T5 (Spanish EPPS).

 System      Enrichment       Question          Doc./Passage        Factual Answer       Def. Answer    NERC
                              classification    Retrieval           Extraction           Extraction
    cut1                                                            hand-crafted rules   hand-crafted   Stanford NER,
             words, NEs       hand-crafted      pass. ranking       with fallback str.   fallback       rules with
             and POS          rules                                 in 1st pass.         strategy       classification
    cut2                                        based on RSV        same in
                                                                    top-3 pass.
 inaoe1      words                                                  candidate
             and NEs          hand-crafted      Lemur               selection            -              regular
 inaoe2      same plus        rules                                 based on NEs                        expressions
             phonetics
    limsi1   words, lemmas,                     ranking             ranking based on     specific
             morphologic      hand-crafted      based on            distance and         index          hand-crafted
             derivations,     rules             search              redundancy           for known      rules with
    limsi2   synonyms and                       descriptors         tree-rewriting       acronyms       stochastic POS
             extended NEs                                           based distance
     ua1     words, NEs                         ranking             ranking based on                    hand-crafted
             POS and          hand-crafted      based on            keyword distance     -              rules
             n-grams          rules             n-grams             and mutual
                                                                    information
    upc1     words, NEs                         ranking based on    ranking based on                    hand-crafted
             lemmas and                         iterative query     keyword distance                    rules,
             POS              perceptrons       relaxation          and density          -              gazeetters
    upc2     same plus                          addition of approximated                                and perceptrons
             phonetics                          phonetic matching


               Table 2: Characteristics of the systems that participated in QAST 2008.


5       Results

The results for the ten QAST 2008 tasks are presented in Tables 3 to 12, according to factual
questions, definitional questions, and all questions.

For manual transcriptions, the accuracy ranges from 45% (LIMSI1 on task T3a) down to 7%
(UPC1 on task T5a). For automatic transcriptions, the accuracy goes from 41% (LIMSI1 on task
T3b and ASR a) to 2% (UPC1 on task T5b and ASR c). Generally speaking, a loss in accuracy
is observed when dealing with automatic transcriptions. Comparing the best accuracy results on
manual transcription and automatic transcriptions, the loss of accuracy goes from 15% for task
T2 to 4% for tasks T3 and T4 tasks. This difference is larger for tasks where the ASR word error
rate is higher.
               System           Factual                       Definitional                  All
                        #Correct   MRR          Acc      #Correct MRR         Acc      MRR      Acc
               cut1       14        0.18        17.9       2         0.09      9.1     0.16     16.0
               cut2       16        0.19        16.7       8         0.26     18.2     0.20     17.0
               limsi1     48        0.53        47.4       4         0.18     18.2     0.45     41.0
               upc1       39        0.44        38.5       4         0.18     18.2     0.38     34.0


Table 3: Results for task T1a, English lectures, manual transcripts (78 factual questions and 22
definitional ones).

             System                  Factual                   Definitional                  All
             ASR 20%         #Correct   MRR      Acc      #Correct   MRR       Acc      MRR      Acc
             limsi1            33        0.34    30.8       3         0.14     13.6     0.30     27.0
             upc1              35        0.39    34.6       4         0.18     18.2     0.34     31.0
             upc2              35        0.37    33.3       4         0.18     18.2     0.33     30.0


Table 4: Results for task T1b, English lectures, ASR transcripts (78 factual questions and 22
definitional ones).

               System           Factual                       Definitional                  All
                        #Correct   MRR          Acc      #Correct MRR         Acc      MRR      Acc
               limsi1     44        0.47        37.8       7         0.22     19.2     0.40     33.0
               upc1       29        0.35        31.1       3         0.12     11.5     0.29     26.0


Table 5: Results for task T2a, English meetings, manual transcripts (74 factual questions and 26
definitional ones).

             System                  Factual                   Definitional                  All
             ASR 38%         #Correct   MRR      Acc      #Correct   MRR       Acc      MRR      Acc
             limsi1            23        0.21    16.2       6         0.18     15.4     0.20     16.0
             upc1              19        0.20    17.6       5         0.19     19.2     0.20     18.0
             upc2              16        0.16    10.8       6         0.23     23.1     0.18     14.0


Table 6: Results for task T2b, English meetings, ASR transcripts (74 factual questions and 26
definitional ones).

               System           Factual                       Definitional                  All
                        #Correct   MRR          Acc      #Correct MRR         Acc      MRR      Acc
               limsi1     45        0.50        45.3       13        0.47     44.0     0.49     45.0
               limsi2     45        0.47        41.3       13        0.46     44.0     0.47     42.0


Table 7: Results for task T3a, French BN, manual transcripts (75 factual questions and 25 defini-
tional ones).

         ASR        System              Factual                     Definitional                  All
                                #Correct   MRR          Acc    #Correct   MRR         Acc    MRR      Acc
         a 11.0%    limsi1        42        0.49        44.0     9         0.33       32.0   0.45     41.0
         b 23.9%    limsi1        29        0.28        22.7     10        0.34       32.0   0.30     25.0
         c 35.4%    limsi1        24        0.24        20.0     7         0.26       24.0   0.24     21.0


Table 8: Results for task T3b, French BN, ASR transcripts (75 factual questions and 25 definitional
ones).
              System           Factual                  Definitional                 All
                       #Correct   MRR     Acc      #Correct MRR        Acc      MRR      Acc
              cut1       12        0.16   16.0        9        0.36    36.0     0.21     21.0
              cut2       12        0.16   16.0       11        0.39    36.0     0.22     21.0
              inaoe1     41        0.43   37.3        6        0.21    20.0     0.38     33.0
              limsi1     44        0.43   33.3       12        0.39    32.0     0.42     33.0
              ua1        32        0.30   21.3        4        0.16    16.0     0.27     20.0
              upc1       38        0.44   40.0        4        0.16    16.0     0.37     34.0


Table 9: Results for task T4a, English EPPS, manual transcripts (75 factual questions and 25
definitional ones).

          ASR     System           Factual                   Definitional                 All
                           #Correct   MRR        Acc    #Correct   MRR        Acc    MRR      Acc
                  inaoe1     32        0.37      33.3      5        0.20      20.0   0.33     30.0
          a       inaoe2     34        0.38      32.0      5        0.20      20.0   0.33     29.0
          10.6%   limsi1     24        0.23      18.7      9        0.31      28.0   0.25     21.0
                  ua1        12        0.09       4.0      4        0.16      16.0   0.10     7.0
                  upc1       18        0.22      20.0      4        0.17      16.7   0.21     19.0
                  upc2       16        0.16      13.3      4        0.17      16.7   0.16     14.1
                  limsi1     22        0.21      16.0      9        0.33      32.0   0.24     20.0
          b       ua1        12        0.11       8.0      4        0.16      16.0   0.12     10.0
          14.0%   upc1       15        0.18      16.0      4        0.16      16.0   0.17     16.0
                  upc2       14        0.16      13.3      4        0.16      16.0   0.16     14.0
                  limsi1     21        0.21      16.0      8        0.30      28.0   0.23     19.0
          c       ua1         9        0.10       8.0      5        0.20      20.0   0.12     11.0
          24.1%   upc1       11        0.11       9.3      5        0.20      20.0   0.14     12.0
                  upc2       11        0.11       8.0      4        0.16      16.0   0.12     10.0


Table 10: Results for task T4b English EPPS, ASR transcripts (75 factual questions and 25
definitional ones).

              System           Factual                  Definitional                 All
                       #Correct   MRR     Acc      #Correct MRR        Acc      MRR      Acc
              limsi1     29        0.32   29.3       13        0.44    36.0     0.35     31.0
              limsi2     29        0.32   29.3       13        0.42    32.0     0.35     30.0
              upc1        9        0.11    9.3        3        0.05     0.0     0.09     7.0


Table 11: Results for task T5a, Spanish EPPS, manual transcripts (75 factual questions and 25
definitional ones).

          ASR     System           Factual                   Definitional                 All
                           #Correct   MRR        Acc    #Correct   MRR        Acc    MRR      Acc
                  limsi1     20        0.25      24.0      8        0.28      24.0   0.26     24.0
          a       upc1        5        0.05       4.0      0        0.00      00.0   0.04     3.0
          11.5%   upc2        5        0.06       5.3      2        0.08       8.0   0.07     6.0
                  limsi1     18        0.20      17.3      9        0.28      24.0   0.22     19.0
          b       upc1        5        0.06       5.3      0        0.00      00.0   0.05     4.0
          12.7%   upc2        5        0.06       5.3      2        0.08       8.0   0.07     6.0
                  limsi1     20        0.24      22.7      8        0.27      24.0   0.25     23.0
          c       upc1        2        0.03       2.7      0        0.00      00.0   0.02     2.0
          13.7%   upc2        3        0.03       2.7      1        0.04       4.0   0.04     3.0


Table 12: Results for task T5b, Spanish EPPS, ASR transcripts (75 factual questions and 25
definitional ones).


Another observation concerns the loss of accuracy when dealing with different word error rates.
Generally speaking higher WER results in lower accuracy (e.g. from 30% for T4b A to 20% for
T4b B). Strangely enough this is not completely true for the T5b task where results for ASR C
(13.7% WER) are 4% higher than for ASR B (12.7% WER). The WER being rather close, it is
probable that ASR C errors had a smaller impact on the named entities present in the questions.


6    Conclusions

In this paper, the QAST 2008 evaluation has been described. Five groups participated in this track
with a total of 49 submitted runs, across ten tasks that included dealing with different types of
speech (spontaneous or prepared), different languages (English, Spanish and French) and different
word error rates for automatic transcriptions (from 10.5% to 35.4%). For the tasks where the word
error rate was low enough (around 10%) the loss in accuracy compared to manual transcriptions
was under 5%, suggesting that QA in such documents is potentially feasible. However, even
where ASR performance is reasonably good, there remain outstanding challenges in dealing with
spoken language and the earlier mentioned differences from written language. The results from
the QAST evaluation indicate that if a QA system which performs well on manual transcriptions
it also performs reasonably well on high quality automatic transcriptions. The performance on
spoken language have not yet reached the level of those in the main QA track.


Acknowledgments

This work has been jointly funded by the Spanish Ministry of Science (TEXTMESS project) and
OSEO under the Quaero program.


References
[1] S. Galliano, E. Geoffrois, G. Gravier, J.F. Bonastre, D. Mostefa, and K. Choukri. Corpus
    description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast
    News. In Proceedings of LREC’06, Genoa, 2006.
[2] G. Gravier, J.F. Bonastre, S. Galliano, E. Geoffrois, K. McTait, , and K. Choukri. The
    ESTER evaluation campaign of Rich Transcription of French Broadcast News. In Proceedings
    of LREC’04, Lisbon, 2004.
[3] C. Peters, P. Clough, F.C. Gey, J. Karlgren, B. M;agnini, D.W. Oard, M. de Rijke, and
    M. Stempfhuber, editors. Evaluation of Multilingual and Multi-modal Information Retrieval.
    Springer-Verlag., 2006.
[4] TC-Star. http://www.tc-star.org, 2004-2008.
[5] J. Turmo, P.R. Comas, C. Ayache, D. Mostefa, S. Rosset, and L. Lamel. Overview of qast 2007.
    In C. Peters, V. Jijkoun, Th. Mandl, H. Müller, D.W. Oard, A. Peas, V. Petras, and D. Santos,
    editors, 8th workshop of the Cross Language Evaluation Forum (CLEF 2007). Revised Selected
    Papers. LNCS, 2008.
[6] E.M. Voorhees and L.L. Buckland, editors. The Fifteenth Text Retrieval Conference Proceed-
    ings (TREC 2006), 2006.

</pre>