Overview of QAST 2007
 Jordi Turmo1 , Pere Comas1 , Christelle Ayache2 , Djamel Mostefa2 and Sophie Rosset3 and Lori Lamel3
                          1
                            TALP Research Centre (UPC). Barcelona. Spain
                                     {turmo,pcomas}@lsi.upc.edu
                                      2
                                        ELDA/ELRA. Paris. France
                                      {ayache,mostefa}@elda.org
                                          3
                                            LIMSI. Paris. France
                                        {rosset,lamel}@limsi.fr


                                           Abstract
     This paper describes QAST, a pilot track of CLEF 2007 aimed at evaluating the task
     of Question Answering in Speech Transcripts. The paper summarizes the evaluation
     framework, the systems that participated and the results achieved. These results have
     shown that question answering technology can be useful to deal with spontaneous
     speech transcripts, so for manually transcribed speech as for automatically recognized
     speech. The loss in accuracy from dealing with manual transcripts to dealing with
     automatic ones implies that there is room for future reseach in this area.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software

General Terms
Experimentation, Performance, Measurement

Keywords
Question Answering, Spontaneous Speech Transcripts


1    Introduction
The task of Question Answering (QA) consists of providing short, relevant answers to natural
language questions. Most Question Answering research has focused on extracting information
from text sources, providing the shortest relevant text in response to a question [4, 5]. For
example, the correct answer to the question How many groups participate in the CHIL project?
is 16. Whereas the response to the question of who are the partners in CHIL? is a list of the
partners. This simple example illustrates the two main advantages of QA has over current search
engines: first, the input is a natural language question rather a keyword query, and second, the
answer provides the desired information content and not a potentially large set of documents or
URLs that the user must plow through.
    Most of current QA systems handle independent questions and produce one answer to each
question, extracted from textual data, for both open domain and limited domain tasks. However,
a large portion of human interactions involve spontaneous speech, e.g. meetings, seminars, lec-
tures, telephone conversations, and are beyond the capacities of current text-based factual QA
systems. Most of the recent QA research has been undertaken by natural language groups who
have typically applied techniques to written texts, and assume that these texts have a correct
syntactic and semantic structure. The grammatical structure of spoken language is different from
that of written language, and some of the anchor points used in text processing such as punctu-
ation must be inferred and are therefore error prone. Other spoken language phenomena include
disfluencies, repetitions, restarts and corrections. In the case that automatic processing is used to
create the speech transcripts, an additional challenge is dealing with the recognition errors. The
lecture and interactive meeting data are particularly difficult due to run-on sentences (where the
distance between the first part of an utterance and its end one can be very long) and interruptions.
Therefore current techniques for text-based QA need substantial adaptation in order to access the
information contained in audio data.
    This paper provides an overview of a pilot evaluation track at CLEF 2007 for Question Answer-
ing in Speech Transcriptions, named QAST. Section 2 describes the principles of this evaluation
track. Sections 3 and 4 present the evaluation framework and the systems that participated, re-
spectively. Section 5 shows the results achieved and the main implications. Finally, Section 6
concludes.


2     The QAST task
The objective of this pilot track is to provide a framework in which QA systems can be evaluated
when the answers have to be found in spontaneous speech transcripts (manual and automatic
transcripts). There are three main objectives to this evaluation:

    • Comparing the performances of the systems dealing with both types of transcripts.
    • Measuring the loss of each system due to the inaccuracies in state of the art ASR technology.
    • Motivating and driving the design of novel and robust factual QA architectures for automatic
      speech transcripts.

   In this evaluation, the QA systems have to return answers found in the audio transcripts to
questions presented in a written natural language form. The answer is the minimal sequence
of words that includes the correct exact answer in the audio stream. For the purposes of this
evaluation, instead of pointers in the audio signal, the recognized words covering the location of
the exact answer have to be returned. For example, consider the question which organisation has
worked with the University of Karlsruhe on the meeting transcription system?, and the following
extract of an automatically recognized document:

      breath fw and this is , joint work between University of Karlsruhe and coming around so
      fw all sessions , once you find fw like only stringent custom film canals communicates
      on on fw tongue initials .

corresponding to the following exact manual transcript:

      uhm this is joint work between the University of Karlsruhe and Carnegie Mellon, so
      also here in these files you find uh my colleagues and uh Tanja Schultz.

   The answer found in the manual transcript is Carnegie Mellon whereas in the automatic
transcript it is coming around. This example illustrates the two principles that guide this track:

    • The questions are generated considering the exact information in the audio stream regardless
      of how this information is transcribed, because the transcription process is transparent to
      the user.
     • The answer to be extracted is the minimal sequence of words that includes the correct exact
       answer in the audio stream (i.e., in the manual transcripts). In the above example, the
       answer to be extracted from the automatic transcript is coming around, because this text
       gives the start/end pointers to the correct answer in the audio stream.

     Four tasks have been defined for QAST:

     • T1: QA in manual transcriptions of lectures.
     • T2: QA in automatic transcriptions of lectures.
     • T3: QA in manual transcripts of meetings.
     • T4: QA in automatic transcriptions of meetings.


3       Evaluation protocol
3.1      Data collections
The data for the QAST pilot track consists of two different resources, one for dealing with the
lecture scenario and the other for dealing with the meeting scenario:

     • The CHIL corpus1 : it consists of around 25 hours (around 1 hour per lecture) both manually
       and automatically transcribed (LIMSI produced the ASR transcriptions with around 20% of
       word error rate -WER- [2], while the manual ones were done by ELDA). In addition, the set
       of lattices and confidences for each lecture has been provided. The domain of the lectures is
       speech and language processing. The language is European English (mostly spoken by non
       native speakers). Lectures have been provided with simple tags. Seminars are formatted as
       plain text files (ISO-8859-1) [3].
     • The AMI corpus2 : it consists of around 100 hours (168 meetings) both manually and au-
       tomatically transcribed (the Univeristy of Edimburgh produced the ASR trasncripts with
       around 38% of WER [1]). The domain of this meetings is design of television remote control.
       The language is European English. Meetings (as lectures) have been produced with simple
       tags. Meetings are formatted as plain text files (ISO-8859-1).

3.1.1      Questions and answer types
For each one of the scenarios, two sets of questions will be provided to the participants:

     • Development set (1 February 2007) :
           – Lectures: 10 seminars and 50 questions.
           – Meetings: 50 meetings and 50 questions.
     • the Evaluation set (18 June 2007):
           – Lectures: 15 seminars and 100 questions.
           – Meetings: 118 meetings and 100 questions.

    Question sets have been formatted as plain text files, with one question per line as defined in
the Guidelines3 . All the questions in the QAST task are factual questions, whose expected answer
is a Named Entity (person, location, organization, language, system, method, measure, time, color,
shape and material). No definition questions have been proposed. The two data collections (CHIL
    1 http://chil.server.de
    2 http://www.amiproject.org
    3 http://www.lsi.upc.edu/˜qast
and AMI corpus) were first tagged with Named Entities. Then, an English native speaker created
questions for each NE tagged session. So each answer is a tagged Named Entity.
    An answer is basically structured as an [answer-string, document-id] pair, where the answer-
string contains nothing more than a complete and exact answer (a Named Entity) and the
document-id is the unique identifier of a document that supports the answer. There are no partic-
ular restrictions on the length of an answer-string (which is usually very short), but unnecessary
pieces of information will be penalised, since the answer will be marked as non-exact. Assessors
will focus mainly on the responsiveness and usefulness of the answers.

3.2    Human judgement
The files submitted by participants have been manually judged by native speaking assessors.
Assessors considered correctness and exactness of the returned answers. They have also checked
that the document labelled with the returned docid supports the given answer. One assessor
evaluated the results. Then, another assessor manually checked each judgement evaluated by the
first one. Any doubts about an answer was solved through various discussions.
    To evaluate the data, assessors used an evaluation tool developed in Perl (at ELDA) named
QASTLE4 . A simple interface permits easy access of the question, the answer and the document
associated with the answer (all in one window only).
    For T2 and T4 (QA on automatic transcripts) the manual transcriptions were aligned to the
automatic ASR outputs to find the answer in the automatic transcripts. The alignments between
the automatic and the manual transcription were done using time information for most of the
seminars and meetings. Unfortunately for some AMI meetings time information were not available
and only word alignments were used.
    After each judgement the submission files have been modified. A new element appears in the
first column: the answer’s evaluation (or judgement). The four possible judgements (also used at
TREC[5]) correspond to a number ranging between 0 and 3:

   • 0 correct: the answer-string consists of the relevant information (exact answer), and the
     answer is supported by the returned document.
   • 1 incorrect: the answer-string does not contain a correct answer or the answer is not respon-
     sive.
   • 2 non-exact: the answer-string contains a correct answer and the docid supports it, but the
     string has bits of the answer missing or is longer than the required length of the answer.
   • 3 unsupported: the answer-string contains a correct answer but the docid does not support
     it.

3.3    Measures
The two following metrics used in CLEF have been used in the QAST evaluation:

  1. Mean Reciprocal Rank (MRR) measures how well ranked is the right answer, as defined in
     Section 2, in the list of 5 possible answers in average.
  2. Accuracy: The fraction of correct answers ranked in the first position in the list of 5 possible
     answers.
  4 http://www.elda.org/qastle/
4       Submitted runs
A total of five groups from five different countries submitted results for one or more of the proposed
QAST tasks. Due to various reasons (technical, financial, etc.), three other registered groups were
not be able to submit any results.
   The five participating groups are the following:
     • CLT, Center for Language Technology, Australia;
     • DFKI, Germany;
     • LIMSI, Laboratoire d’Informatique et de Mécanique des Sciences de l’Ingénieur, France;
     • TOKYO, Tokyo Institute of Technology, Japan;
     • UPC, Universitat Politècnica de Catalunya, Spain.
    Five groups participated in both T1 and T2 tasks (CHIL corpus) and three groups participated
in both T3 and T4 tasks (AMI corpus).
    The participants could submit up to 2 submissions per task and up to 5 answers per question.
The systems used in the submissions are described in Table 1. In total, 28 submissions were
evaluated: 8 submissions from 5 participating sites for T1, 9 submission files from 5 different sites
for T2, 5 submissions from 3 participants for T3 and 6 submissions from 3 participants for T4.
The lattices provided for task T2 were not finally used by any participant.

 System      Enrichment   Question         Doc/Pass               Answer                  NERC
                          classification   Retrieval              Extraction
                                           pass. ranking          candidate ranking       hand-crafted patterns,
     clt1    words        hand-crafted     based on word          based on frequency      gazeetters
             and NEs      patterns         similarities between   and the NER             and ME models
     clt2                                  pass. and query        confidence              No ME models
             words        hand-crafted                            candidate ranking       gazeeteers and
    dfki1    and NEs      sint.-sem.       Lucene                 based on frequency      not tuned statistical
                          rules                                                           models
                                           pass. ranking based
    limsi1                                 on hand-crafter         candidate ranking
             words        hand-crafted     back-off queries        based on frequency,    hand-crafted
             and NEs      patterns         cascaded doc/pass       keyword distance and   patterns
    limsi2                                 ranking based on        retrieval confidence
                                           search descriptors
                                           pass. retrieval with
 tokyo1                   non-linguistic   interpolated doc/pass   candidate ranking
             words        statistical      statistical models      based on statistical   no
                          multi-word       addition of word        multi-word
 tokyo2                   model            classes to the          model
                                           statistical models
             words, NEs                    pass. ranking based     candidate ranking
    upc1     lemmas and                    on iterative query      based on keyword       hand-crafted patterns,
             POS          perceptrons      relaxation              distance and density   gazeetters
    upc2     also                          addition of approximated phonetic matching     and perceptrons
             phonetics


                             Table 1: Systems that participated in QAST


5       Results
The results for the four QAST tasks are presented in tables 2, 3, 4 and 5. Due to some problems
(typo, answer type) some questions have been deleted from the scoring results in tasks T1, T2
and T3. In total, the results have been calculated on the basis of 98 questions for tasks T1 and
T2, and 96 for T3. In addition, and due to also missing time information at word level for some
AMI meetings, seven questions have been deleted from the scoring results of T4. The results for
this task have been calculated on the basis of 93 questions.

                          System         # Questions    #Correct answers      MRR       Accuracy
                          clt1 t1            98               16              0.09        0.06
                          clt2 t1            98               16              0.09        0.05
                          dfki1 t1           98               19              0.17        0.15
                          limsi1 t1          98               43              0.37        0.32
                          limsi2 t1          98               56              0.46        0.39
                          tokyo1 t1          98               32              0.19        0.14
                          tokyo2 t1          98               34              0.20        0.14
                          upc1 t1            98               54              0.53        0.51


                         Table 2: Results for T1 (QA on CHIL manual transcriptions)


                           System        #Questions     #Correct answers     MRR        Accuracy
                           clt1 t2          98                13             0.06         0.03
                           clt2 t2          98                12             0.05         0.02
                           dfki1 t2         98                 9             0.09         0.09
                           limsi1 t2        98                28             0.23         0.20
                           limsi2 t2        98                28             0.24         0.21
                           tokyo1 t2        98                17             0.12         0.08
                           tokyo2 t2        98                18             0.12         0.08
                           upc1 t2          96                37             0.37         0.36
                           upc2 t2          97                29             0.25         0.24


                    Table 3: Results for T2 (QA on CHIL automatic transcriptions)


                         System        #Questions     #Correct answers       MRR         Accuracy
                         clt1 t3          96                31                0.23          0.16
                         clt2 t3          96                29                0.25          0.20
                         limsi1 t3        96                31                0.28          0.25
                         limsi2 t3        96                40                0.31          0.25
                         upc1 t3*         95              23(27)           0.22(0.26)    0.20(0.25)


Table 4: Results for T3 (QA on AMI manual transcriptions). *Due to a bug with the output format
script, UPC asked to the assessors to reevaluate their unique run for T3. The results in brackets must be regarded
as a non official run.


                           System        #Questions     #Correct answers     MRR        Accuracy
                           clt1 t4          93                17             0.10         0.06
                           clt2 t4          93                19             0.13         0.08
                           limsi1 t4        93                21             0.19         0.18
                           limsi2 t4        93                21             0.19         0.17
                           upc1 t4          91                22             0.22         0.21
                           upc2 t4          92                17             0.15         0.13


                         Table 5: Results for T4 (QA on AMI manual transcriptions)

    The results are very encouraging. First, the best result in accuracy achieved in tasks involving
manual transcripts (0.51 for task T1) is closed to the best two results for factual questions in TREC
2006 (0.58 and 0.54), in which monolingual English QA was evaluated. Second, this behaviour is
also observed in average: the accuracy in average achieved in tasks T1 and T3 is 0.22, which is
comparable with 0.18 achieved in TREC 2006. Although no direct comparisons between QAST
and TREC are possible due to the use of different data, questions and answer types, these facts
show that QA technology can be useful to deal with spontaneous speech transcripts.
   Finally, the accuracy values are 0.22 and 0.15 in average for the tasks involving lectures (T1 and
T2, respectively), and 0.21 and 0.14 for those involving meetings (T3 and T4, respectively). These
values show that the accuracy decreases in average more than 36% when dealing with automatic
transcripts. The reduction of this difference between accuracy values have to be taken as a main
goal in the future research.


6    Conclusion
In this paper, we have described the QAST 2007 (Question Answering in Speech Transcripts) task.
A set of five groups participated in this track with a total of 28 submitted runs among four specific
tasks. In general, the results achieved show that, first, QA technology can be useful to deal with
spontaneous speech transcripts, and second, the loss in accuracy when dealing with automatically
transcribed speech is high. These results are very encouraging and suggest that there is room for
future research in this area.
    Future work aims at including in the evaluation framework other languages than English, oral
questions, and other question types different than factual ones.


Acknowledgments
We are very grateful to Thomas Hain from the University of Edimburgh, who provide us with the
AMI transcripts automatically generated by their ASR. This work has been jointly funded by the
European Commission (CHIL project IP-506909), the Spanish Ministry of Science (TEXTMESS
project) and the LIMSI AI/ASP Ritel grant.


References
[1] T. Hain, L. Burget, J. Dines, G. Garau, M. Karafiat, M. Lincoln, J. Vepa, and V. Wan. The
    ami system for the transcription of meetings. In Proceedings of ICASSP’07, 2007.
[2] L. Lamel, G. Adda, E. Bilinski, and J.-L. Gauvain. Transcribing lectures and seminars. In
    Proceedings of Interspeech’05, 2005.
[3] D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. Chu, A. Tyagi, J Casas, J. Turmo,
    L. Cristoforetti, F. Tobia, A. Pnvmatikakis, V. Mylonakis, F. Talantzis, S. Burger, R. Stiefelha-
    gen, L Bernardin, and C. Rochet. The chil audiovisual corpus for lecture and meeting analysis
    inside smart rooms. to appear in Language Resources and Evaluation Journal, 2007.
[4] C. Peters, P. Clough, F.C. Gey, J. Karlgren, B. M;agnini, D.W. Oard, M. de Rijke, and
    M. Stempfhuber, editors. Evaluation of Multilingual and Multi-modal Information Retrieval.
    Springer-Verlag., 2006.
[5] E.M. Voorhees and L.L. Buckland, editors. The Fifteenth Text Retrieval Conference Proceed-
    ings (TREC 2006), 2006.