The LIMSI participation to the QAst 2009 track Guillaume Bernard, Sophie Rosset, Olivier Galibert, Eric Bilinski, Gilles Adda, Spoken Language Processing Group, LIMSI-CNRS, B.P. 133, 91403 Orsay cedex, France {firstname.lastname}@limsi.fr Résumé We present in this paper the three LIMSI question-answering systems on speech transcripts which participated to the QAst 2009 evaluation. These systems are based on a complete and multi-level analysis of both queries and documents. These systems use an automatically ge- nerated research descriptor. A score based on those descriptors is used to select documents and snippets. Three different methods are tried to extract and score candidate answers, and we present in particular a tree transformation based ranking method. We participated to all the tasks and submitted 30 runs (for 24 sub-tasks). The evaluation results for manual transcripts range from 27% to 36% for accuracy depending on the task and from 20% to 29% for automatic transcripts. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing ; H.3.3 Information Search and Retrieval ; H.3.4 Systems and Software ; H.3.7 Digital Libraries General Terms Measurement, Performance, Experimentation Keywords Question answering, speech transcriptions 1 Introduction The Question Answering on Speech Transcripts track of the QA@CLEF task provides an opportunity to evaluate the specificity of speech transcriptions. In this paper, we present the work carried out on the QA system developed at LIMSI for the QAst evaluation. We especially describe an answer re-ranking method used in this system. For the QAst 2009 evaluation [7], 3 main tasks are defined : – T1, QA in English European Parliament Plenary sessions – T2, QA in Spanish European Parliament Plenary sessions – T3, QA in French Broadcast News In the previous QAst evaluations [6], the questions were created by the evaluators from the documents. This year, the objective was to have more spontaneous questions. Native speakers were requested to read fragments of doucments and ask, using speech, questions about information related to but not content in the texts fragments. For each of the tasks, four versions of the data collection were provided, consisting of one manual trans- criptions and three different automatic transcription. Two different sets of questions were provided, one consisting of written questions and the other of manually transcribed semi-spontaneous oral questions. In total a minimum of 8 runs were expected per task, for a total of 24. LIMSI participated to the three tasks. Three systems were tested. Their main architecture is identical and they differ only in the answer scoring method : – Distance-based answer scoring (primary method) – Answer scoring through bayesian modeling – Tree transformation-based answer re-ranking The first method is used on all three tasks, the second is used on the T1 and T2 tasks and the third on the T3 task. The section 2 presents the common architecture and the answer scoring methods. The section 3 is split into three parts : the description of the training and development data (section 3.1, a quick evaluation of the difficulty of the task (section 3.2), and finally the results of the three systems on the development and test data (sectiob 3.3). We compare these results to those obtained in the QAst 2008 evaluation. 2 The LIMSI QA systems The common architecture is identical to the systems used in the previous evaluations and is fully described in [4]. The same complete and multilevel analysis is carried out on both queries and documents. To do so, the query and the documents (which may come from different modalities – text, manual transcripts, automatic trans- cripts) are transformed into a common representation. This normalization process converts raw texts to a form where words and numbers are unambiguously delimited, punctuation is separated from words, and the text is split into sentence-like segments. Case and punctuation are reconstructed using a fully cased, punc- tuated four-gram language model [1] applied to a word graph covering all the possible variants (all possible punctuations permitted between words, all possible word cases). The general objective of this analysis is to find the bits of information that may be of use for search and extraction, called pertinent information chunks. These can be of different categories : named entities, linguistic entities (e.g., verbs, prepositions), or specific entities (e.g., scores). All words that do not fall into such chunks are automatically grouped into chunks via a longest-match strategy. The full analysis comprises some 100 steps and takes roughly 4 ms on a typical user or document sentence. The analysis identifies about 300 different types of entities. The analysis is hie- rarchichal, resulting in a set of trees. Both answers and important element of the questions are supposed to be annotated as one of these entities. The first step of QA system itself is to build a search descriptor (SD) that contains the important elements of the question, and the possible answer types with associated weights. Some elements are marked as criti- cal, which makes them mandatory in future steps, while others are secondary. The element extraction and weighting is based on an empirical classification of the element types in importance levels. Answer types are predicted through rules based on combinations of elements of the question. Documents are selected using this SD. Each element of the document is scored with the geometric mean of the number of occurrences of all the SD elements that appear in it, and sorted by score, keeping the n-best. Snippets are extracted from the document using fixed-size windows and scored using the geometrical mean of the number of occurrences of all the DDR elements that appear in the snippet, smoothed by the document score. 2.1 Distance-based answer scoring In each snippet, all the elements whose type is one of the predicted possible answer types are candidate answers. A score S(r) is associated to each candidate answer r : P P w(l) 1−γ a∈Ar (w(a) maxEa (e,l)∈Ea (1+d(e,a))α ) Sp (a)γ S(r) = Cd (r)β Cp (r)δ w(l) = line weight w(a) = answer weight d(e, a) = element-answer distance Ea = set of SD elements for instance a Ar = set of instances of the answer candidate r Sp (a) = score of the snippet including a Cd (r) = instance count of r in the documents Cp (r) = instance count of r in the snippets α, β, γ, δ = tuning variables 2.2 Answer scoring through bayesian modeling We tried a preliminary method of answer scoring built upon a bayesian modeling of the process of estimating the quality of an answer candidate. This approach relies on multiple elementary models including element co-occurrence probabilities, question element appearance probility in the context of a correct answer and out of context answer probability. This is a very preliminary work. 2.3 Tree transformation-based answer re-ranking Our second approach for the T3 task is built upon the results of the primary system. We stated that the method for finding and extracting the best answer to a given question in 2.1 is based on redundancy and distances between candidate answers and elements of the question. While this approach gives good results, it also has some limitations. Mainly, it does not take into account the structure of the snippet and the relations between the different critical elements detected. Relations between the elements of the text fragments are needed to represent the information stated in the documents and the questions. However, most of the systems use complex syntactic representations which are not adapted to handle oral fragments[2]. However, some systems[5, 3] show that it is possible to identify local syntactic and semantic relations by using a segmentation of the documents into segments (chunks) and then detecting the relations between these segments. From these conclusions, we defined a re-ranking method which computes a score for each of the answers to a question. That method takes as input the question tagged by the analysis module, the answers found by the answer extraction module, and the best snippets associated to each answer. The analysis trees of the question and the snippets are segmented into chunks, and relations are added between these chunks. For each evaluated answer, the method compares the structure of the question with the snippet of the answer. The system tries to match the structure of the question by moving the chunks of the snippets with similar elements. The relations are used in these moves and allow the system to compute the score of the answer. This system uses two sub-modules, the segmenting and annotation module and the relation labelling module. The questions and the snippets are processed through these modules, and then the tree transformation system computes the similarity score of each answers. 2.3.1 Segmentation and annotation module The definition of the segmentation formalism was led by its use for the relation labelling module. We think that verbs have an important role in the structure of a sentence. Therefore, we have defined two types of chunks : verbal chunks (VC) and general chunks (GC). The general chunks can be divided into several subtypes : temporal (TC), spatial (SC) and question markers (QMC). Below is an exemple of a segmented sentence, "The Ebola virus was identified in 1976". "[GC] The Ebola virus [/GC] [VC] was identified [/VC] [TC] in 1976 [/TC]." The segmentation and annotation module uses a Conditional Random Fields (CRF) based approach. Two models were generated : one for the documents, and one for the questions. We used the following features : analysis module of the main architecture and a Part Of Speech annotation. Two training corpus were used, one for the documents and one for the questions. 2.3.2 Relation labelling module The aim of the relations is to represent the dependances between the chunks of the questions and the chunks of the snippets. The relations are oriented and non-exclusive, ie there can be multiple relations between the same two chunks. For the moment, five relations are defined, which are described below. Noun modifier relation ; this relation represents the dependance between two chunks containing noun groups, as in the following sentence : ”[GC] Steven Spielberg [/GC] [VC] is [/VC] [GC] the man [/GC] [GC] with the glasses [/GC]”. In this example, there is a noun modifier relation between “the man” and “with the glasses”. Verb to member relation ; this relation represents the dependance between a verbal chunk and the chunks containing its members. The members of a verb are its subject and its objects. In the following sentence, ”[GC] The Ebola virus [/GC] [VC] was identified [/VC] [TC] in 1976 [/TC].”, there are two verb to mem- ber relations between the verbal chunk “was identified” and the two chunks “in 1976” and “The Ebola virus”. Member to verb relation ; this type of relation is the same as the previous one, except this relation goes from the member to the verb. Temporal relation ; this relation represents the dependance between a temporal chunk and another chunk. In the following sentence, ”[GC] The Ebola virus [/GC] [VC] was identified [/VC] [TC] in 1976 [/TC].”, there are two temporal relations between the temporal chunk “in 1976” and the two chunks “was identi- fied” and “The Ebola virus”. Spatial relation ; this type of relation is the same as the temporal relation, except that it concerns spa- tial chunks. To label the different relations of each of the chunk of the documents and the questions, we use a rule-based system. Each type of relation has an associated rule, with the following parameters : the types of the chunks on which the rule applies, the types of chunks who can be in relation, the direction of the rule, and the context of application of the rule. Here is an exemple of the rule for temporal relations : temporal relation : {TC} {GC | VC | SC | QMC} {LEFT | RIGHT} {TC} This rule means that we add a temporal relation between a chunk of TC type and a chunk of GC, VC, SC or QMC type. The target chunk can either be at the left or a the right of the temporal chunk. The relation is not allowed to cross over another temporal chunk. 2.3.3 Text transformation module As we said previously, before trying to transform the snippet into the question, the system finds the simila- rities between the chunks of the snippet and the chunks of the question. To find the similarities, we use the following information : lemma form, synonyms and morphological derivations. The system defines anchor points between comparable chunks. With these points, the system transforms the snippet into the question by using three types of operations : inserting a chunk, deleting a chunk and substituting a chunk. These types of operation are applied in a certain order. First, the system generates one substituting operation for each anchor point, and compute its cost. It depends on two values : the substitution cost and the displacement cost. The subtitution cost is computing by making the sum of a per word cost for each important word which is not found in the question. We decide wether a word is deemed important based on its type given by the analysis module. For example, verbs and nouns are important but determinants are not. The per word cost has been set empirically. The displacement is seen as a sequence of permutations between adjacent chunks. Each permutation has a cost depending on the relation between the two chunks and their types. Then, the system finds the sequence of substitutin operations with the lowest total score, which results in a similar structure between the question and the snippet. To finish the transformation, the remaining chunks are deleted, and the missing ones from the question are inserted. The sequence of operations with the smallest total cost measures the similarity between the question and the snippet, and by comparing these similarity scores a new ranking is computed. The figure 1 shows an example of how the transformation works. The relations are not shown for clarity. We evaluate the answer ”Northern Ireland” for the question ”What country is Annetta Flanigan from ?”. The snippet of the answer is ”One captive is Annetta Flanigan from my constituency of Northern Ireland”. As you can see on the figure, the system find three anchor points between the chunks of the question and those of the text fragment. The colours show the anchors between the chunks. Using these anchors, the system generates the operations. For this example, a list of transformation could be : – Moving chunk ”Annetta Flanigan” next to chunk ”of Northern Ireland” F IGURE 1 – Transformation exemple between the question ”What country is Annetta Flanigan from ?” and the snippet ”One captive is Annetta Flanigan from my constituency of Northern Ireland” containing the answer ”Northern Ireland” – Deleting ”the captive” and ”from my constituency” We think that verbs in a sentence are important to the signification and the structure of this sentence. Thus, we do not allow a permutation between a verbal chunk and another general chunk. That is why in this example we do not allow a permutation between ”Northern Ireland” and ”is”. Also, we do not insert the preposition ”from” because it is not a chunk of the question associated to a critical element of the search descriptor (SD). This sequence of operations allows the system to compute a new score for this answer. 3 Evaluation 3.1 Training and Development data Each main task had a two 50 questions development corpus, one of manual transcriptions of spontaneous spoken questions and one of written reformulations of these questions. An overview of the different corpus used can be viewed in table 1. The numbers between the parenthesis indicate the number of documents of the corpus. As part of the training data, we used the corpus of reformulated questions we developped last year in addition to the official development corpus and the test data from QAst 2008. Task Off. Dev. Supp. Dev. T1 2×50 (6) 377 T2 2×50 (6) 317 T3 2×50 (18) 450 TABLE 1 – The corpus. Off. Dev : the official development data ; Supp. Dev : the QAst08 development and test data and reformulated questions based on the QAst08 development Since the second approach for the T3 task, the re-ranking method, does not yet give better results on the whole development corpus, we decided to use it only on certain type of questions. In Table 2, the LIMSI1 system is the distance-based method, and the LIMSI2-T3 is the re-ranking method. We found that this method gets better results on questions with a lot of search elements present in the search descriptor. The corpus of questions evaluated in this table is a fusion between the official and the supplementary development corpus. We can see that on questions with at least 5 search elements, the LIMSI2 on the T3 task gets better results than the LIMSI1 system. Thus, we decided that the re-ranking approach will only be applied on questions with at least 5 search elements. LIMSI2-T3 LIMSI1 #E. MRR Acc #Correct MRR Acc #Correct #Questions 1 0.62 48.6 53 0.71 67.0 73 109 2 0.56 42.2 73 0.66 61.8 107 173 3 0.74 67.4 145 0.79 77.9 166 215 4 0.72 65.5 74 0.79 77.9 88 113 5 0.73 65.5 38 0.71 60.3 35 58 6 0.85 85.7 18 0.81 76.0 16 21 7 0.60 60.0 3 0.60 60.0 3 5 TABLE 2 – Results on the development data classified by number of search elements in the search descriptor, with #E being the number of search elements. LIMSI2-T3 is the re-ranking method and LIMSI1 the distance- based method. 3.2 Task difficulty evaluation As stated in Section 1, the procedure for building the question corpus has changed this year. We try to evaluate whether the difficulty of the task had changed as a result. Mainly, we wanted to compare the differences between the development corpus of QAst08 and QAst09. Therefore, we evaluated for each question of the two corpus the distance between the elements of the ques- tion and the answer in the documents containing the correct answer. For each questions, we computed four distance scores : the number of words, the number of nodes of the analysis module, the number of chunks and the number of sentences. Each score is an average of the distance of each element of the question from the answer. This evaluation was made on the corpus of the T3 main task (French corpus). The Table 3 shows the results of this evaluation. Words Nodes Chunks Mean SD Mean SD Mean SD QAst09 written development corpus 27 52 47 17 10 20 QAst09 spoken development corpus 28 52 47 22 10 20 QAst08 development corpus 14 20 13 23 5 7 QAst08 reformulated corpus 18 26 15 25 5 9 TABLE 3 – Mean and standard deviation (SD) of the distance between correct answers and elements of the questions in various units This table shows some differences between the development corpus of QAst09 and QAst08. We see that the mean distance is roughly doubled in the QAst09 development corpus compared to the previous year. While such a difference is significant in absolute terms, we do not think that it by itself fully justifies a large difference in task difficulty. We need to also analyse the impact of lexical variations between the questions and the documents contents. 3.3 Results 3.3.1 General results on manual transcripts The results for the three tasks on manual transcribed data are presented in tables 4 to 6, with all the question types evaluated. For each task, two systems were used. There is also a difference between the LIMSI2 system in the T1 and T2 tasks (English and Spanish) and LIMSI2 system in the T3 task (French). For each case, only the Factual Answer Extraction procedure is changing : in LIMSI1, it uses a scoring of all the candidates of appropriate types given the question classification. In LIMSI2 for the T1 and T2 tasks, it uses the bayesian method explained before, and in the T3 task the tree transformation re-ranking method. As stated before, the LIMSI2 system in the T3 task is not used on all the questions, but only the questions with a lot of search elements. System Questions Test 09 MRR Acc Recall LIMSI1 Written 0.36 27% 53% Spoken 0.33 23% 45% LIMSI2 Written 0.32 23% 45% Spoken 0.27 19% 41% TABLE 4 – Results for task T1, English EPPS, manual transcripts (75 factual questions and 25 definitional ones). System Questions Test 09 MRR Acc Recall LIMSI1 Written 0.45 36.0% 61% Spoken 0.45 36.0% 62% LIMSI2 Written 0.34 24.0% 49% Spoken 0.34 24.0% 49% TABLE 5 – Results for task T2, Spanish EPPS, manual transcripts (44 factual questions and 56 definitional ones). System Questions Test 09 MRR Acc Recall LIMSI1 Written 0.39 28.0% 60% Spoken 0.39 28.0% 59% LIMSI2 Written 0.38 27.0% 60% Spoken 0.39 28.0% 59% TABLE 6 – Results for the T3 task, French Broadcast News, manual transcripts (68 factual questions and 32 definitional ones). 3.3.2 General results on automatic transcripts The results obtained on the three tasks in automatically transcribed data are presented in tables 7 to 9. With the automatic transcripts, only the LIMSI1 system is used. ASR System Questions Test 09 MRR Acc Recall ASR_A LIMSI1 Written 0.31 26.0% 42% 10.6% Spoken 0.30 25.0% 41% ASR_B LIMSI1 Written 0.25 21.0% 32% 14.0% Spoken 0.25 21.0% 33% ASR_C LIMSI1 Written 0.24 21.0% 31% 24.1% Spoken 0.24 20.0% 33% TABLE 7 – Results for task T1, English EPPS, automatic transcripts (75 factual questions and 25 definitional ones). ASR System Questions Test 09 MRR Acc Recall ASR_A LIMSI1 Written 0.32 27.0% 42% 11.5% Spoken 0.31 26.0% 41% ASR_B LIMSI1 Written 0.29 25.0% 37% 12.7% Spoken 0.29 25.0% 37% ASR_C LIMSI1 Written 0.28 23.0% 37% 13.7% Spoken 0.28 24.0% 37% TABLE 8 – Results for task T2, Spanish EPPS, automatic transcripts (44 factual questions and 56 definitional ones). ASR System Questions Test 09 MRR Acc Recall ASR_A LIMSI1 Written 0.37 29.0% 52% 11.0% Spoken 0.37 29.0% 50% ASR_B LIMSI1 Written 0.32 27.0% 40% 23.9% Spoken 0.30 25.0% 38% ASR_C LIMSI1 Written 0.28 23.0% 38% 35.4% Spoken 0.27 22.0% 35% TABLE 9 – Results for the T3 task, French Broadcast News, manual transcripts (68 factual questions and 32 definitional ones). 3.3.3 Analysis of the results Tables 4 to 6 show a great loss between the recall and the accuracy of our systems. The LIMSI1 system gives a bad answer on half of the questions with the good answer in the candidates answers, and it is worst for the LIMSI2 system on the T1 and T2 tasks. The LIMSI2 system on the T3 task gives almost the same results that the LIMSI1 system by applying it only on a small set of questions, as stated previously. A study of the results of this system is showed next. Nevertheless, we can see that there are almost no differences between written and spoken questions. LIMSI2 system on the T1 and T2 tasks is a preliminary version that gives interesting results. As such, we are going to improve it. LIMSI2 system on the T3 task still needs work to improve it. For the results obtained on the three different automatic speech transcription, as showed in tables 7 to 9, we can see that they are lower than the results of the manual transcriptions. We show in table 10 the results obtained by the LIMSI on each task. We also show the best results of all the participants systems in column Best for each task. Except on the T1 Manual and the T1 ASR_A, the LIMSI obtains the best results. It should be noted that we were the only participants in the T3 task. T1 T2 T3 Sub-Task Question Acc. Best Acc. Best Acc. Best Manual Written 27.0% 28.0% 36.0% - 28.0% - Spoken 23.0% 26.0% 36.0% - 28.0% - ASR_A Written 26.0% - 27.0% - 29.0% - Spoken 25.0% - 26.0% - 29.0% - ASR_B Written 21.0% - 25.0% - 27.0% - Spoken 21.0% - 25.0% - 25.0% - ASR_C Written 21.0% 25.0% 23.0% - 23.0% - Spoken 20.0% 25.0% 24.0% - 22.0% - TABLE 10 – Results obtained by the LIMSI on the QAst 2009 evaluation. Table 11 shows the results obtained by each system for the manual sub-tasks on the T1 and T2 tasks. The evaluated corpus are the development and the test corpus of QAst 2009 on both written and spoken questions, and the development and test corpus of QAst 2008, on written questions. As stated before, the LIMSI1 system used the distance-based approach, and the LIMSI2-T1 and LIMSI2-T2 the bayesian approach. T1 T2 LIMSI1 LIMSI2-T1 LIMSI1 LIMSI2-T2 Corpus MRR Acc MRR Acc MRR Acc MRR Acc W. Test09 0.36 27% 0.32 23% 0.45 36% 0.34 24% S. Test09 0.33 23% 0.27 19% 0.45 36% 0.34 24% W. Dev09 0.37 32% 0.21 10% 0.54 48% 0.37 26% S. Dev09 0.39 34% 0.22 10% 0.52 45% 0.42 32% W. Dev08 0.80 78% 0.59 50% 0.68 58% 0.57 42% W. Test08 0.55 52% 0.38 32% 0.62 56% 0.52 44% TABLE 11 – Results obtained on each system for the manual tasks. Table 12 compared the results on the T3 task between the LIMSI1 system and the LIMSI2-T3 system. We show two sets of results for the LIMSI2 system : those obtained where all the questions of the corpus are re-ranked (LIMSI2-T3), and those obtained where only the questions with 5 or more search elements are re-ranked (LIMSI2-T3-SE). As we can see, there is an huge loss between the QAst08 corpus and the test and development corpus of QAst09. One reason for these results could be the new methodology used to build the questions corpus. As stated in section 3.1, the distances between the elements of the question and the answer are greater in the development corpus of QAst09. The greater distance between an answer and its associated question elements does not seem to be the only cause of these results. In particular, we expect lexical variations between the questions and the elements as found in the documents to also have play a significant role. Table 12 shows that the re-ranking of the questions with 5 or more search elements allows the LIMSI2-T3 system to get almost the same results than the LIMSI1 system. We can see that it also gets better results on the development corpus from QAst 2008. While these results are interesting, as stated before this approach needs to be improved. T3 LIMSI1 LIMSI2-T3 LIMSI2-T3-SE Corpus MRR Acc MRR Acc MRR Acc W. Test09 0.39 28% 0.24 18% 0.38 27% S. Test09 0.39 28% 0.24 17% 0.39 28% W. Dev09 0.44 40% 0.25 16% 0.44 40% S. Dev09 0.44 36% 0.26 18% 0.42 34% W. Dev08 0.81 76% 0.68 58% 0.85 80% W. Test08 0.57 50% 0.50 40% 0.57 50% TABLE 12 – Results obtained on each system for the manual tasks. We evaluated the questions with 5 or more search elements which were re-ranked by the LIMSI2-T3 system. Of the ten questions of the written question corpus with that many search elements, six did not have the correct answer ithin the candidates answers and one was a NIL question. Of the remaining three, one was answered correctly by both systems, one was answered correctly by the LIMSI1 but not the LIMSI2-T3 one. And the correct answer for the last question was not found by either of the systems. 4 Conclusion In this paper, we presented the LIMSI question-answering systems on speech transcripts which participated to the QAst 2009 evaluation. These systems obtained state-of-the-art results on the different tasks and lan- guages and the accuracy ranged from 27% for English to 36% for Spanish data). The results of the T1 ans T3 systems show a significant loss of results compared to the 2008 evaluation (6% for T1 and 17% for T3 in accuracy) inspite of the improvements og the systems. It can be explained by the new methodolgy used to build the questions corpus. A deeper analysis is ongoing to understand the origins of this loss. Acknowledgments This work has been partially financed by OSEO under the Quaero program. Références [1] D. Déchelotte, H. Schwenk, G. Adda, and J.-L. Gauvain. Improved machine translation of speech-to-text outputs. Antwerp. Belgium, 2007. [2] P. Paroubek, A. Vilnat, B. Grau, and C. Ayache. Easy, evaluation of parsers of french : what are the results ? In European Language Resources Association (ELRA), editor, Proceedings of the Sixth In- ternational Language Resources and Evaluation (LREC’08), pages 2480–2486, Marrakech, Morocco, 2008. [3] S. Pradhan, W. Ward, K. Hacioglu, J. H. Martin, and D. Jurafski. Semantic role labeling using different syntactic views. pages 581–588, Ann Arbor, USA, 2005. [4] S. Rosset, O. Galibert, G. Bernard, E. Bilinski, and G. Adda. The limsi participation to the qast track. In Working Notes of CLEF 2008 Workshop, Aarhus, Denmark, September 2008. [5] T. Sakai, Y. Saito, Y. Ichimura, M. Koyama, T. Kokubu, and T. Manabe. Askmi : A japanese question answering system based on semantic role analysis. In Proceedings of RIAO 2004, Avignon, 2004. [6] J. Turmo, P. Comas, L. Lamel, S. Rosset, N. Moreau, and D. Mostefa. Overview of qast 2008 - question answering on speech transcriptions. In CLEF 2008 Workshop, Aarhus, Denmark, 2008. [7] J. Turmo, P. Comas, S. Rosset, O. Galibert, N. Moreau, D. Mostefa, P. Rosso, and D. Buscaldi. Overview of qast 2009 - question answering on speech transcriptions. In CLEF 2009 Workshop, Greece, Corfu, 2009, to appear.