=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-QACLEF-RossetEt2007
|storemode=property
|title=The LIMSI Participation in the QAst Track
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-QACLEF-RossetEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/RossetGAB07a
}}
==The LIMSI Participation in the QAst Track==
The LIMSI parti ipation to the QAst tra k Sophie Rosset, Olivier Galibert, Gilles Adda, Eri Bilinski Spoken Language Pro essing Group, LIMSI-CNRS, B.P. 133, 91403 Orsay edex, Fran e {firstname.lastname}limsi.fr Abstra t In this paper, we present twe two dierent question-answering systems on spee h tran- s ripts whi h parti ipated to the QAst 2007 evaluation. These two systems are based on a omplete and multi-level analysis of both queries and do uments. The rst sys- tem uses hand rafted rules for small text fragments (snippet) sele tion and answer extra tion. The se ond one repla es the hand rafting with an automati ally generated resear h des riptor. A s ore based on those des riptors is used to sele t do uments and snippets. The extra tion and s oring of andidate answers is based on proximity measurements within the resear h des riptor elements and a number of se ondary fa - tors. The evaluation results are ranged from 17% to 39% as a ura y depending on the tasks. Categories and Subje t Des riptors H.3 [Information Storage and Retrieval℄: H.3.1 Content Analysis and Indexing; H.3.3 Information Sear h and Retrieval; H.3.4 Systems and Software General Terms Measurement, Performan e, Experimentation Keywords Question answering, spee h trans riptions of meeting and le tures 1 Introdu tion In the QA and Information Retrieval domains progress has been demonstrated via evaluation ampaigns for both open domain and limited domains [1, 2, 3℄. In these evaluations systems are presented with independent questions and should provide one answer extra ted from textual data to ea h question. Re ently, there has been growing interest in extra ting information from multimedia data su h as meetings, le tures... Spoken data is dierent from textual data in various ways. The grammati al stru ture of spontaneous spee h is quite dierent from written dis ourse and in lude various types of disuen ies. The le ture and intera tive meeting data provided in QAst evaluation are parti ularly di ult due to run-on senten es and interruptions. Most of the QA systems use a omplete and heavy synta ti and semanti analysis of both the question and the do ument or snippets given by sear h engine in whi h the answer has to be found. Su h analysis an't reliably be performed on the data we are interested in. Typi al textual QA systems are omposed of question analysis, information retrieval and answer extra tion omponents [1, 4℄. The answer extra tion omponent is quite omplex and involves natural language analysis, pattern mat hing and sometimes even logi al inferen e [5℄. Most of these natural language tools are not designed to handle spoken phenomena. 1 In this paper, we present the ar hite ture of the two QA systems developed in LIMSI for the QAst evaluation. Our QA systems are part of an intera tive and bilingual (English and Fren h) QA system alled Ritel [6℄ whi h spe i ally addressed speed issues. The following se tions present the do uments and queries pre-pro essing and the non- ontextual analysis whi h are ommon to both systems. The se tion 3 des ribes the older system (System 1). Se tion 4 presents the new system (System 2). Se tion 5 nally presents the results for these two systems on both development and test data. 2 Analysis of do uments and queries Usually, the synta ti /semanti analysis is dierent for the do ument and for the query; our approa h is to perform the same omplete and multilevel analysis on both queries and do uments. There are several reasons for this. First of all, the system has to deal with both trans ribed spee h (trans riptions of meetings and le tures, user utteran es) and text do uments, so there should be a ommon analysis that takes into a ount the spe i ities of both data types. Moreover, in orre t analysis due to the la k of ontext or limitations of hand- oded rules are likely to happen on both data types, so using the same strategy for do ument and utteran e analysis helps to redu e their negative impa t. In order to use the same analysis module for all kinds of data, we should transform the query and the do uments, whi h may ome from dierent modality (text, manual trans ripts, automati trans ripts) in order to have a ommon representation of the senten e, word, et . This pro ess is the normalization. 2.1 Normalization Normalization, in our appli ation, is the pro ess by whi h raw texts are onverted to a text form where words and numbers are unambiguously delimited, pun tuation is separated from words, and the text is split into senten e-like segments (or as lose to senten es as is reasonably possi- ble). Dierent normalization steps are applied, depending of the kind of input data; these steps are: 1. Separating words and numbers from pun tuation. 2. Re onstru ting orre t ase for the words. 3. Adding pun tuation. 4. Splitting into senten es at period marks. In the QAst evaluation, four data types are of interest: • CHIL le tures [7℄ with manual trans riptions, where manual pun tuations are separated from words. Only the splitting step is needed. • CHIL le tures with automati trans riptions [8℄. Requires adding pun tuation and splitting. • AMI meetings [9℄ manual trans riptions. The trans riptions had been textied, with pun - tuation joined to the words, rst words senten es upper- ased, et . Requires all the steps ex ept adding pun tuation. • AMI meetings with automati trans riptions [10℄. La king ase, they required the last 3 steps. Re onstru ting the ase and adding pun tuation is done in the same pro ess based on using a fully- ased, pun tuated language model [11℄. A word graph was built overing all the possible variants (all possible pun tuations added between words, all possible word ases), and a 4-gram language model was used to sele t the most probable hypothesis. The language model was estimated on House of Commons Daily Debates, nal edition of the European Parliament Pro eedings and various newspapers ar hives. The nal result, with upper ase only on proper nouns and words learly separated by white-spa es, is then passed to the non- ontextual analysis. 2.2 Non ontextual analysis module The analysis is onsidered non- ontextual be ause ea h senten e is pro essed in isolation. The general obje tive of this analysis is to nd the bits of information that may be of use for sear h and extra tion, whi h we all pertinent information hunks. These an be of dierent ategories: named entities, linguisti entities (e.g. verbs, prepositions), or spe i entities (e.g. s ores). All words that do not fall into su h hunks are automati ally grouped into hunks via a longest- mat h strategy. Some examples of pertinent information hunks are given in Figure 1. In the following se tions, the types of entities handled by the system are des ribed, along with how they are re ognized. _prep in _org NIST _NN metadata evaluations _verb reported _NN speaker tra king _s ore error rates _aux are _prep about _val_s ore 15 % Figure 1: Examples of pertinent information hunks from the CHIL data olle tion 2.2.1 Denition of Entities Following ommonly adopted denitions, the named entities are expressions that denote lo ations, people, ompanies, times, and monetary amounts. These entities have ommonly known and a epted names. For example if the ountry Fran e is a named entity, apital of Fran e is not a named entity. However our experien e is that the information present in the named entities is not su ient to analyze the wide range of user utteran es that an be found in le tures or meetings trans ripts. Therefore we dened a set of spe i entities in order to olle t all observed information expressions ontained in a orpus questions and texts from a variety of sour es (pro eedings, trans ripts of le tures, dialogs et .). Figure 2 summarizes the dierent entity types that are used. Type of entities Examples lassi al pers: Romano Prodi ; Winston Chur hill named entities prod: Pulp Fi tion ; Titani time: third entury ; 1998 ; June 30th org: European Commission ; NATO lo : Cambridge ; England extended method: HMM, Gaussian mixture model named entities event: the 9th onferen e on spee h ommuni ation and te hnology amount: 500 ; two hundred and fty thousand measure: year ; mile ; Hertz olor red, spring green question markers Qpers: who wrote... ; who dire ted Titani Qlo : where is IBM Qmeasure: what is the weight of the blue spoon headset linguisti hunk ompound: language pro essing ; information te hnology verb: Roberto Martinez now knows the full size of the task adj_ omp: the mi rophones would be similar to ... adj_sup: the biggest produ er of o oa of the world Figure 2: Examples of the main entity types 2.2.2 Automati dete tion of typed entities The types we need to dete t orrespond to two levels of analysis: named-entity re ognition and hunk-based shallow parsing. Various strategies for named-entity re ognition using ma hine learn- ing te hniques have been proposed [12, 13, 14℄. In these approa hes, a statisti ally pertinent overage of all dened types and subtypes indu ed the need of a large number of o urren es, and therefore rely on the availability of large annotated orpora whi h are di ult to build. Rule- based approa hes to named-entity re ognition (e.g. [15℄) rely on morphosynta ti and/or synta ti analysis of the do uments. However, in the present work, performing this sort of analysis is not feasible: the spee h trans riptions are too noisy to allow for both a urate and robust linguisti analysis based on typi al rules and the pro essing time of most of existing linguisti analyzers is not ompatible with the high speed we require. We de ided to ta kle the problem with rules based on regular expressions on words as in other works [16℄: we allow the use of lists for initial dete tion, and the denition of lo al ontexts and simple ategorizations. The tool used to implement the rule-based automati annotation system is alled Wmat h. This engine mat hes (and substitutes) regular expressions using words as the base unit instead of hara ters. This property allows for a more readable syntax than traditional regular expressions and enables the use of lasses (lists of words) and ma ros (sub-expressions in-line in a larger expression). Wmat h in ludes also NLP-oriented features like strategies for prioritizing rule appli ation, re ursive substitution modes, word tagging (for tags like noun, verb...), word ategories (number, a ronym, proper name...). It has multiple input and output formats, in luding an XML-based one for interoperability and to allow haining of instan es of the tool with dierent rule sets. Rules are pre-analyzed and optimized in several ways, and stored in ompa t format in order to speed up the pro ess. Analysis is multi-pass, and subsequent rule appli ations operate on the results of previous rule appli ations whi h an be enri hed or modied. The full analysis omprises some 50 steps and takes roughly 4 ms on a typi al user utteran e (or do ument senten e). The analysis provides 96 dierent types of entities. Figure 3 shows an example of the analysis on a query (top) and on a trans ription (bottom). <_Qorg> whi h organization <_a tion> provided <_det> a <_NN> signi ant amount <_prep> of <_NN> training data <_pun t> ? <_pro> it <_verb> 's <_adv> just <_prep_ omp> sort of <_det> a <_NN> very pale <_ olor> blue <_ onj> and <_det> a <_adj> light-up <_ olor> yellow <_pun t> . Figure 3: Example annotation of a query: whi h organization provided a signi ant amount of training data ? (top) and of a trans ription it's just sort of a very pale blue (bottom). 3 Question-Answering System 1 The Question-Answering system handles sear h in do uments of any types (news arti les, web do uments, trans ribed broad ast news, et .). For speed reasons, the do uments are all available lo ally and prepro essed: they are rst normalized, and then analyzed with the NCA module. The (type, values) pairs are then managed by a spe ialized indexer for qui k sear h and retrieval. This somewhat bag-of-typed-words system [6℄ works in three steps: 1. Do ument query lists reation. Using the entities found in the question, we generate a do ument query, and a ordered list of hand rafted ba k-o queries. These queries are obtained by relaxing some of the onstraints on the presen e of the entities, using a relative importan e ordering (Named entity > NN > adj_ omp > a tion > subs ...) 2. Snippet retrieval: we submit ea h query, a ording to their rank, to the indexation server, and stop as soon as we get do ument snippets (senten e or small groups of onse utive senten es) ba k. 3. Answer extra tion and sele tion: the dete tion of the answer type has been extra ted beforehand from the question, using Question Marker, Named, Non-spe i and Extended Entities o-o urren es (_Qwho → _pers or _pers_def or _org). Therefore, we sele t the entities in the snippets with the expe ted type of the answer. At last, a lustering of the andidate answers is done, based on frequen ies. The most frequent answer wins, and the distribution of the ounts gives an idea of the onden e of the system in the answer. 4 Question-Answering System 2 System 1 has three main problems: • The ba k-o queries lists require a large amount of maintenan e work and will never over all of the ombinations of entities whi h may be found in the questions. • The answer sele tion uses only frequen ies of o urren e, often ending up with lists of rst- rank andidate answers with the same s ore. • The system answering speed dire tly depends on the number of snippets to retrieve whi h may sometimes be very large. To limit the number of snippets is not easy, as they are not ranked a ording to pertinen e. A new system, System 2 has been designed to solve these problems. We have kept the three steps des ribed in se tion 3, with some major hanges. In step 1, instead of instantiating do ument queries from a large number of preexisting hand rafted rules (about 5000), we generate a resear h des riptor using a very small set of rules (about 10); this des riptor ontains all the needed information about the entities and the answer types, together with weights. In step 2, a s ore is al ulated from the proximity between the resear h des riptor and the do ument and snippets, in order to hoose the most relevant ones. In step 3, the answer is sele ted a ording to a s ore whi h takes into a ount many dierent features and tuning parameters, whi h allow an automati and e ient adaptation. 4.1 Resear h Des riptor generation The rst step of System 2 is to build a resear h des riptor (data des riptor re ord, DDR) whi h ontains the important elements of the question, and the possible answer types with asso iated weight. Some elements are marked as riti al, whi h makes them mandatory in future steps, while others are se ondary. The element extra tion and weighting is based on a empiri al lassi ation of the element types in importan e levels. Answer types are predi ted through rules based on ombinations of elements of the question. The Figure 4 shows an example of a DDR. 4.2 Do uments and snippets sele tion and s oring Ea h of the do ument is s ored with geometri mean of the number of o urren es of all the DDR elements whi h appear in it. Using a geometri mean prevents from res aling problems due to some elements being naturally more frequent. The do uments are sorted by s ore and the n-best ones are kept. The speed of the entire system an be ontrolled by hoosing n, the whole system being in pra ti e io-bound rather than pu-bound. The sele ted do uments are then loaded and all the lines in a predened window (2-10 lines depending on question types) from the riti al elements are kept, reating snippets. Ea h snippet is s ored using the geometri al mean of the number of o urren es of all the DDR elements whi h appear in the snippet, smoothed with the do ument s ore. { question: in whi h ompany Bart works as a proje t manager ? ddr: { w=1, riti al, pers, Bart}, { w=1, riti al, NN, proje t manager }, { w=1, se ondary, a tion, works }, answer_type = { { w=1.0, type=orgof }, { w=1.0, type=organisation }, { w=0.3, type=lo }, { w=0.1, type=a ronym }, { w=0.1, type=np }, } Figure 4: Example of a DDR onstru ted from the question in whi h ompany Bart works as a proje t manager; ea h element ontains a weight w, their importan e for future steps, and the pair (type,value); ea h possible answer type ontains a weight w and the type of the answer. 4.3 Answer extra tion, s oring and lustering In ea h snippet all the elements whi h type is one of the predi ted possible answer types are andidate answers. We asso iate to ea h andidate answer A a s ore S(A): P w(E) 1−γ γ [w(A) E maxe=E (1+d(e,A))α ] × Ssnip S(A) = (1) Cd (A)β Cs (A)δ In whi h: • d(e, A) is the distan e to ea h element e of the snippet, instantiating a sear h element E of the DDR • Cs is the number of o urren es of A in the extra ted snippets, Cd in the whole do ument olle tion • Ssnip is the extra ted snippet s ore (see 4.2) • w(A) is the weight of the answer type and w(E) the weight of the element E in the DDR • α, β , γ and δ are tuning parameters estimated by systemati trials on the development data. α, β, γ ∈ [0, 1] and δ ∈ [−1, 1] An intuitive explanation of the formula is that ea h element of the DDR adds to the s ore of the P andidate ( E ) proportionally to its weight (w(E)) and inversely proportionally to its distan e of the andidate(d(e, A)). If multiple instan e of the element are found in the snippet only the best one is kept (maxe=E ). The s ore is then smoothed with the snippet s ore (Ssnip ) and ompensated in part with the andidate frequen y in all the do uments (Cd ) and in the snippets (Cs ). The s ores for identi al (type,value) pairs are added together and give the nal s oring for all the possible andidate answers. 5 Evaluation In this se tion, we present the results obtained in the four tasks. T1 and T2 tasks were omposed of an identi al set of 98 questions; T3 task was omposed of a dierent set of 96 questions and T4 task of a subset of 93 questions. Table 1 show the overall results with the 3 measures used in this evaluation. We submitted two runs, one for ea h system, for ea h of the four tasks. As required by the evaluation pro edure, a maximum of 5 answers per question was provided. Globally, we an see that System 2 gets better results than System 1. The improvement of the Re all (9-11%) observed on T1, and T3 tasks for System 2 illustrates that automati generation Task System A . MRR Re all T1 Sys1 32.6% 0.37 43.8% Sys2 39.7% 0.46 57.1% T2 Sys1 20.4% 0.23 28.5% Sys2 21.4% 0.24 28.5% T3 Sys1 26.0% 0.28 32.2% Sys2 26.0% 0.31 41.6% T4 Sys1 18.3% 0.19 22.6% Sys2 17.2% 0.19 22.6% Table 1: General Results. Sys1 System 1; Sys2 System 2; A . is the a ura y, MRR is the Mean Re ipro al Rank and Re all the total number of orre t answers in the 5 returned answers of do ument/snippet queries greatly improves the overage as ompared to hand rafted rules. System 2 did not perform better than System 1 on the T2 task. Further analysis is needed to understand why. The dierent modules we an evaluate are the analysis module, the passage retrieval and the answer extra tion. The passage retrieval is easier to evaluate for System 2 be ause it is a omplete separate module, whi h is not the ase in the System 1. The Table 2 give the results on the passage retrieval in two onditions: with a limitation of the number of passages at 5 and without limitation. The diferen e between the Re all on the snippets (how often the answer is present in the sele ted snippets) and the QA A ura y show that the extra tion and the s oring of the answer has a reasonnable margin for improvement. The dieren e between the snippet Re all and its A ura y (from 26 to 38% for the no limit ondition) illustrates that the snippet s oring an be improved. Passage limit = 5 Passage without limit Task A . MRR Re all A . MRR Re all T1 44.9% 0.52 67.3% 44.9% 0.53 71.4% T2 29.6% 0.36 46.9% 29.6% 0.37 57.0% T3 30.2% 0.37 47.9% 30.2% 0.38 68.8% T4 18.3% 0.22 31.2% 18.3% 0.24 51.6% Table 2: Results for Passage Retrieval for System 2. Passage 5 the maximum of passage number is 5; Passage without limit there is no limit for the passage number; A . is the a ura y, MRR is the Mean Re ipro al Rank and Re all the total number of orre t answers in the returned answers One of the key uses of the analysis results is routing the question whi h is determining a rough lass for the type of the answer ( language, lo ation, ...). The results of the routing omponent are given in Table 3 with details by answer ategory. Two questions of T1/T2 and three of T3/T4 were not routed. We observed large dieren es with the results obtained on the development data, in parti u- larly with the method, olor and time ategories. The analysis module has been built on orpus observations and it seems to be too dependant on the development data. That an explain the absen e of major dieren es between System 1 and System 2 for the T1/T2 tasks. Most of the wrongly routed questions have been routed to the generi answer type lass. In System 1 this lass sele ts spe i entities ( method, models, system, language...) over the other entity types for the possible answers. In System 2 no su h adaptation to the task has been done and all possible entity types have equal priority. All LAN LOC MEA MET ORG PER % Corre t 72% 100% 89% 75% 17% 95% 89% T1/T2 # Questions 98 4 9 28 18 20 9 % Corre t 80% 100% 93% 83% - 85% 80% T3/T4 # Questions 96 2 14 12 - 13 15 TIM SHA COL MAT % Corre t 80% - - - T1/T2 # Questions 10 - - - % Corre t 71% 89% 73% 50% T3/T4 # Questions 14 9 11 6 Table 3: Routing evaluation. All: all questions; LAN: language; LOC: lo ation; MEA: measure; MET: method/system; ORG: organization; PER: person; TIM: time; SHAP: shape; COL: olour. 6 Con lusion and future work We presented the Question Answering systems used for our parti ipation to the QAst evaluation. Two dierent systems have been used for this parti ipation. The two main hanges between System 1 and System 2 are the repla ement of the large set of hand made rules by the automati generation of a resear h des riptor, and the addition of an e ient s oring of the andidate answers. The results show that the System 2 outperforms the System 1. The main reasons are: 1. Better generi ity through the use of a kind of expert system to generate the resear h de- s riptors. 2. More pertinent answer s oring using proximities whi h allows a smoothing of the results. 3. Presen e of various tuning parameters whi h enable the adaption of the system to the various question and do ument types. These systems have been evaluated on dierent data orresponding to dierent tasks. On the manually trans ribed le tures, the best result is 39% for A ura y, on manually trans ribed meetings, 24% for A ura y. There was no spe i eort done on the automati ally trans ribed le tures and meetings, so the performan es only give an idea of what an be done without trying to handle spee h re ognition errors. The best result is 18.3% on meeting and 21.3% on le tures. From the analysis presented in the previous se tion, performan e an be improved at every step. For example, the analysis and routing omponent an be improved in order to better take into a ount some type of questions whi h should improve the answer typing and extra tion. The s oring of the snippets and the andidate answers an also be improved. In parti ular some tuning parameters (like the weight of the transformations generated in the DDR) have not been optimized yet. 7 A knowledgments This work was partially funded by the European Commission under the FP6 Integrated Proje t IP 506909 Chil and the LIMSI AI/ASP Ritel grant. Referen es [1℄ E. M. Voorhees, L. P. Bu kland. The Fifteenth Text REtrieval Conferen e Pro eedings (TREC 2006), In Voorhees and Bu kland eds. 2006. [2℄ B. Magnini, D. Giampi olo, P. Former, C. Aya he, P. Osenova, A. Penas, V. Jijkown, B. Sa aleanu, P. Ro ha, R. Sut lie. Overview of the CLEF 2006 Multilingual Question Answering Tra k. Working Notes for the CLEF 2006 Workshop. 2006. [3℄ C. Aya he, B. Grau, A. Vilnat. Evaluation of question-answering systems : The Fren h EQueR- EVALDA Evaluation Campaign. Pro eedings of LREC'06, Genoa, Italy. [4℄ S. Harabagiu and D. Moldovan. Question-Answering. In The Oxford Handbook of Computa- tional Linguisti s. R. Mitkov (Eds). Oxford University Press. 2003. [5℄ S. Harabagiu, A. Hi kl. Methods for using textual entailment in Open-Domain question- answering. Pro eedings of COLING'06. Sydney, Australia. July 2006. [6℄ B. van S hooten, S. Rosset, O. Galibert, A. Max, R. op den Akker, G. Illouz. Handling spee h input in the Ritel QA dialogue system. 2007. Pro eedings of Interspee h'07. Antwerp. Belgium. August 2007. [7℄ CHIL Proje t. http:// hil.server.de [8℄ L. Lamel, G. Adda, E. Bilinski, and J.-L. Gauvain. Trans ribing Le tures and Seminars. In InterSpee h, Lisbon, September 2005. [9℄ AMI proje t. http://www.amiproje t.org [10℄ T. Hain, L. Burget, J. Dines, G. Garau, M. Karaat, M. Lin oln, J. Vepa, and V. Wan. The AMI Meeting Trans ription System: Progress and Performan e. Ri h Trans ription 2006 Spring (RT06s) Meeting Re ognition Evaluation. 3 May 2006, Bethesda, Maryland, USA. [11℄ D. Dé helotte, H. S hwenk, G. Adda, J.-L. Gauvain. Improved Ma hine Translation of Spee h- to-Text outputs. 2007. Pro eedings of Interspee h'07. Antwerp. Belgium. August 2007. [12℄ D.M. Bikel, S. Miller, R. S hwartz, R. Weis hedel. Nymble: a high-performan e learning name-nder. Pro eedings of ANLP'97, Washington, USA, 1997. [13℄ H. Isozaki, H. Kazawa, E ient Support Ve tor Classiers for Named Entity Re ognition. Pro eedings of COLING, Taipei. 2002. [14℄ M. Surdeanu, J. Turmo, E. Comelles. Named Entity Re ognition from spontaneous Open- Domain Spee h. Pro eedings of InterSpee h'05, Lisbon, Portugal. 2005. [15℄ F. Wolinski, F. Vi hot, B. Dillet. Automati Pro essing of Proper Names in Texts. Pro eedings of EACL'95, Dublin, Ireland. 1995. [16℄ S. Sekine. Denition, di tionaries and tagger of Extended Named Entity hierar hy. Pro eed- ings of LREC'04, Lisbon, Portugal. 2004.