The LIA at QA@CLEF-2006 Laurent Gillard, Laurianne Sitbon, Eric Blaudez, Patrice Bellot and Marc El-Bèze LIA, University of Avignon 339 ch. des Meinajaries, BP 1228, F-84911 Avignon Cedex 9, France {laurent.gillard, laurianne.sitbon, eric.blaudez, patrice.bellot, marc.elbeze} @univ-avignon.fr Abstract This article presents the first participation of the Laboratoire Informatique d’Avignon (LIA) to the Cross Language Evaluation Forum (CLEF). LIA participated to the mono- lingual Question Answering (QA) track dedicated to French language, and to the cross- lingual English to French QA track. Two runs for each track were submitted. English questions were first translated and then answered by using the monolingual French system. This QA System (QAS) already participated to the French Technolangue QA campaign (EQueR) but some improvements needed to be evaluated: definition ques- tions answering module; or were developed specifically for CLEF: integration of Lucene search engine, and re-ranking based on redundancy for factoid answer candidates. The CLEF-QA provided an opportunity to evaluate these modules. Also, English conver- sion of the QAS was started, even if, only the Question Analysis module was adapted to English this year. The generic factoid QAS is based on an extraction of answer candidates in the form of Named Entities (NE) by using keywords density measures. Few factoid answers were also provided by a knowledge base module. Lastly, definition questions were an- swered by a module based upon detection of frequent appositive informational nuggets appearing near the focus to define. The system obtained reasonable results in all runs except for Temporal and List questions that were not recognized as such and wrongly handled as simple factoid one. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software General Terms Measurement, Performance, Experimentation Keywords Question answering, Questions beyond factoids, Definition questions, Keywords density metrics, Multilingual question answering. 1 Introduction Question Answering (QA) systems aim at retrieving precise answers to questions expressed in natural language rather than list of documents that may contain an answer. They have been particularly studied since 1999 and the first large scale QA evaluation campaign held as a track of the Text REtrieval Conference [10]. Since 2003, the Cross Language Evaluation Forum (CLEF) studies multilingual issues of QA and provide an evaluation platform for QA dedicated to many languages. It is the first participation of the Laboratoire Informatique d’Avignon (LIA) to CLEF (we already participated to monolingual English, TREC-11, and monolingual French, the EQueR Technolangue QA campaign [3]). This year LIA participated in two tracks: monolingual French and English to French. For both submissions, the main system was quite the same and inherited of the one we built for our EQueR participation [5]. Moreover, CLEF-2006 allowed us to evaluate some improvements: English question analysis module, integration of Lucene as a search engine, a module to handle definition questions, and another one to experiment on re-ranking by using redundancy for answer candidates. For the two tracks, two run were submitted, the second run for each language used redundancy re-ranking module. Our QA system (QAS) follows the typical QAS architecture, and involves pipelined main com- ponents. A Question Analysis (describes in section 2) is first done to extract the semantic type(s) of the expected answer(s) and keywords but also to decide which subsystem, factoid or definition questions, must be used. The factoid QAS (section 3) performs Document Retrieval (section 3.2) to restrict the amount of processed data by next components; Passage Retrieval (section 3.3) to choose the best answering passages from documents; and finally Answer Extractions to determine the best answer candidate(s) drawn from the previously selected passages. This answer extraction is mainly done by using a density (section 3.4.1) of the keywords appearing around an answer candidate, but may also involve knowledge bases (section 3.4.2). The definition subsystem (sec- tion 4) use frequent appositive nuggets of information appearing near the focus to define it. We also experiments briefly with redundancy to re-rank answer candidates but this module was bro- ken. Lastly result of our participation will be presented and discussed with perspectives of future improvements. 2 Question Analysis Definition questions are firstly recognized with a simple pattern matching process, which also extracts the focus of the definition. All questions which don’t fit these definition patterns are considered as factual questions, including list questions (and consequently, they were wrongly answered as factoid questions with only one instance, this behaviour must be corrected). Then, the analysis of factual questions contains two main independent steps which are: ques- tion classification and keywords extraction. For the questions written in English, the keywords extraction step is done after a translation of the whole question by Google Translator1 . 2.1 Expected Answer Types Classification The answer nugget expected for a factoid question is considered to be a Named Entity (NE) based upon Sekine’s hierarchy [8]. Thus, questions are classified according to this hierarchy. This determination is done with patterns and rules for questions in French; and, complemented, but only for English questions, with semantic decision trees (a machine learning approach introduced by [6] and [1]). Semantic decision trees are adapted as described in [2] to automatically extract generic patterns from questions in a learning corpus. The corpus CLEF multinine composed of 900 questions was used for learning. As all questions are available in both English and French languages, French questions were first automatically classified with the French rules based module, then, checked manually, and lastly, paired with their English translation to be used as a learning corpus. The learning features are words, part-of-speech tags2 , lemmas, and the number of words. 1 http://www.google.fr/language_tools 2 All the part-of-speech tags and lemmas used in our QAS are obtained with the help of the TreeTagger [9]. Language OK Wrong Unknown French 86% 2% 12% English 78% 13% 5% Table 1: Evaluation of the classification process on CLEF-2006 questions The table 1 shows the results of a manual post-evaluation of the classification for the CLEF- 2006 questions. Wrong and unknowns tags prevent an extraction of the answer in the downstream components. In French, only 8 of the 19 unknown tagged questions were classifiable (as being Person name or Company name, other are difficult to map to Sekine’s taxonomy). Due to its pipelined architecture, the percentage of OK tags fixed the maximum score our QAS can finally achieve. Some of the correct classification in English are actually more generic than they could be, for example President is tagged as Person. After assigning questions to classes from the hierarchy, a mapping between these classes and available Named Entities (NE) is done. However, the question hierarchy is much more exhaustive than the named entities that our NE system is able to recognize. 2.2 Keywords Extraction We call keywords all words or expressions extracted from a question, and that are likely to appear near the answer. Keywords can be words, named entities, or noun phrases; indeed they are actually a lemmatized form of each one. Keywords extraction is done on French questions as, at this step, English questions were previously translated. Lemmas are obtained with the help of the TreeTagger. Keywords set is only composed of nouns, adjectives and verbs. Noun phrases may help in the ranking of passages if they appear. For example, in question Q0178, the expression “world champion” is more significant than “world” and “champion” taken separately. Such expressions are extracted in French with the help of SxPipe deep parser described in [7]. Named Entities encountered in question are detected as described later in section 3.1. 3 Answering factoid questions Answering factoid questions was done by using the QA System (QAS) we built for our participation to EQueR. This QAS mainly relies on density measures to select answers which are sought as Named Entities (NE) paired with expected answer types. Two new modules were developed for this years’ participation: Lucene was integrated as the Document Retrieval (DR) search engine (for EQueR, like for TREC, Topdoc lists were available, and so, this DR step was facultative) and we also experimented with redundancy to re-rank candidate answers. 3.1 Named Entity Recognition Named Entities detection is one of the key elements in our QAS. Each answer that will be provided must first be located (and bounded) as a semantic information nugget in the form of a Named Entity. Our named entities hierarchy was created and is a subset of the Sekine’s taxonomy [8]. NE detections are done by using automata; most of them are implemented by using GATE platform, but also by direct mapping of many gazetteers gathered from the Web. 3.2 Document Retrieval The Document Retrieval (DR) step aims at identifying documents that are likely to contain an answer to the question posed and, thus, restrict search space for next steps. Indexation and retrieval were done by using the Lucene search engine with its default similarity scoring. Each document of the collection is considered as a whole document (without any pre- processing) for indexing purposes, and only their lemmas are indexed after a stop-listing based upon their TreeTagger’s part-of-speech. Disjunctive queries are formulated with the question keywords. No query words relaxation was done for CLEF experiments. At retrieval time, only the (at most) first 30 documents (this is an empirically fixed limit) returned by Lucene were considered and passed to the Passage Retrieval component. 3.3 Passage Retrieval Since our first participation in a QA exercise [2], our passage retrieval approach changed from a conventional cosine based similarity to a density measure. Our passage retrieval component considers a question as a set of several kinds of items: lemmas, Named Entity tags, and expected answer types. First, a density score s is computed for each occurrence ow of each item w in a given document d. This score measures how far are the items of the question from the other items of the document. This process focuses on areas where the items of the question are most frequent. It takes into account the number of different items |w| in the question, the number of question items |w, d| occurring in the document d and a distance µ(ow ) that represents the average number of items from ow to the other items in d (in case of multiple occurrences of an item, only the nearest occurrence to ow is considered). Let s(ow , d) be the density score of ow in document d: log [µ(ow ) + (|w| − |w, d|) .p] s(ow , d) = |w| where p is an empirically fixed penalty. The score of each sentence S is the maximum density score of the question items it contains: s(S, d) = max s(ow , d) ow ∈S Passages are then composed of a maximum of three sentences: the previous sentence (if it exists), the sentence S, and the following one (if it exits). The first (and at most) 1000 best passages are then considered by the Answer Extraction component. 3.4 Answer Extraction 3.4.1 Answer extraction by using a density metric To choose the best answer to provide to a question, another density score is calculated inside the previously selected passages for each answer candidate (a Named Entity) adequately paired with an expected answer type for current question. This density score (called compacity) is centred on each candidate and involved keywords extracted (let QSet be this set) from the question at the question analysis step. The assumption behind our compacity score is that the best candidate answer is closely sur- rounded by the important words of the question. Any word not seen in the question can disturb the relation between a candidate answer and its responsiveness to a question. Moreover, in QA, term frequencies are not as useful as for Document Retrieval: an answer word can appear only once, and it is not guaranteed that words of the question will be repeated in the passage, partic- ularly in the sentence containing the answer. A score improvement can come from incorporating an inverse document frequency or linguistic features for non QSet encountered words to further take into account any variation of closeness. For each answer candidate ACi , compacity score is computed as follow: P pyn ,ACi y∈QSet compacity(ACi ) = |QSet| with yn is the nearest occurrence of the keyword y from ACi and: |W | pyn ,ACi = 2R + 1 R = distance(yn , ACi ) W = {z|z ∈ QSet, distance (z, ACi ) ≤ R} For the two runs submitted this year, only one word-length keywords were considered for inclusion in QSet. Other experiments are planned which will use compound words, named entities and noun phrases. All answers candidates are then ranked by a product between the passage density score con- taining it and their compacity score. Top N best scoring distinct answers are provided as final answer, N was equal to 1 for CLEF-2006 experiments. 3.4.2 Answer extraction by using knowledge databases For some questions, answers are quite invariable over time. This is the case, for example, for questions asking about capital of a country, authors of book or famous past events. To answer these questions one may use (static) knowledge databases (KDB). We had built such KDB for our participation to TREC-11 [2], and translated them to French equivalent for our participation to EQueR (they were particularly tuned on CLEF-2004 questions). For CLEF-2006, we used the same unchanged KDB module. This module provided answer patterns, which are used to assess the reliability of our Passage Retrieval and extraction based on compacity score. The table 2 show the coverage of these databases for CLEF from 2004 to 2006 (ne stand for “not evaluated”), the number of passages coming from PR matching a KDB pattern, and number of right supported answers. EN-FR results were lower due to translation errors (9R or 7R). runs 2004 2005 2006 FR-FR Q covered (/200) 51 26 24 Passages matching pattern 48 ne 11 Right answers 38-40 ne 11 Table 2: Knowledge databases coverage 3.5 Re-ranking using a redundancy criterion One of the drawbacks of the extraction of answers by using our compacity measure is that excessive closeness, from others interesting keywords, of a NE (of the adequate expected answer type) may lead to a wrong extraction, particularly if the passage contains many occurrences of the interesting keywords (due to high passage density score and high candidate compacity score). To smooth this flaw, we experimented with a redundancy criterion to re-rank the answer candidates found. But we did not notice significant improvements: number of wrong answers changed to right one was compensated by number of right answers changed to wrong (12R vs. 15W for FR-FR and 9R vs. 9W for EN-FR). After more analysis, it was due to a bug in the weighted-vote mechanism we used. 4 Answering definition questions For our participation to EQueR [5], most of the definition questions were unanswered by the processing chain described in the previous sections. Indeed, this chain mainly rely on the detection of an entity paired with a question type, but, for definition questions, such pairing to an entity is far more difficult as the expected answer can be anything qualifying the focus of the question (even if, when asking about a person definition, the answer sought is often its main occupation - which may constitute a common Named Entity - it can still be tricky to recognize all possible occupations or reasons why someone maybe “famous”). Also, one can notice that while for TREC-QA campaign all vital nuggets should be retrieved, for the EQueR and CLEF exercises, retrieving only - one - vital was sufficient. So, the problem we wanted to address was to find one of the best definitions available inside the corpora. Therefore, we developed a simple approach based on appositive and redundancy to deal with these questions, and it was bundled in an independent component (as it was too different from our classical QA chain): • The focus to define is extracted from the question and is used to filter and keep all sentences of the corpora containing it. • Then, appositives nuggets such as < focus , (comma) nugget , (comma) > , < nugget , (comma) focus , (comma) > , or < definite article (“le” or “la”) nugget focus > are sought. • The nuggets are divided in two sets: the first, and preferred one, contains nuggets that can be mapped, by using their TreeTagger part-of-speech tags, on a minimal noun phrase structure while the second set contains all others (it can be seen as a kind of last chance definition set). • Both set are ordered by theirs nuggets frequency in the corpora. But the “noun phrase set” is also ordered by taking into account its head noun frequency, the overlapping count of nouns inside it among the most frequent nouns appearing inside all the nuggets retrieved and its length. All these frequency measures are aimed to choose what is expected to be “the most common informative definition”. • Best nugget of the first set or of the second, if first is empty, or by default NIL is answered. This component was also in charge of the acronyms/expanded acronyms definitions, as the same syntactic punctuation clues can be used for these questions such as the frequent < acronym ( (opening parenthesis) expanded acronym ) (closing parenthesis) > or < expanded acronym ( (opening parenthesis) acronym ) (closing parenthesis) > - and it performed very well as all answers of this year set were retrieved (even if, due to a bug in the re-matching process between answer and justification, the “TDRS”/Q0145 wasn’t finally answered, while the correct answer was found). 5 Results Two run for monolingual French (FR-FR) and two run for multilingual English to French (EN- FR) were submitted. As required, only one answer per question was provided. For steps such as DR, PR, or re-ranking, deeper analysis will be done when all the correct participants’ answers will be available after the CLEF workshop. But, we evaluate one of the main bottlenecks in our pipelined QAS: missing pairing between adequate NE and expected answer types. So, for factual and temporally restricted FR-FR questions, QAS could not extract more than 131 answers. On the 200 test questions, and for our best run (FR-FR1), 93 right answers (88 + 5 lists) were provided. Table 3 presents the ventilation of our two runs without re-ranking (our second runs, which were buggy; the + or - are due to answers that we think are misclassified). Knowledge Databases: The knowledge databases best contribution was +11 right answer, however, without using KDB, 5 of these correct answers would have been also found by the “generic QAS”. Final best KBD contribution is +6R. IneXact answers: if we examine all the inexact answers from all our 4 runs, 8 of 20 where Date and 3 of 20 were Number of people. For Date, 6 of 8 where correct date answers but missed runs Right Fact. Def. Temp. R NIL NIL answered ineXact Unsupported FR-FR1 88 56 (+1) 32 0 2 30 7 (-1) 2 EN-FR1 67 40 (+1) 27 0 5 34 7 (-1) 2 Table 3: CLEF-QA 2006 results for our best FR-FR and EN-FR runs the year. By the way, the year was not present inside the justification, but it was the directly preceding “19[0-9][0-9]” year for 3 and the year of the current document for the 3 others (with no other year date cited). So, simple rule of detection, with default value to the year of the document, for these questions could have helped to answer them. Lastly, one other Date question was judged as an inexact but is actually a wrong one: Australia will probably never be a European Union Member (Q0180). For Number of people questions, all 3 where missed due to bad detection of NE boundaries, but 2 because of the lack of approximations “prs d[e’]/around”. Temporally restricted factoid questions: Were handled as simple factoid one, without any particular effort to justify date information (actually we were able to extract time constraints from question but not to build the checking module). From the results provided by CLEF staff none of them were correctly answered by our system. Definition questions: Results for the definition module were quite good: of the 41 def- initions that we manually identified (if the question “Décrire le World Trade Center./Describe WTC.”/Q0187 can also be added to this set, none of our module could be able to handle it, and even for a human, the answer to provide is not clear), our best run (FR-FR) answered 32 (not NIL) Right and 1 NIL Right (“Linux”/Q0003, the word Linux never appear inside corpora). Among the 5 incorrect answers provided: 2 of them were due to bugs inside the sentence and justification matching processing and should have been answered correctly; 1 expected a NIL answer (“T-shirt”/Q0144), and cannot be answered by this approach (which tends to always provide an answer unless the object to define did not appear inside corpora, it miss some confidence or validation measure to discard very improbable answer); and the last 2 needed more complex resolutions and were not located at all: answers were before or after a relative clause but also separated from the focus by using more than one punctuation (a dependency tree could have help to provide them). For the 3 inexact definition answers provided: “Boris Becker”/Q0090 is hard to define without using any anaphora resolution, the string “Boris Becker” never co-occurs inside a sentence with a pattern “Tennis” (“joueur/player” or “man”) or any “vainqueur/winner”. The 2 others also need to extract a subpart of a relative clause, but the corresponding passages were located in the top five definitions selected. Results for the EN-FR runs were lower (best one was 27 + 1 NIL/41) due to translation errors. List questions: Our QAS wrongly recognized the 6 list questions3 of this year test set as simple factoid question (and answered with only one answer). For our best FR-FR run, we provided one right answer for 4 questions, and for 3 of these, the justification contained all other needed right answers. For our best EN-FR run, 2 right answers were provided but only one of the justifications contained all the other correct instances. 6 Conclusion and future works We have presented the Question Answering system used for our first participation to the QA@CLEF- 2006 Evaluation. We participated in two tracks: monolingual French (FR-FR), and cross-lingual English to French (EN-FR). Results obtained were acceptable with an accuracy of 46% for FR-FR, and 35% for EN-FR. However, from the quick analysis done in the previous sections, performance can also be im- proved at every step. For example, redundancy is still something we must evaluate. It could be a 3 (4 questions were misclassified by the CLEF staff and are not actual list questions: Q0133, Q0195, Q099, Q0200) redundancy inside corpora, extracted answers and/or using the Web as a statistical judge. Indeed, we noticed that our broken criterion allowed us to answer some questions that were previously wrongly answered. Inadequate Named Entities (NE) were also a bottleneck in the processing chain, and many of them should still be added (or improved). By the way, to compensate this known weakness, we tried to include in our system a simple reformulation module, notably for the “What/Which (something)” questions. It would have been in charge of verifying answers already extracted by the generic extraction system and, particularly, of completing it when NE was unknown. But our preliminary experiments showed us that such question rewriting was tricky. To our opinion, questions were (intentionally or not) formulated such that it is difficult by changing words order to match an answering sentence (synonyms might help here). Another problem is, when using the Web to match such reformulation, results obtained are noisy: we experiment it for presidents or events (Olympic Games) as CLEF corpora is dated from 1994 to 1995 while Web emphasizes on recent era. With an accuracy of 76%, performance on definition questions was quite good even if it could still benefit from anaphora resolution. Moreover, we think that our actual methodology is language independent as it doesn’t involve any language knowledge (other than detecting appositives mark up, a similar approach was used by [4] to answer spanish definition questions). We are interested in improving it in many ways: first, to locate more difficult definitions (as the one we missed); but also to choose better qualitative definitions (by combining better statistics), and finally to synthesize these qualitative definitions. For example, “Bill Clinton” was defined as an “‘American president”, “Democrat president”, “democrat”, “president”, he can be best defined as an “American democrat president”. “Airbus” was defined as “European consortium”, or “Aeronautic consortium”, so defining it as a “European Aeronautic consortium” could be even better. Concerning date and temporally restricted questions, we could not achieve the development of a complete processing chain, so this is a future work (actually, by the time of the CLEF evaluation, we only have built a module to extract time constraints from the question). We also need to complete our post-processing to select or to synthesize complete date (with the difficulties one can expect for relative dates). Lastly, our QAS still miss confidence scores after each pipelined components be able to do some constraint relaxations (for example on keywords), any loop back or simply to decide when a NIL answer must be provided. References [1] F. Béchet, A. Nasr, and F. Genet. Tagging unknown proper names using decision trees. In Proceedings of the ACL 2000, pages 77–84, Hong-Kong, China, 2000. [2] P. Bellot, E. Crestan, M. El-Bèze, L. Gillard, and C. de Loupy. Coupling named entity recognition, vector-space model and knowledge bases for trec-11 question-answering track. In Proceedings of The Eleventh Text REtrieval Conference (TREC 2002), NIST Special Publi- cation 500-251, 2003. [3] B. Grau C. Ayache and A. Vilnat. Equer: the french evaluation campaign of questions an- swering systems. In Proceedings of The Fifth International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, 2006. [4] Daniel Ferrés, Samir Kanaan, Edgar González, Alicia Ageno, Horacio Rodrı́guez, and Jordi Turmo. The talp-qa system for spanish at clef-2005. In Carol Peters, editor, Working Notes for the CLEF 2005 Workshop, 2005. [5] Laurent Gillard, Patrice Bellot, and Marc El-Bèze. Le lia à equer (campagne technolangue des systèmes questions-réponses). In Actes de TALN-Recital 2005, volume 2, pages 81–84, Dourdan, France, 2005. [6] R. Kuhn and R. De Mori. The application of semantic classification trees to natural language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5):449– 460, 1995. [7] Benoı̂t Sagot and Pierre Boullier. From raw corpus to word lattices: Robust pre-parsing processing with sxpipe. Archives of Control Sciences, 15(4):653–662, 2005. [8] Kiyoshi Sudo Satoshi Sekine and Chikashi Nobata. Extended named entity hierarchy. In Proceedings of The Third International Conference on Language Resources and Evaluation (LREC 2002), pages 1818–1824, Las Palmas, Canary Islands, 2002. [9] H. Schmid. Probablistic part-of-speech tagging using decision trees. In Proceedings of The First International Conference on New Methods in Natural Language Processing (NemLap- 94), pages 44–49, Manchester, U.K., 1994. [10] E.M. Voorhees and D. Harman. TREC Experiment and Evaluation in Information Retrieval, chapter 10, pages 233–257. MIT Press, 2005.