MIRACLE at the Spanish CLEF@QA 2006 track César de Pablo-Sánchez, Ana González-Ledesma, Antonio Moreno, José Luis Martı́nez-Fernández, Paloma Martı́nez Universidad Carlos III de Madrid {cesar.pablo,paloma.martinez}@uc3m.es Universidad Autonoma de Madrid {ana,sandoval}@maria.lllf.uam.es DAEDALUS S.A. - Data, Decisions and Language, S.A. jmartinez@daedalus.es Abstract We describe the prototype QA system built by MIRACLE group, a group composed by three Madrid universities and the spin-off, DAEDALUS. The system is an elaboration of our last year system with several improvements in question analysis and NERC components. We submitted two runs for the Spanish runs with different strategies to use NE in passage selection and answer ranking. Results show that a recall oriented approach obtain more accurate results. A detailed analysis of errors and a preliminary comparison with our last year system are also discussed. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Management]: Languages—Query Languages General Terms Measurement, Performance, Experimentation Keywords Question answering, Questions beyond factoids 1 Introduction In our third participation to the CLEF@QA task, the MIRACLE group submitted two runs for the Spanish monolingual subtask. Runs differ in the way they consider Named Entities and other complex expressions in passage selection and ranking. The core of the system is taken from our last year’s effort[1] although it features improvements in most of their modules. The system has also been adapted at an extent to new requirements of 2006 guidelines for the QA task. Questions no longer are provided with a question type and the system should infer this from the question text. New questions types are also introduced, in particular list of factoids. Although our system has been prepared to categorize questions according to these question types, not all of the types have received the same attention due to time constraints. In particular, lists are treated as simpler factual questions. Other novelties in the task include the normalization of timexes which we have also accomplished partially. These improvements also allow our system to permform a more elaborate treatment of temporally restricted questions. Finally, the guidelines also required short passages supporting the answer and several answers(up to 10), although, in the end, only one pair (answer,snippet) has been oficially evaluated. 2 System description The system is organized in a pipelined architecture where we have the three classic main modules, Question Analysis, Passage Retrieval and Answer Extraction. These modules use common lan- guage analysis services for Spanish. We have also developed a tool that logs final and intermediate results and allow further analysis of the system using a web interface. 2.1 Language analysis Our language analysis module is composed of DAEDALUS STILUS[6] analyzer service and some other tools we have developed specially for these task. STILUS provides tokenization, sentence detection and token analysis for Spanish. The analysis of tokens include information about POS tags, lemmas and other morphological features such as number and tense. The analysis is also enriched with semantic information which is stored in a dictionary of common Named Entities (NE) extracted from several resources. Those NE are organized following Sekine’s typology[5]. Themathic relations and geographical relations for some of those NE are also given although we have not used them in the QA system yet. These tool has been developed for spell, grammar and style checking in mind and therefore is very exhaustive in language coverage and their analysis, even if some are infrequent uses. As this tool is reused for this task it does not include a proper POS tagger or chunking like those found in other text analysis toolkits more focused in similar tasks. In contrast it has excellent performance for large quantities of text and is fast, stable and robust. To adapt the output of STILUS to the processing requirements of the QA task we include some more language analysis modules. A very basic set of rules remove infrequent uses of some common words based on contextual information. To improve recognition of NE, in particular recall, another set of rules are used. This step recognizes and groups tokens based on the previous information and other ortographic, contextual and structural features. Most semantic information is retained and even some rules could add new entity class information to the analysis. This module has been revised and improved since last year submission. Finally, for Temporal Expressions (TE) an additional normalization step is also performed. This feature is in current development but our intention id that allow our system to develop a shallow temporal reasoning method. 2.2 Question analysis This module has been extended to deal with new kind of questions and the fact that the question type would not be given as an explicit input to the systems. The question analysis module is responsible for transforming the question string into a common representation that could be used throughout the system. Our question model, the basis for question representation, is very simple as we lack complex analysis resources and we also believe that a simpler question model is more adequate for cross-lingual applications. We characterize each question with the following features as it has been proposed in earlier work [2]: question type (QT), expected answer type (EAT), question focus (QF), answer type term, query terms and relevant terms. Query terms are those considered to retrieve candidate documents and are used to build queries. On the other hand, relevant terms are a broader set of terms that could help to locate an answer but would retrieve noisy documents as it often happens with the answer type term. Before classification, the question text is analyzed using the language analysis services described above. Question classification is carried in two steps. Firstly, question is classified regarding their question type and later on the expected answer type is assigned. We have used a set of handwritten rules to perform question type classification. Four question types are detected, Factoids (F), Temporally Restricted Factoids (T), Lists (L) and Definitions (D). Besides common patterns, the classification use specific information for any of the types. For instance. the detection of TE usually signals for temporally restricted questions and the use of plurals in answer type terms or questions words allow us to detect list questions. The expected answer type is also assigned using rules and lists of related words generated by a linguist. This year we improved coverage for this classification by adding more patterns and extending the lists. F,L and T questions shares the rules and the same hierarchy of answer types used last year[1]. For D questions the type reflects the object to describe or action to take (Person, Organization, Other or Acronym expansion) and uses mainly the information of our NERC module. 2.3 Document retrieval As last year we used Xapian[7] for the retrieval stage. Xapian is an open source engine based on the probabilistic retrieval model that includes the Okapi BM25 scoring funtion [4]. The collection has been indexed using simple terms stemmed with Snowball stemmer[3] for Spanish and query terms are processed in a similar fashion. Some improvements have been done in the query generation step. Simple terms are used in the query and composed with OR operator to improve recall. Complex terms like NE are decomposed and joined using a relaxed AND operator (AND MAYBE) implemented by Xapian. With this operator, documents that contained all the component terms are ranked higher, but if some of this terms does not appear documents are not removed from the result set. Documents selected by the retrieval step are analyzed and sentence boundaries are idenfied. Snippets that contain a number of relevant terms from the questions are considered for further processing. The threshold to select a snippet is proportional to the number of relevant terms that represent the questions and in particular if it contains the entity in focus. So far, we have only considered snippets that spans a single sentence. 2.4 Answer extraction To perform answer extraction a filter for every expected answer type in our system has been developed. Answer extraction used this information to select candidate answers from the set of relevant sentences previously identified. For every pair of candidate answer and snippet a score is given, considering factors as the document score and the frequency and ratio of relevant terms contained. These scoring step that we call local score consider only information present in one snippet. In a second step, similar candidate answers are conflated and a new global step is generated. Rules for temporal expression normalization has been developed this year and used to conflate dates. Other simple techniques that filter stopwords and compare the rest of the tokens are used for other NE types. Those techniques are rather heuristic and definitely does not conflate different variations of a name but have been proven useful. When grouping several candidate answers, the one with the higher score is selected as representantive for that group, together with its snippet. A redundancy score is calculated by counting the number of snippets in a group up to a maximum of N. A global score is assigned to the group that combines the local score (wls) and the normalized redundancy score (wrs) to produce a confidence between 0 and 1. The values used in our experiment has been assigned manually and are N=10,wls=0.9,wrs=0.1 . 3 Description of the runs Two runs were submitted to the Spanish monolingual task. they differ in the way that use multiwords or complex terms. Complex terms includes multiword units such as most NE, TE and Table 1: Evaluation results for MIRACLE’s submitted runs Run R X U Acc(%) Acc(F )(%) Acc(D)(%) Acc(T )(%) CW S K1 mira061eses 36 3 6 18.95 19.86 16.67 0.00 0.05708 -0.3405 mira062eses 40 4 7 21.05 20.55 23.81 0.00 0.07896 -0.3977 numerical expressions. As we index the EFE collection using simple terms it is not clear which strategy works best for selecting and scoring relevant documents and sentences. Before thinking of more complex alternatives, we have tried different simple alternatives in each of the runs. Both runs decompose complex proper names in a query into simple terms joined with the operator AND MAYBE to retrieve documents. Run mira01eses selects and scores sentences considering the whole multiword. This strategy select those sentences in which the expression is the same that appear in the question and is clearly oriented to favour high precision. In contrast, run mira02eses uses terms that compose the term. This run would favour recall because it could match diffrent common forms of referring the same entity for example, but it is also problematic as it could select noisy sentences that refer to other entities. 4 Results Table 1 presents the official results for factual (F), definition (D) and temporally restricted ques- tions (T) for the two monolingual spanish runs submitted. Despite there are several questions that has temporal restrictions (33 in our intrepretation), only two of them have been officially assigned this question type. Neither of them have been answered correctly. General performance for factual and definition questions is quite similar for both runs while it seems that mira062eses obtain better accuracy score and CWS. The interpretation of the results need further investigation but it seems that the precision oriented run (mira061eses) filter some useful sentences. This is also supported by the higher number of NIL responses in this run. In contrast, K1 measure, that takes into account the self score of the system, it is better for run mira061eses than mira062eses. The difference in the scoring for the two runs is motivated by considering complex terms as a unit or separated. Although, both methods perform rather bad with regard to this metric, the second seems to be worse as it usually produces higuer scores for the same sentence and threfore higher confidence. In any case, the inspection of the results suggest that correct and incorrect candidate answers are scored too close. This is also, in our opinion, the source of errors for unsupported answers as it is probably produced for similar scores before or after answer normalization. For questions that we have believed that have a temporal restriction, we have performed an unofficial evaluation. Both runs have similar overall results with accuracy figures comparable to usual factoid questions. If we consider some doubtful cases (unsupported answers and some cases that obtain all the same score) this percentage is even higher. It seems that our effort to process temporal information are in a good direction even if there is in an initial stage. This is also supported from the fact that the system obtain good scores for F-TIME questions. In addition, the 2006 exercise have run a test on list questions. The results for our system have been very low. Run mira061eses answered only 2 questions right, with a P@10 score of 0.03 (three correct answers). The second run (mira062eses) was even worse with only one right answer. Although list questions have been identified in the question analysis phase, no special processing has been implemented for them due to time constraints and they are treated as simple factoids. Table 2: Error analysis for run mira061 Error type % Question classification 7,79 Question analysis 18,18 Document retrieval 38,96 Sentence selection 3,89 Answer extraction 11,03 Answer ranking 20,12 Table 3: Error comparison between 2005 and 2006 runs Error type 2005 2006 Questions classification & analysis 25.98 25.97 Document retrieval 20.81 38.96 Sentence & answer extraction 11.83 14.92 Answer ranking 40.84 20.12 4.1 Error Analysis It is important to estimate the causes for low scores got this year, analysing which module in the general system is the main responsible for the errors. Of course errors in a QA system cannot be assigned to only one system, and this should be beared in mind when judging the following error analysis. With respect to our 2005 error classification, we added two new types this year, to better understand the treatment of the questions and answers in our system. The estimation of errors for run mira061eses are shown in the Table 2. In the first column, we show the percentage of error types with respect to the total of errors considering Wrong, ineXact and Unsupported (in the case of mira061eses, 154 questions). Comparing the error classification in both years (see Table 3), we can observe significant differences in the approaches. While question analysis and answer extraction (although with addings and modifications) have obtained basically the same results in both years; in document retrieval and answer ranking the results have been reversed. A possible explanation of this year bad results in document retrieval is the increasing difficulty of the questions. We should advise that this is a preliminary study, as last year we had correct assesments for all the questions before the workshop, while these year we are considering only our judgments. Definitive results would be presented in a final version of this report. Besides, careful inspection of the results have allowed us to detect at least two significant bugs. The first of them affect the classification of some definition questions as ¿Qué es el CERN? that should expand and acronym. The other bug appears in the extraction of candidates for some NE types. 5 Conclusion and Future Work In these work we have studied how to consider complex terms or multiwords in the selection and ranking of answers in our Question Answering system for Spanish. Results show that favouring recall by dividing multiwords and considering single terms helps to locate more candidate sentences and that improves accuracy. The two runs submitted explore different ways to consider multiwords that lie at different extremes. Using multiwords as a single unit seems to help to identify NIL answers which could be interesting in some practical applications. A technique that combine evidence from both approaches could obtain much better results. Different alternatives consist on mixing the two ranking list or extending analysis to consider simple coreference at the document level. Further investigation on the correct way to weight the evidence of the terms for ranking and self-scoring is still needed as the result of different measures are contradictory. We probably need to consider again factors as distance and weight that we already used last year and that this year were disregarded. Another issue that need to be solved it is the proper scaling of the confidence score to be more informative. In contrast regarding the recognition and normalization of temporal expression it seems that we are in the good track. We plan to continue this work that we espect that would allow to do some shallow reasoning for TIME questions and temporally restricted questions. Finally, it seems that more effort should be put to other aspects of the system to reach the desirable performance. We have other open lines of work regarding the use of relation extraction patterns for some question types and improving retrieval with the experience acquired in other evaluation tasks. References [1] de Pablo-Sanchez C. et al. Miracle’s 2005 approach to cross-lingual question answering. In Working Notes for the CLEF 2005 Workshop. Vienna,Austria, 2005. [2] Marius Pasca. Open Domain Question Answering from Large Text Collections. CSLI Publica- tions, 2003. [3] Martin Porter. Snowball stemmers and resources website. On line http://www.snowball.tartarus.org, July 2006. last visited. [4] S.E. et al. Robertson. Okapi at trec-3. In D.K. Harman, editor, In Overview of the Third Text REtrieval Conference (TREC-3), 1995. [5] Satoshi Sekine. Sekine’s extended named entity hierarchy. On line http://nlp.cs.nyu.edu/ene/, August 2006. last visited. [6] Stilus website. On line http://www.daedalus.es, July 2006. [7] Xapian: an open source probabilistic information retrieval library. On line http://www.xapian.org, July 2006. last visited.