Are Passages Enough? The MIRACLE Team Participation at QA@CLEF2009 María Teresa Vicente-Díez, César de Pablo-Sánchez, Paloma Martínez, Julián Moreno Schneider, Marta Garrote Salazar Universidad Carlos III de Madrid {tvicente, cdepablo, pmf, jmschnei, mgarrote}@inf.uc3m.es Abstract This paper summarizes the participation of the MIRACLE team in the Multilingual Question Answering Track at CLEF 2009. In this campaign, we took part in the monolingual Spanish task at ResPubliQA@CLEF 2009 and submitted two runs. We have adapted our QA system which has been evaluated in EFE and Wikipedia to the new JRC-Acquis collection and the legal domain. We tested the use of answer filtering and ranking techniques to a base system using passage retrieval with no success. Our run using question analysis and passage retrieval obtained a global accuracy of 0.33 while the addition of an answer filtering step obtained 0.29. We provide an initial analysis of the results across the different questions types while we research the reason why it is difficult to leverage previous QA techniques. A different focus of our work has been on temporal reasoning applied to question answering and also detailed discussion of this issue in the new collection and analysis of the questions is provided. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries Keywords Question Answering, Spanish, legal domain, temporal indexing, temporal normalization 1 Introduction We describe the MIRACLE team participation in the ResPubliQA exercise at the Multilingual Question Answering Track at CLEF 2009. The MIRACLE team is a consortium formed by three universities from Madrid, (Universidad Politécnica de Madrid, Universidad Autónoma de Madrid and Universidad Carlos III de Madrid) and DAEDALUS, a small and medium size enterprise (SME). We submitted two runs for the Spanish monolingual subtask which summarize our attempts to adapt our QA system to the new requirements of the task. This year, the main task departed from previous exercises in an attempt to explore new domains, question types and multilingual experiments. The change in application domain has been triggered by the use of the JRC- Acquis document collection which is formed by European legislation translated in several EU languages. This fact raises the problem of dealing with legal language which includes richer terminology and is considerably more complex than news or academic language used in EFE and Wikipedia collections. Moreover, new kind of information needs are required to be solved which has motivated the inclusion of question asking for objectives, motivations, procedures, etc. in addition to the traditional factual and definitional questions. The new types of questions often required longer answers and therefore the expected response of the system has been fixed again at the paragraph level. Nevertheless it should be possible to take advantage of answer selection techniques developed in previous campaigns. This has been in fact one of the hypothesis we would like to test with our participation. Unfortunately, our experiments in this line have not been successful and we have not found configurations that performed substantially better than our baseline. A different aspect of our work has centered on the use of temporal information in the process of QA and we report results for different indexing configurations. Finally, a global objective was to enlarge the capabilities of the QA system and advance towards an architecture that allows domain adaptation and multilingual processing. The rest of the paper is structured as follows, the second section describes the system architecture with special attention paid to the novelties introduced this year, Section 3 introduces the submitted runs and analyzes the results. Finally, conclusions and future work are presented in Section 4. 2 System Description The system architecture is similar to our previous system [2] and is based on a pipeline which analyzes questions, retrieves documents and performs answer extraction based on linguistic and semantic information. Different strategies can be used depending on the type of the question and the expected type of the answer. The architectural schema is shown in Figure 1. A number of modules have been modified, extended or reorganized in order to adjust for the requirements of the task and the legal domain. Other modules have been included to carry new experiments. Question Information Answer Analysis Retrieval Selection Linguistic Query Answer Analysis Generation Filter Answe Timex Information Passage Analyzer Retrieval Fallback Strategy Question Classification Offline Operations Collection Indexer Document Document Timex Collection Index Analyzer Linguistic Analysis Figure 1: MIRACLE 2009 system architecture The main changes performed in the system are outlined: z Adding parsers for the new collections as well as supporting the indexing of passages. z The evaluation procedure was modified to work for passages and a fallback strategy for passages was included. z New rules have been developed for Question Analysis, Question Classification and Answer Filtering for the legal domain using the development set. z Query generation has been adapted to the domain and page heuristics for Wikipedia removed. z Temporal Management was added to normalize temporal expressions and integrated into language analysis and indexing routines. z New functionality for mining acronyms offline and add them to query generation. z The ranking module was redesigned for modularity. Indexes Indexes are really important for QA as obtaining a good retrieval subsystem can considerably improve the final results of the QA system. Due to the change in the document collection, all IR indexes have been newly created using Lucene as the retrieval engine. To accomplish the task of storing the relevant information as appropriately as needed, we have designed two different document types or indexing units: • Document, where all the information related to title, note and the text of a collection file is stored. • Paragraph, which store each paragraph, the title and the notes in a different document. Lucene uses a length document normalization term in the retrieval score which was arguably of no help in the case of paragraph scoring because paragraphs are expected to have more uniform lengths. Both types of indexes, with length normalization and without were tested. In all our experiments previous to the submission the paragraph or passage index worked better than the document index. Besides, we also created different index types regarding the analysis, characterized by the linguistic analyzer used in each case: • Simple Index, where the text analyzer used is a simple analyzer adapted for Spanish. It makes grammar based parsing, stems words using a snowball-generated stemmer, removes stop words, replaces accented characters in the ISO Latin 1 character set and converts text into lower case. All the texts are stored in the same field: text. • Temporal Index, which adds a recognition and normalization of time expressions. These time expressions are normalized and included in the index. Texts are also stored also in the field text. Finally, other modifications required the query generation process to be changed to use the same analyzer that was used to create the index. The idea of a rule engine, was initially considered for classifying question types; later, it has also been used not only in the Question Classification module, but also in the Answer Filter, Timex Analyzer and Topic Detection ones [2]. The rules have a left part that expresses a pattern and a right part specifying the actions to be taken each time the pattern is found. The pattern could refer to lexical, syntactic and/or semantic elements. The change of linguistic domain meant some changes in the new rules. Below, we present an example of a new rule, developed to handle the extraction of definitions on this year corpus: Figure 2: Example of rule for answer extraction RULE("definition") EXISTENTIAL QUESTION TYPE ("DEFINITION") AND WORD_I(N, OBTAIN_FOCUS()) AND (WORD_I(N+1, ":") OR WORD_I(N+1, "\"") AND WORD_I(N-1, "\"") OR WORD_I(N+1, "\"") AND WORD_I(N+2, ":") AND WOD_I(N-1, "\"")) THEN ANSWER_EXTRACTION(0,POS_LAST_TOKEN()); END This rule has been created to detect the topic in definition questions. In most of them, the topic in the answer paragraph was written in quotation marks and/or followed by colon. This rule locates the topic of the question and looks for it in the source documents. Temporal Management Some authors have defined the temporal question answering (TQA) as the specialization of the QA task in which questions have some features that denote temporality [4], as well as a means for providing short and focused answers to temporal information needs formulated in natural language [6]. Previous work has already faced up to this problem for the treatment of other languages, like in [7] or [8], or also in Spanish [3]. Temporal questions can be classified into 2 main categories according to the role of temporality in their resolution: • Temporally Restricted (TR) questions are those containing some time restriction: “¿Qué resolución fue adoptada por el Consejo el 10 de octubre de 1994?” (“What resolution was adopted by the Council on 10 October 1994?”) • Questions with a Timex Answer (TA) are those whose target is a temporal expression or a date: “¿Cuándo empieza la campaña anual de comercio de cereales?” (“When does the marketing year for cereals begin?”) In this campaign, temporal management preserves the approach taken by the MIRACLE QA system participating in CLEF 2008 [2]. This decision is based on later complementary work that was made in order to evaluate the QA system performance versus a baseline system without temporal management capabilities [9]. The experiments showed that additional temporal information management can quantitative and qualitatively benefit the results. This led us to predict that the use of such strategies could enrich future developments. Several adjustments were made in the temporal expressions recognition, resolution and normalization integrated system to enhance its coverage on the new collections. Similarly to the previous version, the date of creation of each document is adopted as the reference date, needed to resolve the relative expressions that contains. In JRC- Acquis documents this information is provided by the “date.created” attribute. Question analysis, indexes generation and answer selection modules have been considered potentially more influential for achieving better results by means of the application of temporal management. They have been slightly adapted to the requirements of this year’s competition, keeping the essence of their functionality. z During question analysis process, queries, including those with temporal features, are classified, distinguishing between TR and TA queries. If a TA query is detected, it determines the granularity of the expected answer (complete date, only year, month, etc.). z The answer selector is involved in two directions: in the case of TA queries, the module must favour a temporal answer, whereas if it manages TR queries, it applies extraction rules based on the temporal inference mechanism and demotes the candidates not fulfilling the temporal restrictions. As a novelty, this year we have created more sophisticated indexes according to the paragraph retrieval approach of the competition. In some configurations, the normalized resolution of temporal expressions is included in the index instead of the expression itself. The main objective is to assess the behaviour of the QA system using different index configurations, mainly focusing on the temporal queries of the collection. Acronym mining Due to the nature of the collection, a large number of questions were expected to be expansion of acronyms, especially about organizations. On the other hand, the recall of the information retrieval step could be improved by including the acronym and their expansion in the query. We implemented a simple offline procedure to mine acronyms by scanning the collection and searching for a pattern which introduces a new entity and provides their acronym between parentheses. Then, results are filtered in order to increase their precision. First, only those associations that occur at least twice in the corpus are considered. As parentheses often convey other relations like persons and their country of origin, another filter removed countries (Spain) and their acronyms (ES) from the list. Finally, some few frequent mistakes were manually removed and acronyms with more than one expansion were also checked. Once we have cleaned the file, we index the acronyms and their expansions separately to be able to search by acronym or by expansion. The index is used in two different places in the QA system: • Query Generation, where it analyzes the question and adds searching terms to the query that is sent to the document collection index. • Answer Filter, where it analyzes the text extracted from the paragraph to determine if that paragraph contains the acronym (or the expansion) and if so, identifies the paragraph as correct answer. Answer Filter and Passage Fallback Strategy This module, previously called Answer Extractor, process the result list from the information retrieval module and selected chunks to form a possible candidate answer. In previous years, this module was designed to extract answers selected from the document. In this campaign, the answer must be the complete text of a paragraph therefore, this year the module works as a filter which removes passages with no answers. The kind of linguistic rules used last year to perform answer extraction has been adapted and new rules to detect acronyms, definitions as expressed in the new corpora and new rules for temporal questions have been developed. The possibility of getting no answer from the answer filter led to the development of a module that simply creates answers from the retrieved documents. This module is called Passage Fallback Strategy. It takes the documents returned by the information retrieval module and generates an answer from every document. The way of generating the indexes (concretely the paragraph index) makes possible the functionality of this module. Evaluation module Evaluation is a paramount part of the development process of the QA system. In order to develop and test the system the English development test provided by CLEF organizers was translated to Spanish and a small gold- standard with answers was developed. Mean Reciprocal Rank (MRR) and Confidence Weighted Score (CWS) were consistently used to compare the outputs of the different configurations with the development gold standard. Periodically, the output and the XML logs of different executions were manually inspected to complete the gold standard and to detect integration problems. 3 Experiments and results We submitted two runs for the monolingual Spanish task. They correspond to the configurations of the system that yielded best results during our development using the translated question set. Paradoxically, both runs match with the simplest configurations that we have tested. • Baseline (mira091eses): The system is based on passage retrieval using the simple index. Question analysis is performed to generate queries and the acronym expansion is used. • Baseline + Answer Filter (mira092eses): Adds answer filtering and the passage fallback strategy after the previous passage retrieval. A number of additional configurations were also tested but no improvements over the baseline were found consistently. In fact, most of the additions seem to produce worse results on our development test. We considered different functions for Answer Ranking and Passage Re-ranking which we have tested for previous participations and some new ones. Different passage length normalization strategies were also applied to the indexes. Finally, a great deal of effort was devoted to the treatment of temporal expressions in question analysis, indexing and extraction and more detailed experiments are presented below. Question Information Analysis Retrieval Linguistic Query Analysis Generation Answe Question Information Classification Retrieval Offline Operations Collection Indexer Document Collection Document Linguistic Index Analysis Figure 3: mira091eses configuration (BL) Information Answer Question Retrieval Selection Analysis Query Answer Linguistic Generation Filter Analysis Answe Information Passage Question Retrieval Fallback Classification Strategy Offline Operations Collection Indexer Document Document Collection Linguistic Index Analysis Figure 4: mira092eses configuration, including answer filtering (BL+AF) Evaluation figures are detailed in Table 1. Answer accuracy has been calculated as the ratio of questions correctly answered to the total number of questions. Only the first candidate answer is considered, rejecting the rest of possibilities. Name Right Wrong Unanswered Unanswered Unanswered Overall Proportion of c@1 with Right with Wrong with Empty accuracy answers measure Candidate Candidate Candidate correctly Answer Answer discarded mira091eses 161 339 0 0 0 0.32 0 0.32 mira092eses 147 352 0 0 1 0.29 0 0.29 Table 1: Results for submitted runs The results on the CLEF09 test set show similar conclusions to those we obtained during our development process, the baseline system using passage retrieval is hard to beat and in fact our second run provide lower accuracy. As in the case of our development experiments there are changes for individual answers of a number of questions but the overall effect is not positive. After the evaluation, and using the larger test set of 500 questions we have decided to carry a class based analysis in order to understand the causes behind our unfruitful efforts. We have manually annotated the questions and grouped them in 6 main question types. In contrast with our expectations, the performance of the second submitted run is also worse for the factual and definition questions. As we have considered these questions types in previous evaluations we expected to have better coverage in the Answer Filter and therefore an improvement. Similar behaviour has been observed across answer types for factual questions, being the class of TIMEX questions the only where the more complex configuration really improves. Our analysis of the errors show that further work is needed to be able to cope with the complexities of the domain. For example, questions are in general more complex and include a large number of domain specific terminologies that our question analysis rules do not handle correctly. The process of finding the focus of the question which is crucial for question classification is specially error prone. Answer Extraction needs also further adaptation to the domain for factual questions as the typology of NE and generalized NE has not wide coverage. Problems with definitions are rooted more deeply and probably require the use of different specialized retrieval strategies. This year evidence along with previous experiments seems to support that definitions depend deeply on the stylistics of the domain. Finally, new question types would require further study of techniques that help to improve the classification of passages as bearing procedures, objectives, etc. mira091eses mira092eses Question Type mira091eses mira092eses TOTAL Accuracy Accuracy BL BL-AF BL BL-AF FACTUAL 54 48 123 0.44 0.39 PROCEDURE 22 15 76 0.28 0.20 CAUSE 43 44 102 0.42 0.43 REQUIREMENT 5 5 16 0.31 0.31 DEFINITION 16 12 106 0.16 0.11 OBJECTIVE 21 23 77 0.27 0.30 ALL 161 147 500 0.32 0.29 ALL - FACTUAL 107 99 377 0.28 0.26 Table 2: An analysis of runs by question type Evaluation of temporal questions With the aim of evaluating the temporal management capabilities of the QA system, we decided to extract the temporal questions from the whole corpus. 46 out of 500 queries denote temporal information, that means a 9,20% over the total. 24 of them are TR questions, whereas TA queries are 22 (4,80% and 4,40% out of the total, respectively). This subset has been studied, evaluating the correctness of the returned answers by two different configurations of the QA system. The results are presented in Table 3. Name Temporal Temporally Restricted Timex Questions (TR + (TR) Answer (TA) TA) BL (mira091eses) 0.43 0.42 0.45 BL-AF (mira092eses) 0.48 0.37 0.59 DA-BL (run1 configuration) 0,28 0,21 0,36 DA-BL-AF (run2 configuration) 0,37 0,21 0,54 Table 3: Results for temporal questions in the submitted runs and other configurations As we can observe, better figures are obtained by the set of TQ in both runs. There is no significant difference between TA and TR queries in the first run, while in the second one they achieve a difference of 22%. In our opinion, the second configuration, with answer filtering and answer creation, enhances precision for TA queries, whereas for TR queries, temporal restrictions introduce noise that the system is not able to solve. Non-submitted runs present similar configurations to the submitted ones, but adopting a different index generation and question analysis strategies. The approach consisted on the inclusion of normalized temporal expressions into the index, as well as in the question analysis process, aiming to increase recall. We tested the performance over the total corpus of questions, but worse results were achieved even if the study is restricted to temporal questions. Results are also presented in Table 3, which show no improvement regarding the submitted runs. Performance difference between TA and TR queries remains stable, since the system has a better response to questions without temporal restrictions. The lost of accuracy can be due to the lack of a more sophisticated inference mechanism at the time of retrieval, capable of reasoning with different granularities in normalized dates format [10]. In addition, we suspect that answer selection module is not filtering candidate answers properly, so current inference mechanism gives more weigh to paragraphs containing dates matching with restrictions in the query, while the rest of terms lose relevancy. Though relative dates present a low frequency in the collections, they are not being correctly solved, as reference date, taken from that of the documents creation, is always set to the same value. 4 Conclusion and Future Work From our point of view, the new ResPubliQA exercise is a challenge for QA systems in two main facets of the problem domain adaptation and multilinguality. This year our efforts have focused on the first problem where we have ported the system and the techniques developed for EFE and Wikipedia to the new legal collection JRC- Acquis. However, our experiments, which are exemplified with the submitted runs, show that a system mainly based on passage retrieval performs quite well. Baseline passage retrieval results provided by the organizers [11] also support these. We are carrying further experiments using the larger test set in order to find how answer selection could help for ResPubliQA questions as well as the differences between passage retrieval alternatives. Regarding our focus on temporal reasoning applied to QA we would explore how question temporal constraints can be integrated at other steps in the process. We expect to compare the effectiveness of temporal reasoning as constraints for filtering answers and for the purpose of re-ranking. Finally, further work in the general architecture of the QA is expected to help in at least three areas: separation of domain knowledge from general techniques, adding different languages to the system and effective evaluation. Acknowledgements This work has been partially supported by the Regional Government of Madrid by means of the Research Network MAVIR (S-0505/TIC/000267) and by the Spanish Ministry of Education by means of the project BRAVO (TIN2007-67407-C3-01) References [1] Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. [2] Martínez-González, A., de Pablo-Sánchez, C., Polo-Bayo, C., Vicente-Díez, M.T., Martinez-Fernández, P., Martínez-Fernández, J.L. 2008. The MIRACLE Team at the CLEF 2008 Multilingual Question Answering Track. In Proceedings of the 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers. Series LNCS (to appear) [3] Apache Lucene project. The Apache Software Foundation. http://lucene.apache.org/, visited 30/07/2009. [4] Saquete, E. Resolución de Información Temporal y su Aplicación a la Búsqueda de Respuestas. 2005. Thesis in Computer Science, Universidad de Alicante. [5] Saquete, E., Martínez-Barco, P., Muñoz, R., Viñedo, JL. 2004. Splitting Complex Temporal Questions for Question Answering Systems. In Proceedings of the ACL’2004 Conference, Barcelona, Spain. [6] De Rijke et al. Inference for temporal question answering Project. 2004-2007. OND1302977. [7] Hartrumpf, S. and Leveling, J. 2006. University of Hagen at QA@CLEF 2006: Interpretation and normalization of temporal expressions. In Results of the CLEF 2006 Cross-Language System Evaluation Campaign, Working Notes for the CLEF 2006 Workshop. Alicante, Spain. [8] Clark, C. and Moldovan, D. Temporally Relevant Answer Selection. In Proceedings of the 2005 International Conference on Intelligence Analysis, May 2005. [9] Vicente-Díez, M.T. y Martínez, P. Aplicación de técnicas de extracción de información temporal a los sistemas de búsqueda de respuestas. Procesamiento del lenguaje natural. N. 42 (marzo 2009); pp.25-30. [10] ISO8601:2004(E) Data elements and interchange formats – Information interchange – Representation of dates and times. Third edition 2004 [11] Pérez j. , Garrido G. , Rodrigo A., Araujo L., Peñas A. Information Retrieval Baselines for the ResPubliQA task. 2009. CLEF 2009 Working Notes.