Developing a Question Answering System for the Romanian-English Track at CLEF 2006 Georgiana Puşcaşu†§\ , Adrian Iftene† , Ionuţ Pistol† , Diana Trandabăţ† , Dan Tufiş∗ , Alin Ceauşu∗ , Dan Ştefănescu∗ , Radu Ion∗ , Constantin Orăsan§ , Iustin Dornescu† , Alex Moruz† , Dan Cristea† † Faculty of Computer Science ”Alexandru Ioan Cuza” University, Romania {adiftene, ipistol, dtrandabat, idornescu, amoruz, dcristea}@infoiasi.ro ∗ Centre for Artificial Intelligence Romanian Academy, Romania {tufis, alceausu, danstef, radu}@racai.ro § Research Group in Computational Linguistics University of Wolverhampton, UK {georgie,C.Orasan}@wlv.ac.uk \ Department of Software and Computing Systems University of Alicante, Spain {georgie}@dlsi.ua.es Abstract This paper describes the development of a question answering system for the Romanian- English cross-lingual track organized within the Cross Lingual Evaluation Forum (CLEF) 2006 campaign. The development stages of our cross-lingual Question Answering (QA) system are described incrementally throughout the paper, at the same time pinpointing the problems that occurred and the way they were addressed. Our system adheres to the classical architecture for QA systems, debuting with question processing followed, after term translation, by information retrieval and answer extraction. Besides the common QA difficulties, the track posed some specific problems, such as the lack of a reliable translation engine from Romanian to English, and the need to evaluate each module individually for a better insight into the system’s failures. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software;; I.2 [Artificial Intelligence]: I.2.7 Natural Language Processing General Terms Measurement, Performance, Experimentation Keywords Question Answering, Cross-lingual Question Answering 1 Introduction Question Answering can be defined as the task which takes a question in natural language and produces one or more ranked answers from a collection of documents. The QA research area has really emerged as a result of the introduction of a monolingual English QA track within the Text Retrieval and Evaluation Conference (TREC). Multilingual, cross-lingual, as well as monolingual QA for languages other than English are addressed at a scientific level by the CLEF evaluation campaigns [7]. This year our team1 of students, PhD students and researchers, have taken part for the first time in the QA@CLEF competition. As in every other cross-lingual task, the system input consisted of 200 questions in Romanian, and the output was expected to be the exact English answer. As this was the first time the Romanian-English track was introduced at CLEF, most of the effort was directed towards the development of a fully functional cross-lingual QA system having as source language Romanian, and less on fine-tuning the system to maximize the results. Following the generic architecture for QA systems [2], our system contains a question analyzer, an information retrieval engine and an answer extraction module, as well as a cross-lingual QA specific module which translates the relevant question terms from Romanian into English. This paper describes the steps taken in the development of our cross-lingual QA system for the Romanian-English track at CLEF 2006, as well as its evaluation. The remainder of the paper is structured as follows: Section 2 provides a description of the system and its embedded modules, while Section 3 presents the details of the submitted runs. Section 4 comprises an analysis of the results obtained, and finally, in Section 5, conclusions are drawn and future directions of system development are considered. 2 System Description 2.1 System Overview Question Answering systems normally adhere to the pipeline architecture that consists of three main stages: question analysis, paragraph retrieval and answer extraction [2]. Specific systems can be seen as instantiations of the general architecture, with particular choices being made concerning representation and processing for each component of the overall model. The first stage is question analysis. The input to this stage is a natural language question and the output is one or more representations of the question to be used in subsequent stages. At this stage most systems identify the semantic type of the entity sought by the question, determine additional constraints on the answer entity, and extract the keywords to be employed at the passage retrieval stage. The next stage, passage retrieval, is typically achieved by employing a conventional IR search engine to select a set of relevant candidate passages from the text collection. At the last stage, answer extraction and ranking, the representation of the question and the representation of the candidate answer- bearing passages are compared and a set of candidate answers is produced, ranked according to how likely they constitute the correct answer. Apart from the typical modules included in a QA system, our system also includes a module which translates terms from Romanian into English, in order to perform the cross-lingual transfer. The system architecture and functionality are illustrated in Figure 1. 2.2 NLP Pre-processing The questions are first morpho-syntactically pre-processed using the Romanian POS tagger de- veloped at RACAI [8]. Afterwards, a pattern-based Named Entity Recognizer is employed to 1 The team that participated in the development of the system included students, PhD students and researchers from the Faculty of Computer Science at the ”Alexandru Ioan Cuza” University, Iasi, Romania (furtherly addressed as UAIC), from the Centre for Artificial Intelligence of the Romanian Academy - (RACAI), from the University of Alicante, Spain, and the University of Wolverhampton, United Kingdom. Figure 1: System Architecture and Functionality identify Named Entities classified in one of the following classes: Person, Location, Measure and Miscellaneous. Similar pre-processing operations are carried out on the English document collection, after segmenting it at sentence level. We used the same set of tools with a different language model. 2.3 Question Analysis This stage is mainly concerned with the identification of the semantic type of the entity sought by the question (expected answer type). In addition it also identifies the question focus, the question type and the set of keywords relevant for the question. To achieve these goals, our question analyzer performed the following steps: a) NP-chunking, Named Entity extraction, Temporal Expression identification The input at this stage consists of the pre-processed set of questions. On the basis of the morpho-syntactic annotation provided by our Romanian POS-tagger, we have implemented a rule-based shallow noun phrase identifier. The Named Entity recognizer employed at the pre- processing stage provides us with the set of question Named Entities. Temporal expressions (TEs) are also identified using the adaptation to Romanian of the TE identifier and normaliser developed by [6]. b) Question focus identification The question focus is the word or word sequence that defines or disambiguates the question, in the sense that it pinpoints what the question is searching for or what it is about. We considered the question focus to be either the noun determined by the question stem (as in What country ) or the head noun of the first question NP if this NP comes before the main verb of the question or if it follows the verb “to be”. c) Distinguishing the expected answer type At this stage we identify the category of the entity expected as an answer to the analyzed question. Our system’s answer type taxonomy distinguishes the following classes: PERSON, LOCATION, ORGANIZATION, TEMPORAL, NUMERIC, DEFINITION and GENERIC. The assignment of a class to an analyzed question is performed using the question stem and the type of the question focus. The question focus type is detected using the WordNet [1] sub-hierarchies specific to the categories PERSON / LOCATION / ORGANIZATION. We manually attach to each category a set of ILI numbers representing the root nodes of the WordNet sub-trees containing category-specific nouns. Using the ILI numbers, we extract from the Romanian WordNet [9] lists of nouns for each of the three categories. In the case of ambiguous question stems (e.g. What), we search in the resulted lists for the head of the question focus, and identify as expected answer type the category of the corresponding list (for example, in the case of the question In which city was Vladislav Listyev murdered?, the question focus is city, noun which appears in the LOCATION list, therefore the associated expected answer type is LOCATION). d) Inferring the question type This year, the QA@CLEF competition has distinguished among four types of questions: factoid, definition, list and temporally restricted questions. As temporal restrictions can constrain any type of question, we proceed by first detecting whether the question has the type factoid, definition or list and then test the existence of temporal restrictions. The question type is identified using two simple rules: if the expected answer type is DEFINITION, then obviously the question type is definition; if the question focus is a plural noun, then the question type is list, otherwise factoid. The temporal restrictions are identified using several patterns and the information provided by the TE identifier. e) Keyword set generation The set of keywords is automatically generated by listing the important question terms in decreasing order of their relevance. Therefore, the set of keywords comprises: the question focus, the identified NEs and TEs, the remaining noun phrases, and all the non-auxiliary verbs present in the question. This set is then passed on to the Term Translation module, in order to obtain English keywords for paragraph retrieval. 2.4 Term Translation In order to achieve term translation we have employed WordNet, a resource available both for English and Romanian. The set of keywords extracted at the question analysis stage served as input for the term translation stage, therefore we were supposed to translate both noun phrases and verbs. The noun phrases are translated in a different manner than the verbs. The NP words are translated individually, by first identifying the Romanian synsets containing the word, by reaching through the ILI numbers the corresponding English synsets, and by forming a set of all possible translations. From this set, using empirical rules based on word frequency, we choose a maximum of three candidates. If the word to be translated does not appear in the Romanian WordNet, as it was quite frequently the case, we search for it in other available dictionaries and preserve the first three translations. If still no translations are found, we consider the word itself as translation. After each individual word is translated, we employ rules that translate the Romanian syntax into English syntax. In this manner we obtain for each NP several translation equivalents. In the case of verbs, we extract for each verb the translation equivalents from WordNet as we did for nouns. Attempts to apply the frequency-based methodology used in the case of nouns failed due to the fact that some very general verbs were preferred instead of the correct translation equivalent. To this end, we decided to select the best translation equivalent by considering both the frequency of the verb and of the nouns which appear in the subject and object positions. To achieve this, the verb-noun co-occurrence data described in [5] has been used to determine which of the verb translation equivalents co-occur more often with the translated nouns, and was selected as the translation. Despite the simplicity of the method, the accuracy of the translation improved dramatically over the row frequency. 2.5 Index Creation and Paragraph Retrieval As described in section 2.2, the English corpus was initially preprocessed using tokenization, lemmatization, POS-tagging and NE recognition tools. In our runs, we employed an index/search engine based on the Lucene [4] search engine. The document collection was indexed both at document, as well as paragraph level, using the lemmas of the content words and the NE classes (MEASURE, PERSON, LOCATION, etc). For a query like: the search engine searches for a segment (document/paragraph) that contains a measure and words belonging to the set of translation equivalents. The question type is used to approximate the maximum number of hits to be returned. When no paragraphs are returned for a given query, we employ two strategies: we either increase the granularity of the segments from paragraphs to documents, or we reformulate the query using for individual words spelling variations and variable word distance for phrases. 2.6 Answer Extraction Two answer extraction modules have been developed, one by UAIC and another one by RACAI. Both modules require as input the expected answer type, the question focus, the set of keywords and the retrieved snippets together with their POS, lemma and NE information, and the relevance score returned by Lucene. The extraction process depends on whether the expected answer type is a Named Entity or not. When the answer type is a Named Entity, the answer extraction module identifies within each retrieved sentence Named Entities with the desired answer type. Therefore, in this case, the answer extraction process is greatly dependent on the results of the NE recognition module. When the answer type is not a Named Entity, the extraction process mainly relies on the recognition of the question focus, as in this case the syntactic answer patterns based on the focus are crucial. Both modules use the same pattern-based extractor to address DEFINITION questions. a) Answering questions asking about Named Entities When the question asks about a Named Entity such as MEASURE, PERSON, LOCATION, ORGANIZATION, DATE, the UAIC answer extractor looks for all the expressions tagged with the desired answer type. If several such expressions exist, we choose the closest to the focus in terms of word distance, if the focus is present in the sentence, otherwise the first one occurring in the analyzed sentence. When there is no named entity of the desired type, we generalize the search using synonyms of the focus extracted from WordNet. The RACAI answer extractor computes scores for each snippet based on whether a snippet contains the focus, on the percentage of keywords or keyword synonyms present in the snippet, and on the snippet/document relevance scores provided by the search engine. The entities having the desired answer type are identified and added to the set of candidate answers. For each candidate answer, another score is computed on the basis of the snippet score, the distance to the question focus and to other keywords. The candidate answers having scores above a certain threshold are presented as final answers. b) Answering questions looking for GENERIC answers When the expected answer type is not a Named Entity (the question has the GENERIC answer type), the UAIC answer extractor locates the answer within the candidate sentence using syntactic patterns. The syntactic patterns for identifying the answer include the focus noun phrase and the answer noun phrase, which can be connected by other elements such as comma, quotation marks, prepositions or even verbs. These syntactic patterns always include the focus of the question. Therefore, the focus has to be determined by the question analysis module in order to enable the system to identify answers consisting of a noun or verb phrase. The RACAI answer extractor employs also patterns to select answer candidates. These candi- dates are then ranked using a score similar to the one employed for answering questions asking about NEs. c) Answering DEFINITION questions In the case of DEFINITION questions, the candidate paragraphs are matched against a set of patterns. Each possible definition is extracted and added to a set of candidate answers, together with a score revealing the quality of the pattern it matched. The set of noun phrases present in the paragraphs are also investigated to detect those NPs containing the term to be defined surrounded by other words (this operation is motivated by cases like the Atlantis space shuttle, where the correct definition for Atlantis is space shuttle). The selected NPs are added to the set of candidate answers with a low score. Then, the set of candidate answers is ordered according to the score attached to each answer and to the number of other candidate answers it subsumes. The highest ranked candidate answers are presented as final answers. 3 Description of Submitted Runs We have submitted three different runs, with the following detailed description: • UAIC This run was obtained by parsing and analyzing the questions, translating the keywords, re- trieving relevant passages and finding the final answers using the UAIC answer extractor. • RACAI This run is also obtained by parsing and analyzing the questions, keyword translations, passage retrieval, but the answers were localized using the RACAI answer extractor. • DIOGENE Our third run, which unfortunately was not evaluated, was obtained by converting the results of our Question Analysis and Term Translation modules to the format required by the DIOGENE QA system [3], and then providing them as input to the DIOGENE IR and Answer Extraction modules. Due to the high number of submitted runs having English as target language only the UAIC and RACAI runs were evaluated. In the remainder of this paper, the RACAI run will be referred to as System 1, and the name System 2 will be used for the UAIC run. 4 Results and Analysis 4.1 General Evaluation Results The official evaluation results of Systems 1 and 2 are presented in Figure 2. Each submitted answer was evaluated as UNKNOWN, CORRECT, UNSUPPORTED, INCORRECT and INEXACT. The high number of answers evaluated as UNKNOWN is due to the fact that we provided ten answers for almost all 200 questions, as ten was the maximum number of answers allowed. However, the final evaluation evaluated only the first answer for the majority of the questions (only in the case of list type questions the first three answers were evaluated). As a result of this, the answers ranked 2 to 10 were labelled as UNKNOWN indicating that no attempt was made to check their correctness. From our internal evaluation, we have noticed that the correct answers were found in the first ten answers generated by our systems for 35-40% of the questions. This Figure 2: Evaluation results for the two evaluated runs is a promising result, as it shows that the answer extractors work, but a better ranking of the produced answers is required. Most of the UNSUPPORTED and INEXACT answers are caused by the lack of a perfect match between the answer string and the supporting snippet. The majority of these errors can be solved with the use of a knowledge base, that will allow, for example, the identification of Atlantis as a space shuttle (question 101). Also, sometimes our systems provide a more generic, or more specific answer than required, as it is the case for the gold answer Shoemaker contrasted to our answer Carolyn Shoemaker, marked as UNSUPPORTED. These errors could be corrected by improving the answer extractor with more specific rules as to the format of the required answer. Another type of error causing UNSUPPORTED and INEXACT answers are the list type questions. We provided one answer per line, even if the question was of the type list. Most of the correct answers were found in the 10 answers provided, but not on the same line. These errors could be solved with a better classification of the returned answers and by grouping them according to the characteristics of the required list. There are also cases when the answer is marked as UNSUPPORTED, but denotes the same entity as the gold answer. The answer appears in the indicated snippet, but the character string is not identical with the gold answer (for example System 2 identifies the answer United States, but is marked as UNSUPPORTED because the related snippet for this answer contains United States). These snippet changes are introduced at the pre-processing stage. 4.2 Detailed Comparison between the two Evaluated Runs The two runs/systems chosen for the official evaluation are the two completely designed and implemented by the two Romanian groups involved, UAIC and RACAI. These two runs were chosen from the set of three submitted not because they provide better scores, but because their source code and the knowledge about their original implementation can be easily accessed. As a result of this we expect to be able to significantly improve them in the future. The performance of the two systems is compared below for each answer type separately, as the answer extraction methodology differs from one answer type to another. In the following graphs, R1 designates the normalized percentage of CORRECT answers, U1 of UNSUPPORTED answers, W1 of WRONG answers, X1 of INEXACT answers, and Z1 of unevaluated answers. Different colors correspond to different answer types (F=factoid, T=temporally restricted, D=definition, L=list). Figure 3: System comparison for the PERSON answer type In the case of the PERSON answer type, System 1 has correctly answered a number of factoid and temporally restricted questions, while System 2 only detected correct answers for a small number of temporally restricted questions. There are many cases when the correct answer is retrieved, but not returned as the first answer. This is case of question 13 (where the gold answer Melvyn Percy appears ranked third and fifth), and question 80 the gold answer Anthony Busuttil appears ranked fifth. The improvement of both answer extraction modules and of the NER module employed at pre-processing need to be seriously addressed for a future CLEF participation. Figure 4: System comparison for the LOCATION answer type System 1 also performs better than System 2 for questions with the LOCATION answer type. Therefore, we need to ensure that our answer extractor preserves the answer format as it appears in the related snippet. The best results have been obtained for the list questions asking about LOCATIONs (30% of the LOCATION list questions were correctly answered by System 1, while the LOCATION list questions were evaluated as UNSUPPORTED in the case of System 2). For a number of questions asking for LOCATIONs, due to the snippet pre-processing stage, answers detected correctly by System 2 are marked as UNSUPPORTED because the indicated snippet contains an ” instead of a whitespace. Figure 5: System comparison for the ORGANIZATION answer type For questions asking about ORGANIZATIONs, System 2 achieves better results than System 1, but still quite low. Both systems fail, mainly because our NE Recognizer does not identify Named Entities of the type ORGANIZATION. Therefore, when our search engine looks for an entity of this type, it is obviously impossible for it to retrieve any snippets containing such entities. Figure 6: System comparison for the DATE expected answer type By analyzing Figure 6, we deduce that the questions asking about a DATE and being at the same time temporally restricted were the most difficult questions to process. Both systems managed to find correct answers only for the factoid DATE questions, but no correct answer was found for temporally restricted DATE questions. In the case of System 1, most of the temporally restricted DATE questions were considered UNSUPPORTED by the provided snippet. Also, we can see a large number of UNSUPPORTED answers for System 1, possibly due to truncating the extracted paragraphs to comply with the 500-byte limit, and leaving the answer in the truncated part. In any case, the improvement of the answer extractor is required with respect to the identification of the correct extent of the required answer. System 2 answered correctly less factoid DATE questions in comparison with System 1. In the future, we intend to perform a temporal pre-annotation of the corpus in order to facilitate the search for temporal expressions and to be able to provide an answer at the granularity required by the question. Figure 7: System comparison for the MEASURE answer type For the MEASURE answer type, our answer extractors mainly fail due to errors in the an- notation introduced at the pre-processing stage. For a number of questions, we correctly select the snippets (the gold answers are in these snippets), but we give wrong answers due to wrong annotation. Figure 8: System comparison for GENERIC questions All the questions that didn’t fit in the previous categories (PERSON, LOCATION, ORGANI- ZATION, DATE nor MEASURE) are classified as GENERIC. For the GENERIC answer type, both systems achieve good results for list questions, but, in the case of factoid questions, most of the returned answers are wrong. Figure 8 shows that System 1 has significantly more correct list and definition answers, unlike System 2. As a general overview, we may say that System 1 outperforms System 2 for most answer types. For the next competition, we will try to exploit the beneficial features of each system, at the same time improving each of the developed modules for an increased performance. 5 Conclusions This paper describes the development stages of our cross-lingual Romanian-English QA system, as well as our participation in the QA@CLEF campaign. We have developed a basic QA system that is able to retrieve answers to Romanian questions in a collection of English documents. Adhering to the generic QA system architecture, our system implements the three essential stages (question analysis, information retrieval and answer extraction), as well as a cross-lingual QA specific module which translates the relevant question terms from Romanian into English. Our participation in the QA@CLEF competition included three runs. Two runs were obtained by analyzing the questions, translating the keywords, retrieving relevant passages and finding the final answers using two different answer extractors. Our third run, which unfortunately was not evaluated, was provided by the DIOGENE QA system on the basis of the output given by our Question Analysis and Term Translation modules. The results are poor for both evaluated runs, but we declare ourselves satisfied with the fact that we have managed to develop a fully functional cross-lingual QA system and that we have learned several important lessons for our future participations. A detailed result analysis has revealed a number of major system improvement directions. The term translation module, as a key stage for the performance of any cross-lingual QA system, is our first improvement target. However, the answer extraction module is the one we will dedicate most of our attention in order to increase its accuracy. An improved answer ranking method for the candidate answers will also form part of our priorities. The Romanian team’s debut in the CLEF competition has enriched us with the experience of developing the first Romanian-English cross-lingual QA system, at the same time setting the scene for subsequent CLEF participations. 6 Acknowledgements The authors would like to thank the following members of the UAIC team: Cornel Bârnă, Corina Forăscu, Maria Husarciuc, Mădălina Ioniţă, Ana Masalagiu, Gabriela Mogoş, Gabriel Negară, Marius Răschip, for their help and support at different stages of system development. Special acknowledgements are addressed to Milen Kouylekov and Bernardo Magnini for their offer to process the output of our question processor and term translation module with the IR and answer extraction modules included in DIOGENE, and for providing us with the DIOGENE run. References [1] Christiane Fellbaum, editor. WordNet: An Eletronic Lexical Database. The MIT Press, 1998. [2] Sanda Harabagiu and Dan Moldovan. Question answering. In Ruslan Mitkov, editor, Oxford Handbook of Computational Linguistics, chapter 31, pages 560 – 582. Oxford University Press, 2003. [3] M. Kouylekov, B. Magnini, M. Negri, and H. Tanev. ITC-irst at TREC-2003: the DIOGENE QA system. In Proceedings of the Twelvth Text Retrieval Conference (TREC-12), 2003. [4] LUCENE. http://lucene.apache.org/java/docs/. [5] V. Pekar, M. Krkoska, and S. Staab. Feature weighting for cooccurrence-based classification of words. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), 2004. [6] G. Puscasu. A Framework for Temporal Resolution. In Proceedings of the 4th Conference on Language Resources and Evaluation (LREC2004), 2004. [7] QA@CLEF. http://clef-qa.itc.it/CLEF-2006.html, 2006. [8] D. Tufis. Tagging with Combined Language Models and Large Tagsets. In Proceedings of the TELRI International Seminar on ”Text Corpora and Multilingual Lexicography”, 1999. [9] D. Tufis, D. Cristea, and S. Stamou. BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. In D. Tufis, editor, Romanian Journal on Information Science and Technology. Special Issue on BalkaNet. Romanian Academy, 2004.