Finding Answers to Indonesian Questions from English Documents Sri Hartati Wijono, Indra Budi, Lily Fitria, and Mirna Adriani Faculty of Computer Science University of Indonesia Depok 16424, Indonesia {shw50, indra, lifi50, mirna}@cs.ui.ac.id Abstract. Our report describes the results of work in our participation in the Indonesian- English question-answering task of the 2006 Cross-Language Evaluation Forum (CLEF). In this work we translated an Indonesian query set into English using a machine translation tool available on the internet. Documents relevant to a question are first retrieved. The relevant documents are then divided into passages of 5 sentences each. The answer to the question is extracted from a passage relevant to the query. The answer is identified based on matching annotations between the query and the documents. A linguistic tool is used to annotate the words in the queries and the documents. 1 Introduction This year we participate in the bilingual Question-Answering (QA) task, the Indonesian-English QA, of the Cross Language Evaluation Forum (CLEF) 2006. Finding the correct answer to a question in documents is a challenging task, and this is the main research topic in CLEF Question Answering task. The question and the documents must be analyzed in order to find the right answer in the documents. There are several techniques that have been used to handle the QA task, i.e., parsing and tagging the sentence in the question [6], in the documents [4], in paragraphs [8], and in passages [2, 3]. 2 The Question Answering Process The process of finding the answer to a query in documents proceeds in a number of stages. First, the original CLEF English queries were translated into Indonesian manually. Next, we classified the Indonesian questions (queries) according to the type of question. We identified the question type from the question word used in the query. The Indonesian question was then translated back into English using a machine translation tool. We used a web machine translation tool called Toggletext1 to translate an Indonesian query set into English. We learned from our previous work [1] that freely available dictionaries on the Internet did not provide sufficiently good translation terms, as their vocabulary was very limited. We hoped that we could achieve better results using a machine translation approach. The resulting English query was then used to retrieve the relevant documents from the collection by means of an information retrieval system. The contents of a number of documents at the top of the rank list were then split into passages. The passages were then tagged using a linguistic tagging (annotation) tool to identify the type of words in the passages. Finally, the passages were then scored using an algorithm, and the answer to the question is extracted from the passage with the highest score. 2.1 Categorizing the Questions Each question category, which is identified by the question word in the question, points to the type of answer that is looked for in the documents. The Indonesian question-words used in the categorization are: dimana, dimanakah, manakah (where) points to apakah nama (what), points to 1 See “http://www.toggletext.com”. siapa, siapakah (who) points to berapa (how many) points to kapan (when) points to organisasi apakah (what organization) points to apakah nama (which) points to sebutkan (name) points to < other> By identifying the question type, we can predict the kind of answer that we need to look for in the document. The Indonesian question was tagged using a question tagger that we developed according to the question word that appears in the question. This approach is similar to those used by Clark et al. [2] and Hull [4,7,8]. However, we ignored the tagging on the question when we ran the query through the IR system to retrieve the documents. 2.2 Building Passages The Indonesian question was translated into English using machine translation. The resulting English query was then run through an information retrieval system as a query to retrieve a list of relevant documents. We used Lemur2 information retrieval system to index and retrieve the documents. The contents of the top 10 relevant documents were split into passages. Each passage contains five sentences where the last sentence is repeated in the next passage as the first sentence. The sentence in the documents was identified using a sentence parser to identify the beginning and the end of a sentence. The passages are then indexed by Lemur and the queries were run through to get the top-10 passages. These top-10 passages are being scored in order to get the answers to the queries. 2.3 Tagging the Passage The passages were then run through an entity tagger to get the entity annotation tags. The entity annotation tagger identifies words of known entity types, and tags them with the entity type tags, such as person, location, and organization. For example, UN, the word UN is identified as an organization so it gets the organization tag. In this work, we used linguistic tagger tool, Gate3. Gate analyzes English words and annotates them with tags to indicate location, organization, and person, where applicable. The annotation tags are used to find the candidate answer based on the type of the question, for example, a word with location tag is a good candidate answer to a where question, and a word with a person tag is a good candidate answer to a who question. 2.4 Scoring the Passages Passages were scored based on their probability of answering the question. The scoring rules are as follows: 1. Give 1 to a passage if its tag is not the same as the query tag and 0 if not. 2. Add 1 if a word in the passage is the same as the query. 3. Add 1 if the number of words in the passage is more than half of the number of the query words. Once the passages obtained their scores, the top 10 scoring with the appropriate tags – e.g., if the question type is person (the question word “who”) then the passages must contain the person tag – were then taken to the next stage. 2.5 Finding the Answer The top 10 passages were analyzed to find the best answer. The probability of a word being the answer to the question is inversely proportional to the number of words in the passage that separate the candidate word and the word in the query. For each word that has the appropriate tag, its distance from a query word found in the passage is computed. The candidate word that has the smallest distance is the final answer to the question. 2 See “http://www.lemurproject.org/”. 3 See “http://www.gate.shef.ac.uk/”. For example: - Question: What is the capital of Somalia? - Passage: – Here there is no coordination. Steffan de Mistura – UNICEF representative in the Somali capital, Mogadishu, and head of the anti-cholera team – said far more refugees are crowded together here without proper housing or sanitarian than during the Somalia crisis. And many are already sick and exhausted by the long trek from Rwanda. The distance between the question word capital and Mogadishu is 1, between the question word capital and Rwanda is 38. So, Mogadishu becomes the final answer since its distance to the question word capital is the smallest one (closest). 3 Experiment Our work focused on the bilingual task using Indonesian questions to find answers in English documents. The Indonesian questions were obtained by manually translating the English questions. The Indonesian questions were then translated back into English an online machine translation tool Toggletext to retrieve relevant English documents from the collection. Using the Gate tagger to tag words in the passages, only 14 correct answers were found (see Table 1). There were 4 inexact (ambiguous) answers and 159 wrong answers. Table 1. Evaluation of the QA result Task : Bilingual QA Evaluation W (wrong) 159 U (unsupported) 13 X (inexact) 4 R (right) 14 Our result shows that we need to find ways to improve the effectiveness in finding correct answers, in particular, ways of reducing the number incorrect word tagging. We also learned that expanding the translated queries by adding related terms could also help, as more relevant documents can be retrieved for the QA algorithm to work. 4 Summary Our participation in the QA task still needs further improvement. In our recent work, we managed to improve the QA result by applying query expansion and different passage scoring techniques. We hope that applying such techniques will result in a better performance next year. References 1. Adriani, M. and van Rijsbergen, C. J. Term Similarity Based Query Expansion for Cross Language Information Retrieval. In Proceedings of Research and Advanced Technology for Digital Libraries (ECDL’99). Springer Verlag, Paris (1999) 311-322 2. Clarke, C. L. A., Cormack, G.G., Kisman, D. I. E. and Lynam, K. Question Answering by Passage Selection. In NIST Special Publication: The 9th Text retrieval Conference (2000) 3. Clarke, Charles L.A., Cormack, Gordon V. and Lynam, Thomas R. Exploiting Redundancy in Question Answering. In Proceeding of ACM SIGIR. New Orleans (2001) 4. Hull, David. Xerox TREC-8 Question Answering Track Report. In NIST Special Publication: The 8th Text Retrieval Conference (1999) 5. Li, Xiaoyan dan Croft, Bruce. Evaluating Question-Answering Techniques in Chinese. In NIST Special Publication: The 10th Text Retrieval Conference (2001) 6. Manning, C.D. and Schutze, H. Foundations of Statistical Natural Language Processing. The MIT Press, Boston (1999) 7. Moldovan, D. et.al. Lasso: A Tool for Surfing the Answer Net. In NIST Special Publication: The 8th Text Retrieval Conference (1999) 8. Pasca, Marius and Harabagiu, Sanda. High Performance Question Anwering. In Proceeding of ACM SIGIR. New Orleans (2001)