=Paper=
{{Paper
|id=Vol-1172/CLEF2006wn-QACLEF-WhittakerEt2006
|storemode=property
|title=CLEF2006 Question Answering Experiments at Tokyo Institute of Technology
|pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-QACLEF-WhittakerEt2006.pdf
|volume=Vol-1172
|dblpUrl=https://dblp.org/rec/conf/clef/WhittakerNCDHF06a
}}
==CLEF2006 Question Answering Experiments at Tokyo Institute of Technology==
CLEF2006 Question Answering Experiments at Tokyo Institute of Technology E.W.D. Whittaker, J.R. Novak, P. Chatain, P.R. Dixon, M.H. Heie and S. Furui Dept. of Computer Science, Tokyo Institute of Technology, 2-12-1, Ookayama, Meguro-ku, Tokyo 152-8552 Japan {edw,novakj,pierre,dixonp,heie,furui}@furui.cs.titech.ac.jp Abstract In this paper we present the experiments performed at Tokyo Institute of Technology for the CLEF2006 Multiple Language Question Answering (QA@CLEF) track. Our approach to question answering centres on a non-linguistic, data-driven, statistical classification model that uses the redundancy of the web to find correct answers. Using this approach a system can be trained in a matter of days to perform question answering in each of the target languages we considered—English, French and Spanish. For the cross-language aspect we employed publicly available web-based text translation tools to translate the question from the source into the corresponding target language, then used the corresponding mono-lingual QA system to find the answers. The hypothesised correct answers were then projected back on to the appropriate closed-domain corpus. Correct and supported answer performance on the mono-lingual tasks was around 14% for both Spanish and French. Performance on the cross-lingual tasks ranged from 5% for Spanish-English, to 12% for French-Spanish. Our projection method was shown not to work well: in the worst case on the French-English task we lost 84% of our otherwise correct answers. Ignoring the need for correct support information the exact answer accuracy increased to 29% and 21% correct on the Spanish and French mono-lingual tasks, respectively. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; General Terms Measurement, Performance, Experimentation Keywords Question answering, Statistical classification, Cross-language, Spanish, French 1 Introduction In this paper we describe how we applied our recently developed statistical, data-driven approach to question answering (QA) to the task of multiple language question answering in the CLEF2006 QA evaluation. The approach that we used, described in detail in previous publications [14, 15, 16], uses a noisy-channel formulation of the question answering problem and the redundancy of data on the web to answer questions. Our approach permits the rapid development of factoid QA systems in new languages given the availability of suitable question and answer training examples and a large corpus of text data such as web-pages or newspaper text as described in [17]. Although we had previously developed systems in five different languages using the same method none of these languages except English were included in the CLEF2006 evaluation. We therefore chose to build French and Spanish systems from scratch and participate in the French and Spanish mono-lingual tasks and the cross-language combinations of both languages together with English. Using the procedure applied successfully in [17] we developed first-cut French and Spanish systems in a couple of days and used the remaining time before the actual evaluation for system optimisation on previous years’ CLEF evaluation questions1 . Our approach is substantially different to conventional approaches to QA though it shares elements of other statistical, data-driven approaches to factoid question answering found in the literature [1, 2, 3, 7, 11, 12, 13]. More recently, similar approaches to the answer typing employed in our system have appeared [9] although they still use linguistic notions that our approach eschews in favour of data-driven classifications. While this approach results in a small number of parameters that must be optimised to minimise the effects of data sparsity and to make the search space tractable, we largely remove the need for numerous ad-hoc weights and heuristics that are an inevitable feature of many rule-based systems. The software for each language-specific QA system is identical; only the training data differs. All model parameters are determined when the data is loaded at system initialisation; this typically only takes a couple of minutes to compute and they do not change in between questions. New data or system settings can therefore be easily applied without the need for time-consuming model re-training. Many contemporary approaches to QA require the specialised skills of native speaking linguistic experts for the construction of rules and databases that are often used by other QA systems. In contrast, our method allows us to include all kinds of dependencies in a consistent manner and has the important benefits that it is fully trainable and requires minimal human intervention once sufficient data is collected. This was particularly important in the case of our participation in CLEF this year since although our French QA system was developed by a native French-speaker, our Spanish system was built by a student with a conversational level of Spanish learnt in school. Our QA systems tend to work well when there are numerous (redundant) sentences that contain the correct answer which is why a web search engine is used to obtain nominally relevant docu- ments. In particular, it is advantageous for good performance that the correct answer co-occurs more frequently (roughly speaking) with words from the question than other candidate answers of the same answer type do. If this is not the case, the QA system has no other information with which to differentiate the correct answer from competing alternatives of the same answer type. By using a large amount of text data that contain the question words we are essentially replacing query expansion (as performed by most QA systems) with what might be called document expan- sion: the documents are expanded to match the query rather than expanding the query to match the documents. Due to the evaluation requirement that support from a fixed document collection be provided for each question, our answers must subsequently be projected on to the appropriate collection. Inevitably this is a lossy operation as will be discussed in Section 5 and also means we never attempt to predict “unanswerable” questions by giving “NIL” as an answer. The rest of this paper is organised as follows: we first present a summary in Section 2 of the mathematical framework for factoid QA as a classification task that was presented in [15]. We describe the experimental setup specific to the CLEF2006 evaluation in Section 3 and present the results on each task that were obtained in the evaluation in Section 4. A discussion and conclusion are given in Sections 5 and 6. 1 Development of the QA system itself is relatively fast and straightforward—by far the most time-consuming part is the development of robust text download, extraction and text normalisation tools for any given language. 2 QA as statistical classification with non-linguistic features This section is re-produced verbatim from the paper “TREC2005 Question Answering Experiments at Tokyo Institute of Technology” [14]. It is clear that the answer to a question depends primarily on the question itself but also on many other factors such as the person asking the question, the location of the person, what questions the person has asked before, and so on. Although such factors are clearly relevant in a real-world scenario they are difficult to model and also to test in an off-line mode, for example, in the context of the TREC evaluations. We therefore choose to consider only the dependence of an answer A on the question Q, where each is considered to be a string of lA words A = a1 , . . . , alA and lQ words Q = q1 , . . . , qlQ , respectively. In particular, we hypothesise that the answer A depends on two sets of features W = W(Q) and X = X (Q) as follows: P (A | Q) = P (A | W, X), (1) where W = w1 , . . . , wlW can be thought of as a set of lW features describing the “question-type” part of Q such as when, why, how, etc. and X = x1 , . . . , xlX is a set of lX features comprising the “information-bearing” part of Q i.e. what the question is actually about and what it refers to. For example, in the questions, Where was Tom Cruise married? and When was Tom Cruise married? the information-bearing component is identical in both cases whereas the question-type component is different. Finding the best answer  involves a search over all A for the one which maximises the probability of the above model:  = arg max P (A | W, X). (2) A This is guaranteed to give us the optimal answer in a maximum likelihood sense if the prob- ability distribution is the correct one. We don’t know this and it’s still difficult to model so we make various modelling assumptions to simplify things. Using Bayes’ rule this can be rearranged as P (W, X | A) · P (A) arg max . (3) A P (W, X) The denominator can be ignored since it is common to all possible answer sequences and does not change. Further, to facilitate modelling we make the assumption that X is conditionally independent of W given A to obtain: arg max P (X | A) · P (W | A) · P (A). (4) A Using Bayes rule, making further conditional independence assumptions and assuming uniform prior probabilities, which therefore do not affect the optimisation criterion, we obtain the final optimisation criterion: arg max P (A | X) · P (W | A). (5) A | {z } | {z } retrieval f ilter model model The P (A | X) model is essentially a language model which models the probability of an answer sequence A given a set of information-bearing features X, similar to the work of [10]. It models the proximity of A to features in X. We call this model the retrieval model and do not examine it further—please refer to [14, 15, 16] for more details. The P (W | A) model matches an answer A with features in the question-type set W . Roughly speaking this model relates ways of asking a question with classes of valid answers. For example, it associates dates, or days of the week with when-type questions. In general, there are many valid and equiprobable A for a given W so this component can only re-rank candidate answers retrieved by the retrieval model. If the filter model were perfect and the retrieval model were to assign the correct answer a higher probability than any other answers of the same type the correct answer should always be ranked first. Conversely, if an incorrect answer, in the same class of answers as the correct answer, is assigned a higher probability by the retrieval model we cannot recover from this error. Consequently, we call it the filter model and examine it further in the next section. 2.1 Filter model The question-type mapping function W(Q) extracts n-tuples (n = 1, 2, . . .) of question-type fea- tures from the question Q, such as How, How many and When were. A set of |VW | = 2522 single-word features is extracted based on frequency of occurrence in questions in previous TREC question sets. Some examples include: when, where, who, whose, how, many, high, deep, long etc. Modelling the complex relationship between W and A directly is non-trivial. We therefore introduce an intermediate variable representing classes of example questions-and-answers (q-and- a) ce for e = 1 . . . |CE | drawn from the set CE , and to facilitate modelling we say that W is conditionally independent of ce given A as follows: |CE | X P (W | A) = P (W, ce | A) (6) e=1 |CE | X = P (W | ce ) · P (ce | A). (7) e=1 Given a set E of example q-and-a tj for j = 1 . . . |E| where tj = (q1j , . . . , qlj j , aj1 , . . . , ajl j ) we Q A define a mapping function f : E 7→ CE by f (tj ) = e. Each class ce = (w1e , . . . , wleW e , ae1 , . . . , aelAe ) l Aj S S j is then obtained by ce = W(tj ) ai , so that: j:f (tj )=e i=1 |CE | X P (W | A) = P (W | w1e , . . . , wleW e ) · P (ae1 , . . . , aelAe | A). (8) e=1 Assuming conditional independence of the answer words in class ce given A, and making the modelling assumption that the jth answer word aej in the example class ce is dependent only on the jth answer word in A we obtain: |CE | lAe X Y P (W | A) = P (W | ce ) · P (aej | aj ). (9) e=1 j=1 Since our set of example q-and-a cannot be expected to cover all the possible answers to questions that may be asked we perform a similar operation to that above to give us the following: |CE | lAe |C A| X Y X P (W | A) = P (W | ce ) P (aej | ca )P (ca | aj ), (10) e=1 j=1 a=1 where ca is a concrete class in the set of |CA | answer classes CA . The independence assumption leads to underestimating the probabilities of multi-word answers so we take the geometric mean of the length of the answer (not shown in Equation (10)) and normalise P (W | A) accordingly. The system using the above formulation of filter model given by Equation (10) is referred to as model ONE. Systems using the model given by Equation (8) are referred to as model TWO. Only systems based on model ONE were used in the CLEF2006 evaluation systems described in this paper. 3 Experimental setup for CLEF2006 The CLEF2006 tasks we took part in are as follows: F-F, S-S, F-S, E-S, E-F, F-E and S-E where F=French, S=Spanish and E=English. For each task we submitted only one run and up to ten ranked answers for each question. No classification as to whether a question was more likely to be a factoid, definition or list question was performed prior to answering a question. Therefore all questions were treated as factoid questions since the QA systems we developed were trained using only factoid questions and the features extracted from factoid questions. For the two mono-lingual tasks (French and Spanish) questions were passed as-is to the ap- propriate mono-lingual system after minimal query normalisation and upper-casing of all question terms. For the cross-lingual tasks questions were first translated into the target language using web-based text translation tools: Altavista’s Babelfish [5] for French-Spanish and Google Trans- late [6] for all other combinations. The translated question was then normalised and upper-cased and passed to the appropriate mono-lingual system2 For each question input to our QA system the question was passed to Google after removing stop words and the (up to) top 500 documents were downloaded for each question. For answering a specific question only that question’s downloaded data was used. Document processing involved the removal of any document markup, conversion to UTF-8, and the same language-specific nor- malisation and upper-casing as applied to the questions. For answering questions in a given language the corpus in which answers were to be located was not used. However, once a set of answers to a question had been obtained the final step was to project the answers on to the appropriate corpus. Due to a lack of time and resources for a full development of the projection system we relied on using Lucene [4] to determine the document which had the highest ranking when the answer and question were used as a Boolean query. Snippets were likewise determined using Lucene’s summary function. If an answer could be found somewhere in the document collection the snippets were further filtered to ensure that the snippet always included the answer string (though possibly none of the question words). Due to time limitations we chose only to implement the French and Spanish system using model ONE, given by Equation (10). Although we have implementations for English using both models ONE and TWO, for consistency we used the English system that implemented model ONE only. In the TREC2005 QA evaluation model TWO outperformed model ONE by approximately 50% relative—our aim is therefore to implement model TWO for the Spanish real-time task and in other languages in time for future CLEF evaluations. 4 Results All tasks were composed of a set of 190 factoid and definition questions and 10 list questions. For all tasks up to 10 answers were given by our QA systems to all questions. In general the top 3 answers for the factoid/definition questions were assessed and all list answers were assessed for exactness and support. In Table 1 a breakdown is given of the results obtained on the two mono-lingual (French and Spanish) tasks for factoid/definition questions with answers in first place and for all answers to list questions. 2 One alternative would have been to use the mono-lingual system of the source language to obtain answers then translate its answers into the target language. A combination of these two approaches could also have been used to try to minimise the effects of poor automatic translation performance. Factoid/definition questions List questions Task Right ineXact Unsupp. CWS Right ineXact Unsupp. P@N S-S 26 (13.7%) 1 29 0.035 3 0 0 0.03 F-F 27 (14.2%) 12 12 0.142 9 2 0 0.09 Table 1: Breakdown of performance on the French and Spanish mono-lingual tasks by type of question and assessment of answer, where right means exactly correct and supported. A similar breakdown for the five3 cross-lingual tasks is given in Table 2 for factoid/definition questions with answers in first place and for all answers to list questions. Factoid/definition questions List questions Task Right ineXact Unsupp. CWS Right ineXact Unsupp. P@N E-F 19 (10.0%) 6 8 0.017 4 1 1 0.06 E-S 11 (5.8%) 0 10 0.005 1 0 0 0.01 F-E 7 (3.7%) 10 37 0.003 1 3 15 0.01 S-E 10 (5.3%) 11 34 0.008 0 1 18 0.00 F-S 22 (11.6%) 0 15 0.037 2 0 0 0.02 Table 2: Breakdown of performance on the English, French and Spanish cross-lingual combinations by type of question and assessment of answer, where right means exactly correct and supported. 5 Discussion There were four main factors in our submissions to CLEF2006 that were expected to have a large impact on performance: (1) the mis-match in time period between the document collection and the web documents used for answering questions; (2) the use of factoid QA systems to also answer definition and list questions; (3) the effect of the machine translation tools for the cross-language tasks; and (4) the projection method of mapping answers back on to the appropriate document collection. Since all our QA systems relied on web data to answer questions for all languages there was an inevitable mis-match in the time period of documents used for answering questions and the time-frame that was meant to be used i.e. 1994-1995. Although web documents exist which cover the same period, web search engines typically return more recent documents. However, it turned out that this was not a major problem although there were inconsistencies for questions such as “¿Quién es el presidente de Letonia?”/“Qui est le président de la Létonie4 ?”/“Who is the president of Latvia?” and “¿Quién es el secretario general de la Interpol?”/“Qui est le secrétaire général d’Interpol?”/“Who is the secretary general of Interpol?” the answers to which have changed in the intervening period. It was observed during our participation in the TREC2005 evaluations that simply using a factoid QA system to output the top so many answers for list questions was not a very promising approach, even when list questions were used for training. Part of the problem was due to a paucity of list question training examples compared to the number of factoid questions available. Another problem lay in how to determine the threshold for outputting answers: whether simply to output a fixed number of answers each time, or to base it on some function of the answer score. In the CLEF2006 evaluations the problem was further compounded by not knowing in advance which questions would be factoid, definition and list questions. We therefore decided to assume 3 Note that the Spanish-French cross-lingual task was not run in CLEF2006. 4 “Létonie” should have been written as “Lettonie” in the French mono-lingual test set; it was, however, written correctly in the French-Spanish and French-English test sets. all questions were factoids and output ten answers in all cases. Our poor performance on all list questions for all tasks can be attributed mostly to there being very few list question examples in our training data and very few list question features (such as plurals) used in the filter model. As a consequence, answer typing for list questions was not very effective. For definition questions the independence assumptions made by model ONE render very poor answer typing of definition questions unless an answer is able to be defined in one or two words. The substantially lower answer accuracies (between 3.7% and 12.0%) obtained on the cross- language tasks where Babelfish and Google Translate were used for question translation were generally expected due to the well documented quality of such translation tools. It was deemed unlikely that the highest result that was obtained, for French-Spanish, was due to using Babelfish rather than Google Translate and was instead due more to the relative similarity of the two languages (see Section 5.1). In any case, further improvements in machine-translation techniques will almost certainly result in considerable improvements in our cross-language QA performance and multi-language combination experiments. We were far more surprised and disappointed by the loss incurred by our projection method which reduced our set of correct answers by 47% and 31% on the Spanish and French mono- lingual tasks, respectively. If we were to ignore the need for correct support information the performance would increase to 29% and 21% correct on the Spanish and French mono-lingual tasks, respectively. In the worst case on the French-English task we lost 84% of our otherwise correct answers; equivalently we would have obtained an exact answer accuracy of 23% if the support requirement were ignored. Our previous experience with projection onto the AQUAINT document collection for English language answers on the TREC2005 QA task using the algorithm included in the open-source Aranea system [8] had shown fairly consistent losses of around 20%. While the algorithm that we applied in CLEF2006 was far simpler than that employed by Aranea, it did have access to the full document collection for finding documents containing answers whereas for TREC we relied only on the (up to) top 1000 documents supplied by NIST that were obtained using the PRISE retrieval system. This prevented any errors from only retrieving documents that were selected using only question features however the increased recall of documents containing the answers might have been offset by lower precision. The Spanish system, like the French system, was developed in a very short period of time. Making further refinements, increasing the amount of training data used, and implementing model TWO are expected to bring accuracy into line with the English system. The possible advantages of applying a refined cross-language approach to mono-lingual tasks, e.g. using the combined results of multiple mono-lingual systems to answer questions in a particular language, are also being investigated. This will provide a means of further exploiting the redundancy of the web, as well as a method to improve the results for languages which are still under-represented on the web. In the next two sections we present brief language-specific discussions of the results that were obtained. 5.1 Spanish As indicated in Table 2 results for the mono-lingual Spanish task were considerably better than those obtained for the cross-lingual tasks (English-Spanish, French-Spanish). The discrepancy between the results for the mono-lingual and cross-language Spanish test sets can be almost entirely explained in terms of the relative accuracy of the automatic translation tools used as an intermediate step to obtaining results for the latter. Furthermore, the difference between the results for the French-Spanish task and those for the English-Spanish task is almost certainly due to the relative closeness of the language pair, with Spanish and French both being members of the Romance family of languages, rather than the use of different automatic translation tools. These differences aside, results for all three Spanish tasks exhibited similar characteristics. The results on all three tasks that included Spanish as a source or target language were by far the best for factoid questions, especially those whose answers could be categorised as names or dates. Of the 26 exactly correct and supported answers obtained for the Spanish mono-lingual task, a total of 20 consisted of proper names or dates, (11 dates, 9 proper names). If the 29 correct but unsupported answers are also taken into account, this total rises to 41, and accounts for approximately 75% of all correct answers obtained for this particular task. Definition questions, in addition to being ambiguous in the evaluation sense, are much more difficult than factoids for our QA system to answer. Yet, despite treating all questions as inherently factoid, some interesting results were obtained for the Spanish mono-lingual task. In particular, these results included 2 exactly correct and supported answers, and 8 correct but unsupported answers in the definition category. A cursory analysis of the data revealed that each of these correct answers could be construed as the result of a categorisation process, whereby the subject of the question had been classified into a larger category, and this category was then returned as the answer, e.g. “¿Qué es la Quinua?” (“What is Quinua?”) answer: [a] cereal, and “¿Quién fue Alexander Graham Bell?” (“Who was Alexander Graham Bell?”) answer: [an] inventor. The system gives these ‘category’ words high scores due to the fact that they often appear in the context of proper nouns, where they are used as definitions, or as noun-qualifiers. However, because the system uses no explicit linguistic typology or categories, this results in occasional mismatches such as: “¿Quién es Nick Leeson?” (“Who is Nick Leeson?”) (answer: Barings). This answer would be categorised as a retrieval error since ‘Barings’ is a valid answer type for a Who-question but its high co-occurrence with the subject of the question results in an overly high retrieval model score. 5.2 French For the French mono-lingual task, unsupported answers were not as much an issue as for the Spanish mono-lingual task, although there were still 12 unsupported answers for the factoid ques- tions, 10 or 11 of which would have also been exact. For the English-French task, there were 8 unsupported answers for factoid questions, almost all of which were also exact. Projection onto French documents, however imperfect, seems to have been less of a problem than for the other languages though it is unlikely that the differences are significant. Out of our 27 correct and supported answers on the mono-lingual task, 23 were places, dates or names. For those types of questions the answer types that were returned were usually correct. Questions involving numbers, however, were a serious problem: out of 15 “How many...” questions we got only one correct, and the answer types which were given were not consistent instead being dates, numbers, names or nouns. The same observation holds for “How old was X when...” questions, which were all incorrectly answered, with varying answer types for the answers given. With a rule-based or regular-expressions-based system it is difficult to make such errors. However, with our probabilistic approach, in which no hard decisions are made and all types of answers are valid but with varying probabilities, it is entirely possible to incur such filter model errors. Although some cases would be trivially remedied with a simple regular-expression this is against our philosophy; instead we feel the problem should be solved through better parameter estimation and better modelling of the training data, rather than ad-hoc heuristics. Another interesting observation on the mono-lingual task was that for 19 questions where the first answer was inexact, wrong, or unsupported we got an exact and supported answer in second place. For answers in third place the number of exact and supported answers was only 3. In most of these cases, the answer types were the same. This is untypical of our results obtained previously on English and Japanese where there is typically a significant drop in the number of correct answers at each increase in the rank of answers considered. As with Spanish, automatic translation of the questions into other languages was far from perfect. One common problem was words with several meanings which were (correctly) translated into French using the wrong meaning, thus radically changing the meaning of keywords in the question. For example, in the English question “In which settlement was a mass slaughter of Muslims committed in 1995?” “settlement”, is translated into “règlement”. Consequently, answers given for this question related to the French legal system rather than a location. Moreover, it was apparent that Google Translate was far from optimal for translating questions presumably because source sentences are expected to be in the affirmative. Thus, “What is...” and “Which is...” became “Ce qui...” which our QA system tended to interpret as “Qui...” thus favouring a person or company as the answer type. Similarly “How old...” often became “Comment vieux...” rather than “Quel âge...” and so was answered as if it were a regular “How...” question. 6 Conclusion With the results obtained in the CLEF2006 QA evaluation we feel we have proven the language independence and generality of our statistical data-driven approach. Comparable performance us- ing model ONE has been obtained under evaluation conditions on the three languages of English, French and Spanish in both this evaluation and TREC2005. In addition, post-evaluation experi- mentation with Japanese [16] has confirmed the efficacy of the approach for an Asian language as well. While the absolute performance of our QA systems falls short of that obtained by state-of- the-art linguistics-based systems both the French and Spanish systems were developed only over the two months prior to the evaluation and use an absolute minimum of linguistic knowledge to answer questions in favour of using the redundancy of the web. Further work will concentrate on how to answer questions using less redundant data through data-driven query expansion methods and also look at removing the independence assumptions made in the formulation of the filter model to improve question and answer typing accuracy. We expect that improvements made on language-specific systems will feed through to improvements in all systems and we hope to be able to compete in more and different language combinations in CLEF evaluations in the future. 7 Online demonstration A demonstration of the system using model ONE supporting questions in English, Japanese, Chinese, Russian, French, Spanish and Swedish can be found online at http://www.inferret.com 8 Acknowledgements This research was supported by the Japanese government’s 21st century COE programme: “Frame- work for Systematization and Application of Large-scale Knowledge Resources”. References [1] A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal. Bridging the Lexical Chasm: Statis- tical Approaches to Answer-Finding. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, Athens, Greece, 2000. [2] E. Brill, S. Dumais, and M. Banko. An Analysis of the AskMSR Question-answering System. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2002. [3] A. Echihabi and D. Marcu. A Noisy-Channel Approach to Question Answering. In Proceedings of the 41st Annual Meeting of the ACL, 2003. [4] O. Gospodnetic and E. Hatcher. Lucene in Action. Manning, 2005. [5] http://babelfish.altavista.com. [6] http://translate.google.com. [7] A. Ittycheriah and S. Roukos. IBM’s Statistical Question Answering System—TREC-11. In Proceed- ings of the TREC 2002 Conference, 2002. [8] J. Lin and B. Katz. Question Answering from the Web Using Knowledge Annotation and Knowl- edge Mining Techniques. In Proceedings of Twelfth International Conference on Information and Knowledge Management (CIKM 2003), 2003. [9] C. Pinchak and D. Lin. A Probabilistic Answer Type Model. In European Chapter of the ACL, Trento, Italy, 2006. [10] J. Ponte and W. Croft. A Language Modeling Approach to Information Retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, Melbourne, Australia, 1998. [11] D. Radev, W. Fan, H. Qi, H. Wu, and A. Grewal. Probabilistic Question Answering on the Web. In Proc. of the 11th international conference on WWW, Hawaii, US, 2002. [12] D. Ravichandran, E. Hovy, and F. Josef Och. Statistical QA—Classifier vs. Re-ranker: What’s the difference? In Proceedings of the ACL Workshop on Multilingual Summarization and Question Answering, 2003. [13] R. Soricut and E. Brill. Automatic Question Answering: Beyond the Factoid. In Proceedings of the HLT/NAACL 2004: Main Conference, 2004. [14] E. Whittaker, P. Chatain, S. Furui, and D. Klakow. TREC2005 Question Answering Experiments at Tokyo Institute of Technology. In Proceedings of the 14th Text Retrieval Conference, 2005. [15] E. Whittaker, S. Furui, and D. Klakow. A Statistical Pattern Recognition Approach to Question Answering using Web Data. In Proceedings of Cyberworlds, 2005. [16] E. Whittaker, J. Hamonic, and S. Furui. A Unified Approach to Japanese and English Question Answering. In Proceedings of NTCIR-5, 2005. [17] E. Whittaker, J. Hamonic, D. Yang, T. Klingberg, and S. Furui. Monolingual Web-based Factoid Question Answering in Chinese, Swedish, English and Japanese. In Workshop on Multi-lingual Question Answering (EACL), Trento, Italy, 2006.