-

CLEF2007 Question Answering Experiments at Tokyo Institute of Technology

E.W.D. Whittaker

J.R. Novak

M. Heie

heie@furui.cs.titech.ac.jp 0

S. Furui Dept. of Computer Science

edw@furui.cs.titech.ac.jp furui@furui.cs.titech.ac.jp 0 0 Tokyo Institute of Technology , 2-12-1, Ookayama, Meguro-ku, Tokyo 152-8552 Japan

In this paper we describe the experiments carried out at Tokyo Institute of Technology for the CLEF 2007 QAst (Question Answering in speech transcripts) pilot task, as well as our results from the official evaluation. We apply a non-linguistic, data-driven approach to Question Answering (QA), based a noisy channel model. The system we use for the QAst evaluation comprises an Information Retrieval (IR) module which uses an LM-based approach to sentence retrieval, and an Answer Extraction (AE) module which identifies and ranks the exact answer candidates in the retrieved sentences. Our team participated in the CLEF 2007 QAst pilot track, task T1: QA in manual transcriptions of lectures, and task T2: QA in automatic transcriptions of lectures. On the official evaluation our system achieved a best run MRR of 0.20 and a top1 score of 0.14 on task T1, and a best run MRR of 0.12 and a top1 score of 0.08 on task T2, placing us 3rd in a field of 5 teams that submitted results for these tasks. All experiments and evaluations descibed in this paper were conducted using the CHIL corpus (transcriptions of lectures) which was supplied to all track participants by the QAst track coordinators. ASR lattices were also provided by LIMSI, however we did not use these during the official evaluation.

Question answering Language modeling Speech recognition Spoken document retrieval

In this paper we explain our experimental setup and general approach to automatic Question Answering (QA), and report our official evaluation results for the CLEF 2007 QAst (Question Answering in speech transcripts) pilot track. We employed an entirely data-driven, non-linguistic and largely language independent QA framework for the QAst track, which was similar but not identical to that which we used in previous QA evaluations such as TREC 2006, CLEF 2006, NTCIR 2006, etc. This approach, which is detailed in [11, 12, 13] centers on a noisy-channel model of the QA problem and generally speaking relies on the redundancy of answer data in the target corpus in order to identify and extract correct answers.

Our QAst system comprises two major components, an Information Retrieval (IR) module used to identify and retrieve relevant sentences from a large corpus, and an Answer Extraction module which is used to identify and rank exact answers in the sentences returned by the IR module. Our approach, which is data-driven and does not require human-guided interaction except for the development of a short list of frequent stop words and common question words, makes it possible to rapidly develop new systems for a wide variety of different languages. Furthermore performance accuracy is roughly comparable even across very disparate languages such as English and Japanese, and developers need not have more than a perfunctory acquaintance with the language [6, 14] in order to build and deploy a new system.

Our data-driven approach differs substantially from conventional rule-based approaches, yet it does share certain features with other approaches in the literature [1, 2, 3, 4, 8, 9, 10]. Systems which employ similar answer-typing approaches have lately begun to appear [7], however most of these systems still utilize some form of specific linguistic knowledge in contrast to our alldata driven, non-linguistic, classification approach. Although our approach requires that a small number of parameters be optimized to minimize the effects of data sparsity, these parameters are all determined at system initialization time and are invariant across different questions. This means that new data or system settings can be applied without the need for wearisome model re-training.

Due to its data-driven nature our QA system performs best when there are numerous redundant sentences containing the correct answer and question words. This reliance on data redundancy to help identify correct answers has seldom been a source of difficulty in past evaluations, however the QAst pilot track presented a unique challenge due to the relatively small size of the CHIL lectures target corpus. In other closed domain evaluations with medium-sized corpora we have opted to utilize web data, however this did not seem entirely appropriate for the QAst track due to the spoken nature of the data and very small corpus size. In part to help combat the resulting data sparsity, we employed a new language-modeling based sentence retrieval IR module as a precursor to the Answer Extraction (AE) stage. This sentence retrieval module acts as an intermediate filter and helps to eliminate noise usually contained in the larger original documents.

The rest of the paper is structured as follows. Section 2 describes our QA architecture in detail. In section 3 we detail our experimental setup. Section 4 describes our results and Section 5 presents a brief discussion of the results. Finally, section 6 concludes the paper. 2

QA Architecture for QAst

The answer to a question depends primarily on the question itself but also on many other factors such as the identity and location of the questioner, previous questions, social context and so on. Although such factors are clearly relevant in many situations, they are difficult to model and also to test. In our approach to QA we therefore limit ourselves to modeling the most straightforward dependence, the probability of an answer A given the question Q. In the system used for the QAst evaluation, we divide the work of identifying answers between two major modules, the Information Retrieval (IR) module which employs an LM-based approach to sentence retrieval, and the Answer Extraction (AE) module. We briefly describe the IR module, the AE module and the Query Expansion process below. 2.1

Information Retrieval module

The general approach to IR for QA is to treat the question as a standard search query, but discard question-type words such as “what”, “when”, “who”, etc., and possibly also a set of stop words. We employ a language modeling approach to this problem where an individual LM is estimated for each document. The documents are then ranked according to the conditional probability P (Q|D), the probability of generating the query Q given the document D.

In our system we employ a sentence-based retrieval approach similar to that described in [5], where each document comprises only one sentence. Due to lack of data to train the sentence specific LMs, all words are treated as independent, and a unigram model is applied, |Q| P (Q|S) = Y P (qi|S),

We use absolute discounting in order to smooth the otherwise sparse LMs, where the probability of a query term q given a sentence S is calculated as:

PC (q|S) = |C| X P (q|cj ) · P (cj |S), j=1 P (cj |S) = |V | X P (cj |wk) · P (wk|S), k=1 where P (q|cj ) = 1/|cj | if q ∈ cj , else P (q|cj ) = 0, where |cj | is the number of words in cj . P (cj |S) can be re-written as a sum over the |V | words in the vocabulary V = {w1...w|V |}: where P (cj |wk) = 1/N (wk, C) if wk ∈ cj , else P (cj |wk) = 0 where N (wk, C) is the number of classes in C where wk occurs. P (wk|S) is the unigram probability of the word wk given the sentence S.

P1(q|S) = max{tf (q, S) − δ, 0} + δ · h(S, δ) l(S) l(S) · P (q|B), where tf (q, S) is the term frequency of q in S, l(S) is the length (number of words) of S, δ is the discount parameter, h(S, δ) is the count of how many unique words in S have a term frequency higher than δ, and P (q|B) is the unigram probability of the query term q according to the background collection model. Note that if δ < 1 then h(S, δ) is equal to the number of unique words in S.

A problem with the model presented in [5] is that words relevant to the sentence might not occur in the sentence itself, but in the surrounding text. For example, for the question “Who is Tom Cruise married to?”, the sentence “He is married to Katie Holmes” in an article about Tom Cruise should ideally be assigned a high probability, despite the sentence missing the words “Tom” and “Cruise”. To account for this, we train document LMs, P1(q|D), in the same manner as for P1(q|S) in Eq. ( 2 ), and perform a linear interpolation between P1(q|S) and P1(q|D): ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 )

P2(q|S) = (1 − α) · P1(q|S) + α · P1(q|D), where 0 ≤ α ≤ 1 is an interpolation parameter. 2.2

Query expansion

In order to help further improve QA performance we experiment with a global query expansion method in which words are grouped into a set C = {c1...c|C|} of |C| overlapping classes beforehand, and calculate the unigram class model probability of a query term q given a sentence S as follows:

Pint(q|S) = (1 − β) · P1(q|S) + β · PC (q|S), where 0 ≤ β ≤ 1 is an interpolation parameter.

Pint(q|D) is calculated in a similar manner. Eq.( 2 ) is then adjusted to give P3(q|S) as follows:

P3(q|S) = (1 − γ) · Pint(q|S) + γ · Pint(q|D), where 0 ≤ γ ≤ 1 is an interpolation parameter. For all QAst evaluation runs, either P2 or P3 were used.

Using Bayes’ rule and making various conditional independence and uniform prior distribution assumptions, Eq. ( 9 ) can be rearranged to give:

Aˆ = arg max P (A|W, X).

A arg max P (A|X) · P (W |A),

A 2.3

Answer Extraction

The AE module models the probability of an answer A given a question Q as:

P (A|Q) = P (A|W, X), where W is a set of features describing the question-type part of Q, such as “when”, “why” and “how”, etc., while X is a set of features describing the information-bearing part of of Q, i.e. what the question is about and what it refers to. For example, in the questions “Where was Tom Cruise married?” and “When was Tom Cruise married”, the information-bearing parts are identical while the question-type parts differ. Finding the best answer Aˆ involves a search over all A for the one which maximizes the probability of the above model:

The word LM in Eq.( 2 ) and the class LM in Eq.( 4 ) are combined using linear interpolation: where P (A|X) is termed the answer retrieval model and P (W |A) the answer filter model. P (A|X) essentially models the proximity of A to features in X. P (W |A) can be viewed as a LM that models the probability of the question-type features W given a candidate answer A.

We will not examine the answer retrieval model and the answer filter model further, see [15] for further details. 3

Experimental Setup for QAst

We participated in task T1: QA in manual transcriptions of lectures, and task T2: QA in automatic transcriptions of lectures. For the official evaluation we used the data released for the QAst evaluation task T1 and task T2. This data comprised a development set and an evaluation set with characteristics described in Table 1. The development set consisted of manual transcripts (MAN) and ASR-based transcripts (ASR) for 10 lectures, a set of questions, and a set of answers for each transcript set. The evaluation set consisted of MAN and ASR for 15 lectures, and a set of 100 questions. The development and evaluation data did not overlap. All questions were of one of the following answer types: person, location, organization, language, system/method, measure, time, color, shape, and material. Word lattices were also made available however, after preliminary experiments with the development data revealed minor inconsistencies between the lattices and ASR, we chose not to use any of the lattices in the actual evaluation. No audio was provided.

We cleaned the data by automatically removing fillers and pauses, and performed simple text processing of abbreviations and numerical expressions using perl’s Lingua CPAN module to ensure consistency between ASR, MAN, questions and answers. ASR documents were sentence segmented ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) Data Set Dev. Set Eval. Set according to the sentence boundaries provided, and MAN was sentence segmented using an inhouse segmenter developed by one of the authors. Our system is not able to identify whether the answer to a question can be found in the corpus, thus we chose never to return a “nil” response for any question.

For retrieval purposes we filtered out question-type words and stop words (in total 28 words) from the questions. Using the remaining words as query terms, we ranked sentences according to either P2(q|S) or P3(q|S), depending on the run. We optimized weights on the development set and used these weights for the official evaluation.

Classes for query expansion were generated based on the overlap in features, which are computed using standard mutual information techniques, for each word in the vocabulary based on a large text corpus. 4

QAst Evaluation Results

Question sets for both task T1 and task T2 comprised the same 100 factoid questions, however 2 of these questions were deemed faulty by the coordinators following submissions and were removed prior to making assessments, resulting in a total of 98 evaluation questions. Our system returned a maximum of 5 answer candidates per question per run. We submitted two ( 2 ) runs each for task T1 and task T2. For both tasks, P2(q|S) was used for the first run and P3(q|S) was used for the second run. In addition to our group, four other teams participated. Table 2 details the official best-run results for the entire field for task T1.

Team ID clt1 dfki1 limsi2 tokyo2 upc1

As can be seen in table 2 our system achieved a best run MRR of 0.20, and was able to correctly answer 34 of 98 questions on the manual data set, placing us third overall. Results for the ASR transcripts were lower, as expected, at 18 correct answers for 98 questions, however other systems showed similar losses on the ASR data. Table 3 shows a comparison of our group’s manual versus ASR results by submission. P2(q|S) was used for runs tokyo1 t1 and tokyo1 t2, while P3(q|S) was used for runs tokyo2 t1 and tokyo2 t2. As can be seen, query expansion employed by P3(q|S) slightly improved our Top5 scores, but had no effect on Top1 accuracy. There was a performance drop of approximately 44% for results based on the top 5 answers using P3(q|S) and a drop of approximately 43% for results based on the top 1 answer for both P2(q|S) and P3(q|S). Similar drops were reflected in other participants results however, and we suspect that this primarily reflects ASR errors. Top5(P2) Top5(P3) Top1(P2) Top1(P3)

MRR

Discussion and Analysis

Our results from task T1 compare favorably with results from previous CLEF and TREC evaluations, despite the size and relative lack of redundancy in the target CHIL lectures corpus. Additional experiments on this corpus which are documented in a paper that is currently pending publication show that our system is able to correctly select the sentence containing the answer over 50% of the time, indicating that there is upwards of a 20% performance loss between the sentence retrieval and answer extraction stages.

While performance across different answer types was fairly consistent, there was a conspicuous gap for the time type, where we did not answer any of the related questions correctly. Analysis of the data indicates that this was caused by multiple factors. There were two time questions for which there was no appropriate answer in the document corpus. There was also a problem with automatically normalizing complex dates which the perl Lingua module was not particular consistent, and as our system generally performs better when times and dates are represented as digits, this made it difficult to correctly extract answers such as “nineteen ninety-eight”. Finally, there was at least one time question for which the question itself did not clearly specify the type.

Finally, we observed a considerable drop in performance between task T1 and task T2, which was similarly mirrored in all other participants’ results. We surmise that in our case this was mainly due to answer typing issues resulting from ASR errors since answer words of the correct answer type are crucial for good AE performance in our system. This can be explained by the way the answer filter model (Section 2.3) works: if the answer words in ASR are of the wrong answer type, then P (W |A) will assign a low probability to the correct answer candidate.

Conclusion

In this paper we have presented our results from the CLEF 2007 QAst pilot track for task T1 and T2, and described our system and experimental setup for the evaluation. In general our results compare favorably with past evaluations, and place us in the middle of the field for this evaluation. We noticed considerable performance drops between the manual transcripts and ASR transcripts, but because these drops were consistent across submissions and participants we are led to believe that this is mainly a result of ASR errors. In future evaluations we think it would be preferable to supply both recognition lattices which consistently match the ASR transcripts, and to be able to use the actual audio. Given that the real aim of this track is to find answers to natural language, factoid questions in spoken documents, having access to these resources might provide greater opportunities for teams to directly exploit the source data in more interesting way. 7 8

Online demonstration Acknowledgements

A demonstration of the system using model ONE supporting questions in English, Japanese, Chinese, Russian, French, Spanish and Swedish can be found online at http://www.inferret.com/ This research was supported in part by the Japanese government’s 21st century COE programme: “Framework for Systematization and Application of Large-scale Knowledge Resources”. [14] E. Whittaker, J. Novak, P. Chatain, P. Dixon, M. Heie, and S. Furui. CLEF2006 Question Answering

Experiments at Tokyo Institute of Technology. In CLEF 2006, LNCS 4730 proceedings, 2006. [15] E. Whittaker, J. Novak, P. Chatain, and S. Furui. TREC 2006 Question Answering Experiments at Tokyo Institute of Technology. In Proceedings of TREC-15, 2006.

[1]

Berger ,

Caruana ,

Cohn ,

Freitag , and

Mittal . Bridging the Lexical Chasm: Statistical Approaches to Answer-Finding . In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval , Athens, Greece, 2000 .

[2]

Brill ,

Dumais , and

Banko . An Analysis of the AskMSR Question-answering System . In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2002 .

[3]

Echihabi and

Marcu . A Noisy-Channel Approach to Question Answering . In Proceedings of the 41st Annual Meeting of the ACL , 2003 .

[4]

Ittycheriah and

Roukos . IBM's Statistical Question Answering System-TREC-11 . In Proceedings of the TREC 2002 Conference , 2002 .

[5]

Merkel and

Klakow . Comparing Improved Language Models for Sentence Retrieval in Question Answering . In Proceedings of CLIN , 2007 .

[6]

Novak , E. Whittaker,

Heie ,

Imai , and

Furui . NTCIR-6 CLQA Question Answering Experiments at the Tokyo Institute of Technology . In Proceedings of the NTCIR-6 Conference , 2006 .

[7]

Pinchak and

Lin . A Probabilistic Answer Type Model . In European Chapter of the ACL , Trento, Italy, 2006 .

[8]

Radev ,

Fan ,

Qi ,

Wu , and

Grewal . Probabilistic Question Answering on the Web . In Proc. of the 11th international conference on WWW, Hawaii , US , 2002 .

[9]

Ravichandran , E. Hovy, and

F. Josef

Och . Statistical QA-Classifier vs . Re-ranker: What's the difference ? In Proceedings of the ACL Workshop on Multilingual Summarization and Question Answering , 2003 .

[10]

Soricut and

Brill . Automatic Question Answering: Beyond the Factoid . In Proceedings of the HLT/NAACL 2004: Main Conference , 2004 .

[11]

Whittaker ,

Chatain ,

Furui , and

Klakow . TREC2005 Question Answering Experiments at Tokyo Institute of Technology. In Proceedings of the 14th Text Retrieval Conference , 2005 .

[12]

Whittaker ,

Furui , and

Klakow . A Statistical Pattern Recognition Approach to Question Answering using Web Data . In Proceedings of Cyberworlds , 2005 .

[13]

Whittaker ,

Hamonic , and

Furui . A Unified Approach to Japanese and English Question Answering . In Proceedings of NTCIR-5 , 2005 .