Robust Question Answering for Speech Transcripts Using Minimal Syntactic Analysis Pere R. Comas, Jordi Turmo and Mihai Surdeanu TALP Research Center Technical University of Catalonia (UPC) {pcomas,turmo,surdeanu}@lsi.upc.edu Abstract This paper describes the participation of the Technical University of Catalonia in the CLEF 2007 Question Answering on Speech Transcripts track. For the processing of manual transcripts we have deployed a robust factual Question Answering that uses minimal syntactic information. For the handling of automatic transcripts we combine the QA system with a novel Passage Retrieval and Answer Extraction engine, which is based on a sequence alignment algorithm that searches for “sounds like” sequences in the document collection. We have also enriched the NERC with phonetic features to facilitate the recognition of named entities even when they are incorrectly transcribed. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval] General Terms Experimentation, Performance, Measurement Keywords Question Answering, Spoken Document Retrieval, Phonetic Distance 1 Introduction The CLEF 2007 Question Answering on Speech Transcripts (QAST) track consists of the following four tasks: T1 : Question Answering (QA) using as underlying document collection the manual transcripts of the lectures recorded within the CHIL European Union project1 . T2 : QA using the automatic transcripts of the CHIL lectures. Word lattices from an automated speech recognizer (ASR) are provided as an additional input source for systems that prefer to decide internally what the best automatic segmentation is. T3 : QA in the manual transcriptions of the meetings that form the corpus collected by the AMI European Union project2 . T4 : QA in the automatic transcripts of the above AMI meetings. 1 http://chil.server.de 2 http://www.amiproject.org For tasks T1 and T3 we have adapted a QA system and Named Entity Recognizer and Classifier (NERC) that we previously developed for the processing of manual speech transcripts[9, 10]. Both these systems obtained good performance in previous evaluations even though they require minimal syntactic analysis of the underlying documents (only part of speech tagging) and minimal additional annotation (punctuation signs are optional). For the handling of automatic transcripts (tasks T2 and T4) we implemented two significant system changes: (a) for Passage Retrieval (PR) and Answer Extraction (AE) we designed a novel keyword matching engine that relies on phonetical similarity –instead of string match– to overcome the errors introduced by the ASR, and (b) we enriched the NERC with phonetic features to facilitate the recognition of named entities even when they are incorrectly transcribed. Even though the resulting QA system does not outperform the initial QA system in tasks T2 and T4, we believe these design choices are a good longer-term research direction because they can address ASR-specific phenomena. The paper is organized as follows. Section 2 overviews the architecture of the QA system. Section 3 describes the NERC improvements, for both manual and automatic transcripts. Section 4 details the novel keyword matching algorithm we designed for automatic transcripts. Section 5 contains the results of the empirical evaluation and Section 6 concludes the paper. 2 Overview of the System Architecture The architecture of our QA system follows a commonly-used schema, which splits the process into three phases that are performed sequentially: Question Processing (QP), Passage Retrieval (PR), and Answer Extraction (AE). In the next sub-section we describe the implementation of the three components for the system that processes manual transcripts. We conclude this section with the changes required for the handling of automatic transcripts. 2.1 QA System for Manual Transcripts For the processing of manual transcripts we used an improved version of the system introduced in [9]. We describe it briefly below. Question Processing. The main goal of this component is to detect the type of the expected answer (e.g., the name of a location, organization etc.). We currently recognize the 53 open- domain answer types from [7] and an additional 3 types that are specific to the corpora used in this evaluation (i.e., system/method, shape, and material). The answer types are extracted using a multi-class Perceptron classifier and a rich set of lexical, semantic (i.e., distributional similarity) and syntactic (part of speech (POS) tags and syntactic chunks) features. This classifier obtains an accuracy of 88.5% on the corpus of [7]. Additionally, the QP component extracts and ranks relevant keywords from the question (e.g., a noun is ranked as more important than a verb, stop words are skipped). Since questions are typed text in all QAST scenarios, we used the same QP component for both manual and automatic transcripts. Passage Retrieval. The goal of this component is to retrieve a set of relevant passages from the document collection, given the previously extracted question keywords. The PR algorithm uses a query relaxation procedure that iteratively adjusts the number of keywords used for retrieval and their proximity until the quality of the recovered information is satisfactory (see [9]). In each iteration a Document Retrieval application3 fetches the documents relevant for the current query and a subsequent passage construction module builds passages as segments where two consecutive keyword occurrences are separated by at most t words. Figure 1 shows an example of passage construction for a simple query and one sample sentence. This algorithm uses limited syntax –only POS tags– which makes it very robust for speech transcripts. 3 Lucene - http://jakarta.apache.org/lucene Keywords: relevant, documents, process Passage z }| { documents must be separated into relevant documents and irrelevant documents by manual process, which . . . | {z } | {z } distance>t distance