=Paper= {{Paper |id=Vol-1173/CLEF2007wn-QACLEF-ComasEt2007 |storemode=property |title=Robust Question Answering for Speech Transcripts Using Minimal Syntactic Analysis |pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-QACLEF-ComasEt2007.pdf |volume=Vol-1173 |dblpUrl=https://dblp.org/rec/conf/clef/ComasTS07a }} ==Robust Question Answering for Speech Transcripts Using Minimal Syntactic Analysis== https://ceur-ws.org/Vol-1173/CLEF2007wn-QACLEF-ComasEt2007.pdf

Robust Question Answering for Speech Transcripts
Using Minimal Syntactic Analysis
Pere R. Comas, Jordi Turmo and Mihai Surdeanu
TALP Research Center
Technical University of Catalonia (UPC)
{pcomas,turmo,surdeanu}@lsi.upc.edu

Abstract
This paper describes the participation of the Technical University of Catalonia in the
CLEF 2007 Question Answering on Speech Transcripts track. For the processing of manual
transcripts we have deployed a robust factual Question Answering that uses minimal syntactic
information. For the handling of automatic transcripts we combine the QA system with a
novel Passage Retrieval and Answer Extraction engine, which is based on a sequence alignment
algorithm that searches for “sounds like” sequences in the document collection. We have also
enriched the NERC with phonetic features to facilitate the recognition of named entities even
when they are incorrectly transcribed.

Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]

General Terms
Experimentation, Performance, Measurement

Keywords
Question Answering, Spoken Document Retrieval, Phonetic Distance

1 Introduction
The CLEF 2007 Question Answering on Speech Transcripts (QAST) track consists of the following
four tasks:

T1 : Question Answering (QA) using as underlying document collection the manual transcripts
of the lectures recorded within the CHIL European Union project1 .
T2 : QA using the automatic transcripts of the CHIL lectures. Word lattices from an automated
speech recognizer (ASR) are provided as an additional input source for systems that prefer
to decide internally what the best automatic segmentation is.
T3 : QA in the manual transcriptions of the meetings that form the corpus collected by the AMI
European Union project2 .
T4 : QA in the automatic transcripts of the above AMI meetings.
1 http://chil.server.de
2 http://www.amiproject.org
For tasks T1 and T3 we have adapted a QA system and Named Entity Recognizer and Classifier
(NERC) that we previously developed for the processing of manual speech transcripts[9, 10].
Both these systems obtained good performance in previous evaluations even though they require
minimal syntactic analysis of the underlying documents (only part of speech tagging) and minimal
additional annotation (punctuation signs are optional). For the handling of automatic transcripts
(tasks T2 and T4) we implemented two significant system changes: (a) for Passage Retrieval
(PR) and Answer Extraction (AE) we designed a novel keyword matching engine that relies on
phonetical similarity –instead of string match– to overcome the errors introduced by the ASR,
and (b) we enriched the NERC with phonetic features to facilitate the recognition of named
entities even when they are incorrectly transcribed. Even though the resulting QA system does
not outperform the initial QA system in tasks T2 and T4, we believe these design choices are a
good longer-term research direction because they can address ASR-specific phenomena.
The paper is organized as follows. Section 2 overviews the architecture of the QA system.
Section 3 describes the NERC improvements, for both manual and automatic transcripts. Section 4
details the novel keyword matching algorithm we designed for automatic transcripts. Section 5
contains the results of the empirical evaluation and Section 6 concludes the paper.

2 Overview of the System Architecture
The architecture of our QA system follows a commonly-used schema, which splits the process into
three phases that are performed sequentially: Question Processing (QP), Passage Retrieval (PR),
and Answer Extraction (AE). In the next sub-section we describe the implementation of the three
components for the system that processes manual transcripts. We conclude this section with the
changes required for the handling of automatic transcripts.

2.1 QA System for Manual Transcripts
For the processing of manual transcripts we used an improved version of the system introduced
in [9]. We describe it briefly below.

Question Processing. The main goal of this component is to detect the type of the expected
answer (e.g., the name of a location, organization etc.). We currently recognize the 53 open-
domain answer types from [7] and an additional 3 types that are specific to the corpora
used in this evaluation (i.e., system/method, shape, and material). The answer types are
extracted using a multi-class Perceptron classifier and a rich set of lexical, semantic (i.e.,
distributional similarity) and syntactic (part of speech (POS) tags and syntactic chunks)
features. This classifier obtains an accuracy of 88.5% on the corpus of [7]. Additionally,
the QP component extracts and ranks relevant keywords from the question (e.g., a noun is
ranked as more important than a verb, stop words are skipped). Since questions are typed
text in all QAST scenarios, we used the same QP component for both manual and automatic
transcripts.

Passage Retrieval. The goal of this component is to retrieve a set of relevant passages from the
document collection, given the previously extracted question keywords. The PR algorithm
uses a query relaxation procedure that iteratively adjusts the number of keywords used for
retrieval and their proximity until the quality of the recovered information is satisfactory
(see [9]). In each iteration a Document Retrieval application3 fetches the documents relevant
for the current query and a subsequent passage construction module builds passages as
segments where two consecutive keyword occurrences are separated by at most t words.
Figure 1 shows an example of passage construction for a simple query and one sample
sentence. This algorithm uses limited syntax –only POS tags– which makes it very robust
for speech transcripts.
3 Lucene - http://jakarta.apache.org/lucene
Keywords: relevant, documents, process
Passage
z }| {
documents must be separated into relevant documents and irrelevant documents by manual process, which . . .
| {z } | {z }
distance>t distance