DFKI–LT at QAST 2007: Adapting QA components to mine answers in speech transcripts Günter Neumann and Rui Wang LT lab, DFKI, Saarbrücken neumann|wang.rui@dfki.de Abstract The paper describes QAst-v1 a robust question answering system for answering fac- toid questions in manual and automatic transcriptions of speech. Our system is an adaptation of our text–based crosslingual open–domain QA system that we used for the Clef main tasks. In particular we assume that good answer candidates to factoid questions are named entities which are type–compatible with the expected answer type of the question. The main features of QAst-v1 are: use of preemptive off-line anno- tation of speech transcripts with sentence boundaries, chunk structures and named entities (NEs); construction of a fulltext search index using words and all found NEs; use of robust Wh-analysis component to determine shallow dependency structures, recognition of NEs, and expected answer type (EAT); use of EAT–driven retrieval of sentences and answer candidates; use of redundancy as an indicator of good answer candidates. The main focus of our effort was on the technical realization of a first QAST research prototype making use of as many of our existing QA components as possible. The results of evaluating the system’s performance by QAST 2007 were as follows: for subtask T1 (Question-Answering in manual transcriptions of lectures) we achieved an overall accuracy (ACC) of 15% and a mean reciprocal rank (MRR) of 0.17; for subtask T2 (Question-Answering in automatic transcriptions of lectures) we obtained 9% (ACC) and 0.09 (MRR). Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 In- formation Search and Retrieval; H.3.4 Systems and Software; H.2.3 [Database Managment]: Languages—Query Languages; I.2 [Artificial Intelligence]: I.2.7 Natural Language Processing General Terms Measurement, Performance, Experimentation Keywords QAst pilot task, Question answering, speech transcripts from lectures 1 Introduction The focus of the new Question Answering on Speech Transcripts (QAst) track within CLEF 2007 is on extracting answers to written factoid questions in manual and automatic transcripts of records of spoken lectures and meetings. Although the basic functionality of a QAst–based system is similar to that of a textual QA–system the nature of the different scenarios and answer sources provoke new challenges. The answer sources for Clef and Trec–like systems are usually text documents like news articles or articles from Wikipedia. In general, an article of such a corpora describes a single topic using a linguistically and stylistically well–formed short text which has been created through a number of revision loops. In this sense, such an article can be considered as being created off–line for the prospective reader. By contrast, transcripts from lectures or meetings are live records of spontaneous speech produced incrementally (or on–line) in human–human interactions. Here, revisions (of errors or refinements) of utterances take place explicitly and immediately or not at all. Thus, speech transcripts also have to encode such properties of incremental language production, like word repetition, error corrections, refinements or interruptions. Consequently, transcripts are less well–formed, stylistic and fluent as written texts. Furthermore, in case of automatic transcripts errors and language gaps caused by the used automatic speech recognition system also make things not easier for a QAst–based system (see also the Background tap of the QAst Clef 2007 web page at http://www.lsi.upc.edu/∼qast/index.html). It seems that QA on speech transcripts demands a high degree of robustness and flexibility from the QA components and its architecture. Nevertheless, the component architecture of a QAst–based system is similar to that of a textual QA–system and consists of the following core functionality: NL question analysis, retrieval of relevant snippets from speech transcripts, answer extraction, and answer selection. Therefore, we decided to develop our initial prototype QAst-v1 following the same underlying design principles that we used for our textual QA system and by the adaptation of some of its core components, cf. [3, 5]. 2 System Overview Figure 1: The architecture of QAst-v1. The current architecture is shown in figure 1. In an off–line phase we firstly generate an inverted index for the speech corpora such that each sentence is considered as a single document and indexed by its word forms and named entities. In the question answering phase, a list of NL questions is passed to the system. Each NL question is analyzed by the named entity recognizer and by the question analysis component. The main output is a question object which represents the expected answer type (EAT) of the question and its relevant keywords. For example, the EAT of the question “Where is Southern Methodist University?” is location and the relevant keywords are “Southern Methodist University”. From the question object an IR–query expression is created in order to access the indexed document space. The IR–query for the example question is {+neTypes:location AND +“southern methodist university”} which can be read as “select only documents (in our case only sentences) which contain at least one location entity and the phrase southern methodist university ”, see section 5 for details. In the answer extraction step all found location names are considered as answer candidates and the most frequent answer candidates are selected as answers to the question, e.g., “Dallas” and “Texas” are found as possible answers in the manual transcript of the lecture corpus. For each question a list of its N–best answers is returned. In the next sections, we describe the core components in more detail starting with the named entity recognition because it is used in all other components. 3 Named Entity Recognition Figure 2: The internal structure or our hybrid named entity recognizer. Named Entity Recognition (NER) plays a central role in a factual QA architecture: Named entities are the answers of factual questions and as such define the range for the expected answer types. In the QAst 2007 pilot task the following answer types are used: person, location, organization, time, language, system/method, measure, colour, shape, and material. The answer types directly corresponds to the type of named entities which should be covered by NER. There exists already a number NER components, but with different coverage of types. For that reason, we developed a hybrid NER approach where we combined three different NER components, see also figure 2: • LingPipe developed by Alias–i and available at http://www.alias-i.com/lingpipe/. It mainly covers person, location, and organization names for English and co–references be- tween pronouns and corresponding named entities. It realizes a supervised statistical based approach to NER. • Opennlp tools available at http://opennlp.sourceforge.net/. Its name finder is also based on a supervised statistical approach and covers mainly seven types of NEs for English, viz. person, location, organization, date, time, money, and percentage. • BiQueNER developed by our group. It is based on the co-training algorithm for named entities developed by [1]. We are using BiQueNER for handling all NE types, especially language, system/method, measure, colour, shape, and material. LingPipe and Opennlp require large sets of annotated training material which we had not at hand for speech transcripts. Hence we directly used the language models both tools come with and which have been created especially for written news texts. BiQueNER embeds a very flexible gazetteer sub-component which we used for creating NE lists for the covered types using a simple word–based frequency analysis of the corpora and resources from Wikipedia. Performing a training step with BiQueNER turned out to be unfeasible for the moment, because BiQueNER requires that the corpus is preprocessed with a POS–tagger and a chunker. We used the Opennlp tools for this but the result of the preprocessing step on the CHILL corpora are currently inappropriate when using the available models of Opennlp. Linguistically annotated speech transcripts are needed urgently. All three NERs run in parallel on an input text and we construct NE–specific indexes for each individual recognizer. The individual results are combined via the IR–query construction process (see sec. 5) and the answer extraction process (see sec. 6). In this way, also conflicting cases are handled like different NE readings (e.g., the same instance is typed as person by LingPipe and organization by BiQueNER) and (implicit) partial or overlapping annotations (e.g., the name “Rui Wang” is recognized as a single person name by LingPipe or as two different person names by BiQueNER). 4 Document Preprocessing Based on a number of experiments we made during the development of our textual open–domain QA–technology, we developed the hypothesis that a structural analysis of un-structured documents towards the information needs of questions, will support the retrieval of relevant small textual information units through informative IR-queries. However, since we cannot foresee all the different users’ interests or questions, a challenging research question is: How detailed can the structural analysis be made without putting over a “straitjacket” of a particular interpretation on the un- structured source? Thus, there is a tradeoff between off-line and on-line document annotation. Questions and answers are somewhat related in that questions influence the information geometry and hence, the information view and access, cf. [4]. Furthermore, a sentence-oriented preprocessing determining only sentence boundaries, named entities (NE) and their co-references turned out to be a useful level of offline annotation of written texts, at least for the Clef–kind of factual questions. For that reason we decided to apply the same off–line preprocessing approach also to the QAst collections. In particular the following steps are performed: • Extracting lines of words from the automatic speech transcripts so that both the manual and automatic transcript are in the same format. • Identification of sentence boundaries using the sentence splitter of the Opennlp tool which is based on maximum entropy modeling. We are currently using the language model the sentence splitter comes with which is optimized for written texts. • Annotation of the sentences with recognized named entities. The preprocessed documents are further processed by the IR–development engine Lucene, cf. [2]. We are using Lucene in such a way that for all extracted named entities and content words, Lucene provides indexes which point to the corresponding sentences directly. Especially in the case of named entities type–based indexes are created which support the specification of type constraints in an IR–query. By doing this, we could query the IR component not only by keywords extracted from the questions, but also by NE types corresponding to their expected answer types. An ex- ample would make this clear: for the question “Where is Southern Methodist University located?” beside creating an IR-query containing the keywords: {+“Southern Methodist University”, +lo- cated}, we could supply also the expected answer type location querying an additional field neTypes: {+text:“Southern Methodist University”, +text:located +neTypes:location}. This will not only narrow the amount of data being analyzed for answer extraction, but will also guar- antee the existence of an answer candidate. 5 Question Processing and Sentence Retrieval In the current QAst 2007 task setting natural language questions are specified in written form. For this reason we were able to integrate the question parser from our textual QA–system into QAst-v1. The question parser computes for each question a syntactic dependency tree (which also contains recognized named entities) and semantic information like question type, the expected answer type, and the question focus, cf. [3] for details. The semantic information is determined on the basis of syntactic constraints applied on relevant NP and VP phrases of the dependency tree, and by taking into account information from two small structured vocabularies. They basically perform a mapping from linguistic entities to values of the questions type, e.g., trigger phrases like name of, type of, abbreviation of or they perform a mapping of lexical elements to expected answer types, like town, person, president. In a second step the result of the question parser is mapped to an ordered set of alternative IR–queries following the same approach as in our textual QA system, cf. [3]. The alternative IR–queries differ in the degree of their specialization. For example, for the question “Where is Southern Methodist University located?” we construct {+text:“Southern Methodist Univer- sity”, +text:located +neTypes:location} and {text:Southern text:Methodist text:University, text:located +neTypes:location}. The latter means that we relax the requirement that the phrase “Southern Methodist University” and the keyword “located” must both be present in a sentence to serve as a relevant snippet from which answer candidates are extracted. The different IR–queries are passed to Lucene in the order of their specificity such that if a higher–ranked IR–query does not return any sentence, the next less specific IR–query is tried. This means that the alternative IR–queries are processed in the manner of a decision list. Lucene returns a list of all matching sentences. This means that each returned sentence at least contains one named entity which is an instance of the EAT and overlaps with the set of keywords depending on the specificity of the IR–query. The set of sentences are ordered according to Lucene’S similarity measure applied on each sentence and the IR–query 6 Answer Extraction In our current version of QAst-v1 answers are considered as instances of the expected answer type (EAT). The EAT corresponds to a type of the named entities covered by our system. Thus we consider each named entity of the retrieved sentences as answer candidate if and only if its type is the same as the EAT. For each answer candidate we compute its frequency relative to the set of retrieved sentences and order them accordingly. We finally filter out ill–formed answer candidates if they consists of material like “the”, “muh”, “uhm” using a set of manually specified rules. If no EAT–compatible named entity exists the empty answer NIL is returned. 7 Results and Discussion We took part in the tasks: • T1: Question-Answering in manual transcriptions of lectures; • T2: Question-Answering in automatic transcriptions of lectures; In both cases the CHILL corpus was used which was adapted by the organizers for the QAst 2007 track. It consists of around 25 hours (around 1 hour per lecture) both manually and automatically transcribed. The language is European English, mostly spoken by non–native speakers. We submitted only one run to each task and the table below shows the results we obtained: Run task Questions returned (#) [98] Correct answers (#) MRR Accuracy dfki1 t1 T1 98 19 0.17 0.15 dfki1 t2 T2 98 9 0.09 0.09 where MRR is the Mean Reciprocal Rank that measures how well ranked is the right answer in the list of 5 possible answers in average. Accuracy is the fraction of correct answers ranked in the first position in the list of 5 possible answers. The currently low number of returned correct answers has two main error sources. On the one hand side, the coverage and quality of the named entity recognizers are low. This is probably due to the fact that we used the languages models that were created from written texts. One possible solution is to improve the corpus preprocessing step, especially the sentence splitter and the repairment of errors like word repetition. Another possible source of improvement is the development of annotated training corpus of speech transcripts for named entities. Both activities surely demand further research and resources. On the other hand side, the performance of the answer extraction process strongly depends on the coverage and quality of the question analysis tool. We will improve this by extending the current coverage of the English Wh–grammar, especially by extending the mapping of general verbs and nouns to corresponding expected answer types and by exploiting strategies that validate the semantic type consistency between the relevant nouns and verbs of a question. Acknowledgement The work presented here has been partially supported by a research grant from the German Federal Ministry of Education, Science, Research and Technology (BMBF) to the DFKI project HyLaP (FKZ: 01 IW F02) and by the EU funded project QALL-ME (FP6 IST-033860). References [1] M. Collins and Y. Singer. Unsupervised models for named entity classification, 1999. [2] Erik Hatcher and Otis Gospodnetic. Lucene in Action (In Action series). Manning Publications Co., Greenwich, CT, USA, 2004. [3] G. Neumann and S. Sacaleanu. Experiments on robust nl question interpretation and multi- layered document annotation for a cross-language question/answering system. In Clef 2004, volume 3491, pages 411–422. Springer-Verlag LNCS, 2005. [4] C.J. Van Rijsbergen. The Geometry of Information Retrieval. Cambridge University Pres, 2004. [5] B. Sacaleanu and G. Neumann. Dfki-lt at the clef 2006 multiple language question answering track. In Working notes of Clef 2006. August, Alicante, Spain.