1 Introduction

University of Ottawa's participation in the CL-SR task at CLEF 2006

Muath Alzghool

alzghool@site.uottawa.ca 0

Diana Inkpen

diana@site.uottawa.ca 0 0 School of Information Technology and Engineering University of Ottawa

This paper presents the second participation of the University of Ottawa group in CLEF, the CrossLanguage Spoken Retrieval (CL-SR) task. We present the results of the submitted runs for the English collection and very briefly for the Czech collection, followed by many additional experiments. We have used two Information Retrieval systems in our experiments: SMART and Terrier were tested with many different weighting schemes for indexing the documents and the queries and with several query expansion techniques (including a new method based on log-likelihood scores for collocations). Our experiments showed that query expansion methods do not help much for this collection. We tested whether the new Automatic Speech Recognition transcripts improve the retrieval results; we also tested combinations of different automatic transcripts (with different estimated word error rates). The retrieval results did not improve, probably because the speech recognition errors happened for the words that are important in retrieval, even in the newer ASR2006 transcripts. By using different system settings, we improved on our submitted result for the required run (English queries, title and description) on automatic transcripts plus automatic keywords. We present crosslanguage experiments, where the queries are automatically translated by combining the results of several online machine translation tools. Our experiments showed that high quality automatic translations (for French) led to results comparable with monolingual English, while the performance decreased for the other languages. Experiments on indexing the manual summaries and keywords gave the best retrieval results.

1 Introduction

This paper presents the second participation of the University of Ottawa group in CLEF, the Cross-Language Spoken Retrieval (CL-SR) track. We briefly describe the task [ 10 ]. Then, we present our systems, followed by results for the submitted runs for the English collection and very briefly for the Czech collection. We present results for many additional runs for the English collection. We experiment with many possible weighting schemes for indexing the documents and the queries, and with several query expansion techniques. We test with different speech recognition transcripts to see if the word error rate has an impact on the retrieval performance. We describe cross-language experiments, where the queries are automatically translated from French, Spanish, German and Czech into English, by combining the results of several online machine translation (MT) tools. At the end we present the best results when summaries and manual keywords were indexed.

The CLEF-2006 CL-SR collection includes 8104 English segments, and 105 topics (queries). Relevance judgments were provided for 63 training topics, and later for 33 test topics. In each document (segment), there are six fields that can be used for the official runs: ASRTEXT2003A, ASRTEXT2004A, ASRTEXT2006A, ASRTEXT2006B, AUTOKEYWORD2004A1, and AUTOKEYWORD2004A2. The first four fields are transcripts produced using Automatic Speech Recognition (ASR) systems developed by the IBM T. J. Watson Research Center in three successive years 2003, 2004, and 2006, with different estimated mean word error rates of 44%, 38%, and 25% respectively.

Among the 8104 segments covered by the test collection, only 7377 segments have the ASRTEXT2006A field. The ASRTEXT2006B field content is identical to the ASRTEXT2006A field if there is ASR output produced by the 2006 system for the segment, or identical to the ASRTEXT2004A if not. Moreover just 7034 segments have ASRTEXT2003A field. The AUTOKEYWORD2004A1 and AUTOKEYWORD2004A2 field contain a set of thesaurus keywords that were assigned automatically using two different k-Nearest Neighbor (kNN) classifiers using only words from the ASRTEXT2004A field of the segment. Among the 8104 segments covered by the test collection, 8071 and 8090 segments have AUTOKEYWORD2004A1 and AUTOKEYWORD2004A2, respectively

There is also a Czech collection for this year’s CL-SR track; the document collection consists of ASR transcripts for 354 interviews in Czech, together with some manually assigned metadata and some automatically generated metadata, and 115 search topics in two languages (Czech and English). The task for this collection is to return a ranked list of time stamps marking the beginning of sections that are relevant to a topic.

2 System Overview

The University of Ottawa Cross-Language Information Retrieval (IR) systems were built with off-the-shelf components. For translating the queries from French, Spanish, German, and Czech into English, several free online machine translation tools were used. Their output was merged in order to allow for variety in lexical choices. All the translations of a title made the title of the translated query; the same was done for the description and narrative fields. For the retrieval part, the SMART [ 2,9 ] IR system and the Terrier [ 1,6 ] IR system were tested with many different weighting schemes for indexing the collection and the queries.

For translating the topics into English we used several online MT tools. The idea behind using multiple translations is that they might provide more variety of words and phrases, therefore improving the retrieval performance. The seven online MT systems that we used for translating from Spanish, French, and German were: 1. http://www.google.com/language_tools?hl=en 2. http://www.babelfish.altavista.com 3. http://freetranslation.com 4. http://www.wordlingo.com/en/products_services/wordlingo_translator.html 5. http://www.systranet.com/systran/net 6. http://www.online-translator.com/srvurl.asp?lang=en 7. http://www.freetranslation.paralink.com

For translation the Czech language topics into English we were able to find only one online MT system: http://intertran.tranexp.com/Translate/result.shtml.

We combined the outputs of the MT systems by simply concatenating all the translations. All seven translations of a title made the title of the translated query; the same was done for the description and narrative fields. We used the combined topics for all the cross-language experiments reported in this paper. 3

Retrieval

We used two systems in our participation: SMART and Terrier. SMART was originally developed at Cornell University in the 1960s. SMART is based on the vector space model of information retrieval [ 2 ]. It generates weighted term vectors for the document collection. SMART preprocesses the documents by tokenizing the text into words, removing common words that appear on its stop-list, and performing stemming on the remaining words to derive a set of terms. When the IR server executes a user query, the query terms are also converted into weighted term vectors. Vector inner-product similarity computation is then used to rank documents in decreasing order of their similarity to the user query. The newest version of SMART (version 11) offers many state-ofthe-art options for weighting the terms in the vectors. Each term-weighting scheme is described as a combination of term frequency, collection frequency, and length normalization components [ 8 ].

In this paper we employ the notation used in SMART to describe the combined schemes: xxx.xxx. The first three characters refer to the weighting scheme used to index the document collection and the last three characters refer to the weighting scheme used to index the query fields. In SMART, we used mainly the lnn.ntn weighting scheme which performs very well in CLEF-CLSR 2005 [ 4 ]; lnn.ntn means that lnn was used for documents and ntn for queries according to the following formulas: weight ln n = ln(tf ) + 1.0 weight ntn= tf × log nt

Where tf denote the term frequency of a term t in the document or query, N denotes the number of documents in the collection, and nt denotes the number of documents in which the term t occurs.

We have also used a query expansion mechanism with SMART, which follows the idea of extracting related words for each word in the topics using the Ngram Statistics Package (NSP) [ 7 ]. We extracted the top 6412 pairs of related words based on log likelihood ratios (high collocation scores in the corpus of ASR transcripts), using a window size of 10 words. We chose log-likelihood scores because they are known to work well even when the text corpus is small. For each word in the topics, we added the related words according to this list. We call this approach to relevance feedback SMARTnsp.

Terrier was originally developed at University of Glasgow. It is based on Divergence from Randomness models (DFR) where IR is seen as a probabilistic process [ 1, 6 ]. We experimented with the In(exp)C2 weighting model, one of Terrier’s DFR-based document weighting models. Using the In(exp)C2 model, the relevance score of a document d for a query q is given by the formula: sim(d , q) = ∑ qtf .w(t, d )

t∈q where - qtf is the frequency of term t in the query q, - w(t,d) is the relevance score of a document d for the query term t, given by: w(t, d ) = (

F + 1 N + 1 nt × (tfne + 1) ) × (tfne × log 2 ne + 0.5) where -F is the term frequency of t in the whole collection. -N is the number of document in the whole collection. -nt is the document frequency of t. -ne is given by ne = N × (1 − (1 − nt ) F )

N - tfne is the normalized within-document frequency of the term t in the document d. It is given by the normalization 2 [ 1, 3 ]: tfne = tf × loge (1 + c × avg _ l l

) where - c is a parameter, for the submitted run, we fix this parameter to 1. - tf is the within-document frequency of the term t in the document d. - l is the document length and avg_l is the average document length in the whole collection.

We estimated the parameter c of the normalization 2 formula by running some experiments on the training data, to get the best values for c depending on the topic fields used. We obtained the following values: c=0.75 for queries using the Title only, c=1 for queries using the Title and Description fields, and c=1 for queries using the Title, Description, and Narrative fields. We select the c value that has a best MAP score according to the training data.

We have also used a query expansion mechanism in Terrier, which follows the idea of measuring divergence from randomness. In our experiments, we applied the Kullback-Leibler (KL) model for query expansion [ 4, 10 ]. It is one of the Terrier DFR-based term weighing models. Using the KL model, the weight of a term t in the topranked documents is given by: w(t) = P × log 2 PPx x

c where

Px = tfx lx and Pc =

F tokenc -tfx is the frequency of the query term in the top-ranked documents. -lx is the sum of the length of the top-ranked documents, -F is the term frequency of the query term in the whole collection. - tokenc is the total number of tokens in the whole collection.

4 Experimental Results 4.1 Submitted runs

In the rest of the paper we focus only on the Eglish CL-SR collection. TD TD

Comparison of systems and query expansion methods

Table 3 presents results for the best weighting schemes: for SMART we chose lnn.ntn and for Terrier we chose the In(exp)C2 weighting model, because they achieved the best results on the training data. We present results with and without relevance feedback.

According to Table 3, we note that: • Relevance feedback helps to improve the retrieval results in Terrier for TDN, TD, and T for the training data; the improvement was high for TD and T, but not for TDN. For the test data there is a small improvement. • NSP relevance feedback with SMART does not help to improve the retrieval for the training data (except for TDN), but it helps for the test data (small improvement).

• SMART results are better than Terrier results for the test data, but not for the training data. In order to find the best ASR transcripts to use for indexing the segments, we compared the retrieval results when using the ASR transcripts from the years 2003, 2004, and 2006 or combinations. We also wanted to find out if adding the automatic keywords helps to improve the retrieval results. The results of the experiments using Terrier and SMART are shown in Table 4 and Table 5, respectively.

We note from the experimental results that: • Using Terrier, the best field is ASRTEXT2006B which contains 7377 transcripts produced by the ASR system on 2006 and 727 transcripts produced by the ASR system in 2004, this improvement over using only the ASRTEXT2004A field is very. On the other hand, the best ASR field using SMART is ASRTEXT2004A. • Any combination between two ASRTEXT fields does not help to improve the retrieval. • Using Terrier and adding the automatic keywords to ASRTEXT2004A improved the retrieval for the training data but not for the test data. For SMART it helps for both the training and the test data. • In general, adding the automatic keywords helps. Adding them to ASRTEXT2003A or ASRTEXT2006B improved the retrieval results for the training and test data. • For the required submission run English TD, the maximum MAP score was obtained by the combination of ASRTEXT 2004A and 2006A plus autokeywords using Terrier (0.0952) or SMART (0.0932) on the training data; on the test data the combination of ASRTEXT 2004A and autokeywords using SMART obtained the highest value, 0.0725, higher than the value we report in Table 1 for the submitted run.

Segment fields

ASRTEXT 2003A ASRTEXT 2004A ASRTEXT 2006A ASRTEXT 2006B ASRTEXT 2003A+2004A ASRTEXT 2004A+2006A ASRTEXT 2004A+2006B ASRTEXT 2003A + AUTOKEYWORD2004A1,A2 ASRTEXT 2004A+ AUTOKEYWORD2004A1,A2 ASRTEXT 2006B+ AUTOKEYWORD2004A1,A2 ASRTEXT 2004A+ 2006A + AUTOKEYWORD2004A1,A2 ASRTEXT 2004A +2006B + AUTOKEYWORD2004A1,A2

4.4 Cross-language experiments

Table 6 presents results for the combined translation produced by the seven online MT tools, from French, Spanish, and German into English, for comparison with monolingual English experiments (the first line in the table). All the results in the table are from SMART using the lnn.ntn weighting scheme.

Since the result of combined translation for each language was better than when using individual translations from each MT tool on the CLEF 2005 CL-SR data [ 4 ], we used combined translations in our experiments.

Terrier SMART 4.5 Manual summaries and keywords Conclusion

We experimented with two different systems: Terrier and SMART, with various weighting scheme for indexing the document and query terms. We proposed a new approach for query expansion that uses collocations with high log-likelihood ratio. Used with SMART, the method obtained a small improvement on test data (probably not significant). The KL relevance feedback method produced only small improvements with Terrier on test data. So, query expansion methods do not seem to help for this collection.

The improvements of mean word error rates in the ASR transcripts (of ASRTEXT2006A relative to ASRTEXT2004A) did not improve the retrieval results. Also, combining different ASR transcripts (with different error rates) did not seem to help.

For some experiments, Terrier was better than SMART, for other it was not; therefore we cannot clearly choose one or another IR system for this collection.

The idea of using multiple translations proved to be good. More variety in the translations would be beneficial. The online MT systems that we used are rule-based systems. Adding translations by statistical MT tools might help, since they could produce radically different translations.

On the manual data, the best MAP score we obtained is around 29%, for the English test topics. On automatically-transcribed data the best result is around 7.6% MAP score. Since the improvement in the ASR word error rate does not improve the retrieval results, as shown from the experiments in section 4.3, we think that the justification for the difference to the manual summaries is due to the fact that summaries contain different words to represent the content of the segments. In future work we plan to investigate methods of removing or correcting some of the speech recognition errors in the ASR contents and to use speech lattices for indexing.

Amati and C. J. van Rijsbergen : Probabilistic models of information retrieval based on measuring the divergence from randomness . ACM Transactions on Information Systems (TOIS) , 20 ( 4 ): 357 - 389 , October 2002 .

Buckley , G. Salton, and

Allan : Automatic retrieval with locality information using SMART . In Proceedings of the First Text REtrieval Conference (TREC-1) , pages 59 - 72 . NIST Special Publication 500-207, March 1993 .

Carpineto , R. de Mori, G. Romano, and

Bigi . An information-theoretic approach to automatic query expansion . ACM Transactions on Information Systems (TOIS) , 19 ( 1 ): 1 - 27 , January 2001 .

Inkpen ,

Alzghool , and

Islam

: Using various indexing schemes and multiple translations in the CLSR task at CLEF 2005 . In Proceedings of CLEF 2005, Lecture Notes in Computer Science 4022, SpringerVerlag , 2006 .

D. W.

Oard ,

Soergel ,

Doermann ,

Huang ,

G. C.

Murray ,

Wang ,

Ramabhadran ,

Franz and S. Gustman : Building an Information Retrieval Test Collection for Spontaneous Conversational Speech , in Proceedings of SIGIR , 2004 .

Ounis ,

Amati ,

Plachouras ,

He ,

Macdonald and D. Johnson : Terrier Information Retrieval Platform . In Proceedings of the 27th European Conference on Information Retrieval (ECIR 05) , 2005 . http://ir.dcs.gla.ac.uk/wiki/Terrier

7. Pedersen . and S. Banerjee : The design, implementation and use of the ngram statistics package ., Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics , Mexico City, Mexico, 2003 .

8. G. Salton : Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer . Addison-Wesley Publishing Company, 1989 .

Salton and C. Buckley : Term-weighting approaches in automatic retrieval . Information Processing and Management 24 ( 5 ): 513 - 523 , 1988 .

10.

R. W.

White ,

D. W.

Oard ,

G. J. F.

Jones ,

Soergel , X. Huang : Overview of the CLEF-2005 Cross-Language Speech Retrieval Track . In Proceedings of CLEF 2005, Lecture Notes in Computer Science 4022 , Springer-Verlag, 2006 .