1 Introduction

Model Fusion Experiments for the Cross Language Speech Retrieval Task at CLEF 2007

Muath Alzghool

alzghool@site.uottawa.ca 0

Diana Inkpen

diana@site.uottawa.ca 0 0 School of Information Technology and Engineering University of Ottawa

This paper presents the participation of the University of Ottawa group in the Cross-Language Speech Retrieval (CL-SR) task at CLEF 2007. We present the results of the submitted runs for the English collection. We have used two Information Retrieval systems in our experiments: SMART and Terrier, with two query expansion techniques: one based on a thesaurus and the second one based on blind relevant feedback. We proposed two novel data fusion methods for merging the results of several models (retrieval schemes available in SMART and Terrier). Our experiments showed that the combination of query expansion methods and data fusion methods helps to improve the retrieval performance. We also present cross-language experiments, where the queries are automatically translated by combining the results of several online machine translation tools. Experiments on indexing the manual summaries and keywords gave the best retrieval results.

eol>Data Fusion Retrieval Models Query Expansion

1 Introduction

This paper presents the third participation of the University of Ottawa group in the Cross-Language Speech Retrieval (CL-SR) track, at CLEF 2007. We present our systems, followed by results for the submitted runs for the English collection. We present results for many additional runs for the English collection. We experimented with many possible weighting schemes for indexing the documents and the queries, and with several query expansion techniques. Several researchers in the literature have explored the idea of combining the results of different retrieval strategies, different document representations and different query representations; the motivation is that each technique will retrieve different sets of relevant documents; therefore combining the results could produce a better result than any of the individual techniques. We propose new data fusion techniques for combining the results of different Information Retrieval (IR) schemes. We applied our data fusion techniques to monolingual settings and to cross-language settings where the queries are automatically translated from French and Spanish into English by combining the results of several online machine translation (MT) tools. At the end we present the best results, when manual summaries and manual keywords were indexed.

2 System Description

The University of Ottawa Cross-Language Information Retrieval systems were built with off-the-shelf components. For the retrieval part, the SMART [ 3, 11 ] IR system and the Terrier [ 2, 10 ] IR system were tested with many different weighting schemes for indexing the collection and the queries.

SMART was originally developed at Cornell University in the 1960s. SMART is based on the vector space model of information retrieval. We used nnn.ntn, ntn.ntn, lnn.ntn, ann.ntn, ltn.ntn, atn.ntn, ntn.nnn, nnc.ntc, ntc.ntc, ntc.nnc, lnc.ntc, anc.ntc, ltc.ntc, atc.ntc weighting schemes [ 3 ,11 ]; lnn.ntn performs very well in CLEFCLSR 2005 and 2006 [ 6,1 ] .

Terrier was originally developed at University of Glasgow. It is based on Divergence from Randomness models (DFR) where IR is seen as a probabilistic process [ 2, 10 ]. We experimented with the In(exp)C2 weighting model, one of Terrier’s DFR-based document weighting models.

For translating the queries from French and Spanish into English, several free online machine translation tools were used. The idea behind using multiple translations is that they might provide more variety of words and phrases, therefore improving the retrieval performance. Seven online MT systems [ 6 ] were used for translating from Spanish and from French into English. We combined the outputs of the MT systems by simply concatenating all the translations. All seven translations of a title made the title of the translated query; the same was done for the description and narrative fields. We used the combined topics for all the cross-language experiments reported in this paper.

We have used two query expansion methods. The first one is based on the Shoah Visual History Foundation thesaurus provided with the Mallach collection; our method adds two items and their alternatives (synonyms) from the thesaurus, based on the similarity between the thesaurus terms and the title field for each topic. More specifically, to select two items from the thesaurus, we used SMART with the title of each topic as query and the thesaurus terms as documents, using the weighting scheme lnn.ntn. After computing the similarity, the top two thesaurus terms were added to the topic; for these terms all the alternative terms was also added to the topic. For example, in topic 3005, the title is “Death marches”, and the most similar terms from the thesaurus are “death marches” and “deaths during forced marches”; the alternative terms for theses terms are “death march” and “Todesmärsche”. Table 1 shows two entries from the thesaurus; each entry contains six types of fields: name ̶ contains a unique numeric code for each entry, label ̶ a phrase or word which represents the entry, alt-label ̶ contains the alternative phrase or the synonym for the entry, usage ̶ contains the usage or the definition of the entry. There are two more relations in the thesaurus: is-a and of-type, which contain the numeric code of the entry involved in the relation. The second query expansion method extracts the most informative terms from the top-returned documents as the expanded query terms. In this expansion process, 12 terms from the returned documents (the top 15 documents) were added to the topic, based on Bose-Einstein 1 model (Bo1) [ 4,10 ]; we have put a restriction on the new terms: their document frequency must be less than the maximum document frequency in the title of the topic. The aim of this restriction is avoid more-general terms being added to the topic. Any term that satisfies this restriction will be a part of the new topic. We have also up weighted the title terms five times higher than the other terms in the topic. For the data fusion part, we proposed two methods that use the sum of normalized weighted similarity scores of 15 different IR schemes as shown in the following formulas :

Fusion1 = Fusion2 =

∑[Wr4 (i) + WM3AP (i)] ∗ NormSimi i∈IR schems

∑Wr4 (i) *WM3AP (i) ∗ NormSimi i∈IR schems (1) (2) where Wr(i) and WMAP(i) are experimentally determined weights based on the recall (the number of relevant documents retrieved) and precision (MAP score) values for each IR scheme computed on the training data. For example, suppose that two retrieval runs r1 and r2 give 0.3 and 0.2 (respectively) as MAP scores on training data; we normalize these scores by dividing them by the maximum MAP value: then WMAP(r1) is 1 and WMAP(r2) is 0.66 (then we compute the power 3 of these weights, so that one weight stays 1 and the other one decreases; we chose power 3 for MAP score and power 4 for recall, because the MAP is more important than the recall). We hope that when we multiply the similarity values with the weights and take the summation over all the runs, the performance of the combined run will improve. NormSimi is the normalized similarity for each IR scheme. We did the normalization by dividing the similarity by the maximum similarity in the run. The normalization is necessary because different weighting schemes will generate different range of similarity values, so a normalization method should applied to each run. Our method is differed than the work done by Fox and Shaw in 1994 [ 5 ] and Lee in 1995 [ 7 ]; they combined the results by taking the summation of the similarity scores without giving any weight to each run. In our work we weight each run according to the precision and recall on the training data.

3 Experimental Results 3.1 Submitted Runs 3.2 Comparison of Systems and Query Expansion Methods

In order to compare between different methods of query expansion and a base run without query expansion, we selected the base run with the weighting scheme lnn.ntn, topic fields title and description, and document fields ASRTEXT2004A, AUTOKEYWORD2004A1, and AUTOKEYWORD2004A2. We used the two techniques for query expansion, one based on the thesaurus and the other one on blind relevance feedback (denoted Bo1 in Table 3). We present the results (MAP scores) with and without query expansion, and with the combination of both query expansion methods, on the test and training topics. According to Table 3, we note that both methods help to improve the retrieval results, but the improvement is not significant on the training and test data; also the combination of the two methods helps to improve the MAP score on the training data (not significantly), but not on the test data.

3.3 Experiments using Data Fusion

We applied the data fusion methods described in section 2 to 14 runs produced by SMART and one run produced by Terrier; all runs was produced using a combination of the two methods of query expansion as described in section 2. Performance results for each single run and fused runs are presented in Table 4, in which % change is given with respect to the run providing better effectiveness in each combination on the training data. The Manual English column represents the results when only the manual keywords and the manual summaries were used for indexing the documents using English topics, the Auto-English column represents the results when automatic fields are indexed from the documents (ASRTEXT2004A, and AUTOKEYWORD2004A1, A2) using English topics. For cross-languages experiments the results are represented in the columns Auto-French, and Auto-Spanish.

Data fusion helps to improve the performance (MAP score) on the test data The best improvement using data fusion (Fusion1) was on the French cross-language experiments with 21.7%, which is statistically significant while on monolingual the improvement was only 6.5% which is not significant. Also, there is an improvement in the number of relevant documents retrieved (recall) for all the experiments, except Auto-French on the test data, as shown in Table 5. We computed these improvements relative to the results of the best single-model run, as measured on the training data. This supports our claim that data fusion improves the recall by bringing some new documents that were not retrieved by all the runs. On the training data, the Fusion2 method gives better results than Fusion1 for all cases except on Manual English, but on the test data Fusion1 is better than Fusion2. In general, the data fusion seems to help, because the performance on the test data in not always good for weighting schemes that obtain good results on the training data, but combining models allows the best-performing weighting schemes to be taken into consideration.

The retrieval results for the translations from French were very close to the monolingual English results, especially on the training data, but on the test data the difference was significantly worse. For Spanish, the difference was significantly worse on the training data, but not on the test data.

Experiments on manual keywords and manual summaries showed high improvements, the MAP score jumped from 0.0855 to 0.2761 on the test data.

4 Conclusion

We experimented with two different systems: Terrier and SMART, with combining the various weighting schemes for indexing the document and query terms. We proposed two approaches for query expansion, one based on the thesaurus and another one based on blind relevance feedback. The combination of the query expansion methods obtained a small improvement on the training and test data (not statistically significant according to a Wilcoxon signed test).

Our focus this year was on data fusion: we proposed two methods to combine different weighting scheme from different systems, based on weighted summation of normalized similarity measures; the weight for each scheme was based on the relative precision and recall on the training data. Data fusion helps to improve the retrieval significantly for some experiments (Auto-French) and for other not significantly (Manual English).

The idea of using multiple translations proved to be good. More variety in the translations would be beneficial. The online MT systems that we used are rule-based systems. Adding translations by statistical MT tools might help, since they could produce radically different translations.

Combining query expansion methods and data fusion helped to improve the retrieval significantly comparing to the median and average of all required runs submitted by all the teams that participated in the track.

In future work we plan to investigate more methods of data fusion, removing or correcting some of the speech recognition errors in the ASR content words, and to use speech lattices for indexing. -0.7% 1759 1736

1. Alzghool

and Inkpen D. : Experiments for the Cross Language Speech Retrieval Task at CLEF 2006 . In Proceedings of CLEF 2006, Lecture Notes in Computer Science , Springer-Verlag 4730, 2007 , pp. 778 - 785

2. Amati , G. and van Rijsbergen, C. J. : Probabilistic models of information retrieval based on measuring the divergence from randomness . ACM Transactions on Information Systems , Vol. 20 , No. 4, October ( 2002 ) 357 - 389 .

3. Buckley

, Salton

, and Allan J.: Automatic retrieval with locality information using SMART . In Text REtrieval Conference (TREC-1) , March ( 1993 ) 59 - 72 .

4. Carpineto

, de Mori

, Romano G., and Bigi

: An information-theoretic approach to automatic query expansion . ACM Transactions on Information Systems (TOIS) , Vol. 19 , No. 1, January ( 2001 ) 1 - 27 .

5. Fox , E.A. and Shaw , J.A. ( 1994 ). Combination of multiple searches . Proceedings of the Third Text REtrieval Conference (TREC-3) . National Institute of Standards and Technology Special Publication 500 -215.

6. Inkpen

, Alzghool

, and Islam A. : Using various indexing schemes and multiple translations in the CL-SR task at CLEF 2005 . In Accessing Multilingual Information Repositories, 6th Workshop of the Cross-Language Evaluation Forum , CLEF 2005 , Vienna, Austria, 21 - 23 September, ( 2005 ).

7. Lee , J.H. ( 1995 ). Combining multiple evidence from different properties of weighting schemes . Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , pp. 180 - 188 .

8. Oard

D.W.

, Soergel

, Doermann

, Huang

, Murray

G.C.

, Wang

, Ramabhadran

, Franz

and Gustman

: Building an Information Retrieval Test Collection for Spontaneous Conversational Speech , in Proceedings of SIGIR , ( 2004 ).

9. Oard

D.W.

, J., Jones

G. J. F.

, Pecina

, et al: Overview of the CLEF 2007 cross-language speech retrieval track . In Working Notes of the CLEF- 2007 Evaluation , Budapest, Hungary, ( 2007 ).

10. Ounis

, Amati

, Plachouras

, He

, Macdonald

and Johnson

: Terrier Information Retrieval Platform . In 27th European Conference on Information Retrieval (ECIR 05) , ( 2005 ). http://ir.dcs.gla.ac.uk/wiki/Terrier

11. Salton

and Buckley

: Term-weighting approaches in automatic retrieval . Information Processing and Management , Vol. 24 , No. 5 , ( 1988 ) 513 - 523 .