=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-CLSR-AlzghoolEt2007
|storemode=property
|title=Model Fusion Experiments for the Cross Language Speech Retrieval Task at CLEF 2007
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-CLSR-AlzghoolEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/AlzghoolI07a
}}
==Model Fusion Experiments for the Cross Language Speech Retrieval Task at CLEF 2007==
Model Fusion Experiments for the Cross Language Speech Retrieval Task at CLEF 2007 Muath Alzghool and Diana Inkpen School of Information Technology and Engineering University of Ottawa {alzghool, diana}@site.uottawa.ca Abstract This paper presents the participation of the University of Ottawa group in the Cross-Language Speech Retrieval (CL-SR) task at CLEF 2007. We present the results of the submitted runs for the English collection. We have used two Information Retrieval systems in our experiments: SMART and Terrier, with two query expansion techniques: one based on a thesaurus and the second one based on blind relevant feedback. We proposed two novel data fusion methods for merging the results of several models (retrieval schemes available in SMART and Terrier). Our experiments showed that the combination of query expansion methods and data fusion methods helps to improve the retrieval performance. We also present cross-language experiments, where the queries are automatically translated by combining the results of several online machine translation tools. Experiments on indexing the manual summaries and keywords gave the best retrieval results. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval General Terms Measurement, Performance, Experimentation Keywords Data Fusion, Retrieval Models, Query Expansion 1 Introduction This paper presents the third participation of the University of Ottawa group in the Cross-Language Speech Retrieval (CL-SR) track, at CLEF 2007. We present our systems, followed by results for the submitted runs for the English collection. We present results for many additional runs for the English collection. We experimented with many possible weighting schemes for indexing the documents and the queries, and with several query expansion techniques. Several researchers in the literature have explored the idea of combining the results of different retrieval strategies, different document representations and different query representations; the motivation is that each technique will retrieve different sets of relevant documents; therefore combining the results could produce a better result than any of the individual techniques. We propose new data fusion techniques for combining the results of different Information Retrieval (IR) schemes. We applied our data fusion techniques to monolingual settings and to cross-language settings where the queries are automatically translated from French and Spanish into English by combining the results of several online machine translation (MT) tools. At the end we present the best results, when manual summaries and manual keywords were indexed. 2 System Description The University of Ottawa Cross-Language Information Retrieval systems were built with off-the-shelf components. For the retrieval part, the SMART [3, 11] IR system and the Terrier [2, 10] IR system were tested with many different weighting schemes for indexing the collection and the queries. SMART was originally developed at Cornell University in the 1960s. SMART is based on the vector space model of information retrieval. We used nnn.ntn, ntn.ntn, lnn.ntn, ann.ntn, ltn.ntn, atn.ntn, ntn.nnn, nnc.ntc, ntc.ntc, ntc.nnc, lnc.ntc, anc.ntc, ltc.ntc, atc.ntc weighting schemes [3 ,11]; lnn.ntn performs very well in CLEF- CLSR 2005 and 2006 [6,1] . Terrier was originally developed at University of Glasgow. It is based on Divergence from Randomness models (DFR) where IR is seen as a probabilistic process [2, 10]. We experimented with the In(exp)C2 weighting model, one of Terrier’s DFR-based document weighting models. For translating the queries from French and Spanish into English, several free online machine translation tools were used. The idea behind using multiple translations is that they might provide more variety of words and phrases, therefore improving the retrieval performance. Seven online MT systems [6] were used for translating from Spanish and from French into English. We combined the outputs of the MT systems by simply concatenating all the translations. All seven translations of a title made the title of the translated query; the same was done for the description and narrative fields. We used the combined topics for all the cross-language experiments reported in this paper. We have used two query expansion methods. The first one is based on the Shoah Visual History Foundation thesaurus provided with the Mallach collection; our method adds two items and their alternatives (synonyms) from the thesaurus, based on the similarity between the thesaurus terms and the title field for each topic. More specifically, to select two items from the thesaurus, we used SMART with the title of each topic as query and the thesaurus terms as documents, using the weighting scheme lnn.ntn. After computing the similarity, the top two thesaurus terms were added to the topic; for these terms all the alternative terms was also added to the topic. For example, in topic 3005, the title is “Death marches”, and the most similar terms from the thesaurus are “death marches” and “deaths during forced marches”; the alternative terms for theses terms are “death march” and “Todesmärsche”. Table 1 shows two entries from the thesaurus; each entry contains six types of fields: name ̶ contains a unique numeric code for each entry, label ̶ a phrase or word which represents the entry, alt-label ̶ contains the alternative phrase or the synonym for the entry, usage ̶ contains the usage or the definition of the entry. There are two more relations in the thesaurus: is-a and of-type, which contain the numeric code of the entry involved in the relation. The second query expansion method extracts the most informative terms from the top-returned documents as the expanded query terms. In this expansion process, 12 terms from the returned documents (the top 15 documents) were added to the topic, based on Bose-Einstein 1 model (Bo1) [4,10]; we have put a restriction on the new terms: their document frequency must be less than the maximum document frequency in the title of the topic. The aim of this restriction is avoid more-general terms being added to the topic. Any term that satisfies this restriction will be a part of the new topic. We have also up weighted the title terms five times higher than the other terms in the topic. Table 1. The top two entries from the thesaurus that are similar to the topic title “Death marches”.9125 death march Todesmärsche 15445 5289 Forced marches of prisoners over long distances, under heavy guard and extremely harsh conditions. (The term was probably coined by concentration camp prisoners.) For the data fusion part, we proposed two methods that use the sum of normalized weighted similarity scores of 15 different IR schemes as shown in the following formulas : Fusion1 = ∑ [W (i) + W i∈IR schems r 4 3 MAP (i )] ∗ NormSimi (1) Fusion2 = ∑W (i) *W i∈IR schems r 4 3 MAP (i) ∗ NormSimi (2) where Wr(i) and WMAP(i) are experimentally determined weights based on the recall (the number of relevant documents retrieved) and precision (MAP score) values for each IR scheme computed on the training data. For example, suppose that two retrieval runs r1 and r2 give 0.3 and 0.2 (respectively) as MAP scores on training data; we normalize these scores by dividing them by the maximum MAP value: then WMAP(r1) is 1 and WMAP(r2) is 0.66 (then we compute the power 3 of these weights, so that one weight stays 1 and the other one decreases; we chose power 3 for MAP score and power 4 for recall, because the MAP is more important than the recall). We hope that when we multiply the similarity values with the weights and take the summation over all the runs, the performance of the combined run will improve. NormSimi is the normalized similarity for each IR scheme. We did the normalization by dividing the similarity by the maximum similarity in the run. The normalization is necessary because different weighting schemes will generate different range of similarity values, so a normalization method should applied to each run. Our method is differed than the work done by Fox and Shaw in 1994 [5] and Lee in 1995 [7]; they combined the results by taking the summation of the similarity scores without giving any weight to each run. In our work we weight each run according to the precision and recall on the training data. 3 Experimental Results 3.1 Submitted Runs Table 2 shows the results of the submitted results on the test data (33 queries). The evaluation measure we report is the standard measure computed with the trec_eval script (version 8): MAP (Mean Average Precision) and Recall. The information about what fields of the topic were indexed is given in the column named Fields: T for title only, TD for title + description, TDN for title + description + narrative. For each run we include an additional description of the experimental settings and which document fields were indexed; [8,9] give more information about the training and test data . For the uoEnTDtManF1 and uoEnTDtQExF1 runs we used the Fusion1 formula for data fusion; and for uoEnTDtQExF2, uoFrTDtF2, and uoEsTDtF2 we used the Fuison2 formula for data fusion. We used blind relevance feedback and query expansion from the thesaurus for the uoEnTDtManF1, uoEnTDtQExF1, and uoEnTDtQExF2 runs; we didn't use any query expansion techniques for uoFrTDtF2 and uoEsTDtF2. Our required run, English TD, obtained a MAP score of 0.0855. Comparing this result to the median and average of all runs submitted by all the teams that participated in the track (0.0673, 0.0785) [9], our result was significantly better (based on a two-tailed Wilcoxon Signed-Rank Test for paired samples at p < 0.05 across the 33 evaluation topics) with a relative improvement of 21% and 8%; there is a small improvement using Fusion1 (uoEnTDtQExF1) over Fusion2 (uoEnTDtQExF2), but this improvement is not significant. Table 2. Results of the five submitted runs, for topics in English, French, and Spanish. The required run (English, title + description) is in bold. Runs MAP Recall Fields Description uoEnTDtManF1 0.2761 1832 TD English: Fusion 1, query expansion methods, fields: MANUALKEYWORD + SUMMARY uoEnTDtQExF1 0.0855 1333 TD English: Fusion 1, query expansion methods, fields: ASRTEXT2004A + AUTOKEYWORD2004A1, A2 uoEnTDtQExF2 0.0841 1336 TD English: Fusion 2, query expansion methods, fields: ASRTEXT2004A + AUTOKEYWORD2004A1, A2 uoFrTDtF2 0.0603 1098 TD French : Fusion 2, fields: ASRTEXT2004A + AUTOKEYWORD2004A1, A2 uoEsTDtF2 0.0619 1171 TD Spanish : Fusion 2, fields: ASRTEXT2004A + AUTOKEYWORD2004A1, A2 3.2 Comparison of Systems and Query Expansion Methods In order to compare between different methods of query expansion and a base run without query expansion, we selected the base run with the weighting scheme lnn.ntn, topic fields title and description, and document fields ASRTEXT2004A, AUTOKEYWORD2004A1, and AUTOKEYWORD2004A2. We used the two techniques for query expansion, one based on the thesaurus and the other one on blind relevance feedback (denoted Bo1 in Table 3). We present the results (MAP scores) with and without query expansion, and with the combination of both query expansion methods, on the test and training topics. According to Table 3, we note that both methods help to improve the retrieval results, but the improvement is not significant on the training and test data; also the combination of the two methods helps to improve the MAP score on the training data (not significantly), but not on the test data. Table 3. Results (MAP scores) for Terrier and SMART, with or without relevance feedback, for English topics (using the TD query fields). System Training Test 1 lnn.ntn 0.0906 0.0725 2 lnn.ntn +thesaurus 0.0941 0.0730 3 lnn.ntn +Bo1 0.0954 0.0811 4 lnn.ntn+ thesaurus+ Bo1 0.0969 0.0799 3.3 Experiments using Data Fusion We applied the data fusion methods described in section 2 to 14 runs produced by SMART and one run produced by Terrier; all runs was produced using a combination of the two methods of query expansion as described in section 2. Performance results for each single run and fused runs are presented in Table 4, in which % change is given with respect to the run providing better effectiveness in each combination on the training data. The Manual English column represents the results when only the manual keywords and the manual summaries were used for indexing the documents using English topics, the Auto-English column represents the results when automatic fields are indexed from the documents (ASRTEXT2004A, and AUTOKEYWORD2004A1, A2) using English topics. For cross-languages experiments the results are represented in the columns Auto-French, and Auto-Spanish. Data fusion helps to improve the performance (MAP score) on the test data The best improvement using data fusion (Fusion1) was on the French cross-language experiments with 21.7%, which is statistically significant while on monolingual the improvement was only 6.5% which is not significant. Also, there is an improvement in the number of relevant documents retrieved (recall) for all the experiments, except Auto-French on the test data, as shown in Table 5. We computed these improvements relative to the results of the best single-model run, as measured on the training data. This supports our claim that data fusion improves the recall by bringing some new documents that were not retrieved by all the runs. On the training data, the Fusion2 method gives better results than Fusion1 for all cases except on Manual English, but on the test data Fusion1 is better than Fusion2. In general, the data fusion seems to help, because the performance on the test data in not always good for weighting schemes that obtain good results on the training data, but combining models allows the best-performing weighting schemes to be taken into consideration. The retrieval results for the translations from French were very close to the monolingual English results, especially on the training data, but on the test data the difference was significantly worse. For Spanish, the difference was significantly worse on the training data, but not on the test data. Experiments on manual keywords and manual summaries showed high improvements, the MAP score jumped from 0.0855 to 0.2761 on the test data. Table 4. Results (MAP scores) for 15 weighting schemes using Smart and Terrier (the In(exp)C2 model), and the results for the two Fusions Methods. In bold are the best scores for the 15 single runs on the training data and the corresponding results on the test data. Underlined are the results of the submitted runs. Weighting Manual English Auto-English Auto-French Auto-Spanish scheme Training Test Training Test Training Test Training Test nnc.ntc 0.2546 0.2293 0.0888 0.0819 0.0792 0.055 0.0593 0.0614 ntc.ntc 0.2592 0.2332 0.0892 0.0794 0.0841 0.0519 0.0663 0.0545 lnc.ntc 0.2710 0.2363 0.0898 0.0791 0.0858 0.0576 0.0652 0.0604 ntc.nnc 0.2344 0.2172 0.0858 0.0769 0.0745 0.0466 0.0585 0.062 anc.ntc 0.2759 0.2343 0.0723 0.0623 0.0664 0.0376 0.0518 0.0398 ltc.ntc 0.2639 0.2273 0.0794 0.0623 0.0754 0.0449 0.0596 0.0428 atc.ntc 0.2606 0.2184 0.0592 0.0477 0.0525 0.0287 0.0437 0.0304 nnn.ntn 0.2476 0.2228 0.0900 0.0852 0.0799 0.0503 0.0599 0.061 ntn.ntn 0.2738 0.2369 0.0933 0.0795 0.0843 0.0507 0.0691 0.0578 lnn.ntn 0.2858 0.245 0.0969 0.0799 0.0905 0.0566 0.0701 0.0589 ntn.nnn 0.2476 0.2228 0.0900 0.0852 0.0799 0.0503 0.0599 0.061 ann.ntn 0.2903 0.2441 0.0750 0.0670 0.0743 0.038 0.057 0.0383 ltn.ntn 0.2870 0.2435 0.0799 0.0655 0.0871 0.0522 0.0701 0.0501 atn.ntn 0.2843 0.2364 0.0620 0.0546 0.0722 0.0347 0.0586 0.0355 In(exp)C2 0.3177 0.2737 0.0885 0.0744 0.0908 0.0487 0.0747 0.0614 Fusion 1 0.3208 0.2761 0.0969 0.0855 0.0912 0.0622 0.0731 0.0682 % change 1.0% 0.9% 0.0% 6.5% 0.4% 21.7% -2.2% 10.0% Fusion 2 0.3182 0.2741 0.0975 0.0842 0.0942 0.0602 0.0752 0.0619 % change 0.2% 0.1% 0.6% 5.1% 3.6% 19.1% 0.7% 0.8% 4 Conclusion We experimented with two different systems: Terrier and SMART, with combining the various weighting schemes for indexing the document and query terms. We proposed two approaches for query expansion, one based on the thesaurus and another one based on blind relevance feedback. The combination of the query expansion methods obtained a small improvement on the training and test data (not statistically significant according to a Wilcoxon signed test). Our focus this year was on data fusion: we proposed two methods to combine different weighting scheme from different systems, based on weighted summation of normalized similarity measures; the weight for each scheme was based on the relative precision and recall on the training data. Data fusion helps to improve the retrieval significantly for some experiments (Auto-French) and for other not significantly (Manual English). The idea of using multiple translations proved to be good. More variety in the translations would be beneficial. The online MT systems that we used are rule-based systems. Adding translations by statistical MT tools might help, since they could produce radically different translations. Combining query expansion methods and data fusion helped to improve the retrieval significantly comparing to the median and average of all required runs submitted by all the teams that participated in the track. In future work we plan to investigate more methods of data fusion, removing or correcting some of the speech recognition errors in the ASR content words, and to use speech lattices for indexing. Table 5. Results (number of relevant documents retrieved) for 15 weighting schemes using Terrier and SMART, and the results for the Fusions Methods. In bold are the best scores for the 15 single runs on training data and the corresponding test data; underlined are the submitted run Weighting Manual English Auto-English Auto- French Auto- Spanish scheme Training Test Training Test Training Test Training Test nnc.ntc 2371 1827 1726 1306 1687 1122 1562 1178 ntc.ntc 2402 1857 1675 1278 1589 1074 1466 1155 lnc.ntc 2402 1840 1649 1301 1628 1111 1532 1196 ntc.nnc 2354 1810 1709 1287 1662 1121 1564 1182 anc.ntc 2405 1858 1567 1192 1482 1036 1360 1074 ltc.ntc 2401 1864 1571 1211 1455 1046 1384 1097 atc.ntc 2387 1858 1435 1081 1361 945 1255 1011 nnn.ntn 2370 1823 1740 1321 1748 1158 1643 1190 ntn.ntn 2432 1863 1709 1314 1627 1093 1502 1174 lnn.ntn 2414 1846 1681 1325 1652 1130 1546 1194 ntn.nnn 2370 1823 1740 1321 1748 1158 1643 1190 ann.ntn 2427 1859 1577 1198 1473 1027 1365 1060 ltn.ntn 2433 1876 1582 1215 1478 1070 1408 1134 atn.ntn 2442 1859 1455 1101 1390 975 1297 1037 In(exp)C2 2638 1823 1624 1286 1676 1061 1631 1172 Fusion 1 2645 1832 1745 1334 1759 1147 1645 1219 % change 0.3% 0.5 % 0.3% 1.0% 0.6% -1.0% 0.1% 2.4% Fusion 2 2647 1823 1727 1337 1736 1098 1631 1172 % change 0.3% 0.0% 0.8% 1.2% -0.7% -5.5% -0.7% -1.5% References 1. Alzghool M. and Inkpen D. : Experiments for the Cross Language Speech Retrieval Task at CLEF 2006. In Proceedings of CLEF 2006, Lecture Notes in Computer Science, Springer-Verlag 4730, 2007, pp.778-785 2. Amati, G. and van Rijsbergen, C. J. : Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, Vol. 20, No. 4, October (2002) 357–389. 3. Buckley C., Salton G., and Allan J.: Automatic retrieval with locality information using SMART. In Text REtrieval Conference (TREC-1), March (1993) 59–72. 4. Carpineto C., de Mori R., Romano G., and Bigi B.: An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems (TOIS), Vol. 19, No. 1, January (2001) 1-27. 5. Fox, E.A. and Shaw, J.A. (1994). Combination of multiple searches. Proceedings of the Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500-215. 6. Inkpen D., Alzghool M., and Islam A.: Using various indexing schemes and multiple translations in the CL-SR task at CLEF 2005. In Accessing Multilingual Information Repositories, 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Vienna, Austria, 21-23 September, (2005). 7. Lee, J.H. (1995). Combining multiple evidence from different properties of weighting schemes. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.180-188. 8. Oard D.W., Soergel D., Doermann D., Huang X., Murray G.C., Wang J., Ramabhadran B., Franz M. and Gustman S.: Building an Information Retrieval Test Collection for Spontaneous Conversational Speech, in Proceedings of SIGIR, (2004). 9. Oard D.W., J., Jones G. J. F., Pecina P., et al: Overview of the CLEF 2007 cross-language speech retrieval track. In Working Notes of the CLEF- 2007 Evaluation, Budapest, Hungary, (2007). 10. Ounis I., Amati G., Plachouras V., He B., Macdonald C. and Johnson D.: Terrier Information Retrieval Platform. In 27th European Conference on Information Retrieval (ECIR 05), (2005). http://ir.dcs.gla.ac.uk/wiki/Terrier 11. Salton G. and Buckley C.: Term-weighting approaches in automatic retrieval. Information Processing and Management, Vol. 24, No. 5, (1988) 513-523. 15460 15445 4109 The daily experience of individuals with death during forced marches that was not the result of executions, punishments, arbitrary killings or suicides.