Pitt at CLEF05: Data Fusion for Spoken Document Retrieval

Pitt at CLEF05: Data Fusion for Spoken Document Retrieval DaqingHe School of Information Sciences University of Pittsburgh IS Building

135 N. Bellefield Avenue 15260 Pittsburgh PA USA

JaewookAhn School of Information Sciences University of Pittsburgh IS Building

135 N. Bellefield Avenue 15260 Pittsburgh PA USA

Pitt at CLEF05: Data Fusion for Spoken Document Retrieval D1548B9EBDB07402AA5D47472668E5B5 GROBID - A machine learning software for extracting information from scholarly documents H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval Measurement Performance Experimentation Spoken Document Retrieval Data Fusion Evidence Combination

This paper describes an experimental investigation of data fusion techniques for spoken document retrieval. The effectiveness of retrievals solely based on the outputs from automatic speech recognition (ASR) is subject to the recognition errors introduced by the ASR process. This is especially true for retrievals on Malach test collection, whose ASR outputs have average word error rate (WER) of 35explored data fusion techniques for integrating the manually generated metadata information, which is provided for every Malach document, with the ASR outputs. We concentrated our effort on the post-search data fusion techniques, where multiple retrieval results using automatic generated outputs or human metadata were combined. Our initial studies indicated that a simple unweighted combination method (i.e., CombMNZ) that had demonstrated to be useful in written text retrieval environment only generated significant 38in retrieval effectiveness (measured by Mean Average Precision) for our task by comparing to a simple retrieval baseline where all manual metadata and ASR outputs are put together. This motivated us to explore a more elaborated weighted data fusion model, where the weights are associated with each retrieval results, and can be specified by the users in advance. We also explored multiple iteration of data fusion in our weighted fusion model, and obtained further improvement at 2nd iteration. In total, our best run on data fusion obtained 31baseline, and 4which is a significant difference.

Introduction

Spoken documents become more and more popular in people's information seeking activities along with the advance of information technologies, especially the storage and network communication technologies. However, comparing to the studies performed on text base documents, the achievements on retrieving spoken documents are still far less. Recent remarkable advances in the automatic recognition of spontaneous conversational speech makes it even urgent to study effective spoken document retrieval techniques. This is the reason that we participated CLEF Spoken Document Retrieval (SDR) track, and our goal is to leverage technologies developed for text retrieval into retrieving spoken documents.

Retrieving spoken documents from Malach test collection, a test collection developed by University of Maryland for retrieving spontaneous conversational speech [8], poses some interesting challenges. Before Malach collection, there have been several spoken document collections, whose documents are mostly news broadcast stories, help desk telephone calls, and political speeches [1]. The documents in the Malach collection, however, are interviews of Holocaust survivors, which talk about their personal stories. Because of the genre, the topics, and the emotion involved, the average Word Error Rate (WER) in machine transcripts of the documents, an indicator of the quality of the ASR output, is around 35%. This imposes a great difficult for searching based on these ASR outputs.

Our major interests for this year's experiments, however, lie on the several forms of human generated metadata associated with the spoken documents. For each document, there are a list of person names mentioned in the documents, the human assigned thesaurus keywords and a brief summary in 2-3 sentences written by the human catalogers during their cataloging process.

We view ASR outputs and human generated metadata as two types of information that are complimentary of each other in retrieval process. On the one hand, ASR outputs provide full and detailed information about the content of the documents, which often could not be totally covered by human generated data. On the other hand, human generated metadata provide focused, humanprocessed, and high quality information that can be relied on for the accuracy of the retrieval. If we can develop a reliable retrieval method that can combine both information into the retrieval process in such way that their complimentary features can be fully explored, the achieved retrieval effectiveness would be greatly superior than that of any one of them. This is the goal of our studies in this year's CLEF-SDR experiments, and two derived research questions are:

1. how the manual metadata and ASR outputs in Malach collection can be integrated for improving the retrieval effectiveness of the final run?

2. what are the parameters that we can utilize to make the data fusion techniques more effective for our task?

In the rest of this report, we will firstly review some existing data fusion methods in Section 2; discuss in detail the experiment settings in Section 3; then talk about the fusion techniques we developed for the CLEF-SDR experiments in Section 4. Finally we will discuss some further studies in Section 5.

Data Fusion

In the literature, the techniques for combining multiple queries, document representations or retrieval results is called "data fusion" [7]. It has been an active topic in text retrieval process, and people have developed many techniques for applying fusion techniques in various retrieval applications. Belkin et al. [2] studied pre-search data-fusion approach by progressively combining Boolean query formulations. Lee [7] provided an analysis of multiple post-search data fusion methods using TREC3 ad hoc retrieval data. Data fusion also has been applied in cross-language information retrieval [4,3], recommendation systems [9], and many other areas.

In post-search data fusion approaches, to properly merge retrieval results that are commonly ranklist of documents, the score associated with each document has to be normalized within that list. A often used normalization scheme (see Equation ( 1)) utilizes the maximum and minimum scores of a ranklist (i.e., M axScore and M inScore) in the normalization process [7].

N ormalizedScore = U nnormalizedScore − M inScore M axScore − M inScore (1)

Fox and Shaw [5] developed several fusion methods for combining multiple evidence, and they named the methods as CombMIN, CombMAX, CombSUM, CombANZ, and CombMNZ (the definitions of them are shown in table 1). Lee [7] [8]. The collection contains about 7800 segments from 300 interviews of Holocaust survivors. All the segments were constructed manually by catalogers. Each segment contains two automatic speech recognition outputs from the ASR system developed by IBM in 2003 and 2004 respectively (see Figure 1 for an example of the output from ASR 2004 system). The WER of the two outputs are about 40% and 35% respectively. In addition, there are automatically generated thesaurus terms from a system developed at University of Maryland. Each segment also contains a set of human generated data, including person names mentioned in the segment, average 5 thesaurus labels and 3-sentence summaries (see Figure 2 for an example).

There are total 63 search topics, 38 of which were available for training, and 25 were held as the testing topics. Each topic is designed in TREC style, which has a title, a description and a narrative (see Figure 3). The topics are available in English, Spanish, Czech, German and French, however, we only used English topics for our studies.

Indri search engine

Our search engine was Indri 1.0, which was a collaboration effort between the University of Massachusetts and Carnegie Mellon University [6]. Its retrieval model is a combination of language model and inference network. We chose it not only because of its state-of-art retrieval effectiveness, but also because it is an open source system, on which we can easily integrated our modifications. Its powerful query syntax is another attraction to us, since we want to specify which index fields should a retrieval be based on for our studies of manual metadata only or automatic data only searches.

Measures

To study the retrieval results in as wide scenarios as possible, instead of choosing on single measure, we employed a set of evaluation measures, each of which tells us some aspect of the retrieval effectiveness of the search results:

• mean average precision (MAP), the measure aim at giving an emphasis view of precision in a ranklist. Since the ranks of the relevant documents are considered in the measure, this measure gives a reasonable overview of the quality of the ranklist for a given retrieval topic.

<ASRTEXT2004A> oh i'll you know are yeah yeah yeah yeah yeah yeah yeah the very why don't we start with you saying anything in your about grandparents great grandparents well as a small child i remember only one of my grandfathers and his wife his second wife he was selling flour and the type of business it was he didn't even have a store he just a few sacks of different flour and the entrance of an apartment building and people would pass by everyday and buy a chela but two killers of flour we have to remember related times were there was no already baked bread so people had to baked her own bread all the time for some strange reason i do remember fresh rolls where everyone would buy every day but not the bread so that was the business that's how he made a living where was this was the name of the town it wasn't shammay dish he ours is we be and why i as i know in southern poland and alisa are close to her patient mountains it was rather mid sized town and uhhuh i was and the only child and the family i had a governess who was with me all their long from the time i got up until i went to sleep she washed me practice piano she took me to ballet lessons she took me skiing and skating wherever there was else that i was doing being non reach higher out i needed other children to players and the governors were always follow me and stay with me while ours twang that i was a rotten spoiled care from work to do family the youngest and the large large family and everyone was door in the army </ASRTEXT2004A> • Precision at 10 (P10) is a useful measure to examine how much relevant documents are there in the first result screen, which is often the only results viewed by a user.

• R-Precision (R-PREC) emphases on precision but also avoids the artificial cut-off effect imposed by pre-defined cut-off point, like in P10. The "R" varies according to the number of relevant documents of a given topic.

• Average Recall at top 1000 returned documents. It tells the quality of the ranklist from the point of recall.

Baselines

We established three baselines for evaluating our methods. The first two represent the scenario that no data fusion is performed. We selected a run on ASRTEXT 2004 as the baseline for searching on ASR output, since ASRTEXT 2004 is the better one among the two ASR outputs. This baseline is referred as "asr04", and is treated as the lower baseline. Ideally, we should use the search on manual transcripts as the upper baseline. Since Malach collection does not provide manual transcripts, we used all manually generated data in the segments as the surrogate. It acts as a proximate upper bound baseline, and is referred as "manual-only" baseline. We did not apply blind relevance feedback (BRF) over "manual-only" run since our explorations using Indri's BRF function over "manual-only" baseline generated inferior results. The third baseline is a search on all manual and ASR outputs being put together as if they are different parts of the same document. This represents the simplest data fusion method, and is referred "simple-fusion" baseline.

The search effectiveness of the three baselines are presented in The first data fusion method studied in the CLEF-SDR experiments was our implementation of CombMNZ method since Lee demonstrated its superior over the other three methods [7]. The run based on CombMNZ merged results from three retrieval runs, the "manual-only" baseline, the "asr04" baseline, and a retrieval run on the automatic assigned thesaurus keywords called "AU-TOKEYWORD2004A1" (we call this run "autokw1"). Table 3 shows the results of "CombMNZ" run and that of the three runs that it was based on. Comparing to the lower "asr04" baseline, this combined run has achieved significant improvement by the measures of MAP, P10, and especially Avg-Recall (P 0.05 in paired T-tests). However, it generates significant decrease at MAP, R-PREC, and P10 when comparing to the two higher baselines, "manual-only" baseline and "simple-fusion" baseline (P 0.05 in paired T-test). The only improvement it achieved over the two higher baselines is measured by Avg-Recall. This means that "CombMNZ" run does return more relevant documents comparing to the two high baselines, but it ranks them badly.

A close exam of the retrieval runs in Table 3 shows that the retrieval effectiveness of "manualonly" run is greatly higher than that of the two automatic runs. For example, the MAP of "manual-only" increases about 200% over "asr04", the better one of the two automatic runs. Therefore, it makes no sense to assume that their contribution to the final fused ranklist is the same, which is the assumption in CombMNZ model. We need a data fusion model that considers the retrieval difference.

Weighted CombMNZ

The failure of CombMNZ on our data fusion task motivated us to explore a weighted scheme for data fusion based on CombMNZ. A natural place to insert a weight in CombMNZ is to assign a weight of belief for each retrieval run as it is possible to obtain such evidence or belief. In our weighted CombMNZ model (called WCombMNZ model), such belief is used in calculation of the final combined score for a document (see Equation 2).

W CombM N Z i = n j=1 (w j × N ormalizedScore i,j ) × n (2)

where w j is a predefined weight associated with a search result to be combined, n is the number of nonzero scores of document i, and the N ormalizedScore i,j is calculated using Equation 1. Various methods can be used to obtain the weight w j for a given ranklist j. In this year's experiment, we firstly used the retrieval effectiveness of those pre-combined runs on the 38 training topics as the weights (the details of the pre-fused runs on the training topics are in Table 4). Therefore, we have four different "WCombMNZ" runs (see Table 5), and their retrieval effectiveness evaluated by the four measures are in Table 6. All four WCombMNZ runs are significant higher than the non-weighted "CombMNZ" run (paired T-test with P < 0.05) when looking at MAP, R-PREC, and P10 as the measures. However, they are still significant lower than the "manual-only" run using the same measures.

Since the weights of "WCombMNZ-1" generated the best MAP, R-PREC and P10 results, we used those weights to help us explored the effect of different combinations of the weights. As the difference of the retrieval effectiveness between "manual-only" run and the two automatic runs is significant higher than that between the two automatic runs, we first explored the change of ratio between the weight of "manual-only" run and that of the "asr-04" and "autowa1" runs. The ratio we tested were 2:1 (that is the weight for "manual-only" run is 2, and the weights for the two automatic runs were both assigned to be 1 in WCombMNZ model), 5:1, 10:1, 15:1, and up to 1000:1 (the results are presented in Table 7). We then change the weight ratio between the "asr04" and that of "autowa1" to 2:1, which is closer to the weight ratio in "WCombMNZ-1", and varied the weight ratio of the three runs from 4:2:1, 6:2:1, and up to 50:2:1. As shown in Table 7, the ratio between the weight of the "manual-only" and that of two automatic runs is the dominate factor in affecting the retrieval performance, and once the ratio between the manual run and the automatic runs is larger than 10, there is not much difference in the retrieval effectiveness evaluated by all measures. However, still none of the fused runs achieves better MAP, R-PREC, and P10 than "manual-only", although they are much closer to the performance of the "manualonly" than the two automatic runs, and at the same time, many of they have achieved significant better Avg-Recall than the "manual-only" run.

Multiple Iteration of Data Fusion

One exploration within the data fusion framework is 'does multiple iterations of data fusion make sense?" To answer this, we conducted several experiments in WCombMNZ model. Total five retrieval runs were used in the second iteration of data fusion. We kept the "manual-only" run since it is the best run so far, and we used the four runs listed in Table 6 "WCombMNZ-1" to "WCombMNZ-4". We used the similar scheme to vary the weight ratios among the runs, and we also set all weights to 1 to make WCombMNZ model fall back to CombMNZ so that we can study CombMNZ too. The ratio "2-1" in Table 8 means that the weight for "manual-only" is 2, and that for the other four runs is 1.

As shown in Table 8, the "manual-only", in our current retrieval setting, still deserves more weights than the other runs, and the best retrieval results are achieved with the ratio around 10:1. Statistical tests (paired T-test) between the results of the 2nd round fusion runs and that of the "manual-only" run demonstrate that all the 2-iteration fusion data generated significant improvement on average recall, but only the runs with ratio above 10:1 generated significant We then generated various 3rd round fusion runs using the similar scheme, which include the "manual-only" run, and the four 2nd round runs with the ratio 5:1, 10:1, 15:1 and 20:1. The results are shown in Table 8. None of the 3rd round runs could generate statistical significant improvement over the 2nd round runs. It seems that fusions with iteration more than 2 does not justify the extra costs involved comparing to the 2nd round fusion runs.

Our multiple iteration experiments tell us that it is usually difficult to obtain a better fusion results over the best pre-fusion run when the retrieval effectiveness of the pre-fusion runs are greatly different to each other. The fusion experiment on "manual-only" and the other automatic runs is an example of such fusion. However, significant improvement over the best pre-fusion run could be achieved via multiple iterations of fusion. For example, we achieved significant improvement over the best pre-fusion run in two iterations. Of course, we need further experiments to test the general effectiveness of the multiple iteration fusion.

In this paper, we have described an experimental investigation of data fusion techniques for spoken document retrieval. Because of the various characteristics of the documents in the Malach test collection, retrieval solely based on the outputs from automatic speech recognition (ASR) is well below retrievals on manual generated data. To overcome the problem, we have explored data fusion techniques for integrating the manually generated metadata information with the ASR outputs. We concentrated on the post-search fusion approach, and explored weighted CombMNZ model with different weight ratios and multiple iterations. Our initial results indicate that a simple unweighted combination method that has demonstrated to be useful in written retrieval environment [7] only generated significant 38% relative decrease in retrieval effectiveness (Mean Average Precision) for our task by comparing to a simple retrieval baseline where all manual metadata and ASR outputs are put together. Only with the more elaborated weighted combination scheme did we obtained 31% significant relative improvements over the simple fusion baseline, and 4% relative improvement over the manual-only baseline, which is a significant difference.

Our future work include further experiments on the general effectiveness of the multiple iteration fusion, and another future work is to explore the usage of WCombMNZ in other retrieval tasks, where multiple retrieval results can be obtained from one retrieval engine, or even different engines. The third further study we want to work on is to answer the question what is the minimum human generated data to ASR output if the goal is to combine the human generated data with the ASR output to achieve a comparable retrieval effectiveness to a retrieval on manual transcripts.

Figure 1 :1Figure 1: An example of ASR output (2004 version) and automatically generated thesaurus terms (autowa1 version)

Figure 3 :3Figure 3: An example of the search topic in Malach Collection

WCombMNZ- 11use the MAP values of the three runs on training topics as the weights WCombMNZ-2 use the R-PREC values of the three runs on training topics as the weights WCombMNZ-3 use the P10 values of the three runs on training topics as the weights WCombMNZ-4 use the Avg-Recall values of the three runs on training topics as the weights

Table 1 :1studied these methods, and established that CombMNZ is the best among the four in retrieving TREC ad hoc data. Combining functions proposed by Fox and ShawCombMINminimum of all scores of a documentCombMAX maximum of all scores of a documentCombSUM summation of all scores of a documentCombANZ CombSUM ÷ number of nonzero scores of a documentCombMNZ CombSUM × number of nonzero scores of a document3 Experiment Settings3.1 Malach Test CollectionMalach Test Collection was developed by University of Maryland as part of their effort in Malachproject

Table 22runsMAPR-PRECP10Avg-Recallmanual-only 0.23120.28360.46030.5813asr040.06930.11390.21110.3595simple-fusion 0.18420.19850.36350.5847

Table 2 :2Retrieval effectiveness of the three baselines4 Experiments and Results Analysis

4.1 Data Fusion with CombMNZ

Table 3 :3Retrieval effectiveness of CmbMNZ run

Table 4 :4Retrieval effectiveness of individual runs on the 38 training topics

runsMAPR-PRECP10Avg-Recallmanual-only 0.14940.18230.32370.4221asr040.05250.07540.14470.2788autowa10.02390.04600.08160.2832

Table 5 :5Combining functions proposed by Fox and ShawrunsMAPR-PRECP10Avg-Recallmanual-only0.23120.28360.46030.5813CombMNZ0.11270.11730.30790.6182WCombMNZ-1 0.21370.25890.44600.6008WCombMNZ-2 0.19870.24310.42060.6254WCombMNZ-3 0.19670.24160.41900.6253WCombMNZ-4 0.17830.21880.37780.6215

Table 6 :6Retrieval effectiveness of The first 4 WCmbMNZ runs on total 63 topics

Table 7 :7Exploring the weight ratios in WCmbMNZ model improvement on P10, and only runs with ratio 10:1 and 15:1 generated significant improvement on MAP. No significant improvement can be achieved on R-PREC.

runsMAPR-PRECP10Avg-Recallratio 2-10.18840.23150.40320.6236ratio 5-10.20880.25900.42540.6259ratio 10-10.21330.25980.43020.6240ratio 15-10.21320.25900.43810.6211ratio 20-10.21400.25810.44130.6202ratio 25-10.21310.25810.44440.6129ratio 50-10.21410.25770.44290.6133ratio 100-10.21400.25830.44600.6087ratio 1000-10.21320.25930.44600.5995ratio 4-2-10.19370.23540.37350.6246ratio 6-2-10.20470.25230.41900.6253ratio 10-2-10.21100.25910.42380.6236ratio 20-2-10.21380.25990.43330.6208ratio 30-2-10.21440.25910.43650.6207ratio 50-2-10.21410.25740.44440.6172ratio 100-2-1 0.21400.25750.44440.6089runsMAPR-PRECP10Avg-Recallmanual-only0.23120.28360.46030.58132nd-ratio 1-10.21190.26700.44130.62412nd-ratio 2-10.22950.27200.45400.62282nd-ratio 5-10.23970.28060.47780.62002nd-ratio 10-1 0.24090.28600.49370.61882nd-ratio 15-1 0.24000.28530.49680.61572nd-ratio 20-10.23880.28560.48100.61423rd-ratio 1-10.24030.28760.50160.61433rd-ratio 2-10.23930.28650.48570.61423rd-ratio 1-20.24090.28660.49840.61573rd-ratio 1-50.24070.28670.49370.61593rd-ratio 1-100.24080.28690.49210.61683rd-ratio 1-150.24080.28640.49210.61683rd-ratio 1-300.24040.28690.49210.6171

Table 8 :8Exploring multiple iterations in data fusion

Acknowledgment

The authors would like to thank Doug Oard, Gareth Jones and Ryen White for their tireless efforts to coordinate CLEF-SDR.

Eu-su working group on spoken-word audio collections 2003 The effect of multiple query representations on information retrieval system performance NJBelikn CCool WBCroft JPCallan Proceeding of ACM-SIGIR'93 eeding of ACM-SIGIR'93 1993 Cross-language retrieval experiments at clef 2002 AitaoChen Proceedings of CLEF 2002 CLEF 2002 2002 CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval KareemDarwish DouglasWOard Proceedings of TREC 2002 TREC 2002 2002 Combination of multiple searches EAFox JAShaw Proeddings of the 2nd Text REtrieval Conference (TREC-2) 1994 Analyses of multiple evidence combination JoonHo Lee Proceeding of SIGIR'97 eeding of SIGIR'97 1997 Building an information retrieval test collection for spontaneous conversational speech DouglasWOard DagobertSoergel DavidDoermann XiaoliHuang GCraigMurray JianqiangWang BhuvanaRamabhadran MartinFranz SamuelGustman Proceedings of SIGIR'94 SIGIR'94 2004 Combination of evidence in recommendation systems characterized by distance functions MLuis Rocha Proceedings of the 2002 World Congress on Computational Intelligence, FUZZ-IEEE'02 the 2002 World Congress on Computational Intelligence, FUZZ-IEEE'02 IEEE Press 2002