Introduction and Motivation

Chemnitz at VideoCLEF 2009: Experiments and Observations on Treating Classi cation as IR Task

0 Jens Kursten and Maximilian Eibl Chemnitz University of Technology Faculty of Computer Science, Dept.

Computer Science and Media 09107 Chemnitz, Germany [ jens.kuersten

2007

This paper describes the participation of the Chemnitz University of Technology in the VideoCLEF 2009 classi cation task. Our motivation lies in its close relation to our research project sachsMedia1. In our second participation in the task we experimented with treating the task as IR problem and used the Xtrieval framework [3] to run our experiments. We proposed a automatic threshold estimation to limit the number of documents per label since too many returned documents hurt the overall correct classi cation rate. Although the experimental setup was enhanced this year and the data sets were changed we found that the IR approach still works quite well. Our query expansion approach performed better than the baseline experiments in terms of mean average precision. We also showed that combining the ASR transcriptions and the archival metadata improves the classi cation performance, unless query expansion is used in the retrieval phase.

Automatic Speech Transcripts Video Classi cation

Introduction and Motivation

This article describes a system and its con guration that was used for our participation in the VideoCLEF classi cation task. The task was to categorize dual-language video into 46 given classes based on provided ASR transcripts [ 5 ] and additional archival metadata. In a mandatory experiment only the ASR transcripts of the videos had to be used as source for classi cation. Furthermore each of the given video documents can have none, one or even multiple labels. Hence the task can be characterized as a real world scenario in the eld of automatic classi cation.

Our participation in the task is motivated by the its close relation to our research project sachsMedia1. The main goals of the project are twofold. The rst main objective is automatic extraction of low level features from audio and video for automated annotation of poorly described material in archives. On the other hand sachsMedia aims to support local TV stations in Saxony to replace analog distribution technology with innovative digital distribution services. A special problem of the broadcasters is the accessibility of their archives for end users. Though we are currently developing algorithms for automatic extraction of low-level metadata the VideoCLEF classi cation task is a direct use case within our project. The remainder of the article is organized as follows. In section 2 we brie y review existing approaches and describe the system architecture and its main con guration. In sections 3 and 4 we present the results of preliminary and o cially submitted experiments and interpret the results. A summary of our observations and experiences is given in section 5. The nal section concludes the experiments with respect to our expectations and gives and outlook to future work. 2

System Architecture and Con guration

Since the classi cation task was an enhanced modi cation of last years VideoCLEF classi cation task [ 4 ], we give a brief review on previously used approaches. There were mainly two distinct ways to approach the classi cation task: (a) collecting training data from external sources like general Web content or Wikipedia to train a text classi er or (b) treat the problem as information retrieval task. Villena and Lana [ 8 ] combined both ideas by obtaining training data from Wikipedia and assigning the class labels to the indexed training data. The metadata from the video documents were used as query on the training corpus and the dominant label of the retrieved documents was assigned as class label. Newman and Jones [ 6 ] as well as Perea-Ortega et. al. [ 7 ] approached the problem merely as IR task and achieved similar strong performance. Kursten et. al. [ 2 ] and He et. al. [ 1 ] tried to solve the problem with state of the art classi ers like k-NN and SVM. Both used Wikipedia articles to train their classi ers. 2.1

Resources

Given the impressions from last year's evaluation and the huge success of the IR approaches as well as the enhancement of the task to a larger number of class labels and more documents, we decided to treat the problem as an IR task. Hence we used the Xtrieval framework [ 3 ] to create an index on the provided metadata. This index was composed of three elds, one with the ASR output, one with the archival metadata and a third one containing both. To process the tokens a language speci c stopword list2 and the Dutch stemmer from the Snowball project2 was applied. We used the class labels to query our video document index. The Lucene4 retrieval core with the default vector-based IR model was utilized within our framework. In the retrieval phase we used an English thesaurus5 in combination with the Google AJAX language API6 for query expansion purposes. 2.2

System Con guration and Parameters

The following list brie y explains some of our system parameters and their values for the experimental evaluation.

Query Expansion (QE): The most frequent term from the top-5 documents was used to reformulate the original query.

Thesaurus Term Query Expansion (TT): Thesaurus term query expansion was used for those queries, which returned less than two documents (even after QE).

Multi-label Limit (DpL): DpL denotes the maximum number of assigned documents per class label and it was used to manually set a threshold for the document cut-o in the result sets. 2http://snowball.tartarus.org/algorithms/dutch/stop.txt 3http://snowball.tartarus.org/algorithms/dutch/stemmer.html 4http://lucene.apache.org 5http://de.openo ce.org/spellcheck/about-spellcheck-detail.html#thesaurus 6http://code.google.com/apis/ajaxlanguage/documentation

Source Field (SF): The metadata source was variated to indicate which source is most reliable and whether their combination yields to improvement of the classi cation or not.

Due to the problem of determining the document cut-o level a priori we calculated the following threshold for each query. The threshold TDpL is based on the scores of the retrieved documents per class label. Thereby RSVavg denotes the average score and RSVmax is the maximum score of the documents retrieved. N umdocs stands for the total number of document retrieved for a speci c class label.

TDpL = RSVavg + 2

RSVmax RSVavg

Numdocs 3

Experiments and Results

In this section we report results that were obtained by running various system con gurations on the provided training data. In table 1 columns 2-5 refer to speci c system parameters that were introduced in section 2.2. Please note that the utilization of the threshold formula is denoted with x in column DpL, which means that the number of assigned documents can be di erent for each class label.

Regarding the evaluation of the task we had a problem with calculating the measures. We report two values for MAP due to a peculiarity in our Xtrieval framework, which allows the system to return two documents with identical RSV. The trec eval7 tool seems to penalize this behavior by randomly reordering the result set. Thus the MAP values reported by trec eval and our framework (labeled MAP* in the following tables) have marginal variations. Unfortunately we were neither able to correct the behavior of our system nor could we nd out when or why the trec eval tool reorders our result sets. Thus, we decided to report both MAP values for our experiments in agreement with the task organizers. 3.1

Experiments on the Training Data

For evaluation of the classi cation performance the total number of assigned labels (SumL), the ratio of correct assigned labels (CR), averaged recall (AR) over all class labels and mean average precision (MAP) are reported. Table 1 is divided into three sections with respect to the used metadata sources. In the ve rightmost columns the best values for each section of the table are emphasized bold and the best value over all sections is marked bold and italic. The following observations can be made by analyzing the experimental results. No matter which metadata source was used, the experiment without limitation of the class labels per document had the best performance in terms of AR and MAP (see ID's cut2, cut5 and cut11). The drawback of those runs is that they have very low correct classi cation rates (CR) of about 3% for the ASR data and about 8% when using archival metadata alone or in combination with ASR data. In contrast to that the experiments without any form of query expansion (see ID's cut1, cut4 and cut10) had the highest correct classi cation rates (CR) from 33% up to 47%. However, this is more a result from limiting to one document per label, which also yields to lower performance in terms of AR and MAP. Numerous experiments with either manual or automatic thresholds to limit the assigned documents per label were conducted. The results show that it is possible to improve CR substantially and almost sustain the best MAP values (compare cut5 to cut9 and cut11 to cut15). Nevertheless for those runs the AR was signi cantly lower. 3.2

Experiments on the Test Data

In this section we report the experimental results on the evaluation data set. Please note that we run all con gurations from section 3.1 again, because we wanted to gure out if our observations on the training data are also valid on the test data set. Experiments that were submitted for o cial evaluation by the organizers of the task are denoted with *. Again in table 2 columns 2-5 contain parameters of our system, which are brie y explained in section 2.2. The performance of the experiments is reported with respect to overall sum of assigned label (SumL), the average ratio of correct classi cations (CR) as well as average recall (AR) and mean average precision (MAP). Corresponding to section 3.1 table 2 is also divided into three sections with respect to the used metadata sources. In the ve rightmost columns the best values for each section of the table are emphasized bold and the best value over all sections is marked bold and italic. In general we see similar behavior on both the training and the test data set. For all data sources used the best correct classi cation rate (CR) is achieved without using any form of query expansion (see ID's cut1, cut4 and cut10). The best overall (CR) was achieved by only using archival metadata in the retrieval phase. Since the archival metadata consists of intellectual annotations this is a very straightforward nding. Another obvious observation is, that the best overall results in terms of MAP and AR were also achieved on the archival metadata. Nevertheless the gap to the best results when combining ASR output with archival metadata is very small (compare cut5 to cut11). Regarding our proposed automatic threshold calculation for limitation of the number of assigned documents per label the results are twofold. On the one hand there is a slight improvement in terms of MAP and AR compared to low manually xed thresholds between 1 and 3 assigned documents per label. On the other hand the overall correct classi cation rate (CR) decreases in the same magnitude MAP and AR are increasing, which is another very straightforward nding.

The interpretation of our experimental results led us to the conclusion that using MAP for evaluating a multi-label classi cation task is somehow questionable. The main reason in our point of view is that MAP does not take into account the overall correct classi cation rate CR. Let us take a look on the two best performing experiments using archival metadata and ASR transcriptions either in table 1 or 2 (see ID's cut10 and cut15). The di erence in terms of MAP is about 6% or 12%, but the gain in terms of CR is about 293% or 337% respectively. In our opinion in a real world scenario were assignment of class labels to video documents should be completely automatic it would be essential to take into account the overall ratio of correct assigned labels. Our prosposal for future evaluations is to combine measures that take into account the position of the correct assigned labels in a result set (like MAP or averaged R-Precision) with the micro or macro correct classi cation rate. 4

Result Analysis - Summary

The following list provides a short summary of our observations and ndings from the participation in the VideoCLEF classi cation task in 2009.

Classi cation as an IR task: According to the experiences from last year, we conclude that treating the given task as a traditional IR task with some modi cations is a quite successful approach. Query Expansion: Both types of query expansion improved the results in terms of MAP and AR but had very low correct classi cation rates CR.

Metadata Sources: Combining both ASR output and archival metadata improves MAP and AR when no query expansion is used. For those experiments where query expansion was used there is no gain in terms of MAP and AR comparing archival metadata runs to experiments which used both data sources. Label Limits: We compared an automatically calculated threshold to low manual set thresholds and found that the automatic threshold works better in terms of MAP and AR.

Evaluation Measure: In our opinion using MAP as evaluation measure for a multi-label classi cation task is questionable. We would prefer a measure that takes into account both correct classi cation rate and averaged recall. 5

Conclusion and Future Work

This year we used the Xtrieval framework for the VideoCLEF classi cation task. In our experimental evaluation we can con rm the observations from last year, where approaches treating the task as IR problem were most successful. We proposed an automatic threshold to limit the number of assigned documents per class label to keep high correct classi cation rates. This seems to be the main issue that could be worked on in the future. A manual limitation of assigned documents per label is not an appropriate solution to a comparable real world problem, where possibly tens or hundred of thousand video documents should be labeled with maybe hundreds of di erent topic labels. Furthermore one could try to evaluate di erent retrieval models or try to combine the results from those models to gain a better overall performance. Finally it should be evaluated if assigning eld boosts to the metadata sources could improve performance in the combined retrieval setting.

Acknowledgments

We would like to thank the VideoCLEF organizers and the Netherlands Institute of Sound and Vision (Beeld & Geluid) for providing the data sources for the task.

This work was accomplished in conjunction with the project sachsMedia, which is funded by the Entrepreneurial Regions 8 program of the German Federal Ministry of Education and Research.

[1]

Jyin

He , Xu Zhang, Wouter Weerkamp, and

Martha

Larson . The University of Amsterdam at VideoCLEF 2008. Working Notes for the CLEF 2008 Workshop , 17 - 19 September, Aarhus, Denmark, 2008 .

[2]

Jens

Ku rsten, Daniel Richter, and Maximlian Eibl. VideoCLEF 2008 : ASR Classi cation based on Wikipedia Categories . Working Notes for the CLEF 2008 Workshop , 17 - 19 September, Aarhus, Denmark, 2008 .

[3]

Jens

Ku rsten, Thomas Wilhelm, and

Maximilian

Eibl . Extensible Retrieval and Evaluation Framework: Xtrieval . LWA 2008: Lernen - Wissen - Adaption, Wurzburg, October 2008 ,

Workshop

Proceedings , 2008 .

[4]

Martha

Larson , Eamonn Newman, and

Gareth

Jones . Overview of VideoCLEF 2008 : Automatic Generation of Topic-based Feeds for Dual Language Audio-Visual Content . Working Notes for the CLEF 2008 Workshop , 17 - 19 September, Aarhus, Denmark, 2008 .

[5]

Martha

Larson , Eamonn Newman, and

Gareth

Jones . Overview of VideoCLEF 2009 : New Perspectives on Speech-based Multimedia Content Enrichment . In Francesca Borri, Alessandro Nardi, and Carol Peters, editors, Working Notes of CLEF 2009 , September 2009 .

[6]

Eamonn

Newman and

Gareth J. F.

Jones . DCU at VideoClef 2008. Working Notes for the CLEF 2008 Workshop , 17 - 19 September, Aarhus, Denmark, 2008 .

[7] Jose

Perea-Ortega , Arturo Montejo-Raez, and M.

Teresa Mart n-Valdivia. SINAI at VideoCLEF 2008. Working Notes for the CLEF 2008 Workshop , 17 - 19 September, Aarhus, Denmark, 2008 .

[8]

Julio

Villena-Roman and Sara Lana-Serrano . MIRACLE at VideoCLEF 2008: Classi cation of Multilingual Speech Transcripts . Working Notes for the CLEF 2008 Workshop , 17 - 19 September, Aarhus, Denmark, 2008 .