Introduction

VideoCLEF 2008: ASR Classi cation based on Wikipedia Categories

0 Jens Kursten, Daniel Richter and Maximilian Eibl Chemnitz University of Technology Faculty of Computer Science, Dept.

Computer Science and Media 09107 Chemnitz, Germany [ jens.kuersten

This article describes our participation at the VideoCLEF track of the CLEF campaign 2008. We designed and implemented a prototype for the classi cation of the Video ASR data. Our approach was to regard the task as text classi cation problem. We used terms from Wikipedia categories as training data for our text classi ers. For the text classi cation the Naive-Bayes and kNN classi er from the WEKA toolkit were used. We submitted experiments for classi cation task 1 and 2. For the translation of the feeds to English (translation task) Google's AJAX language API was used. The evaluation of the classi cation task showed bad results for our experiments with a precision between 10 and 15 percent. These values did not meet our expectations. Interestingly, we could not improve the quality of the classi cation by using the provided metadata. But at least the created translation of the RSS Feeds was well.

Automatic Speech Transcripts Video Classi cation

Introduction

is given in section 4. The nal section concludes our experiments with respect to our expectations and gives and outlook to future work. 2

System Architecture

The general architecture of the system we used is illustrated in gure 1. Besides the given input data (archival metadata, ASR transcripts and RSS items) we used a snapshot of the English and the Dutch Wikipedia as external training data. We extracted terms related to the given categories by applying a category mapping. These extracted terms were later used as training data for our text classi ers. In the following subsections we describe the components and operational steps of our system.

Wikipedia Categories

Archival

Metadata MPEG-7 Files

RSS Items

Labeled RSS Feeds

Term Extraction Metadata Extraction

ASR

Output Extraction

Word Processor

Tokenizer Stopword Removal

Stemmer RSS Item Reader Translation Module

Feed Writer RSS Feed Generator

Weka Toolkit

Training Term Dictionary (Wikipedia)

Training Build Classifier

Category Classifier Testing

Test Term Dictionary (ASR Transcripts) The training of the classi er consists of three essential steps that will be explained in the subsections below. At rst a xed number of terms were extracted by using the JWPL library [4] for each of the 10 categories. These terms were then used to train two classi ers of the WEKA toolkit. Namely the Naive-Bayes and the kNN (with k = 4) classi er were used. In the last step of the training the classi ers were stored because they should remain available for the later classi cation step. 2.1.1

Wikipedia Term Extraction Before the extraction of the terms was done, we needed to specify a mapping between the two source language categories and the available Wikipedia categories. The speci ed categories formed the starting points for the Wikipedia term extraction procedure. The nal mapping is presented in table 1. To create a training set we extracted a speci ed number (TMAX) of unique terms from both Wikipedia snapshots by using the JWPL library3. This maximum number of terms is one of the most important parameters of the complete system. We have conducted several experiments with di erent values for TMAX, varying from 3000 to 10000 (see section Evaluation for more details). Since the extraction of the terms is very time consuming due to the large size of the Wikipedia we also stored the training term dictionaries (TRTD) for the categories and for di erent variations of the parameter TMAX. The training term dictionaries consist of a simple term list with term occurrence frequencies.

Another important parameter of the system and also for the creation of the TRTDs is the depth (D) we use to descend in the Wikipedia link structure. The maximum size of each TRTD directly depends on the parameter D, because only when we descend to a certain depth in the linking structure of the Wikipedia category tree we could extract a su cient number of unique terms. 2.1.3

Word Processing 2.1.4

TRTD Balancing Before the extracted terms were added to the TRTD, they were processed by our word processor (WP). The word processor simply applied a language-speci c stopword list and reduced the term to its root with the help of the Snowball stemmers4 for English and Dutch.

After our rst experiments with the creation of the TRTDs for all 10 categories we discovered, that the TRTDs were unbalanced with respect to the number of unique terms. This is due to the fact that the categories have di erent total numbers of sub-categories and these again contain di erent amounts of terms. To avoid that some categories will get a large weight because of a high TMAX that could never be satis ed by a category with a smaller number of pages, we decided to implement two di erent thresholds to balance the TRTDs in terms of their size. The rst strategy was simply to use the term amount of the smallest category as TMAX, but it turned out that this creates bad classi cations when TMAX and D are small. So we decided to use the mean of the term amounts of all 10 categories, which means that some categories might have a too small number of terms, but in general the TRTDs are balanced. 2.1.5

TRTD Discrimintation For a better discrimination of the categories we implemented a training term duplication threshold (WT). This threshold is used to delete terms from the TRTDs that occur in at least (WT) categories. We assumed that this might help during the classi cation step. Our idea is that a natural term distribution that can be

3http://www.ukp.tu-darmstadt.de/software/jwpl 4http://snowball.tartarus.org

found in the Wikipedia could not be categorized very well. By implementing this assumption we hoped to improve the precision of the classi cation.

Another parameter that might be useful for the discrimination of the TRTDs is the frequency based selection (FS) of the terms. As mentioned before we selected a maximum number of terms (TMAX) for each category. We could use di erent strategies for that because the TRTDs most likely contain much more terms than we may want to extract. We implemented two options for the selection of the terms. The rst is just to use the terms with the highest occurrence and the second is to take the average term occurrence frequency and to extract 0.5 times TMAX of the terms above and below this average. 2.1.6

TRTD Term Statistics descending in the link structure (D) is increased. In the English Wikipedia the category science contained the largest amount of terms, followed by history, music and visual arts. The smallest amount of terms could be extracted for the category paintings. In our opinion the statistic allows to draw the following conclusions for the parameter sets. With WT=2, i.e. that all term duplicates were removed that occur in at least two category TRTDs, we could create the most balanced TRTDs. All parameter sets with WT>3 create TRTDs with more realistic term distributions.

For the term statistics of the Dutch Wikipedia one could draw similar conclusions, but there are some di erences. The most important di erence is the smaller number of entries in the Dutch Wikipedia, which generally results in smaller TRTDs. Also the distribution of the speci ed categories is little di erent. There are no outliers like science or paintings, which consequently follows from the smaller amount of pages. For the Dutch Wikipedia the category dans produced the smallest TRTD. 2.1.7

Training Setup In the rst step of the classi er training process we loaded the relevant TRTD for each category. Thereafter, we fed the instances of the TRTD into the Naive-Bayes and kNN classi ers. Finally, the classi ers were stored, because we wanted them to remain for further evaluations of di erent parameter sets for the complete system. For the preparation of the classi cation it was necessary to parse the ASR transcripts and to extract the textual information. We also parsed the metadata that could be used for the classi cation task 2. We used the same procedure for the creation of the test term dictionaries (TSTD) as we did before for the creation of the TRTDs. At rst the word processor removes stopwords and then it stems all terms to their root. For the TSTDs we also applied a parameter (VT) for the removal of duplicate terms. We hoped this would help in discriminating the ASR transcripts. In the classi cation process the stored classi ers were reloaded into memory. They were then used to classify the contents of the TSTD for each video. The results of the classi cation are 10 probabilities for the membership of the video in all of the 10 speci ed categories. These probabilities sum up to 1 for each term of the TSTD. This was repeated for all terms in the TSTD in order to get a nal classi cation. For the classi cation task 2 we also used the terms from the metadata les for classi cation.

As next step we normalized the returned classi cations, i.e. each of the 10 speci ed categories were normalized to nd the nal classi cation of the videos. The normalization is de ned as the sum of the arithmetic mean and the standard deviation of each category. This sum was used as nal classi cation threshold (CT) for each corresponding category.

In the last step the nal classi cation was created. Therefore we iteratively decreased a prede ned score (S), which is always larger than CT, until at least one of the ten CT values is larger than S. Finally, we compared the resulting S with all CT for the 10 categories and assigned the corresponding classi cation to the video. 2.4

RSS Feed Creation

The RSS Feeds were created continuously during the last step of the classi cation. Thereby, the RSS item for each video was subsequently added to the corresponding category RSS Feeds. 2.5

RSS Feed Translation

The translation of the RSS Feeds was conducted when all categories were complete. For the translation we used Google's AJAX Language API5, which is the actual translation component of the Xtrieval framework [1]. The translation was technically limited to a maximal amount of 100 characters per time. Therefore we split the Feed contents into sentences and translated these. Thereafter we rebuilt the RSS Feed in the translated language. 3

Evaluation

This section provides experimental results on the development and test sets. At rst, we describe the determination of the parameters by using the development set and nally we present the setup of the complete system for the experiments on the test set. 3.1

Parameter Tuning with Development Data

We used the development data for the tuning the parameter set of our system for the experiments on the test data. The system has six important parameters:

5http://code.google.com/apis/ajaxlanguage/documentation

Depth of Wikipedia Category Extraction (D) Frequency-based Selection of Training Terms (FS); 0 for high frequency terms and 1 for mid frequency terms Maximum Number of Training Terms (TMAX) Training Term Duplicate Deletion (WT); 5 for deletion of terms that appear in at least 5 categories Test Term Duplicate Deletion (VT); 5 for deletion of terms that appear in at least 5 video ASR transcripts

Classi ers (C); we used both Naives-Bayes and kNN (k = 4) for all experiments We derived two possibly useful parameter sets from table 3. At rst for large TRTDs with TMAX > 3000 the parameter set (0;3;2;5) seemed to be promising. For smaller TRTDs with TMAX 3000 the parameter set (1;2;2;2) could be useful. Unfortunately, we tested the con guration with TMAX = 1000 after the deadline of the submission. 3.2

Experimental Setup and Results

We submitted two experiments for each of the two classi cation tasks. The results of the evaluation are presented in table 4. The results were not very well and did not meet our expectations and observations on the development data. Interestingly, using the metadata in classi cation task 2 did not improve the classi cation performance in both cases.

Additionally, we submitted a translation of the RSS Feeds. The translation was evaluated by three assessors in terms of uency ( 1-5 ) and adequacy ( 1-5 ). The higher the score the better was the quality of the translation. The results are summarized in table 5. 4

Result Analysis - Summary

The following items conclude our observations of the experimental evaluation:

Classi cation task 1: The quality of the video classi cation was not as good as expected, both in terms of precision and in terms of recall.

Classi cation task 2: Surprisingly, the quality of the video classi cation could not be improved by utilizing the given metadata. The reason for that might be the small impact of the metadata in comparison to the large size of the TRTD we used.

Translation task: The translation of the RSS Feeds was quite good, but there is also room for improvement, especially in terms of uency. 5

Conclusion and Future Work

The experiments showed that the classi cation of dual-language video based on ASR transcripts is a quite hard task. Nevertheless, we presented an idea to tackle the problem. But there are a number of points to improve the system. The two most important problems are the size of the training data on the one hand and the balance of the categories on the other hand. We consider to omit the TRTD balancing step and to shrink the TRTD size in further experiments. Another point might be to weight the TRTD based on an approximated distribution of the categories in the video collection, because this could be a good indicator on how to nd the correct classes for a given video.

[1]

Jens

Ku rsten, Thomas Wilhelm, and

Maximilian

Eibl . Extensible retrieval and evaluation framework: Xtrieval . LWA 2008: Lernen - Wissen - Adaption, Wurzburg, October 2008 , Workshop Proceedings, October 2008 , to appear.

[2]

Martha

Larson , Eamonn Newman, and

Gareth

Jones . Overview of videoclef 2008: Automatic generation of topic-based feeds for dual language audio-visual content . CLEF 2008: Workshop Notes , September 2008 .

[3] Ian

Witten and Eibe

Frank . Data mining : practical machine learning tools and techniques . Elsevier, Morgan Kaufman, Amsterdam, 2. ed. edition, 2005 .

[4]

Torsten

Zesch , Christof Muller, and Iryna Gurevych. Extracting lexical semantic knowledge from wikipedia and wiktionary . Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) , May 2008 .