<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VideoCLEF 2008: ASR Classi cation based on Wikipedia Categories</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Jens Kursten, Daniel Richter and Maximilian Eibl Chemnitz University of Technology Faculty of Computer Science, Dept.</institution>
          <addr-line>Computer Science and Media 09107 Chemnitz, Germany [ jens.kuersten</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article describes our participation at the VideoCLEF track of the CLEF campaign 2008. We designed and implemented a prototype for the classi cation of the Video ASR data. Our approach was to regard the task as text classi cation problem. We used terms from Wikipedia categories as training data for our text classi ers. For the text classi cation the Naive-Bayes and kNN classi er from the WEKA toolkit were used. We submitted experiments for classi cation task 1 and 2. For the translation of the feeds to English (translation task) Google's AJAX language API was used. The evaluation of the classi cation task showed bad results for our experiments with a precision between 10 and 15 percent. These values did not meet our expectations. Interestingly, we could not improve the quality of the classi cation by using the provided metadata. But at least the created translation of the RSS Feeds was well.</p>
      </abstract>
      <kwd-group>
        <kwd>Automatic Speech Transcripts</kwd>
        <kwd>Video Classi cation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>is given in section 4. The nal section concludes our experiments with respect to our expectations and gives
and outlook to future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>System Architecture</title>
      <p>The general architecture of the system we used is illustrated in gure 1. Besides the given input data (archival
metadata, ASR transcripts and RSS items) we used a snapshot of the English and the Dutch Wikipedia as
external training data. We extracted terms related to the given categories by applying a category mapping.
These extracted terms were later used as training data for our text classi ers. In the following subsections
we describe the components and operational steps of our system.</p>
      <p>Wikipedia
Categories</p>
      <p>Archival</p>
      <p>Metadata
MPEG-7 Files</p>
      <p>RSS Items</p>
      <p>Labeled
RSS Feeds</p>
      <p>Term
Extraction
Metadata
Extraction</p>
      <p>ASR</p>
      <p>Output
Extraction</p>
      <p>Word Processor</p>
      <p>Tokenizer
Stopword Removal</p>
      <p>Stemmer
RSS Item Reader
Translation Module</p>
      <p>Feed Writer
RSS Feed Generator</p>
      <p>Weka Toolkit</p>
      <p>Training
Term Dictionary
(Wikipedia)</p>
      <p>Training
Build Classifier</p>
      <p>Category
Classifier
Testing</p>
      <p>Test
Term Dictionary
(ASR Transcripts)
The training of the classi er consists of three essential steps that will be explained in the subsections below.
At rst a xed number of terms were extracted by using the JWPL library [4] for each of the 10 categories.
These terms were then used to train two classi ers of the WEKA toolkit. Namely the Naive-Bayes and the
kNN (with k = 4) classi er were used. In the last step of the training the classi ers were stored because they
should remain available for the later classi cation step.
2.1.1</p>
      <p>Wikipedia Term Extraction
Before the extraction of the terms was done, we needed to specify a mapping between the two source language
categories and the available Wikipedia categories. The speci ed categories formed the starting points for the
Wikipedia term extraction procedure. The nal mapping is presented in table 1.
To create a training set we extracted a speci ed number (TMAX) of unique terms from both Wikipedia
snapshots by using the JWPL library3. This maximum number of terms is one of the most important
parameters of the complete system. We have conducted several experiments with di erent values for TMAX,
varying from 3000 to 10000 (see section Evaluation for more details). Since the extraction of the terms is very
time consuming due to the large size of the Wikipedia we also stored the training term dictionaries (TRTD)
for the categories and for di erent variations of the parameter TMAX. The training term dictionaries consist
of a simple term list with term occurrence frequencies.</p>
      <p>Another important parameter of the system and also for the creation of the TRTDs is the depth (D) we
use to descend in the Wikipedia link structure. The maximum size of each TRTD directly depends on the
parameter D, because only when we descend to a certain depth in the linking structure of the Wikipedia
category tree we could extract a su cient number of unique terms.
2.1.3</p>
      <p>Word Processing
2.1.4</p>
      <p>TRTD Balancing
Before the extracted terms were added to the TRTD, they were processed by our word processor (WP). The
word processor simply applied a language-speci c stopword list and reduced the term to its root with the
help of the Snowball stemmers4 for English and Dutch.</p>
      <p>After our rst experiments with the creation of the TRTDs for all 10 categories we discovered, that the
TRTDs were unbalanced with respect to the number of unique terms. This is due to the fact that the
categories have di erent total numbers of sub-categories and these again contain di erent amounts of terms.
To avoid that some categories will get a large weight because of a high TMAX that could never be satis ed
by a category with a smaller number of pages, we decided to implement two di erent thresholds to balance
the TRTDs in terms of their size. The rst strategy was simply to use the term amount of the smallest
category as TMAX, but it turned out that this creates bad classi cations when TMAX and D are small. So
we decided to use the mean of the term amounts of all 10 categories, which means that some categories might
have a too small number of terms, but in general the TRTDs are balanced.
2.1.5</p>
      <p>TRTD Discrimintation
For a better discrimination of the categories we implemented a training term duplication threshold (WT).
This threshold is used to delete terms from the TRTDs that occur in at least (WT) categories. We assumed
that this might help during the classi cation step. Our idea is that a natural term distribution that can be</p>
      <sec id="sec-2-1">
        <title>3http://www.ukp.tu-darmstadt.de/software/jwpl 4http://snowball.tartarus.org</title>
        <p>found in the Wikipedia could not be categorized very well. By implementing this assumption we hoped to
improve the precision of the classi cation.</p>
        <p>Another parameter that might be useful for the discrimination of the TRTDs is the frequency based
selection (FS) of the terms. As mentioned before we selected a maximum number of terms (TMAX) for each
category. We could use di erent strategies for that because the TRTDs most likely contain much more terms
than we may want to extract. We implemented two options for the selection of the terms. The rst is just to
use the terms with the highest occurrence and the second is to take the average term occurrence frequency
and to extract 0.5 times TMAX of the terms above and below this average.
2.1.6</p>
        <p>TRTD Term Statistics
descending in the link structure (D) is increased. In the English Wikipedia the category science contained
the largest amount of terms, followed by history, music and visual arts. The smallest amount of terms could
be extracted for the category paintings. In our opinion the statistic allows to draw the following conclusions
for the parameter sets. With WT=2, i.e. that all term duplicates were removed that occur in at least two
category TRTDs, we could create the most balanced TRTDs. All parameter sets with WT&gt;3 create TRTDs
with more realistic term distributions.</p>
        <p>For the term statistics of the Dutch Wikipedia one could draw similar conclusions, but there are some
di erences. The most important di erence is the smaller number of entries in the Dutch Wikipedia, which
generally results in smaller TRTDs. Also the distribution of the speci ed categories is little di erent. There
are no outliers like science or paintings, which consequently follows from the smaller amount of pages. For
the Dutch Wikipedia the category dans produced the smallest TRTD.
2.1.7</p>
        <p>Training Setup
In the rst step of the classi er training process we loaded the relevant TRTD for each category. Thereafter,
we fed the instances of the TRTD into the Naive-Bayes and kNN classi ers. Finally, the classi ers were
stored, because we wanted them to remain for further evaluations of di erent parameter sets for the complete
system.
For the preparation of the classi cation it was necessary to parse the ASR transcripts and to extract the
textual information. We also parsed the metadata that could be used for the classi cation task 2. We used
the same procedure for the creation of the test term dictionaries (TSTD) as we did before for the creation
of the TRTDs. At rst the word processor removes stopwords and then it stems all terms to their root. For
the TSTDs we also applied a parameter (VT) for the removal of duplicate terms. We hoped this would help
in discriminating the ASR transcripts.
In the classi cation process the stored classi ers were reloaded into memory. They were then used to classify
the contents of the TSTD for each video. The results of the classi cation are 10 probabilities for the
membership of the video in all of the 10 speci ed categories. These probabilities sum up to 1 for each
term of the TSTD. This was repeated for all terms in the TSTD in order to get a nal classi cation. For the
classi cation task 2 we also used the terms from the metadata les for classi cation.</p>
        <p>As next step we normalized the returned classi cations, i.e. each of the 10 speci ed categories were
normalized to nd the nal classi cation of the videos. The normalization is de ned as the sum of the
arithmetic mean and the standard deviation of each category. This sum was used as nal classi cation
threshold (CT) for each corresponding category.</p>
        <p>In the last step the nal classi cation was created. Therefore we iteratively decreased a prede ned score
(S), which is always larger than CT, until at least one of the ten CT values is larger than S. Finally, we
compared the resulting S with all CT for the 10 categories and assigned the corresponding classi cation to
the video.
2.4</p>
        <sec id="sec-2-1-1">
          <title>RSS Feed Creation</title>
          <p>The RSS Feeds were created continuously during the last step of the classi cation. Thereby, the RSS item
for each video was subsequently added to the corresponding category RSS Feeds.
2.5</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>RSS Feed Translation</title>
          <p>The translation of the RSS Feeds was conducted when all categories were complete. For the translation we
used Google's AJAX Language API5, which is the actual translation component of the Xtrieval framework
[1]. The translation was technically limited to a maximal amount of 100 characters per time. Therefore
we split the Feed contents into sentences and translated these. Thereafter we rebuilt the RSS Feed in the
translated language.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>This section provides experimental results on the development and test sets. At rst, we describe the
determination of the parameters by using the development set and nally we present the setup of the complete
system for the experiments on the test set.
3.1</p>
      <sec id="sec-3-1">
        <title>Parameter Tuning with Development Data</title>
        <p>We used the development data for the tuning the parameter set of our system for the experiments on the
test data. The system has six important parameters:</p>
        <sec id="sec-3-1-1">
          <title>5http://code.google.com/apis/ajaxlanguage/documentation</title>
          <p>Depth of Wikipedia Category Extraction (D)
Frequency-based Selection of Training Terms (FS); 0 for high frequency terms and 1 for mid frequency
terms
Maximum Number of Training Terms (TMAX)
Training Term Duplicate Deletion (WT); 5 for deletion of terms that appear in at least 5 categories
Test Term Duplicate Deletion (VT); 5 for deletion of terms that appear in at least 5 video ASR
transcripts</p>
          <p>Classi ers (C); we used both Naives-Bayes and kNN (k = 4) for all experiments
We derived two possibly useful parameter sets from table 3. At rst for large TRTDs with TMAX &gt; 3000 the
parameter set (0;3;2;5) seemed to be promising. For smaller TRTDs with TMAX 3000 the parameter set
(1;2;2;2) could be useful. Unfortunately, we tested the con guration with TMAX = 1000 after the deadline
of the submission.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Experimental Setup and Results</title>
        <p>We submitted two experiments for each of the two classi cation tasks. The results of the evaluation are
presented in table 4.
The results were not very well and did not meet our expectations and observations on the development data.
Interestingly, using the metadata in classi cation task 2 did not improve the classi cation performance in
both cases.</p>
        <p>
          Additionally, we submitted a translation of the RSS Feeds. The translation was evaluated by three assessors in
terms of uency (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-5</xref>
          ) and adequacy (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-5</xref>
          ). The higher the score the better was the quality of the translation.
The results are summarized in table 5.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Result Analysis - Summary</title>
      <p>The following items conclude our observations of the experimental evaluation:</p>
      <p>Classi cation task 1: The quality of the video classi cation was not as good as expected, both in terms
of precision and in terms of recall.</p>
      <p>Classi cation task 2: Surprisingly, the quality of the video classi cation could not be improved by
utilizing the given metadata. The reason for that might be the small impact of the metadata in
comparison to the large size of the TRTD we used.</p>
      <p>Translation task: The translation of the RSS Feeds was quite good, but there is also room for
improvement, especially in terms of uency.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>The experiments showed that the classi cation of dual-language video based on ASR transcripts is a quite
hard task. Nevertheless, we presented an idea to tackle the problem. But there are a number of points to
improve the system. The two most important problems are the size of the training data on the one hand
and the balance of the categories on the other hand. We consider to omit the TRTD balancing step and
to shrink the TRTD size in further experiments. Another point might be to weight the TRTD based on an
approximated distribution of the categories in the video collection, because this could be a good indicator on
how to nd the correct classes for a given video.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Jens</given-names>
            <surname>Ku</surname>
          </string-name>
          rsten, Thomas Wilhelm, and
          <string-name>
            <given-names>Maximilian</given-names>
            <surname>Eibl</surname>
          </string-name>
          .
          <article-title>Extensible retrieval and evaluation framework: Xtrieval</article-title>
          . LWA 2008: Lernen - Wissen - Adaption, Wurzburg,
          <year>October 2008</year>
          , Workshop Proceedings,
          <year>October 2008</year>
          , to appear.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Martha</given-names>
            <surname>Larson</surname>
          </string-name>
          , Eamonn Newman, and
          <string-name>
            <given-names>Gareth</given-names>
            <surname>Jones</surname>
          </string-name>
          . Overview of videoclef 2008:
          <article-title>Automatic generation of topic-based feeds for dual language audio-visual content</article-title>
          .
          <source>CLEF 2008: Workshop Notes</source>
          ,
          <year>September 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ian</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
            and
            <given-names>Eibe</given-names>
          </string-name>
          <string-name>
            <surname>Frank</surname>
          </string-name>
          .
          <article-title>Data mining : practical machine learning tools and techniques</article-title>
          . Elsevier, Morgan Kaufman, Amsterdam, 2. ed. edition,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Torsten</given-names>
            <surname>Zesch</surname>
          </string-name>
          , Christof Muller, and Iryna Gurevych.
          <article-title>Extracting lexical semantic knowledge from wikipedia and wiktionary</article-title>
          .
          <source>Proceedings of the Sixth International Language Resources and Evaluation (LREC'08)</source>
          , May
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>