<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Chemnitz at VideoCLEF 2009: Experiments and Observations on Treating Classi cation as IR Task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Jens Kursten and Maximilian Eibl Chemnitz University of Technology Faculty of Computer Science, Dept.</institution>
          <addr-line>Computer Science and Media 09107 Chemnitz, Germany [ jens.kuersten</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2007</year>
      </pub-date>
      <abstract>
        <p>This paper describes the participation of the Chemnitz University of Technology in the VideoCLEF 2009 classi cation task. Our motivation lies in its close relation to our research project sachsMedia1. In our second participation in the task we experimented with treating the task as IR problem and used the Xtrieval framework [3] to run our experiments. We proposed a automatic threshold estimation to limit the number of documents per label since too many returned documents hurt the overall correct classi cation rate. Although the experimental setup was enhanced this year and the data sets were changed we found that the IR approach still works quite well. Our query expansion approach performed better than the baseline experiments in terms of mean average precision. We also showed that combining the ASR transcriptions and the archival metadata improves the classi cation performance, unless query expansion is used in the retrieval phase.</p>
      </abstract>
      <kwd-group>
        <kwd>Automatic Speech Transcripts</kwd>
        <kwd>Video Classi cation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction and Motivation</title>
      <p>
        This article describes a system and its con guration that was used for our participation in the VideoCLEF
classi cation task. The task was to categorize dual-language video into 46 given classes based on provided
ASR transcripts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and additional archival metadata. In a mandatory experiment only the ASR transcripts
of the videos had to be used as source for classi cation. Furthermore each of the given video documents can
have none, one or even multiple labels. Hence the task can be characterized as a real world scenario in the
eld of automatic classi cation.
      </p>
      <p>Our participation in the task is motivated by the its close relation to our research project sachsMedia1.
The main goals of the project are twofold. The rst main objective is automatic extraction of low level
features from audio and video for automated annotation of poorly described material in archives. On the
other hand sachsMedia aims to support local TV stations in Saxony to replace analog distribution technology
with innovative digital distribution services. A special problem of the broadcasters is the accessibility of their
archives for end users. Though we are currently developing algorithms for automatic extraction of low-level
metadata the VideoCLEF classi cation task is a direct use case within our project. The remainder of the
article is organized as follows. In section 2 we brie y review existing approaches and describe the system
architecture and its main con guration. In sections 3 and 4 we present the results of preliminary and o cially
submitted experiments and interpret the results. A summary of our observations and experiences is given in
section 5. The nal section concludes the experiments with respect to our expectations and gives and outlook
to future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>System Architecture and Con guration</title>
      <p>
        Since the classi cation task was an enhanced modi cation of last years VideoCLEF classi cation task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
we give a brief review on previously used approaches. There were mainly two distinct ways to approach the
classi cation task: (a) collecting training data from external sources like general Web content or Wikipedia
to train a text classi er or (b) treat the problem as information retrieval task. Villena and Lana [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] combined
both ideas by obtaining training data from Wikipedia and assigning the class labels to the indexed training
data. The metadata from the video documents were used as query on the training corpus and the dominant
label of the retrieved documents was assigned as class label. Newman and Jones [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as well as Perea-Ortega
et. al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] approached the problem merely as IR task and achieved similar strong performance. Kursten et.
al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and He et. al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] tried to solve the problem with state of the art classi ers like k-NN and SVM. Both
used Wikipedia articles to train their classi ers.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Resources</title>
        <p>
          Given the impressions from last year's evaluation and the huge success of the IR approaches as well as
the enhancement of the task to a larger number of class labels and more documents, we decided to treat
the problem as an IR task. Hence we used the Xtrieval framework [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to create an index on the provided
metadata. This index was composed of three elds, one with the ASR output, one with the archival metadata
and a third one containing both. To process the tokens a language speci c stopword list2 and the Dutch
stemmer from the Snowball project2 was applied. We used the class labels to query our video document
index. The Lucene4 retrieval core with the default vector-based IR model was utilized within our framework.
In the retrieval phase we used an English thesaurus5 in combination with the Google AJAX language API6
for query expansion purposes.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>System Con guration and Parameters</title>
        <p>The following list brie y explains some of our system parameters and their values for the experimental
evaluation.</p>
        <p>Query Expansion (QE): The most frequent term from the top-5 documents was used to reformulate
the original query.</p>
        <p>Thesaurus Term Query Expansion (TT): Thesaurus term query expansion was used for those queries,
which returned less than two documents (even after QE).</p>
        <p>Multi-label Limit (DpL): DpL denotes the maximum number of assigned documents per class label and
it was used to manually set a threshold for the document cut-o in the result sets.
2http://snowball.tartarus.org/algorithms/dutch/stop.txt
3http://snowball.tartarus.org/algorithms/dutch/stemmer.html
4http://lucene.apache.org
5http://de.openo ce.org/spellcheck/about-spellcheck-detail.html#thesaurus
6http://code.google.com/apis/ajaxlanguage/documentation</p>
        <p>Source Field (SF): The metadata source was variated to indicate which source is most reliable and
whether their combination yields to improvement of the classi cation or not.</p>
        <p>Due to the problem of determining the document cut-o level a priori we calculated the following threshold
for each query. The threshold TDpL is based on the scores of the retrieved documents per class label. Thereby
RSVavg denotes the average score and RSVmax is the maximum score of the documents retrieved. N umdocs
stands for the total number of document retrieved for a speci c class label.</p>
        <p>TDpL = RSVavg + 2</p>
        <p>RSVmax RSVavg</p>
        <p>Numdocs
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>In this section we report results that were obtained by running various system con gurations on the provided
training data. In table 1 columns 2-5 refer to speci c system parameters that were introduced in section 2.2.
Please note that the utilization of the threshold formula is denoted with x in column DpL, which means that
the number of assigned documents can be di erent for each class label.</p>
      <p>Regarding the evaluation of the task we had a problem with calculating the measures. We report two
values for MAP due to a peculiarity in our Xtrieval framework, which allows the system to return two
documents with identical RSV. The trec eval7 tool seems to penalize this behavior by randomly reordering
the result set. Thus the MAP values reported by trec eval and our framework (labeled MAP* in the following
tables) have marginal variations. Unfortunately we were neither able to correct the behavior of our system
nor could we nd out when or why the trec eval tool reorders our result sets. Thus, we decided to report
both MAP values for our experiments in agreement with the task organizers.
3.1</p>
      <sec id="sec-3-1">
        <title>Experiments on the Training Data</title>
        <p>For evaluation of the classi cation performance the total number of assigned labels (SumL), the ratio of
correct assigned labels (CR), averaged recall (AR) over all class labels and mean average precision (MAP)
are reported. Table 1 is divided into three sections with respect to the used metadata sources. In the ve
rightmost columns the best values for each section of the table are emphasized bold and the best value over
all sections is marked bold and italic.
The following observations can be made by analyzing the experimental results. No matter which metadata
source was used, the experiment without limitation of the class labels per document had the best performance
in terms of AR and MAP (see ID's cut2, cut5 and cut11). The drawback of those runs is that they have
very low correct classi cation rates (CR) of about 3% for the ASR data and about 8% when using archival
metadata alone or in combination with ASR data. In contrast to that the experiments without any form
of query expansion (see ID's cut1, cut4 and cut10) had the highest correct classi cation rates (CR) from
33% up to 47%. However, this is more a result from limiting to one document per label, which also yields
to lower performance in terms of AR and MAP. Numerous experiments with either manual or automatic
thresholds to limit the assigned documents per label were conducted. The results show that it is possible
to improve CR substantially and almost sustain the best MAP values (compare cut5 to cut9 and cut11 to
cut15). Nevertheless for those runs the AR was signi cantly lower.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Experiments on the Test Data</title>
        <p>In this section we report the experimental results on the evaluation data set. Please note that we run all
con gurations from section 3.1 again, because we wanted to gure out if our observations on the training data
are also valid on the test data set. Experiments that were submitted for o cial evaluation by the organizers
of the task are denoted with *. Again in table 2 columns 2-5 contain parameters of our system, which are
brie y explained in section 2.2. The performance of the experiments is reported with respect to overall sum
of assigned label (SumL), the average ratio of correct classi cations (CR) as well as average recall (AR) and
mean average precision (MAP). Corresponding to section 3.1 table 2 is also divided into three sections with
respect to the used metadata sources. In the ve rightmost columns the best values for each section of the
table are emphasized bold and the best value over all sections is marked bold and italic.
In general we see similar behavior on both the training and the test data set. For all data sources used
the best correct classi cation rate (CR) is achieved without using any form of query expansion (see ID's
cut1, cut4 and cut10). The best overall (CR) was achieved by only using archival metadata in the retrieval
phase. Since the archival metadata consists of intellectual annotations this is a very straightforward nding.
Another obvious observation is, that the best overall results in terms of MAP and AR were also achieved on
the archival metadata. Nevertheless the gap to the best results when combining ASR output with archival
metadata is very small (compare cut5 to cut11). Regarding our proposed automatic threshold calculation
for limitation of the number of assigned documents per label the results are twofold. On the one hand there
is a slight improvement in terms of MAP and AR compared to low manually xed thresholds between 1 and
3 assigned documents per label. On the other hand the overall correct classi cation rate (CR) decreases in
the same magnitude MAP and AR are increasing, which is another very straightforward nding.</p>
        <p>The interpretation of our experimental results led us to the conclusion that using MAP for evaluating a
multi-label classi cation task is somehow questionable. The main reason in our point of view is that MAP
does not take into account the overall correct classi cation rate CR. Let us take a look on the two best
performing experiments using archival metadata and ASR transcriptions either in table 1 or 2 (see ID's cut10
and cut15). The di erence in terms of MAP is about 6% or 12%, but the gain in terms of CR is about
293% or 337% respectively. In our opinion in a real world scenario were assignment of class labels to video
documents should be completely automatic it would be essential to take into account the overall ratio of
correct assigned labels. Our prosposal for future evaluations is to combine measures that take into account
the position of the correct assigned labels in a result set (like MAP or averaged R-Precision) with the micro
or macro correct classi cation rate.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Result Analysis - Summary</title>
      <p>The following list provides a short summary of our observations and ndings from the participation in the
VideoCLEF classi cation task in 2009.</p>
      <p>Classi cation as an IR task: According to the experiences from last year, we conclude that treating
the given task as a traditional IR task with some modi cations is a quite successful approach.
Query Expansion: Both types of query expansion improved the results in terms of MAP and AR but
had very low correct classi cation rates CR.</p>
      <p>Metadata Sources: Combining both ASR output and archival metadata improves MAP and AR when
no query expansion is used. For those experiments where query expansion was used there is no gain in
terms of MAP and AR comparing archival metadata runs to experiments which used both data sources.
Label Limits: We compared an automatically calculated threshold to low manual set thresholds and
found that the automatic threshold works better in terms of MAP and AR.</p>
      <p>Evaluation Measure: In our opinion using MAP as evaluation measure for a multi-label classi cation
task is questionable. We would prefer a measure that takes into account both correct classi cation rate
and averaged recall.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>This year we used the Xtrieval framework for the VideoCLEF classi cation task. In our experimental
evaluation we can con rm the observations from last year, where approaches treating the task as IR problem were
most successful. We proposed an automatic threshold to limit the number of assigned documents per class
label to keep high correct classi cation rates. This seems to be the main issue that could be worked on in the
future. A manual limitation of assigned documents per label is not an appropriate solution to a comparable
real world problem, where possibly tens or hundred of thousand video documents should be labeled with
maybe hundreds of di erent topic labels. Furthermore one could try to evaluate di erent retrieval models
or try to combine the results from those models to gain a better overall performance. Finally it should
be evaluated if assigning eld boosts to the metadata sources could improve performance in the combined
retrieval setting.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank the VideoCLEF organizers and the Netherlands Institute of Sound and Vision (Beeld
&amp; Geluid) for providing the data sources for the task.</p>
      <p>This work was accomplished in conjunction with the project sachsMedia, which is funded by the
Entrepreneurial Regions 8 program of the German Federal Ministry of Education and Research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Jyin</given-names>
            <surname>He</surname>
          </string-name>
          , Xu Zhang, Wouter Weerkamp, and
          <string-name>
            <given-names>Martha</given-names>
            <surname>Larson</surname>
          </string-name>
          . The University of Amsterdam at VideoCLEF 2008.
          <source>Working Notes for the CLEF 2008 Workshop</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>19</lpage>
          September, Aarhus, Denmark,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jens</given-names>
            <surname>Ku</surname>
          </string-name>
          rsten, Daniel Richter, and Maximlian Eibl.
          <source>VideoCLEF</source>
          <year>2008</year>
          :
          <article-title>ASR Classi cation based on Wikipedia Categories</article-title>
          .
          <source>Working Notes for the CLEF 2008 Workshop</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>19</lpage>
          September, Aarhus, Denmark,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jens</given-names>
            <surname>Ku</surname>
          </string-name>
          rsten, Thomas Wilhelm, and
          <string-name>
            <given-names>Maximilian</given-names>
            <surname>Eibl</surname>
          </string-name>
          .
          <article-title>Extensible Retrieval and Evaluation Framework: Xtrieval</article-title>
          . LWA 2008: Lernen - Wissen - Adaption, Wurzburg,
          <year>October 2008</year>
          ,
          <string-name>
            <given-names>Workshop</given-names>
            <surname>Proceedings</surname>
          </string-name>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Martha</given-names>
            <surname>Larson</surname>
          </string-name>
          , Eamonn Newman, and
          <string-name>
            <given-names>Gareth</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <source>Overview of VideoCLEF</source>
          <year>2008</year>
          :
          <article-title>Automatic Generation of Topic-based Feeds for Dual Language Audio-Visual Content</article-title>
          .
          <source>Working Notes for the CLEF 2008 Workshop</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>19</lpage>
          September, Aarhus, Denmark,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Martha</given-names>
            <surname>Larson</surname>
          </string-name>
          , Eamonn Newman, and
          <string-name>
            <given-names>Gareth</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <source>Overview of VideoCLEF</source>
          <year>2009</year>
          :
          <article-title>New Perspectives on Speech-based Multimedia Content Enrichment</article-title>
          . In Francesca Borri, Alessandro Nardi, and Carol Peters, editors,
          <source>Working Notes of CLEF</source>
          <year>2009</year>
          ,
          <year>September 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Eamonn</given-names>
            <surname>Newman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gareth J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <source>DCU at VideoClef 2008. Working Notes for the CLEF 2008 Workshop</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>19</lpage>
          September, Aarhus, Denmark,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Jose</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Perea-Ortega</surname>
            , Arturo Montejo-Raez, and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Teresa Mart</surname>
          </string-name>
          n-Valdivia.
          <source>SINAI at VideoCLEF 2008. Working Notes for the CLEF 2008 Workshop</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>19</lpage>
          September, Aarhus, Denmark,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Julio</given-names>
            <surname>Villena-Roman</surname>
          </string-name>
          and
          <article-title>Sara Lana-Serrano</article-title>
          . MIRACLE at VideoCLEF 2008:
          <article-title>Classi cation of Multilingual Speech Transcripts</article-title>
          .
          <source>Working Notes for the CLEF 2008 Workshop</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>19</lpage>
          September, Aarhus, Denmark,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>