<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TEXT MINING AS SUPPORT FOR SEMANTIC VIDEO INDEXING AND ANALYSIS Jan Nemrava, Vojtěch Svátek</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>University of Economics, Prague</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DFKI Saarbrücken</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Stuhlsatzenhausweg 3</institution>
          ,
          <addr-line>66123 Saarbrücken-D</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>W.</institution>
          <addr-line>Churchil Sq. 4, 130 68 Prague-CZ</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents our work in the field of semantic multimedia annotation and indexing with the use of complementary textual resources analysis. We describe the advantages of complementary sources of information as a support for annotation and test whether these data can be used for automatic annotation and event detection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Paul Buitelaar, Thierry Declerck</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>In this paper we present our work using the
complementary textual resources in video analysis. This,
for the selected domain (soccer in our case) concerns
various textual sources such as structured data (match
tables with teams, player names, score goals, substitutions,
etc.) and semi-structured, textual web data
(minute-byminute match reports – unstructured text accompanied with
temporal information). Events and entities detected in these
sources are marked up with semantic classes derived from
an ontology on soccer by use of information extraction
tools. Since the target audience comes from various
research areas, this text will be focused on the potential use
of this approach rather than on the description of the
details. Temporal alignment of primary video data (soccer
match videos) with semantically organized events and
entities from the textual and structured complementary
resources can be used as indicator for video segment
extraction and semantic classification; e.g. the occurrence
of a 'Header' event in the complementary resources will be
used to train and later classify the corresponding video
segment accordingly. This information can then be used for
semantic indexing and retrieval of events in soccer videos,
but also for the targeted extraction of audio/visual (A/V)
features (motion, audio-pitch, field-line, close-up). We
denote such extraction of A/V features based on textual
evidence "cross-media feature extraction".</p>
      <p>
        There is quite a lot research effort carried out in the field of
semantic annotation and indexing in the sports domain.
Some of them, such as the work by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], also use the
complementary resources, but (in this case) not to the
extent as we do. For further related work see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. RESOURCES COMPLEMENTARY TO A/V</title>
    </sec>
    <sec id="sec-4">
      <title>STREAMS</title>
      <p>
        The exploitation of related (complementary) textual
resources, especially when these are endowed with temporal
references can largely increase the quality of the video
analysis, indexing and retrieval. Of course the number of
domains containing freely available detailed temporal
descriptions is limited, but those where this information is
available in a large scale can be very effectively used.
Multiple parallel descriptions of one event will further
increase the coverage and eliminate false events. Good
examples can be found in the sports domain. Current
research in sports video analysis focuses on event
recognition and classification based on the extraction of
low-level features and is—when based solely on the
lowlevel features—limited to a very small number of different
event types, e.g. 'scoring-event' [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. On the other hand,
complementary resources can serve as a valuable source for
more fine-grained event recognition and classification.
We distinguish between two different kinds of information
sources according to their direct vs. indirect connection to
the video material. Primary complementary resources
include such information that is directly attached to the
media, namely, overlay texts, audio track and spoken
commentaries. Secondary complementary resources include
information that is independent from the media itself but
related to its content, but it must be identified and
processed first.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3. COMPLEMENTARY TEXTUAL RESOURCES</title>
    </sec>
    <sec id="sec-6">
      <title>AND VIDEO INDEXING</title>
      <p>
        Major sports events, such as the FIFA Soccer World Cup
Tournament that was held in Germany in 2006, provide a
wide range of available textual resources, ranging from
semi-structured data in the form of tables on web sites to
textual summaries and other match reports. The video
material was analyzed independently of the research
described here, see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The results of analysis are taken as
input for our research and consist of video segmentation,
where each second is defined by a set of feature detectors,
i.e. Crowd detection, Speech-Band Audio Activity,
OnScreen Graphics, Motion activity measure and Field Line
orientation.
      </p>
      <p>
        A dataset for ontology-based [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] information extraction and
ontology learning from text (SmartWeb corpus) consists of
a soccer ontology, a corpus of semi-structured and textual
match reports and a knowledge base of automatically
extracted events and entities.
      </p>
      <p>
        Minute-by-minute reports are usually published at soccer
web sites and enable people to ’watch‘ the match in textual
form on the web. Processing several such reports in parallel
increases the coverage of events and eliminates false
positive events. We therefore rely on the following 6
different sources in this case: ARD, bild.de, LigaLive (all in
German), and Guardian, DW-World, DFB.de (all in
English); we apply the SProUT tool [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] on them. This effort
resulted in an interactive non-linear event browsing demo
presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The next section describes experiments
with event detection based on the general A/V detectors.
      </p>
    </sec>
    <sec id="sec-7">
      <title>4. CROSS-MEDIA FEATURE EXTRACTION</title>
      <p>
        The aim of the semantic annotation is to allow
(semi)automatic detection of events in the video based on
previously learned examples. The aim of this experiment
was to test whether the general detectors are able to serve as
sufficient source of information. For this experiment we
used two manually-annotated soccer match videos, one as a
training set and the other for the tests. We created
additional derived features describing the previous and the
next values of the detectors in the same time range as the
event instance itself, providing us with better chance to
capture the behavior of the detector in time. We used
decision trees as machine learning algorithm and built up a
binary classifier for each of the observed events. The task of
the classifier was to decide whether the particular segment
is or is not an event. By our observation, the detectors we
used are too generic for fine-grained event detection but
they can help detect a certain event in a given (usually one
minute long) time range where the event was identified in
the text. The table below shows that different detectors are
important for different event types. This potentially allows
detecting instances of event types based on observing only
those detectors that are discriminative for them (this
assumption is also used by the decision tree algorithm). The
letters P, C, N represent Previous, Current or Next Value of
the detector for particular event type. More details can be
found in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Results of the cross-media feature selection</p>
    </sec>
    <sec id="sec-8">
      <title>5. CONCLUSIONS AND FUTURE WORK</title>
      <p>
        We presented an approach to the use of resources that
are complementary to A/V streams, such as videos of
football matches, for the semantic indexing of such streams.
We further presented an experiment with event detection
based on general A/V detectors supported by textual
annotation. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] we showed that such event detection
based on general detectors can quite satisfactorily act as
binary classifier, but when trained to provide classification
for more classes it performs significantly worse. Using
classifiers similar to those we have tested together with
complementary textual minute-by-minute information
(providing minute-based rough estimates where a particular
event occurred) can help in refining the video indexing and
retrieval. The potential of this work is not only in the
annotation for indexing and retrieval of multimedia but also
as feedback to the video learning algorithm, so we see its
role in the area of OCR and other video analysis areas
where we have to deal with text.
      </p>
    </sec>
    <sec id="sec-9">
      <title>6. ACKNOWLEDGEMENTS</title>
      <p>This research was supported by the European Commission
under contract FP6-027026 for the K-Space project. We
thank D Sadlier and Noel O’Connor (DCU, Ireland) for
providing the A/V data and analysis results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bertini</surname>
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Automatic annotation and semantic retrieval of video sequences using multimedia ontologies</article-title>
          .
          <source>MULTIMEDIA '06. ACM</source>
          , New York, NY
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Castano</surname>
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>Ontology Dynamics with Multimedia Information: The BOEMIE Evolution Methodology</article-title>
          .
          <source>In Proc. of International Workshop on Ontology Dynamics (IWOD) ESWC 2007 Workshop</source>
          , Innsbruck, Austria
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Drozdzynski</surname>
            <given-names>W.</given-names>
          </string-name>
          , et al.:
          <article-title>Shallow Processing with Unification and Typed Feature Structures - Foundations and Applications</article-title>
          . In KI 1/
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Lanagan</surname>
            <given-names>J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Smeaton</surname>
            <given-names>A.F.</given-names>
          </string-name>
          :
          <article-title>SportsAnno: What do you think?</article-title>
          , RIAO 2007 -
          <article-title>Large-Scale Semantic Access to Content, Pittsburgh</article-title>
          , PA, USA, 30 May - 1
          <source>June</source>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Nemrava</surname>
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Text Mining Support for Semantic Indexing and Analysis of A/V Streams</article-title>
          , OntoImage Workshop at LREC 2008, Marrakech, Morocco, May 2008
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Nemrava</surname>
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>An Architecture for Mining Resources Complementary to Audio-Visual Streams</article-title>
          .
          <source>In: Proc. of the KAMC workshop at SAMT07</source>
          , Italy, Dec.
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Oberle</surname>
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>DOLCE ergo SUMO: On Foundational and Domain Models in SWIntO (SmartWeb Integrated Ontology</article-title>
          )
          <source>Journal of Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>5</volume>
          (
          <year>2007</year>
          )
          <fpage>156</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Sadlier</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Connor</surname>
          </string-name>
          <string-name>
            <surname>N.</surname>
          </string-name>
          :
          <article-title>Event Detection in Field Sports Video using Audio-Visual Features and a Support Vector Machine</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology, Oct 2005</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Xu</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chua</surname>
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>The fusion of audio-visual features and external knowledge for event detection in team sports video</article-title>
          .
          <source>In Proceedings of the 6th ACM SIGMM Workshop on Multimedia information Retrieval</source>
          ,
          <year>2004</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>