<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MediaMill: Video Search using a Thesaurus of 500 Machine Learned Concepts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cees G.M. Snoek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcel Worring</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bouke Huurnink</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan. C. van Gemert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Koen E.A. van de Sande</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dennis C. Koelma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ork de Rooij</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A. Semantic Indexing</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>B. Topic Analysis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>are with the Intelligent Systems Lab Amsterdam, Informatics Institute, University of Amsterdam</institution>
          ,
          <addr-line>Kruislaan 403, 1098 SJ Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>- In this technical demonstration we showcase the current version of the MediaMill system, a search engine that facilitates access to news video archives at a semantic level. The core of the system is a thesaurus of 500 automatically detected semantic concepts. To handle such a large thesaurus in retrieval, an engine is developed which automatically selects a set of relevant concepts based on the textual query and userspecified example images. The result set can be browsed easily to obtain the final result for the query.</p>
      </abstract>
      <kwd-group>
        <kwd>Video archive</kwd>
        <kwd>Concept query</kwd>
        <kwd>Concept query</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>II. THE MEDIAMILL 2006 SYSTEM</title>
      <p>The data flow of the MediaMill 2006 system is depicted in
Fig. 1. We will now highlight its components in more detail.</p>
      <p>Concept thesaurus
Set of concepts</p>
      <p>Ranked list
Topic
(text)
Examples
(Images)</p>
      <p>Topic
analysis
Image
classification</p>
      <p>Index Terms— Semantic indexing, video retrieval, information
visualization.</p>
    </sec>
    <sec id="sec-2">
      <title>I. INTRODUCTION</title>
      <p>Most commercial video search engines such as Google,
Blinkx, and YouTube provide access to their repositories based
on text as this is still the easiest way for a user to describe
an information need. The indices of these search engines are
based on the filename, surrounding text, social tagging, or
a transcript. This results in disappointing performance when
the visual content is not reflected in the associated text. In
addition, when the videos originate from non-English speaking
countries, such as China or the Netherlands, querying the
content becomes even harder as automatic speech recognition
results are so much poorer. Additional visual analysis yields
more robustness. Thus, in video retrieval a recent trend is
to learn a lexicon of semantic concepts from multimedia
examples and to employ these as entry points in querying the
collection.</p>
      <p>
        Last year we presented the MediaMill 2005 video search
engine [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] using a 101 concept lexicon [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] evaluated in the
TRECVID benchmark [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For our current system we made
a jump to a thesaurus of 500 concepts. The items vary from
pure format like a detected split screen, or a style like an
interview, or an object like a horse, or an event like an
airplane take off. Any one of those brings an understanding
of the current content. The elements in such a thesaurus offer
users a semantic entry to video by allowing them to query on
presence or absence of content elements. For a user, however,
selecting the right topic from the large thesaurus is difficult.
We therefore developed a suggestion engine that analyzes the
textual topic, and possible image examples given by the user,
to automatically derive the most relevant concept detectors for
querying the video archive (see Fig. 1 and Fig. 2).
      </p>
      <p>
        For semantic indexing we proposed the semantic pathfinder,
for details see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. First, it extracts features from the visual [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
textual, and auditory modality. The architecture exploits
supervised machine learning to automatically label segments with
semantic concepts. In the first step learning is on the content
features only. In the second step, the video is analyzed based
on its style properties. Finally, semantic concepts are analyzed
in context, with the potential to boost index results further.
The resulting thesaurus of 500 semantic concepts, covering
setting, objects, and people, is learned based on the LSCOM
annotations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and the 101 concepts used in our 2005 engine
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        We map the richness and subjectivity of semantics in user
queries to concept detectors available in our thesaurus. To
derive the most relevant concepts for a given user topic, we
first assign syntactic categories to groups of words in the input
text using a chunking algorithm. We then assign a grammatical
classification to each word by using a part-of-speech tagger.
From there, looking up each noun chunk in WordNet [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
When a match has been found those words are eliminated
from further lookups. Then we look up any remaining nouns
in WordNet. The result is a number of WordNet words related
to the input text. Now that both the concepts in the text and the
multimedia concept detectors are related to WordNet, we can
compute the semantic distance between the textual concepts
important elements. Remaining elements are still visible, but
and the multimedia concepts. We use Resnik’s algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
much darker (see Fig. 2).
which calculates the similarity of a concept to each of the
WordNet nouns from the query text. Based on the combined
scores we rank each multimedia concept detector in order of
expected utility.
      </p>
      <sec id="sec-2-1">
        <title>C. Image Classification</title>
        <p>
          Concept suggestion based on query image analysis first
extracts visual features [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Based on the features we predict
for each image a concept using pre-learned visual-only models.
Rather than selecting the concept with maximal score –which
are often the most robust but also least informative ones, e.g.
people, face, outdoor – we select the model that maximizes
the probability of observing this image given the concept. To
compute, Bayes’ theorem is applied using training set
statistics. Hence, we prioritize less frequent, but discriminative,
concepts with reasonable probability scores over frequent, but
less discriminative, concepts with high probability scores.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>D. Rank Combination</title>
        <p>
          We offer users several possibilities to combine the various
ranked lists. They can employ standard combination methods
such as min, max, sum, and product [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In addition, they may
specify that some concepts are more important than others by
adding weights to individual concepts.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>E. Browsing the Result</title>
        <p>The result of concept suggestion, the subsequent concept
queries and their combination yields a ranked list of shots.
To aid human interpretation in exploring this result the
CrossBrowser visualizes the ranked list (vertical axis) versus the
time (horizontal axis) of the program containing the shot. The
two dimensions are projected onto a sphere to allow easy
navigation. It also enhances focus of attention on the most</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>III. DEMONSTRATION</title>
      <p>We demonstrate semantic exploration of news video
archives with the MediaMill system. We will show how a
thesaurus of 500 concepts can be exploited for effective access
to video at a semantic level. In addition, we will exhibit
novel browsers that present retrieval results using advanced
visualizations. Taken together, the search engine provides users
with semantic access to news video archives.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.G.M.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.C. van Gemert</given-names>
            ,
            <surname>J.-M. Geusebroek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koelma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , O. de Rooij, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Seinstra</surname>
          </string-name>
          , “
          <article-title>MediaMill: Exploring news video archives based on learned semantics</article-title>
          ,” Singapore,
          <year>November 2005</year>
          , pp.
          <fpage>225</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.G.M.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.C. van Gemert</given-names>
            ,
            <surname>J.-M. Geusebroek</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.W.M.</given-names>
            <surname>Smeulders</surname>
          </string-name>
          , “
          <article-title>The challenge problem for automated detection of 101 semantic concepts in multimedia</article-title>
          ,”
          <source>in Proceedings of the ACM International Conference on Multimedia, Santa Barbara</source>
          , USA,
          <year>October 2006</year>
          , pp.
          <fpage>421</fpage>
          -
          <lpage>430</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          , “
          <article-title>Large scale evaluations of multimedia information retrieval: The TRECVid experience,” in CIVR, ser</article-title>
          .
          <source>LNCS</source>
          , vol.
          <volume>3569</volume>
          . SpringerVerlag,
          <year>2005</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.G.M.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Geusebroek</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          <string-name>
            <surname>Koelma</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          <string-name>
            <surname>Seinstra</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.W.M.</given-names>
            <surname>Smeulders</surname>
          </string-name>
          , “
          <article-title>The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing</article-title>
          ,
          <source>” IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          , vol.
          <volume>28</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>1678</fpage>
          -
          <lpage>1689</lpage>
          ,
          <year>October 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.C. van Gemert</given-names>
            ,
            <surname>J.-M. Geusebroek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.J.</given-names>
            <surname>Veenman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.G.M.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.W.M.</given-names>
            <surname>Smeulders</surname>
          </string-name>
          , “
          <article-title>Robust scene categorization by learning image statistics in context</article-title>
          ,” in International Workshop on Semantic Learning Applications in Multimedia, in conjunction with CVPR'
          <volume>06</volume>
          , New York, USA,
          <year>June 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Naphade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tesic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kennedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hauptmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Curtis</surname>
          </string-name>
          , “
          <article-title>Large-scale concept ontology for multimedia</article-title>
          ,
          <source>” IEEE Multimedia</source>
          , vol.
          <volume>13</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>86</fpage>
          -
          <lpage>91</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.A.</given-names>
            <surname>Miller</surname>
          </string-name>
          , “
          <article-title>Wordnet: A lexical database for english,” Communications of the ACM</article-title>
          , vol.
          <volume>38</volume>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Resnik</surname>
          </string-name>
          , “
          <article-title>Using information content to evaluate semantic similarity in a taxonomy,” in</article-title>
          <string-name>
            <surname>IJCAI</surname>
          </string-name>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , “
          <article-title>Analysis of multiple evidence combination</article-title>
          ,”
          <source>in Proceedings of ACM SIGIR</source>
          ,
          <year>1997</year>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>