=Paper= {{Paper |id=Vol-233/paper-23 |storemode=property |title=MediaMill: Video Search using a Thesaurus of 500 Machine Learned Concepts |pdfUrl=https://ceur-ws.org/Vol-233/p47.pdf |volume=Vol-233 |dblpUrl=https://dblp.org/rec/conf/samt/SnoekWHGSKR06 }} ==MediaMill: Video Search using a Thesaurus of 500 Machine Learned Concepts== https://ceur-ws.org/Vol-233/p47.pdf
 MediaMill: Video Search using a Thesaurus of 500
            Machine Learned Concepts
                          Cees G.M. Snoek, Marcel Worring, Bouke Huurnink, Jan. C. van Gemert,
                               Koen E.A. van de Sande, Dennis C. Koelma, and Ork de Rooij



   Abstract— In this technical demonstration we showcase the                                      II. T HE M EDIA M ILL 2006 S YSTEM
current version of the MediaMill system, a search engine that
facilitates access to news video archives at a semantic level.
                                                                                The data flow of the MediaMill 2006 system is depicted in
The core of the system is a thesaurus of 500 automatically                    Fig. 1. We will now highlight its components in more detail.
detected semantic concepts. To handle such a large thesaurus
in retrieval, an engine is developed which automatically selects
                                                                                                                       Video archive
a set of relevant concepts based on the textual query and user-
                                                                                        Concept thesaurus
specified example images. The result set can be browsed easily                           Set of concepts
to obtain the final result for the query.                                                Ranked list
                                                                                                                    Semantic Indexing
   Index Terms— Semantic indexing, video retrieval, information




                                                                                                                                        Rank combination
visualization.                                                                                                                                             Browsing
                                                                               Topic                Topic                Concept
                                                                               (text)               analysis             query

                         I. I NTRODUCTION                                     Examples             Image                 Concept
                                                                              (Images)
                                                                                                   classification        query
   Most commercial video search engines such as Google,
Blinkx, and YouTube provide access to their repositories based
on text as this is still the easiest way for a user to describe               Fig. 1. Overview of the different processing steps in the MediaMill semantic
an information need. The indices of these search engines are                  video search engine.
based on the filename, surrounding text, social tagging, or
a transcript. This results in disappointing performance when
the visual content is not reflected in the associated text. In                 A. Semantic Indexing
addition, when the videos originate from non-English speaking
                                                                                 For semantic indexing we proposed the semantic pathfinder,
countries, such as China or the Netherlands, querying the
                                                                              for details see [4]. First, it extracts features from the visual [5],
content becomes even harder as automatic speech recognition
                                                                              textual, and auditory modality. The architecture exploits super-
results are so much poorer. Additional visual analysis yields
                                                                              vised machine learning to automatically label segments with
more robustness. Thus, in video retrieval a recent trend is
                                                                              semantic concepts. In the first step learning is on the content
to learn a lexicon of semantic concepts from multimedia
                                                                              features only. In the second step, the video is analyzed based
examples and to employ these as entry points in querying the
                                                                              on its style properties. Finally, semantic concepts are analyzed
collection.
                                                                              in context, with the potential to boost index results further.
   Last year we presented the MediaMill 2005 video search                     The resulting thesaurus of 500 semantic concepts, covering
engine [1] using a 101 concept lexicon [2] evaluated in the                   setting, objects, and people, is learned based on the LSCOM
TRECVID benchmark [3]. For our current system we made                         annotations [6] and the 101 concepts used in our 2005 engine
a jump to a thesaurus of 500 concepts. The items vary from                    [2].
pure format like a detected split screen, or a style like an
interview, or an object like a horse, or an event like an
airplane take off. Any one of those brings an understanding                   B. Topic Analysis
of the current content. The elements in such a thesaurus offer                   We map the richness and subjectivity of semantics in user
users a semantic entry to video by allowing them to query on                  queries to concept detectors available in our thesaurus. To
presence or absence of content elements. For a user, however,                 derive the most relevant concepts for a given user topic, we
selecting the right topic from the large thesaurus is difficult.               first assign syntactic categories to groups of words in the input
We therefore developed a suggestion engine that analyzes the                  text using a chunking algorithm. We then assign a grammatical
textual topic, and possible image examples given by the user,                 classification to each word by using a part-of-speech tagger.
to automatically derive the most relevant concept detectors for               From there, looking up each noun chunk in WordNet [7].
querying the video archive (see Fig. 1 and Fig. 2).                           When a match has been found those words are eliminated
                                                                              from further lookups. Then we look up any remaining nouns
   This research is sponsored by the BSIK MultimediaN project. The authors    in WordNet. The result is a number of WordNet words related
are with the Intelligent Systems Lab Amsterdam, Informatics Institute, Uni-
versity of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands       to the input text. Now that both the concepts in the text and the
(e-mail: info@mediamill.nl, http://www.mediamill.nl).                         multimedia concept detectors are related to WordNet, we can
Fig. 2. On the left an example of a query for shots of a goal being made in a soccer match, using both text and image examples, yielding soccer and grass
as most relevant concepts. Result of the query are visualized in the CrossBrowser on the right.



compute the semantic distance between the textual concepts                    important elements. Remaining elements are still visible, but
and the multimedia concepts. We use Resnik’s algorithm [8]                    much darker (see Fig. 2).
which calculates the similarity of a concept to each of the
                                                                                                  III. D EMONSTRATION
WordNet nouns from the query text. Based on the combined
scores we rank each multimedia concept detector in order of                      We demonstrate semantic exploration of news video
expected utility.                                                             archives with the MediaMill system. We will show how a
                                                                              thesaurus of 500 concepts can be exploited for effective access
C. Image Classification                                                        to video at a semantic level. In addition, we will exhibit
                                                                              novel browsers that present retrieval results using advanced
   Concept suggestion based on query image analysis first                      visualizations. Taken together, the search engine provides users
extracts visual features [5]. Based on the features we predict                with semantic access to news video archives.
for each image a concept using pre-learned visual-only models.
Rather than selecting the concept with maximal score –which                                                R EFERENCES
are often the most robust but also least informative ones, e.g.               [1] C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek,
people, face, outdoor – we select the model that maximizes                        D.Koelma, G.P. Nguyen, O. de Rooij, and F. Seinstra, “MediaMill:
                                                                                  Exploring news video archives based on learned semantics,” Singapore,
the probability of observing this image given the concept. To                     November 2005, pp. 225–226.
compute, Bayes’ theorem is applied using training set statis-                 [2] C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and
tics. Hence, we prioritize less frequent, but discriminative,                     A.W.M. Smeulders, “The challenge problem for automated detection
                                                                                  of 101 semantic concepts in multimedia,” in Proceedings of the ACM
concepts with reasonable probability scores over frequent, but                    International Conference on Multimedia, Santa Barbara, USA, October
less discriminative, concepts with high probability scores.                       2006, pp. 421–430.
                                                                              [3] A. Smeaton, “Large scale evaluations of multimedia information retrieval:
                                                                                  The TRECVid experience,” in CIVR, ser. LNCS, vol. 3569. Springer-
D. Rank Combination                                                               Verlag, 2005, pp. 19–27.
  We offer users several possibilities to combine the various                 [4] C.G.M. Snoek, M. Worring, J.-M. Geusebroek, D.C. Koelma, F.J. Sein-
                                                                                  stra, and A.W.M. Smeulders, “The semantic pathfinder: Using an author-
ranked lists. They can employ standard combination methods                        ing metaphor for generic multimedia indexing,” IEEE Transactions on
such as min, max, sum, and product [9]. In addition, they may                     Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1678–
specify that some concepts are more important than others by                      1689, October 2006.
                                                                              [5] J.C. van Gemert, J.-M. Geusebroek, C.J. Veenman, C.G.M. Snoek, and
adding weights to individual concepts.                                            A.W.M. Smeulders, “Robust scene categorization by learning image
                                                                                  statistics in context,” in International Workshop on Semantic Learning
                                                                                  Applications in Multimedia, in conjunction with CVPR’06, New York,
E. Browsing the Result                                                            USA, June 2006.
   The result of concept suggestion, the subsequent concept                   [6] M. Naphade, J. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. Kennedy,
                                                                                  A. Hauptmann, and J. Curtis, “Large-scale concept ontology for mul-
queries and their combination yields a ranked list of shots.                      timedia,” IEEE Multimedia, vol. 13, no. 3, pp. 86–91, 2006.
To aid human interpretation in exploring this result the Cross-               [7] G.A. Miller, “Wordnet: A lexical database for english,” Communications
Browser visualizes the ranked list (vertical axis) versus the                     of the ACM, vol. 38, pp. 39–41, 1995.
                                                                              [8] P. Resnik, “Using information content to evaluate semantic similarity in
time (horizontal axis) of the program containing the shot. The                    a taxonomy,” in IJCAI, 1995.
two dimensions are projected onto a sphere to allow easy                      [9] J. Lee, “Analysis of multiple evidence combination,” in Proceedings of
navigation. It also enhances focus of attention on the most                       ACM SIGIR, 1997, pp. 267–276.