=Paper=
{{Paper
|id=Vol-233/paper-23
|storemode=property
|title=MediaMill: Video Search using a Thesaurus of 500 Machine Learned Concepts
|pdfUrl=https://ceur-ws.org/Vol-233/p47.pdf
|volume=Vol-233
|dblpUrl=https://dblp.org/rec/conf/samt/SnoekWHGSKR06
}}
==MediaMill: Video Search using a Thesaurus of 500 Machine Learned Concepts==
MediaMill: Video Search using a Thesaurus of 500
Machine Learned Concepts
Cees G.M. Snoek, Marcel Worring, Bouke Huurnink, Jan. C. van Gemert,
Koen E.A. van de Sande, Dennis C. Koelma, and Ork de Rooij
Abstract— In this technical demonstration we showcase the II. T HE M EDIA M ILL 2006 S YSTEM
current version of the MediaMill system, a search engine that
facilitates access to news video archives at a semantic level.
The data flow of the MediaMill 2006 system is depicted in
The core of the system is a thesaurus of 500 automatically Fig. 1. We will now highlight its components in more detail.
detected semantic concepts. To handle such a large thesaurus
in retrieval, an engine is developed which automatically selects
Video archive
a set of relevant concepts based on the textual query and user-
Concept thesaurus
specified example images. The result set can be browsed easily Set of concepts
to obtain the final result for the query. Ranked list
Semantic Indexing
Index Terms— Semantic indexing, video retrieval, information
Rank combination
visualization. Browsing
Topic Topic Concept
(text) analysis query
I. I NTRODUCTION Examples Image Concept
(Images)
classification query
Most commercial video search engines such as Google,
Blinkx, and YouTube provide access to their repositories based
on text as this is still the easiest way for a user to describe Fig. 1. Overview of the different processing steps in the MediaMill semantic
an information need. The indices of these search engines are video search engine.
based on the filename, surrounding text, social tagging, or
a transcript. This results in disappointing performance when
the visual content is not reflected in the associated text. In A. Semantic Indexing
addition, when the videos originate from non-English speaking
For semantic indexing we proposed the semantic pathfinder,
countries, such as China or the Netherlands, querying the
for details see [4]. First, it extracts features from the visual [5],
content becomes even harder as automatic speech recognition
textual, and auditory modality. The architecture exploits super-
results are so much poorer. Additional visual analysis yields
vised machine learning to automatically label segments with
more robustness. Thus, in video retrieval a recent trend is
semantic concepts. In the first step learning is on the content
to learn a lexicon of semantic concepts from multimedia
features only. In the second step, the video is analyzed based
examples and to employ these as entry points in querying the
on its style properties. Finally, semantic concepts are analyzed
collection.
in context, with the potential to boost index results further.
Last year we presented the MediaMill 2005 video search The resulting thesaurus of 500 semantic concepts, covering
engine [1] using a 101 concept lexicon [2] evaluated in the setting, objects, and people, is learned based on the LSCOM
TRECVID benchmark [3]. For our current system we made annotations [6] and the 101 concepts used in our 2005 engine
a jump to a thesaurus of 500 concepts. The items vary from [2].
pure format like a detected split screen, or a style like an
interview, or an object like a horse, or an event like an
airplane take off. Any one of those brings an understanding B. Topic Analysis
of the current content. The elements in such a thesaurus offer We map the richness and subjectivity of semantics in user
users a semantic entry to video by allowing them to query on queries to concept detectors available in our thesaurus. To
presence or absence of content elements. For a user, however, derive the most relevant concepts for a given user topic, we
selecting the right topic from the large thesaurus is difficult. first assign syntactic categories to groups of words in the input
We therefore developed a suggestion engine that analyzes the text using a chunking algorithm. We then assign a grammatical
textual topic, and possible image examples given by the user, classification to each word by using a part-of-speech tagger.
to automatically derive the most relevant concept detectors for From there, looking up each noun chunk in WordNet [7].
querying the video archive (see Fig. 1 and Fig. 2). When a match has been found those words are eliminated
from further lookups. Then we look up any remaining nouns
This research is sponsored by the BSIK MultimediaN project. The authors in WordNet. The result is a number of WordNet words related
are with the Intelligent Systems Lab Amsterdam, Informatics Institute, Uni-
versity of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands to the input text. Now that both the concepts in the text and the
(e-mail: info@mediamill.nl, http://www.mediamill.nl). multimedia concept detectors are related to WordNet, we can
Fig. 2. On the left an example of a query for shots of a goal being made in a soccer match, using both text and image examples, yielding soccer and grass
as most relevant concepts. Result of the query are visualized in the CrossBrowser on the right.
compute the semantic distance between the textual concepts important elements. Remaining elements are still visible, but
and the multimedia concepts. We use Resnik’s algorithm [8] much darker (see Fig. 2).
which calculates the similarity of a concept to each of the
III. D EMONSTRATION
WordNet nouns from the query text. Based on the combined
scores we rank each multimedia concept detector in order of We demonstrate semantic exploration of news video
expected utility. archives with the MediaMill system. We will show how a
thesaurus of 500 concepts can be exploited for effective access
C. Image Classification to video at a semantic level. In addition, we will exhibit
novel browsers that present retrieval results using advanced
Concept suggestion based on query image analysis first visualizations. Taken together, the search engine provides users
extracts visual features [5]. Based on the features we predict with semantic access to news video archives.
for each image a concept using pre-learned visual-only models.
Rather than selecting the concept with maximal score –which R EFERENCES
are often the most robust but also least informative ones, e.g. [1] C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek,
people, face, outdoor – we select the model that maximizes D.Koelma, G.P. Nguyen, O. de Rooij, and F. Seinstra, “MediaMill:
Exploring news video archives based on learned semantics,” Singapore,
the probability of observing this image given the concept. To November 2005, pp. 225–226.
compute, Bayes’ theorem is applied using training set statis- [2] C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and
tics. Hence, we prioritize less frequent, but discriminative, A.W.M. Smeulders, “The challenge problem for automated detection
of 101 semantic concepts in multimedia,” in Proceedings of the ACM
concepts with reasonable probability scores over frequent, but International Conference on Multimedia, Santa Barbara, USA, October
less discriminative, concepts with high probability scores. 2006, pp. 421–430.
[3] A. Smeaton, “Large scale evaluations of multimedia information retrieval:
The TRECVid experience,” in CIVR, ser. LNCS, vol. 3569. Springer-
D. Rank Combination Verlag, 2005, pp. 19–27.
We offer users several possibilities to combine the various [4] C.G.M. Snoek, M. Worring, J.-M. Geusebroek, D.C. Koelma, F.J. Sein-
stra, and A.W.M. Smeulders, “The semantic pathfinder: Using an author-
ranked lists. They can employ standard combination methods ing metaphor for generic multimedia indexing,” IEEE Transactions on
such as min, max, sum, and product [9]. In addition, they may Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1678–
specify that some concepts are more important than others by 1689, October 2006.
[5] J.C. van Gemert, J.-M. Geusebroek, C.J. Veenman, C.G.M. Snoek, and
adding weights to individual concepts. A.W.M. Smeulders, “Robust scene categorization by learning image
statistics in context,” in International Workshop on Semantic Learning
Applications in Multimedia, in conjunction with CVPR’06, New York,
E. Browsing the Result USA, June 2006.
The result of concept suggestion, the subsequent concept [6] M. Naphade, J. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. Kennedy,
A. Hauptmann, and J. Curtis, “Large-scale concept ontology for mul-
queries and their combination yields a ranked list of shots. timedia,” IEEE Multimedia, vol. 13, no. 3, pp. 86–91, 2006.
To aid human interpretation in exploring this result the Cross- [7] G.A. Miller, “Wordnet: A lexical database for english,” Communications
Browser visualizes the ranked list (vertical axis) versus the of the ACM, vol. 38, pp. 39–41, 1995.
[8] P. Resnik, “Using information content to evaluate semantic similarity in
time (horizontal axis) of the program containing the shot. The a taxonomy,” in IJCAI, 1995.
two dimensions are projected onto a sphere to allow easy [9] J. Lee, “Analysis of multiple evidence combination,” in Proceedings of
navigation. It also enhances focus of attention on the most ACM SIGIR, 1997, pp. 267–276.