MediaMill: Video Search using a Thesaurus of 500 Machine Learned Concepts Cees G.M. Snoek, Marcel Worring, Bouke Huurnink, Jan. C. van Gemert, Koen E.A. van de Sande, Dennis C. Koelma, and Ork de Rooij Abstract— In this technical demonstration we showcase the II. T HE M EDIA M ILL 2006 S YSTEM current version of the MediaMill system, a search engine that facilitates access to news video archives at a semantic level. The data flow of the MediaMill 2006 system is depicted in The core of the system is a thesaurus of 500 automatically Fig. 1. We will now highlight its components in more detail. detected semantic concepts. To handle such a large thesaurus in retrieval, an engine is developed which automatically selects Video archive a set of relevant concepts based on the textual query and user- Concept thesaurus specified example images. The result set can be browsed easily Set of concepts to obtain the final result for the query. Ranked list Semantic Indexing Index Terms— Semantic indexing, video retrieval, information Rank combination visualization. Browsing Topic Topic Concept (text) analysis query I. I NTRODUCTION Examples Image Concept (Images) classification query Most commercial video search engines such as Google, Blinkx, and YouTube provide access to their repositories based on text as this is still the easiest way for a user to describe Fig. 1. Overview of the different processing steps in the MediaMill semantic an information need. The indices of these search engines are video search engine. based on the filename, surrounding text, social tagging, or a transcript. This results in disappointing performance when the visual content is not reflected in the associated text. In A. Semantic Indexing addition, when the videos originate from non-English speaking For semantic indexing we proposed the semantic pathfinder, countries, such as China or the Netherlands, querying the for details see [4]. First, it extracts features from the visual [5], content becomes even harder as automatic speech recognition textual, and auditory modality. The architecture exploits super- results are so much poorer. Additional visual analysis yields vised machine learning to automatically label segments with more robustness. Thus, in video retrieval a recent trend is semantic concepts. In the first step learning is on the content to learn a lexicon of semantic concepts from multimedia features only. In the second step, the video is analyzed based examples and to employ these as entry points in querying the on its style properties. Finally, semantic concepts are analyzed collection. in context, with the potential to boost index results further. Last year we presented the MediaMill 2005 video search The resulting thesaurus of 500 semantic concepts, covering engine [1] using a 101 concept lexicon [2] evaluated in the setting, objects, and people, is learned based on the LSCOM TRECVID benchmark [3]. For our current system we made annotations [6] and the 101 concepts used in our 2005 engine a jump to a thesaurus of 500 concepts. The items vary from [2]. pure format like a detected split screen, or a style like an interview, or an object like a horse, or an event like an airplane take off. Any one of those brings an understanding B. Topic Analysis of the current content. The elements in such a thesaurus offer We map the richness and subjectivity of semantics in user users a semantic entry to video by allowing them to query on queries to concept detectors available in our thesaurus. To presence or absence of content elements. For a user, however, derive the most relevant concepts for a given user topic, we selecting the right topic from the large thesaurus is difficult. first assign syntactic categories to groups of words in the input We therefore developed a suggestion engine that analyzes the text using a chunking algorithm. We then assign a grammatical textual topic, and possible image examples given by the user, classification to each word by using a part-of-speech tagger. to automatically derive the most relevant concept detectors for From there, looking up each noun chunk in WordNet [7]. querying the video archive (see Fig. 1 and Fig. 2). When a match has been found those words are eliminated from further lookups. Then we look up any remaining nouns This research is sponsored by the BSIK MultimediaN project. The authors in WordNet. The result is a number of WordNet words related are with the Intelligent Systems Lab Amsterdam, Informatics Institute, Uni- versity of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands to the input text. Now that both the concepts in the text and the (e-mail: info@mediamill.nl, http://www.mediamill.nl). multimedia concept detectors are related to WordNet, we can Fig. 2. On the left an example of a query for shots of a goal being made in a soccer match, using both text and image examples, yielding soccer and grass as most relevant concepts. Result of the query are visualized in the CrossBrowser on the right. compute the semantic distance between the textual concepts important elements. Remaining elements are still visible, but and the multimedia concepts. We use Resnik’s algorithm [8] much darker (see Fig. 2). which calculates the similarity of a concept to each of the III. D EMONSTRATION WordNet nouns from the query text. Based on the combined scores we rank each multimedia concept detector in order of We demonstrate semantic exploration of news video expected utility. archives with the MediaMill system. We will show how a thesaurus of 500 concepts can be exploited for effective access C. Image Classification to video at a semantic level. In addition, we will exhibit novel browsers that present retrieval results using advanced Concept suggestion based on query image analysis first visualizations. Taken together, the search engine provides users extracts visual features [5]. Based on the features we predict with semantic access to news video archives. for each image a concept using pre-learned visual-only models. Rather than selecting the concept with maximal score –which R EFERENCES are often the most robust but also least informative ones, e.g. [1] C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, people, face, outdoor – we select the model that maximizes D.Koelma, G.P. Nguyen, O. de Rooij, and F. Seinstra, “MediaMill: Exploring news video archives based on learned semantics,” Singapore, the probability of observing this image given the concept. To November 2005, pp. 225–226. compute, Bayes’ theorem is applied using training set statis- [2] C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and tics. Hence, we prioritize less frequent, but discriminative, A.W.M. Smeulders, “The challenge problem for automated detection of 101 semantic concepts in multimedia,” in Proceedings of the ACM concepts with reasonable probability scores over frequent, but International Conference on Multimedia, Santa Barbara, USA, October less discriminative, concepts with high probability scores. 2006, pp. 421–430. [3] A. Smeaton, “Large scale evaluations of multimedia information retrieval: The TRECVid experience,” in CIVR, ser. LNCS, vol. 3569. Springer- D. Rank Combination Verlag, 2005, pp. 19–27. We offer users several possibilities to combine the various [4] C.G.M. Snoek, M. Worring, J.-M. Geusebroek, D.C. Koelma, F.J. Sein- stra, and A.W.M. Smeulders, “The semantic pathfinder: Using an author- ranked lists. They can employ standard combination methods ing metaphor for generic multimedia indexing,” IEEE Transactions on such as min, max, sum, and product [9]. In addition, they may Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1678– specify that some concepts are more important than others by 1689, October 2006. [5] J.C. van Gemert, J.-M. Geusebroek, C.J. Veenman, C.G.M. Snoek, and adding weights to individual concepts. A.W.M. Smeulders, “Robust scene categorization by learning image statistics in context,” in International Workshop on Semantic Learning Applications in Multimedia, in conjunction with CVPR’06, New York, E. Browsing the Result USA, June 2006. The result of concept suggestion, the subsequent concept [6] M. Naphade, J. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis, “Large-scale concept ontology for mul- queries and their combination yields a ranked list of shots. timedia,” IEEE Multimedia, vol. 13, no. 3, pp. 86–91, 2006. To aid human interpretation in exploring this result the Cross- [7] G.A. Miller, “Wordnet: A lexical database for english,” Communications Browser visualizes the ranked list (vertical axis) versus the of the ACM, vol. 38, pp. 39–41, 1995. [8] P. Resnik, “Using information content to evaluate semantic similarity in time (horizontal axis) of the program containing the shot. The a taxonomy,” in IJCAI, 1995. two dimensions are projected onto a sphere to allow easy [9] J. Lee, “Analysis of multiple evidence combination,” in Proceedings of navigation. It also enhances focus of attention on the most ACM SIGIR, 1997, pp. 267–276.