Integrating Social Tagging and Document Annotation for Content-Based Search in Multimedia Data Harald Sack Jörg Waitelonis Institut für Informatik Institut für Informatik Friedrich-Schiller-Universität Jena Friedrich-Schiller-Universität Jena Germany Germany sack@minet.uni-jena.de joerg@minet.uni-jena.de ABSTRACT to specific parts of that resources. In case of electronic doc- Collaborative tagging systems have become rather popular uments, as e. g., HTML encoded documents, single parts or for annotating any kind of resources ranging from electronic fractions of the document can be referenced, if the document documents to real world objects. In current tagging systems author – and not the document reader – has provided an- resources as a whole are annotated with and referenced by chors encoded within the document for the identification of user defined tags. For multimedia data, as e. g. for video those document parts. In case of multimedia data, as e. g., data, single scenes can be identified and annotated by using recorded video, specific document parts – i. e. single video MPEG-7 metadata. We propose a collaborative tagging sys- scenes – can be identified and annotated by using MPEG-7 tem that is combined with an automated annotation system metadata. for synchronized multimedia presentations. MPEG-7 meta- data are used for the annotation of single scenes with user We propose the combination of a CTS with an automated compiled tagging information in combination with metadata annotation system for synchronized multimedia presenta- provided directly by the author or by other annotation sys- tions that is able to annotate single parts of multimedia data tems. Thus, we propose a system being able to search within with user defined tags. We have developed a system for au- multimedia data that can further be extended to search tomated annotation of synchronized multimedia documents within any kind of (partial) document to achieve a more that is focused on lecture recordings. The video recording of tightly focused and personalized search. the lecturer is synchronized with a recorded desktop presen- tation [11], which serves as a basis for an automated creation of MPEG-7 metadata and enables content-based annotation 1. INTRODUCTION of single scenes within the video recording. This MPEG-7 Online Social Networking enables collaboration relationships annotation is endorsed with user defined tags to enable a and allows exploiting these relationships for automated in- personalized search that can be performed on a large multi- formation distribution and classification. In particular, col- media database as well as within a single multimedia file. laborative tagging systems (CTS) have become increasingly popular for annotating any kind of electronic documents The paper is structured as follows: Section 2 gives a short (e. g. web pages, images, videos) or even real world ob- overview on related work concerning video annotation sys- jects (e. g. books, consumer goods, people). In a CTS the tems and CTS. Section 3 illustrates our approach that com- users assign freely chosen terms (i. e. tags) to specific re- bines tagging information with MPEG-7 metadata and shows sources with the purpose of referencing those resources later how to apply this combined information for content based on with the help of the assigned tags. search within multimedia data. Section 4 concludes the pa- per with an outlook on how to apply our concept of par- By considering also other users’ tags serendipitous discov- tial document tagging to the processing of large text docu- ery of new, previously unknown resources is possible via ments. so called tag browsing, i. e. all resources that are annotated with the same tag(s) as a decisive resource will be referenced. For an overview of CTS see [8, 6]. Current CTS usually con- 2. RELATED WORK sider the resources being tagged as a whole. Thus, tag based In this section we give a short overview on current video search produces a hit list that contains entire resources, al- annotation systems and CTS. The service that we are fo- though the tags describing these resources might refer only cusing on in this paper combines collaborative tagging and traditional video annotation. MPEG-7 [4, 9] is an XML based markup language for the description and annotation of multimedia data. We have developed an MPEG-7 based annotation service that is focused on the automated annota- tion of lecture video recordings. The recorded video is syn- chronized with a desktop presentation given by the lecturer. The textual content of this presentation is used to annotate single sections of the video with weighted descriptors. A key- word based search can be performed on the annotated video recordings resulting in a list of video sections related to the search term (see [11] for a more detailed description). Repp tified and annotated with the element (see and Meinel have proposed a similar video annotation system Fig. 1). Within each the elements and specify the segment’s tem- part of the video data [10]. In a similar way Hauptmann poral location within the video stream (see Fig. 2). For tex- et al. extracted textual annotation from recorded video by tual annotation MPEG-7 provides the tags , , and . The information connected to these tags can be just mentioned video annotation systems is that there, the utilized for a keyword based search within the video data annotation is conducted in a centralized way either by the facilitating a fine-grained access. author or producer of the video or by an independent auto- mated system. The user of the video data does not have the possibility to add his own annotations and to make them available for the system’s search facilities. Furthermore, re- ... liability of speech recognition itself depends on training data and it is difficult to identify context and semantically con- enable personalized search facilities, but without simultane- ously providing a platform that is able to use annotations from different users in a collaborative way. CTS enable personalized annotation of resources that can Figure 1: Simplified MPEG-7 basic elements. be utilized collaboratively by all users. YouTube [2] is a rather popular system for the collaborative annotation of video data. But, YouTube only allows the annotation of the . . . video data as a whole and not the annotation of single parts ... of a video document. The majority of available video clips in YouTube is rather short and most times those clips only cover a single subject. Thus, for YouTube it is probably c a t not necessary to provide a possibility for partial document mouse annotation. Our system is focused on lecture recordings, where most lectures cover a variety of different topics. By b i l l y th e c a t i s c a t c h i n g a mouse providing partial document annotation facilities the user is able to annotate single video scenes that are related to a specific topic according to his own interests. By considering also those annotations that have been provided by other T 0 0 : 0 5 : 0 5 : 0 F 2 5 PT00H00M31S0N25F users, the system enables the discovery of related (similar) video scenes by tag browsing. 3. INTEGRATING COLLABORATIVE TAG- Figure 2: Simplified element. GING INFORMATION AND MPEG-7 3.1 MPEG-7 Encoding For the integration of collaborative tagging information into This section describes how MPEG-7 metadata can be used the MPEG-7 metadata description schema an obvious ap- to maintain collaborative tagging information. MPEG-7 is proach would be to use the element associated an XML based markup language for the description of mul- with each video segment. But, for each set of tags additional timedia metadata. Besides various standard metadata infor- user dependent information has to be stored to facilitate a mation MPEG-7 enables the identification and annotation personalized search. Collaborative tagging information can of distinct spatial and temporal segments within multimedia be encoded as a tupel data. For our purpose, the description of temporal decom- ({tagset}, username, date, [rating]), position of video data is essential. Thereby, MPEG-7 allows the identification and annotation of overlapping temporal where a set of tags is supplemented by user, date, and aux- segments, which is a prerequisite for storing collaborative iliary (optional) rating information. Therefore, instead of tagging information that is provided by different users. the element we use the element, which allows a video segment to be annotated with user Video segments can be annotated with various information specific textual information including also a rating indicator by utilizing the tag of the MPEG- (see Fig. 3). The tagset denotes the set of all tags that a dis- 7 metadata description scheme. Each video segment is iden- tinct user has employed to annotate a video segment. It is axis representing overlapping sequences (5). By pointing at a video sequence within the coordinate system all tags referring to that segment are displayed. Besides user anno- tation, we also consider annotations provided by the author 9 . 1 of a video resource. These annotations can include struc- tural informations (cut points) as well as semantic informa- tion (tags, headings, comments). The interface provides the tag1 , tag2 , ta g 3 possibility to use the annotation given by the author as a default starting point for user dependent annotation. Al- ternatively, the video can be pre-cut at fixed time intervals . . . that can be fine-tuned by the user. For selecting a new video sequence to be annotated, the user is able to mark starting time and end time simply by clicking special buttons in the video display during playback or/and by adjusting those cut Harald Sack points in a separate timeline display (6). After selecting a video sequence the user is able to add his tags in a separate . . . tag definition window (7). For faster processing it is possi- ble to place tags just at a specific point in time during the video playback without denoting an entire segment. Then, starting point and end point of a sequence being annotated with that tag is chosen using predefined or author-given cut- Figure 3: Simplified element. points. To consider the most important parts of a video a rating index is displayed along a separate timeline (8). represented as comma-separated list of tags and is encoded 3.3 Searching Tagged MPEG-7 Metadata in the element. The date of the last CTS enable different ways of searching the system’s resources modification of the tagset is encoded with the element. The user identification is encoded in Personalized Search By utilizing his own set of tags the the element, which is derived from the MPEG-7 user is able to perform a search based on his personal infor- agent type. Furthermore, an optional rating indicator can be mation needs. These tags can be descriptive or functional by included to enable the ranking of video content. Thus, the nature, i. e. they either describe a resource in general – and element provides the possibility to store all thus, are also useful for other users – or they draw the focus necessary collaborative tagging information. The element is embedded inside the and elements of a video seg- to extend a general search according to personal informa- ment. Within the element several dif- tion needs. As e.g., the user might tag several sequences of ferent elements can be combined that each a lecture video that are relevant for an examination with the represent annotations from different users. tag exam. General Search By considering the (descriptive) tags of all 3.2 Browser-Based User Interface users in combination with the original MPEG-7 annotations For collaborative tagging of video segments the design of an of the resource’s author, a general keyword-based search can efficient user interface is mandatory. Thus, we define three be performed. distinguished areas in the browser’s user interface: the video Tag Browsing Here, we refer to the retrieval of all resources display area (1), the tag display area (2), and the tag/seg- that are annotated with the same tags as a specific resource ment definition area (3) (see Fig. 4 for an overview of the under current consideration. Now, esp. those resources be- user interface). The tag display is organized as a tag cloud come important that have been annotated with the same (2). The single tags are ordered alphabetically while their tags, but by other users. In that way the user is able to font size indicates additional information that can refer to discover new resources that are considered to be similar to frequency of usage or tag rating (according to the relevance the original resource. indicator). We consider different display modes: either per- Social Networking Additionally, in CTS the inherent so- sonal or popular tags can be displayed, while a static view cial network of users can be considered. To participate in includes all tags for the entire video in difference to a dy- a CTS the user has to register which often includes the de- namic view that refers to tags used at a distinct point in livery of a personal profile. Thus, a social network can be time within the video. By pointing at a tag with the mouse defined connecting users that are considered to be similar device a list of video segments annotated with that tag will according to their profiles. On the other hand, users that be displayed in a separate window (4). There, the video have annotated the same resource (probably even with the segments are represented by a miniature screen shot and same tags) can be considered to be similar. Thus, by brows- by their starting time and end time. The user can select ing resources that have been annotated by similar users, new a particular video segment from the list for playback. On relevant resources can be discovered. the other hand, the user has to get an overview of all (non disjunctive) segments that have already been annotated in 4. CONCLUSIONS AND OUTLOOK the video. This information is displayed within a coordinate We have shown how to integrate collaborative tagging infor- system with the x-axis representing the timeline and the y- mation within a MPEG-7 framework to facilitate a search Figure 4: User interface combining collaborative tagging and MPEG-7 annotation. function on multimedia data that is able to deliver distinct [5] Document Object Model Level 1 specification. parts of interest within a multimedia document. In differ- http://www.w3.org/TR/REC-DOM-Level-1/. ence to current CTS our approach allows the annotation of partial documents which is important esp. for time- [6] S. Golder and B. A. Huberman. Usage Patterns of dependent media, as e. g., video data. A prototype of the Collaborative Tagging Systems. Journal of proposed system for collaborative video scene tagging and Information Science, 32(2):198–208, 2006. retrieval is under current development. [7] A. G. Hauptmann, R. Jin, and T. D. Ng. Multi-modal information retrieval from broadcast video using OCR The concept of collaboratively annotating partial video doc- and speech recognition. In JCDL’02: Proceedings of uments can be extended for other types of media, as e. g., the 2nd ACM/IEEE-CS Joint Conference on Digital for large text documents (textbooks). There, the users (doc- Libraries, Video and multimedia digital libraries, ument readers) should have the possibility to annotate dis- pages 160–161, 2002. tinct sections of the text document and to benefit from these annotations in a personal or collaborative way. One way to [8] C. Marlow, M. Naaman, D. Boyd, and M. Davis. facilitate the identification of distinct sections within any Position Paper, Tagging, Taxonomy, Flickr, Article, type of document can be realized with the help of the doc- ToRead. In Collaborative Web Tagging Workshop at ument object model (DOM) [5]. The DOM representation WWW2006, Edinburgh, Scotland, May 2006. of a document is a rooted graph (document tree), where different sections (at different levels within the document’s [9] National Institute of Standards and Technology. NIST hierarchy) are represented by nodes that can be linked with MPEG-7 Validation Service and MPEG-7 user annotations. Thus, with the collaborative annotation XML-schema specifications, of partial documents a more focused and personalized search http://m7itb.nist.gov/M7Validation.html. can be achieved for any type of document. [10] S. Repp and C. Meinel. Semantic indexing for recorded educational lecture videos. In 4th Annual 5. REFERENCES IEEE Int. Conference on Pervasive Computing and [1] Ricoh movie tool, Communications Workshops (PERCOMW’06), 2006. http://m7itb.nist.gov/M7Validation.html. [11] H. Sack and J. Waitelonis. Automated annotations of [2] YouTube - video sharing and tagging system, synchronized multimedia presentations. In In http://www.youtube.com/. Proceedings of the ESWC 2006 Workshop on [3] D. Bargeron, A. Gupta, J. Grudin, and E. Sanocki. Mastering the Gap: From Information Extraction to Annotations for streaming video on the web: System Semantic Representation, CEUR Workshop design and usage studies. Computer Networks, Proceedings, June 2006. 31(11-16), 1999. [12] J. R. Smith and B. Lugeon. A visual annotation tool [4] S. F. Chang, T. Sikora, and A. Puri. Overview of the for multimedia content description. In Proc. SPIE MPEG-7 Standard. IEEE Trans. Circuits and Systems Photonics East, Internet Multimedia Management for Video Technology, 11(6):688–695, 2001. Systems, pages 160–161, 2000.