Integrating Social Tagging and Document Annotation for
             Content-Based Search in Multimedia Data

                           Harald Sack                                           Jörg Waitelonis
                       Institut für Informatik                                 Institut für Informatik
                Friedrich-Schiller-Universität Jena                     Friedrich-Schiller-Universität Jena
                              Germany                                                 Germany
                   sack@minet.uni-jena.de                                  joerg@minet.uni-jena.de

ABSTRACT                                                          to speciﬁc parts of that resources. In case of electronic doc-
Collaborative tagging systems have become rather popular          uments, as e. g., HTML encoded documents, single parts or
for annotating any kind of resources ranging from electronic      fractions of the document can be referenced, if the document
documents to real world objects. In current tagging systems       author – and not the document reader – has provided an-
resources as a whole are annotated with and referenced by         chors encoded within the document for the identiﬁcation of
user deﬁned tags. For multimedia data, as e. g. for video         those document parts. In case of multimedia data, as e. g.,
data, single scenes can be identiﬁed and annotated by using       recorded video, speciﬁc document parts – i. e. single video
MPEG-7 metadata. We propose a collaborative tagging sys-          scenes – can be identiﬁed and annotated by using MPEG-7
tem that is combined with an automated annotation system          metadata.
for synchronized multimedia presentations. MPEG-7 meta-
data are used for the annotation of single scenes with user       We propose the combination of a CTS with an automated
compiled tagging information in combination with metadata         annotation system for synchronized multimedia presenta-
provided directly by the author or by other annotation sys-       tions that is able to annotate single parts of multimedia data
tems. Thus, we propose a system being able to search within       with user deﬁned tags. We have developed a system for au-
multimedia data that can further be extended to search            tomated annotation of synchronized multimedia documents
within any kind of (partial) document to achieve a more           that is focused on lecture recordings. The video recording of
tightly focused and personalized search.                          the lecturer is synchronized with a recorded desktop presen-
                                                                  tation [11], which serves as a basis for an automated creation
                                                                  of MPEG-7 metadata and enables content-based annotation
1.    INTRODUCTION                                                of single scenes within the video recording. This MPEG-7
Online Social Networking enables collaboration relationships      annotation is endorsed with user deﬁned tags to enable a
and allows exploiting these relationships for automated in-       personalized search that can be performed on a large multi-
formation distribution and classiﬁcation. In particular, col-     media database as well as within a single multimedia ﬁle.
laborative tagging systems (CTS) have become increasingly
popular for annotating any kind of electronic documents           The paper is structured as follows: Section 2 gives a short
(e. g. web pages, images, videos) or even real world ob-          overview on related work concerning video annotation sys-
jects (e. g. books, consumer goods, people). In a CTS the         tems and CTS. Section 3 illustrates our approach that com-
users assign freely chosen terms (i. e. tags) to speciﬁc re-      bines tagging information with MPEG-7 metadata and shows
sources with the purpose of referencing those resources later     how to apply this combined information for content based
on with the help of the assigned tags.                            search within multimedia data. Section 4 concludes the pa-
                                                                  per with an outlook on how to apply our concept of par-
By considering also other users’ tags serendipitous discov-       tial document tagging to the processing of large text docu-
ery of new, previously unknown resources is possible via          ments.
so called tag browsing, i. e. all resources that are annotated
with the same tag(s) as a decisive resource will be referenced.
For an overview of CTS see [8, 6]. Current CTS usually con-       2. RELATED WORK
sider the resources being tagged as a whole. Thus, tag based      In this section we give a short overview on current video
search produces a hit list that contains entire resources, al-    annotation systems and CTS. The service that we are fo-
though the tags describing these resources might refer only       cusing on in this paper combines collaborative tagging and
                                                                  traditional video annotation. MPEG-7 [4, 9] is an XML
                                                                  based markup language for the description and annotation
                                                                  of multimedia data. We have developed an MPEG-7 based
                                                                  annotation service that is focused on the automated annota-
                                                                  tion of lecture video recordings. The recorded video is syn-
                                                                  chronized with a desktop presentation given by the lecturer.
                                                                  The textual content of this presentation is used to annotate
                                                                  single sections of the video with weighted descriptors. A key-
                                                                  word based search can be performed on the annotated video
                                                                  recordings resulting in a list of video sections related to the
search term (see [11] for a more detailed description). Repp       tiﬁed and annotated with the <VideoSegment> element (see
and Meinel have proposed a similar video annotation system         Fig. 1). Within each <VideoSegment> the elements <Media-
based on the transcription of spoken language in the audio         TimePoint> and <MediaDuration> specify the segment’s tem-
part of the video data [10]. In a similar way Hauptmann            poral location within the video stream (see Fig. 2). For tex-
et al. extracted textual annotation from recorded video by         tual annotation MPEG-7 provides the tags <KeywordAnno-
optical character recognition (OCR) and speech recognition         tation>, <FreeTextAnnotation>, and <StructuredAnnota-
[7]. The major diﬀerence between our new approach and the          tion>. The information connected to these tags can be
just mentioned video annotation systems is that there, the         utilized for a keyword based search within the video data
annotation is conducted in a centralized way either by the         facilitating a ﬁne-grained access.
author or producer of the video or by an independent auto-
mated system. The user of the video data does not have the         <Mpeg7 xmlns=” . . . ”>
possibility to add his own annotations and to make them             <Description x s i : t y p e=” C on t e n t E n t i t y T y pe ”>
available for the system’s search facilities. Furthermore, re-        ...
liability of speech recognition itself depends on training data      <MultimediaContent x s i : t y p e=” VideoType”>
and it is diﬃcult to identify context and semantically con-           <Video>
nected sequences [11].                                                  <MediaInformation>
                                                                         ...
                                                                        <TemporalDecomposition>
On the other hand, there are manual multimedia data an-                   <VideoSegment> . . . </VideoSegment>
notation systems that enable the user to connect personal                 <VideoSegment> . . . </VideoSegment>
annotations to single scenes of a video recording [12, 1, 3]. In           ...
diﬀerence to our approach, those multimedia annotation sys-             </TemporalDecomposition>
tems are focused on personal annotations only. Indeed, they           </Video>
                                                                     </MultimediaContent>
enable personalized search facilities, but without simultane-       </ Description>
ously providing a platform that is able to use annotations         </Mpeg7>
from diﬀerent users in a collaborative way.

CTS enable personalized annotation of resources that can               Figure 1: Simplified MPEG-7 basic elements.
be utilized collaboratively by all users. YouTube [2] is a
rather popular system for the collaborative annotation of
                                                                   <VideoSegment>
video data. But, YouTube only allows the annotation of the          <CreationInformation> . . . </ CreationInformation>
video data as a whole and not the annotation of single parts         ...
of a video document. The majority of available video clips          <TextAnnotation>
in YouTube is rather short and most times those clips only           <KeywordAnnotation>
cover a single subject. Thus, for YouTube it is probably               <Keyword>c a t</Keyword>
not necessary to provide a possibility for partial document            <Keyword>mouse</Keyword>
                                                                     </KeywordAnnotation>
annotation. Our system is focused on lecture recordings,             <FreeTextAnnotation>
where most lectures cover a variety of diﬀerent topics. By             b i l l y th e c a t i s c a t c h i n g a mouse
providing partial document annotation facilities the user is         </FreeTextAnnotation>
able to annotate single video scenes that are related to a          </TextAnnotation>
speciﬁc topic according to his own interests. By considering        <MediaTime>
also those annotations that have been provided by other              <MediaTimePoint>T 0 0 : 0 5 : 0 5 : 0 F 2 5</MediaTimePoint>
                                                                     <MediaDuration>PT00H00M31S0N25F</MediaDuration>
users, the system enables the discovery of related (similar)        </MediaTime>
video scenes by tag browsing.                                      </VideoSegment>


3.  INTEGRATING COLLABORATIVE TAG-                                      Figure 2: Simplified <VideoSegment> element.
    GING INFORMATION AND MPEG-7
3.1 MPEG-7 Encoding                                                For the integration of collaborative tagging information into
This section describes how MPEG-7 metadata can be used             the MPEG-7 metadata description schema an obvious ap-
to maintain collaborative tagging information. MPEG-7 is           proach would be to use the <Keyword> element associated
an XML based markup language for the description of mul-           with each video segment. But, for each set of tags additional
timedia metadata. Besides various standard metadata infor-         user dependent information has to be stored to facilitate a
mation MPEG-7 enables the identiﬁcation and annotation             personalized search. Collaborative tagging information can
of distinct spatial and temporal segments within multimedia        be encoded as a tupel
data. For our purpose, the description of temporal decom-                        ({tagset}, username, date, [rating]),
position of video data is essential. Thereby, MPEG-7 allows
the identiﬁcation and annotation of overlapping temporal           where a set of tags is supplemented by user, date, and aux-
segments, which is a prerequisite for storing collaborative        iliary (optional) rating information. Therefore, instead of
tagging information that is provided by diﬀerent users.            the <Keyword> element we use the <MediaReview> element,
                                                                   which allows a video segment to be annotated with user
Video segments can be annotated with various information           speciﬁc textual information including also a rating indicator
by utilizing the <TemporalDecomposition> tag of the MPEG-          (see Fig. 3). The tagset denotes the set of all tags that a dis-
7 metadata description scheme. Each video segment is iden-         tinct user has employed to annotate a video segment. It is
<CreationInformation>                                            axis representing overlapping sequences (5). By pointing
 <C l a s s i f i c a t i o n>                                   at a video sequence within the coordinate system all tags
  <MediaReview>                                                  referring to that segment are displayed. Besides user anno-
   <Rating>                                                      tation, we also consider annotations provided by the author
      <RatingValue>9 . 1</RatingValue>
      <RatingScheme s t y l e=” h i g h e r B e t t e r ”/>
                                                                 of a video resource. These annotations can include struc-
   </Rating>                                                     tural informations (cut points) as well as semantic informa-
   <FreeTextReview>                                              tion (tags, headings, comments). The interface provides the
       tag1 , tag2 , ta g 3                                      possibility to use the annotation given by the author as a
   </FreeTextReview>                                             default starting point for user dependent annotation. Al-
   <ReviewReference>                                             ternatively, the video can be pre-cut at ﬁxed time intervals
    <CreationInformation>
          <Date> . . . </Date>
                                                                 that can be ﬁne-tuned by the user. For selecting a new video
    </ CreationInformation>                                      sequence to be annotated, the user is able to mark starting
   </ReviewReference>                                            time and end time simply by clicking special buttons in the
   <Reviewer x s i : t y p e=” PersonType” >                     video display during playback or/and by adjusting those cut
      <Name>Harald Sack</Name>                                   points in a separate timeline display (6). After selecting a
   </Reviewer>                                                   video sequence the user is able to add his tags in a separate
  </MediaReview>
  <MediaReview> . . . </MediaReview>
                                                                 tag deﬁnition window (7). For faster processing it is possi-
 </ C l a s s i f i c a t i o n>                                 ble to place tags just at a speciﬁc point in time during the
</ CreationInformation>                                          video playback without denoting an entire segment. Then,
                                                                 starting point and end point of a sequence being annotated
                                                                 with that tag is chosen using predeﬁned or author-given cut-
     Figure 3: Simplified <MediaReview> element.                 points. To consider the most important parts of a video a
                                                                 rating index is displayed along a separate timeline (8).

represented as comma-separated list of tags and is encoded       3.3 Searching Tagged MPEG-7 Metadata
in the <FreeTextReview> element. The date of the last            CTS enable diﬀerent ways of searching the system’s resources
modiﬁcation of the tagset is encoded with the <Creation-         that can be adapted to our multimedia search:
Information> element. The user identiﬁcation is encoded in       Personalized Search By utilizing his own set of tags the
the <Reviewer> element, which is derived from the MPEG-7         user is able to perform a search based on his personal infor-
agent type. Furthermore, an optional rating indicator can be     mation needs. These tags can be descriptive or functional by
included to enable the ranking of video content. Thus, the       nature, i. e. they either describe a resource in general – and
<MediaReview> element provides the possibility to store all      thus, are also useful for other users – or they draw the focus
necessary collaborative tagging information. The <Media-         on a certain aspect that (most times) is only relevant for the
Review> element is embedded inside the <CreationInfor-           user who supplied it. Esp. the functional tags are suitable
mation> and <Classification> elements of a video seg-            to extend a general search according to personal informa-
ment. Within the <Classification> element several dif-           tion needs. As e.g., the user might tag several sequences of
ferent <MediaReview> elements can be combined that each          a lecture video that are relevant for an examination with the
represent annotations from diﬀerent users.                       tag exam.
                                                                 General Search By considering the (descriptive) tags of all
3.2 Browser-Based User Interface                                 users in combination with the original MPEG-7 annotations
For collaborative tagging of video segments the design of an     of the resource’s author, a general keyword-based search can
eﬃcient user interface is mandatory. Thus, we deﬁne three        be performed.
distinguished areas in the browser’s user interface: the video   Tag Browsing Here, we refer to the retrieval of all resources
display area (1), the tag display area (2), and the tag/seg-     that are annotated with the same tags as a speciﬁc resource
ment deﬁnition area (3) (see Fig. 4 for an overview of the       under current consideration. Now, esp. those resources be-
user interface). The tag display is organized as a tag cloud     come important that have been annotated with the same
(2). The single tags are ordered alphabetically while their      tags, but by other users. In that way the user is able to
font size indicates additional information that can refer to     discover new resources that are considered to be similar to
frequency of usage or tag rating (according to the relevance     the original resource.
indicator). We consider diﬀerent display modes: either per-      Social Networking Additionally, in CTS the inherent so-
sonal or popular tags can be displayed, while a static view      cial network of users can be considered. To participate in
includes all tags for the entire video in diﬀerence to a dy-     a CTS the user has to register which often includes the de-
namic view that refers to tags used at a distinct point in       livery of a personal proﬁle. Thus, a social network can be
time within the video. By pointing at a tag with the mouse       deﬁned connecting users that are considered to be similar
device a list of video segments annotated with that tag will     according to their proﬁles. On the other hand, users that
be displayed in a separate window (4). There, the video          have annotated the same resource (probably even with the
segments are represented by a miniature screen shot and          same tags) can be considered to be similar. Thus, by brows-
by their starting time and end time. The user can select         ing resources that have been annotated by similar users, new
a particular video segment from the list for playback. On        relevant resources can be discovered.
the other hand, the user has to get an overview of all (non
disjunctive) segments that have already been annotated in        4. CONCLUSIONS AND OUTLOOK
the video. This information is displayed within a coordinate     We have shown how to integrate collaborative tagging infor-
system with the x-axis representing the timeline and the y-      mation within a MPEG-7 framework to facilitate a search
              Figure 4: User interface combining collaborative tagging and MPEG-7 annotation.


function on multimedia data that is able to deliver distinct    [5] Document Object Model Level 1 speciﬁcation.
parts of interest within a multimedia document. In diﬀer-           http://www.w3.org/TR/REC-DOM-Level-1/.
ence to current CTS our approach allows the annotation
of partial documents which is important esp. for time-          [6] S. Golder and B. A. Huberman. Usage Patterns of
dependent media, as e. g., video data. A prototype of the           Collaborative Tagging Systems. Journal of
proposed system for collaborative video scene tagging and           Information Science, 32(2):198–208, 2006.
retrieval is under current development.                         [7] A. G. Hauptmann, R. Jin, and T. D. Ng. Multi-modal
                                                                    information retrieval from broadcast video using OCR
The concept of collaboratively annotating partial video doc-        and speech recognition. In JCDL’02: Proceedings of
uments can be extended for other types of media, as e. g.,          the 2nd ACM/IEEE-CS Joint Conference on Digital
for large text documents (textbooks). There, the users (doc-        Libraries, Video and multimedia digital libraries,
ument readers) should have the possibility to annotate dis-         pages 160–161, 2002.
tinct sections of the text document and to beneﬁt from these
annotations in a personal or collaborative way. One way to      [8] C. Marlow, M. Naaman, D. Boyd, and M. Davis.
facilitate the identiﬁcation of distinct sections within any        Position Paper, Tagging, Taxonomy, Flickr, Article,
type of document can be realized with the help of the doc-          ToRead. In Collaborative Web Tagging Workshop at
ument object model (DOM) [5]. The DOM representation                WWW2006, Edinburgh, Scotland, May 2006.
of a document is a rooted graph (document tree), where
diﬀerent sections (at diﬀerent levels within the document’s     [9] National Institute of Standards and Technology. NIST
hierarchy) are represented by nodes that can be linked with         MPEG-7 Validation Service and MPEG-7
user annotations. Thus, with the collaborative annotation           XML-schema speciﬁcations,
of partial documents a more focused and personalized search         http://m7itb.nist.gov/M7Validation.html.
can be achieved for any type of document.
                                                               [10] S. Repp and C. Meinel. Semantic indexing for
                                                                    recorded educational lecture videos. In 4th Annual
5.   REFERENCES                                                     IEEE Int. Conference on Pervasive Computing and
 [1] Ricoh movie tool,                                              Communications Workshops (PERCOMW’06), 2006.
     http://m7itb.nist.gov/M7Validation.html.
                                                               [11] H. Sack and J. Waitelonis. Automated annotations of
 [2] YouTube - video sharing and tagging system,                    synchronized multimedia presentations. In In
     http://www.youtube.com/.                                       Proceedings of the ESWC 2006 Workshop on
 [3] D. Bargeron, A. Gupta, J. Grudin, and E. Sanocki.              Mastering the Gap: From Information Extraction to
     Annotations for streaming video on the web: System             Semantic Representation, CEUR Workshop
     design and usage studies. Computer Networks,                   Proceedings, June 2006.
     31(11-16), 1999.                                          [12] J. R. Smith and B. Lugeon. A visual annotation tool
 [4] S. F. Chang, T. Sikora, and A. Puri. Overview of the           for multimedia content description. In Proc. SPIE
     MPEG-7 Standard. IEEE Trans. Circuits and Systems              Photonics East, Internet Multimedia Management
     for Video Technology, 11(6):688–695, 2001.                     Systems, pages 160–161, 2000.