Multimedia Annotations and the Semantic Web

         Jacco van Ossenbruggen1 , Giorgos Stamou2 and Jeff Z. Pan3
 1
   Centrum voor Wiskunde en Informatica, Kruislaan 413, NL-1098 SJ Amsterdam,
                                 The Netherlands
2
  Department of Electrical and Computer Engineering, National Technical University
                        of Athens, Zographou 15780, Greece
3
  School of Computer Science, The University of Manchester, Manchester, M13 9PL,
                                        UK


      Abstract. Multimedia and the Semantic Web: in theory, it is a perfect
      match. The Semantic Web, on the one hand, provides a stack of lan-
      guages and technologies for annotating Web resources, enabling machine
      processing of metadata describing semantics of web content. Multime-
      dia applications, on the other hand, require metadata descriptions of
      their media items to facilitate search and retrieval, intelligent process-
      ing and effective presentation of multimedia information. This need for
      multimedia metadata was recognized by the media industry long ago.
      Semantic Web technologies, however, still play a very minor role within
      multimedia applications and most approaches employ non-RDF based
      techniques. This paper describes a number of current approaches to mul-
      timedia metadata and provides an inventory of the open issues to achieve
      a practical integration of multimedia metadata into the Semantic Web.


1    Introduction

Those who deal with multimedia, from professional archivists to amateur photog-
raphers, are faced with daunting problems when it comes to storing, annotating
and retrieving media items from multimedia repositories, whether via the Web
or otherwise. Although the standardization activities of ISO (and other) com-
munities (MPEG-7, MPEG-21, Dublin Core etc) [5, 10–12, 17, 22] have provided
standards for describing content, these standards have not been widely used,
mainly for the following reasons. Firstly, it is difficult, time-consuming, and thus
very expensive to manually annotate multimedia content. Secondly, many organi-
zations feel that the complexity of many standards makes multimedia annotation
unnecessarily difficult. Thirdly, there is little incentive for organizations to pro-
vide, for example, MPEG-7 metadata because there are insufficient applications
that would benefit from its use.
    We believe that these problems could be solved by merging and aligning exist-
ing practices in multimedia industry with the current technological development
of the Semantic Web. First, such integration would give metadata providers
immediate payoff because they could directly benefit from the Semantic Web
software that is (publicly) available. Second, it would enable the deployment
                Fig. 1. Abstraction levels of multimedia annotation


of more intelligent applications to reason over multimedia metadata in a way
that is currently not possible because current multimedia metadata standards
are usually (XML) syntax-oriented, and thus lack a formal semantics. Third,
the “open world” approach of the Semantic Web would simplify the integration
of multiple vocabularies from different communities. Finally, it could provide
small, simple but extensible vocabularies. These vocabularies should be suitable
for private use (e.g. simple annotation of online photo albums à la Flickr) but
at the same time be sufficiently flexible to be extended for more complex and
professional annotation tasks.


2   Multimedia semantic annotation

The information conveyed by a multimedia document can be formalized, rep-
resented, analyzed and processed in three different levels of abstraction: the
subsymbolic, the symbolic and the logical (see Figure 1).
    The subsymbolic level of abstraction covers the raw multimedia information
represented in well known formats for video, image, audio, text, metadata, etc.
Note that these are typically binary formats, typically optimized for compression
and streaming delivery. They are not necessarily well-suited for further process-
ing that uses, for example, the internal structure or specific features of the media
stream. To address this issue, one can introduce a level of abstraction, the mid-
dle layer in Figure 1, which provides this information. This is the approach of
MPEG-7, which allows one to use the output of feature detectors, (multi-cue)
segmentation algorithms, etc. to provide a structural layer on top of the binary
media stream. Note that information on this level is typically serialized in XML.
    The problem with this XML-based, structural layer is that the semantics of
the information encoded in the XML is specified only in the specification of, for
example, the MPEG-7 standard and needs to be hard wired into the code by
the programmer of the MPEG-7 application software. It also makes it hard to
re-use this data in environments that are not based on MPEG-7, or to integrate
non-MPEG metadata in an MPEG-7 application.
    To address this, one could simply replace the middle layer by another one that
is open and has formal, machine processable semantics. This, however, would not
take advantage of existing XML-based metadata, and, more importantly, ignore
the advantages of an XML-based structural layer (more on that later). So, rather
than replacing the middle layer, a solution is to add a third layer that provides
the semantics for the middle layer.
    These semantics are mappings between the structured information sources
and a formal knowledge representation of the domain, for example in OWL. In
this layer, the implicit knowledge of the multimedia document description can
be made explicit and reasoned upon, for example to derive new knowledge not
explicitly present in the middle layer.
    Several standards have been proposed and used in the literature for the repre-
sentation of multimedia document descriptions (Dublin Core, MPEG-7, MPEG-
21 etc), mainly operating in the middle layer of Figure 1. The stack of RDF-based
languages and technology provided by the W3C community are well-suited to
the formal, semantic descriptions of the terms used in the middle layer. However,
since they often lack the structural advantages of the XML-based approach, a
combination of the above standards seems to be the most promising way for
multimedia document description in the near future [8, 9, 13, 14, 20, 23, 24].


3     Open issues

To realize such an integrated scenario, several open issues need to be addressed.

Interoperability and tool support The main problem we see is that the Semantic
Web technologies do not interoperate with existing approaches in the multime-
dia production chain. In the longer term, integration of Semantic Web tech-
nologies in the major multimedia tool is essential. In the shorter term, we need
to show how RDF-based software can take advantage of popular existing, non-
RDF metadata such as ID3 tags in MP3 music files4 , EXIF metadata added to
JPEG images by digital cameras5 , informal tagging of images (with terms from
so called ’folksonomies’) etc.
    The problem mentioned above of aligning Semantic Web-based approaches
with MPEG-7 is also a major issue, for a thorough comparison of and a list
4
    As is done by the content handlers of Kowari, http://kowari.org/
5
    As is done by JpegRDF, http://sourceforge.net/projects/jpegrdf
of the open issues in integrating the MPEG-7 and Semantic Web approaches,
see [15, 26].

Linking media data with metadata Since metadata is just data about other data,
the link between the metadata and the target media item is of crucial importance.
On the Semantic Web, the link between the two is simple: all you need is the
URI of the media item which you use as the value of the rdf:about attribute
somewhere in your metadata:
     <rdf:Description rdf:about=’http://www.example.com/mymovie.mpg’
       dc:title=’My First DVD Soccer Movie’/>
This approach, however, makes a number of assumptions that do not always
hold in the multimedia domain.
    First of all, it assumes that the thing being annotated can be addressed by
a commonly agreed upon URI scheme. While this may be a safe assumption
for HTML and XML-based resources, this is not the case in multimedia. The
example above works because the dc:title is an annotation that applies to the
entire resource. Annotations that apply only to a part of the resource are much
harder. Imagine you would like to provide annotations for the 7th scene on
that DVD, or for a specific sequence of frames, a specific region in a frame, a
specific object (the ball, a specific player), a part of the sound track (the audience
singing), etc, etc. Standardizing URIs for such targets is not trivial. For example,
when sticking with the current URL schemes, it requires standardizing a powerful
fragment identifier6 syntax for all common multimedia MIME types used.
    Second, the example above assumes that the link can be embedded within the
metadata. A disadvantage of this approach is that it is geared to 1-to-1 relations
and it becomes harder to model n-to-m relations [16]. On a more practical level, it
becomes harder to associate existing annotations with other media items, since
this requires modification of the original RDF. This might be unwanted from
a maintenance perspective or downright impossible if the person creating the
new link has no write access on the metadata. From the (pre-Web) hypertext
literature (see [25] for an overview) we know that links between two pieces of
information can be embedded at the source (as is the case in, for example,
HTML), in the target, or in an independent location (often called a link base).
All three solutions have different characteristics when it comes to flexibility and
complexity, so the key issue is to know what solution to use in what context.
    Third, the example assumes that the URI unambiguously identifies the tar-
get. However, in many multimedia resources the URI of the digital artifact is
used also for the physical object it represents. For example the URI of an image
of a painting is also used to for the painting itself. Vocabularies such as the
VRA Core 3.0 [27] make the distinction explicit by distinguish metadata records
describing the “work” from records that are about the “image”. How to link
“work” records to associated “image” records remains, however, unspecified.
6
    Informally, fragment identifiers are what comes after the ’#’ in a URI. So in http:
    //example.com/index.html#section1, section1 is the fragment identifier. Note
    that the syntax and semantics of fragment identifiers depend on the MIME type of
    the resource
Vocabularies for multimedia annotations While it is true that the Semantic Web
allows everyone to create his or her own vocabulary, sharing and reusing infor-
mation benefits from having only a few widely used vocabularies for a specific
purpose, and having these vocabularies widely available in a Semantic Web com-
patible format. Good examples of such vocabularies are, however, still hard to
find7 .
    Part of the problem is that many vocabularies for multimedia predate the
Semantic Web. Another explanation is that describing the content of audiovisual
material in general requires a large vocabulary basically covering the entire (vi-
sual) world around us. Developing such vocabularies is a long and costly process,
and organizations that have invested large sums of money in creating such vo-
cabularies are often not willing to make the results publicly available on a royalty
free basis. Well known examples (see [7] for a more extensive overview) include
Getty’s Art and Architecture Thesaurus [6], that needs to be licensed and is
not available in RDF. Another example is Mark Davis’s MediaStreams iconic
ontology [4] developed in the early nineties, also predating the Semantic Web.
In addition, many national audiovisual archives (e.g. INA in France and Beeld
en Geluid in The Netherlands) have developed in house vocabularies to describe
and index the large quantities of audiovisual material they need to archive, and
these vocabularies have also been developed long before the Semantic Web took
shape.

Uncertainty in multimedia annotations Several issues of multimedia information
systems are often subject to uncertainty and imprecision. The representation of
multimedia annotations, the automatic extraction of this annotation, the re-
trieval of multimedia documents etc are processes that involve uncertainty and
inconsistency in several levels. For example, the extraction (automatic or man-
ual) of the key entities that semantically describe the multimedia document is
always a matter of degree. Moreover, the visual characteristics of an object (i.e.
its color) possess usually imprecise information with its accuracy being a mat-
ter of the measurement process, though most of times is not really important
(usually in the retrieval process only linguistic terms like ”red” are needed). For
the above reasons, theories and methods covering the framework of uncertainty
(fuzzy logic, probabilistic reasoning, evidence theory, neural networks etc) are
very important and sometimes crucial in multimedia information systems. Al-
though several papers have been published in this area [1,21], the issue remains
open for further research.

Datatypes The output of many multimedia feature detectors is described by
complex datatypes. MPEG-7, for example, uses XML Schema to describe mul-
timedia objects, such as video, audio and images, as instances of XML schema
datatypes [2].
7
    During the first Workshop on Multimedia and the Semantic Web at ESWC 2005 on
    Crete, the participants agreed to collect publicly available multimedia ontologies on
    a central website, http://www.acemedia.org/aceMedia/reference/resource/.
    In the Semantic Web standards, such as RDF and OWL, datatypes are de-
fined in a more formal way [3]. More specifically, a datatype (such as boolean)
is characterized by a lexical space (such as {T,F,1,0}), a value space (such as
{true, f alse}) and a lexical-to-value mapping (such as {T7→ true, F7→ f alse,
17→ true, 07→ f alse}). Although RDF and OWL only allow some built-in XML
Schema simple types, OWL-Eu [19] has been designed to support user-defined
XML Schema simple types based on restriction and union. Furthermore, OWL-
E [18] (the n-ary extension of OWL-Eu) supports user-defined datatype predi-
cates.
    An obvious issue here is that MPEG-7 also requires the structuring sup-
port of XML Schema complex types, which are not compatible with the above
RDF/OWL datatype model. The main issue is, however, whether it is proper
to introduce the structuring support into datatypes, or simply use concept lan-
guages provided by OWL to represent the structure of multimedia objects. An-
other issue is that even XML Schema complex types are not enough for MPEG-7,
which extends XML Schema datatypes with array and matrix datatypes (among
others), with both fixed size and parameterized size.


4   Conclusions

In this paper we analyzed the open problems for enabling multimedia metadata
on the Semantic Web. Obviously, solving these problems will require effort from
both the multimedia and the Semantic Web communities. We feel, however, that
the Semantic Web community has a special obligation to prove that their models
and techniques can add sufficient added value to convince the multimedia content
owners to go beyond the current, XML-based, approaches. From this perspective,
this paper is “a call to arms” to the Semantic Web community to address the
following issues.
     First, we should show how our RDF-based environments can interoperate
with current, non-RDF metadata practices in the multimedia field, for example
by developing RDF tools that can handle embedded metadata in MP3 and JPEG
files, or build upon XML-based approaches such as MPEG-7.
     Second, we should collect and publish example multimedia vocabularies using
Semantic Web languages, and show that annotating multimedia data with these
vocabularies is not only practical but also provides more useful functionality
than that provided by current multimedia metadata tools.
     Third, we should be able to flexibly attach metadata to media resources.
To be able to link metadata to the appropriate part of the target media item,
fragment identifier schemes need to be standardized and widely implemented
for a wide variety of commonly used media types. We also need a standard
way of describing the link between a piece of metadata and its target media
item independently of the media item and the meta data. In addition, we need a
common way to discriminate between, and relate the metadata about, a physical
object and the metadata about a (digital) representation of that object.
    Fourth, we need to extend our languages and tools to be able to formally
express the uncertainty and imprecision inherent to many statements about mul-
timedia data.
    Last but not least, we need to extend the Semantic Web datatype formalism
to deal with the often complex media types that are required to express state-
ments about specific multimedia features. We need to harmonize the structure
information, specified by XML Schema complex types, about multimedia objects
in the symbolic level and the semantic descriptions, represented by concept and
datatype constraints, in the logical level.
    Even when all of the above issues have been solved, multimedia annotation
will remain a difficult, time consuming and expensive process. The question is
whether we can develop the required standards in a way that reduces, and not
adds, to the complexity of the task, and develop the tools and applications with
an added value that makes multimedia annotation payoff in practice.


5   Acknowledgments

We wish to thank our colleagues Joost Geurts, Frank Nack and Lynda Hardman
for their contributions to this work. Part of this work was funded by the European
Knowledge Web and Dutch MultimediaN and NWO NASH projects.


References

 1. Special issue on management of uncertainty and imprecision in multimedia infor-
    mation systems. International Journal of Uncertainty, Fuzziness and Knowledge-
    based Systems,, Volume 11, February 2003.
 2. P. V. Biron and A. Malhotra. Extensible Markup Language (XML) Schema Part 2:
    Datatypes – W3C Recommendation 02 May 2001. Technical report, World Wide
    Web Consortium, 2001. http://www.w3.org/TR/xmlschema-2/.
 3. J. J. Carroll and J. Z. Pan. XML Schema Datatypes in RDF and OWL. Technical
    report, W3C Semantic Web Best Practices and Development Group, Nov 2004.
    Editors’ Draft, http://www.w3.org/2001/sw/BestPractices/XSCH/xsch-sw/.
 4. M. Davis. Readings in Human-Computer Interaction: Toward the Year 2000, chap-
    ter Media Streams: An Iconic Visual Language for Video Representation., pages
    854–866. Morgan Kaufmann Publishers, Inc., 1995.
 5. Dublin     Core     Community.         Dublin   Core    Element     Set,   Ver-
    sion    1.1,   2003.         ISO   Standard   15836-2003    (February    2003),
    http://www.niso.org/international/SC4/n515.pdf; NISO Standard Z39.85-
    2001     (September      2001),   http://www.niso.org/standards/resources/Z39-
    85.pdf;    CEN      Workshop     Agreement   CWA      13874    (March    2000),
    http://www.cenorm.be/isss/cwa download area/cwa13874.pdf.
 6. Getty Research Institute.          Art & Architecture Thesaurus (Online).
    http://www.getty.edu/research/tools/vocabulary/aat/, 2000. Version 2.0.
 7. J. Geurts, J. van Ossenbruggen, and L. Hardman. Requirements for practical
    multimedia annotation. In Workshop on Multimedia and the Semantic Web, pages
    4–11, May 2005.
 8. J. Hunter. Adding Multimedia to the Semantic Web — Building an MPEG-7
    Ontology. In International Semantic Web Working Symposium (SWWS), Stanford
    University, California, USA, July 30 - August 1, 2001.
 9. J. Hunter, J. Drennan, and S. Little. Realizing the hydrogen economy through
    semantic web technologies. IEEE Intelligent Systems Journal, January 2004.
10. ISO/IEC.      Overview of the MPEG-7 Standard (version 6.0).             ISO/IEC
    JTC1/SC29/WG11/N4980, Pattaya, December 2001.
11. ISO/IEC. Text of ISO/IEC 15938-5/FDIS Information Technology - Multimedia
    Content Description Interface - Part 5: Multimedia Description Schemes. ISO/IEC
    JTC 1/SC 29/WG 11/N4242, Singapore, September 2001.
12. ISO/IEC. MPEG-21 Overview v.5. ISO/IEC JTC1/SC29/WG11/N5231, Shang-
    hai, October 2002.
13. A. Jaimes and J. R. Smith. Semi-automatic, data-driven construction of multime-
    dia ontologies. Proc. IEEE Intl. Conf. on Multimedia and Expo (ICME), March
    2003.
14. S. Little and J. Hunter. Rules-b-example - a novel approach to semantic in-
    dexing and querying of images. In 3rd International Semantic Web Conference
    (ISWC2004), November 2004.
15. F. Nack, J. van Ossenbruggen, and L. Hardman. That Obscure Object of Desire:
    Multimedia Metadata on the Web (Part II). IEEE Multimedia, 12(1):54–63, Jan-
    uary – March 2005. based on http://ftp.cwi.nl/CWIreports/INS//INS-E0309.pdf.
16. Natasha Noy and Alan Rector. Defining N-ary Relations on the Semantic Web:
    Use With Individuals. Work in progress. W3C Working Drafts are available at
    http://www.w3.org/TR, 21 July 2004.
17. NewsML. The NewsML Home Page. http://www.newsml.org/, 2000.
18. J. Z. Pan. Description Logics: Reasoning Support for the Semantic Web. PhD
    thesis, School of Computer Science, The University of Manchester, 2004.
19. J. Z. Pan and I. Horrocks. OWL-Eu: Adding Customised Datatypes into OWL.
    In Proc. of Second European Semantic Web Conference (ESWC 2005), 2005.
20. G. Stamou and S. Kollias (eds). Multimedia Content and the Semantic Web:
    Methods, Standards and Tools. John Wiley & Sons Ltd, 2005.
21. G. Stoilos, G. Stamou, V. Tzouvaras, J. Pan, and I. Horrocks. A fuzzy description
    logic for multimedia knowledge representation. Multimedia and the Semantic Web
    Workshop, European Semantic Web Conference,, pages pages 12–19, May 29-June
    1 2005.
22. The TV-Anytime Forum. The TV-Anytime Forum Home Page. http://www.tv-
    anytime.org/.
23. R. Troncy. Integrating Structure and Semantics into Audio-visual Documents.
    In Second International Semantic Web Conference (ISWC2003), pages 566 – 581,
    Sanibel Island, Florida, USA, October 20-23, 2003. Springer-Verlag Heidelberg.
24. C. Tsinaraki, P. Polydoros, and S. Christodoulakis. Interoperability support for
    ontology-based video retrieval applications. In Proceedings of Third International
    Conference on Image and Video Retrieval (CIVR),, pages 582–591, July 21-23
    2004.
25. J. van Ossenbruggen, L. Hardman, and L. Rutledge. Hypermedia and the Semantic
    Web: A Research Agenda. Journal of Digital Information, 3(1), August 2002.
26. J. van Ossenbruggen, F. Nack, and L. Hardman. That Obscure Object of De-
    sire: Multimedia Metadata on the Web (Part I). IEEE Multimedia, 11(4):38–
    48, October – December 2004. based on http://ftp.cwi.nl/CWIreports/INS//INS-
    E0308.pdf.
27. Visual Resources Association. Visual Resources Association Website.