Connecting Foundational Ontologies with MPEG-7
          Ontologies for Multimodal QA
                                     Massimo Romanelli, Daniel Sonntag, and Norbert Reithinger


   Abstract— In the S MART W EB project [1] we aim at developing                 SWI NT O (SmartWeb Integrated Ontology), is based on a up-
a context-aware, mobile, and multimodal interface to the Seman-                  per model ontology realized by merging well chosen concepts
tic Web. In order to reach this goal we provide a integrated                     from two established foundational ontologies, DOLCE [7]
ontological framework offering coverage for deep semantic con-
tent, including ontological representation of multimedia based                   and SUMO [8], into the S MART W EB foundational ontology
on the MPEG-7 standard1 . A discourse ontology covers concepts                   S MART SUMO [9]. The domain-specific knowledge like sport-
for multimodal interaction by means of an extension of the W3C                   event, navigation, or webcam is defined in dedicated ontolo-
standard EMMA2 . For realizing multimodal/multimedia dialog                      gies modeled as sub-ontologies of S MART SUMO. Semantic
applications, we link the deep semantic level with the media-                    integration takes place for heterogeneous information sources:
specific semantic level to operationalize multimedia information
in the system. Through the link between multimedia represen-                     extraction results from semi-structured data such as tabular
tation and the semantics of specific domains we approach the                     structures which are stored in an ontological knowledge base
Semantic Gap.                                                                    [10], and hand-annotated multimedia instances such as images
 Index Terms— Multimedia systems, Knowledge representation,                      of football teams. In addition, Semantic Web Services deliver
Multimodal ontologies, ISO standards.                                            MPEG-7 annotated city maps with points of interest.


                          I. I NTRODUCTION


W       ORKING with multimodal, multimedia dialog appli-
        cations with question answering (QA) functionality
assumes the presence of a knowledge model that ensures ap-
propriate representation of the different levels of descriptions.
Ontologies provide instruments for the realization of a well
modeled knowledge base with specific concepts for different
domains. For related work, see e.g. [2]–[5].
   Within the scope of the S MART W EB project3 we real-
ized a multi-domain ontology where a media ontology based
on MPEG-7 supports meta-data descriptions for multimedia
audio-visual content; a discourse ontology based on the W3C                      Fig. 1.   S MART W EB’s Ontological Framework for Multimodal QA
standard EMMA covers multimodal annotation. In our ap-
proach we assign conceptual ontological labels according to
the ontological framework (figure 1) to either complete multi-                       II. T HE D ISC O NTO AND S MARTMEDIA O NTOLOGIES
media documents, or entities identified therein. We employ
                                                                                     The SWI NT O supplements QA specific knowledge in a
an abstract foundational ontology as a means to facilitate
                                                                                 discourse ontology (D ISC O NTO) and represents multimodal
domain ontology integration (combined integrity, modeling
                                                                                 information in a media ontology (S MARTMEDIA). The D IS -
consistency, and interoperability between the domain on-
                                                                                 C O NTO provides concepts for dialogical interaction with the
tologies). The ontological infrastructure of S MART W EB, the
                                                                                 user and with the Semantic Web sub-system. The D ISC O NTO
   This research was funded by the German Federal Ministry for Education         models multimodal dialog management using SWEMMA,
and Research under grant number 01IMD01A.                                        the S MART W EB extention of EMMA, dialog acts, lexical
   M. Romanelli, D. Sonntag and N. Reithinger are with DFKI GmbH –               rules for syntactic-semantic mapping, HCI concepts (pattern
German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3,
d-66123 Saarbrücken, Germany {romanell,sonntag,bert}@dfki.de                    language for interaction design), and semantic question/answer
   1 http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm                types. Concepts for QA functionality are realized with the
   2 http://www.w3.org/TR/emma/
                                                                                 discourse:Query concept specifying emma:interpretation. It
   3 S MART W EB aims to realize a mobile and multimodal interface to Seman-
                                                                                 models the user query to the system in form of a partially
tic Web Services and ontological knowledge bases [6]. The project moves
through three scenarios: handheld, car, and motorbike. In the handheld sce-      filled ontology instance. The discourse:Result concept refer-
nario the user is able to pose multimodal closed- and open-domain questions      ences information the user is asking for [11]. In order to
using speech and gesture. The system reacts with a concise and short answer      efficiently search and browse multimedia content SWI NT O
and the possibility to browse pictures, videos, or additional text information
found on the Web or in Semantic Web sources (http://www.smartweb-                specifies a multimedia sub-ontology called S MARTMEDIA.
project.de/).                                                                    S MARTMEDIA is an ontology for semantic annotations based
on MPEG-7. It offers an extensive set of audio-visual de-                                     IV. C ONCLUSION
scriptions for the semantics of multimedia [12]. Basically, the       We presented the connection of our foundational ontology
S MARTMEDIA ontology uses the MPEG-7 multimedia content            with an MPEG-7 ontology for multimodal QA in the context
description and multimedia content management (see [13] for        of the S MART W EB project. The foundational ontology is
details on description schemes in MPEG-7) and enriches it to       based on two upper model ontologies and offers coverage
account for the integration with domain-specific ontologies.       for deep semantic ontologies in different domains. To capture
A relevant contribution of MPEG-7 in S MARTMEDIA is the            multimedia low-level semantics we adopted an MPEG-7 based
representation of multimedia decomposition in space, time,         ontology that we connected to our domain-specific concepts
and frequency as in the case of the general mpeg7:Segment-         by means of relations in the top level classes of the SWI NT O
Decomposition concept. In addition we use file format and          and S MARTMEDIA. This work enables the system the use of
coding parameters (mpeg7:MediaFormat, mpeg7:MediaProfile,          multimedia in a multimodal context like in the case of mixed
etc.).                                                             gesture and speech interpretation, where every object that is
                                                                   visible on the screen must have a comprehensive ontological
                                                                   representation in order to be identified on the discourse level.

                                                                                            ACKNOWLEDGMENT
                                                                      We would like to thank our partners in SmartWeb. The
                                                                   responsibility for this article lies with the authors.

                                                                                                 R EFERENCES
                                                                    [1] Wahlster, W.: Smartweb: Mobile applications of the semantic web. In:
Fig. 2.   The SWI NT O - D ISC O NTO - S MARTMEDIA Connection           P. Dadam and M. Reichert, editors, GI Jahrestagung 2004, Springer,
                                                                        2004.
                                                                    [2] Reyle, U., Saric, J.: Ontology Driven Information Extraction, Proceed-
                                                                        ings of the 19th Twente Workshop on Language Technology, 2001.
                                                                    [3] Lopez, V., Motta, E.: Ontology-driven Question Answering in AquaLog
                III. C LOSING THE S EMANTIC G AP                        In Proceedings of 9th International Conference on Applications of
                                                                        Natural Language to Information Systems (NLDB), 2004.
   In order to close the Semantic Gap deriving from the             [4] Niekrasz, J. and Purver, M.: A multimodal discourse ontology for meet-
different levels of media representations, namely the surface           ing understanding. In Bourlard, H. and Bengio, S., editors, Proceedings
                                                                        of MLMI’05. LNCS.
level referring to the properties of realized media as in the       [5] Nirenburg, S., Raskin, V.: Ontological Semantics, MIT Press, 2004.
S MARTMEDIA, and the deep semantic representation of these          [6] Reithinger, N., Bergweiler, S.,Engel, R., Herzog, G., Pfleger, N., Ro-
objects, the smartmedia:aboutDomainInstance property with               manelli, M., and Sonntag, D.: A Look Under the Hood - Design and
                                                                        Development of the First SmartWeb System Demonstrator. In: Proc.
range smartdolce:entity has been added to the top level class           ICMI 2005, Trento, 2005.
smartmedia:ContentOrSegment (see fig. 2). In this way the           [7] Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., Schcneider, L.:
link to the upper model ontology is inherited to all segments of        Sweetening Ontologies with DOLCE. In Proc. of the 13th International
                                                                        Conference on Knowledge Engineering and Knowledge Management
a media instance decomposition, so that we can guarantee for            (EKAW02), volume 2473 of Lecture Notes in Computer Science,
deep semantic representation of the S MARTMEDIA instances               Sigünza, Spain, 2002.
referencing the specific media object, or the segments making       [8] Niles, I., Pease, A.: Towards a Standard Upper Ontology. In Proc. of
                                                                        the 2nd International Conference on Formal Ontology in Information
up the decomposition. Through the discourse:hasMedia prop-              Systems (FOIS-2001), C. Welty and B. Smith, Ogunquit, Maine, 2001.
erty with range smartmedia:ContentOrSegment located in the          [9] Cimiano, P., Eberhart, A., Hitzler, P., Oberle, D., Staab, S., Studer,
smartdolce:entity top level class and inherited to each concept         S.: The smartweb foundational ontology. Technical report, Institute for
                                                                        Applied Informatics and Formal Description Methods (AIFB) University
in the ontology, we realize a pointer back to the S MARTMEDIA           of Karlsruhe, SmartWeb Project, Karlsruhe, Germany, 2004.
ontology.                                                          [10] Buitelaar, P., Cimiano P., Racioppa S., Siegel, M.: Ontology-based
   This type of representation is useful for pointing gesture           Information Extraction with SOBA In Proc. of the 5th Conference on
                                                                        Language Resources and Evaluation (LREC 2006).
interpretation and co-reference resolution. A map which is         [11] Sonntag, D., Romanelli, M.: A Multimodal Result Ontology for In-
obtained from the Web Services to be displayed on the screen            tegrated Semantic Web Dialogue Applications. In Proc. of the 5th
shows selectable objects (e.g. restaurants, hotels), and the            Conference on Language Resources and Evaluation (LREC 2006).
                                                                   [12] Benitez, A., Rising, H., Jorgensen, C., Leonardi, R., Bugatti, A., Hasida,
map is represented in terms of an mpeg7:StillRegion instance,           K., Mehrotra, R., Tekalp, A., Ekin, A., Walker, T.: Semantics of
decomposed into different mpeg7:StillRegion instances for               Multimedia in MPEG-7. In Proc. of IEEE International Conference on
each object segment of the image. The MPEG-7 instances are              Image Processing (ICIP), 2002.
                                                                   [13] Hunter, J.: Adding Multimedia to the Semantic Web - Building an
linked to a domain-specific instance, i.e., the deep semantic           MPEG-7 Ontology. In Proc. of the Internatioyesnal Semantic Web
description of the picture (in this case the smartsumo:Map) or          Working Symposium (SWWS), 2001.
the segment of picture (e. g., navigation:ChineseRestaurant).
In this way the user can refer to the restaurant by touching
on the displayed map. Hence a multimodal fusion component
can directly process the referred navigation:ChineseRestaurant
instance performing linguistic co-reference resolution: What’s
the phone number here?