Dealing Efficiently with Ontology-Enhanced
           Linked Data for Multimedia

    Oliver Gries1 , Ralf Möller2 , Anahita Nafissi3 , Maurice Rosenfeld4 , Kamil
                         Sokolski5 , and Sebastian Wandelt6
              1
                    Hamburg University of Technology, Hamburg, Germany
                          2
                            University of Lübeck, Lübeck, Germany
                              moeller@ifis.uni-luebeck.de
                     3
                       Amirkabir University of Technology, Tehran, Iran
                    4
                       Lufthansa Industry Solutions, Hamburg, Germany
                        5
                          Nordex Energy GmbH, Hamburg, Germany
                  6
                     Humboldt University of Technology, Berlin, Germany


       Abstract. In order to provide automatic ontology-based multimedia
       annotation for producing linked data, scalable high-level media interpre-
       tation processes on (video) streams are required. In this paper we shortly
       describe an abductive media interpretation agent, and based on a Mul-
       timedia Content Ontology we introduce partitioning techniques for huge
       sets of time-related annotation assertions such that interpretation as well
       as retrieval processes refer to manageable sets of metadata.

       Keywords: linked data, multimedia interpretation, stream processing


1    Introduction
A large amount of multimedia content is available on the Web, and these days
appropriate multimedia documents can hardly be systematically found using
keyword-based search. Therefore, the field of Linked Data has emerged [1].
Linked data are also called rich semantic media in the literature [2]. These re-
search fields investigate the derivation and management of symbolic descriptions
for multimedia content. Symbolic descriptions are anchored at various parts of
a multimedia object, and they can be used to link various (parts of) multimedia
objects. Hence the term linked data has emerged. Symbolic content descriptions
approximate human-level interpretations of media content, and, therefore, can be
used for systematic document retrieval based on high-level topic-based queries.
Retrieval based on linked data can be enhanced if retrieval processes are based
on ontologies, namely a domain ontology and a general ontology for describing
document structure and content (see, e.g., [3]).
    To some extent linked data can be automatically derived using existing data-
driven media analysis systems. However, there still exists a gap between, for in-
stance, low-level image/video analysis and high-level image/video interpretation,
not to mention human-level understanding. Thus, analysis-level results obtained
from state-of-the-art tools have to be augmented with more abstract symbolic
descriptions. This is accomplished in an automatic process which we call media
interpretation. Recent research in the area of ontology-based media interpreta-
tion has shown enormous advances, and we assume that media interpretation
processes can safely generate linked data, to be used in ontology-oriented media
retrieval processes. For brevity, linked-data generation is also called (automatic)
annotation in this paper, and we focus on videos as multimedia objects in order
to be as concrete as possible.
    The sheer amount of assertions for appropriately describing the content of
large media objects makes media interpretation as well as annotation-based re-
trieval increasingly difficult. In this paper we advance the state-of-the-art in
several areas by: 1. Proposing a description language for video annotations that
supports scalable high-level reasoning about video content (interpretation as
well as retrieval). 2. Explaining ontology-based reasoning techniques for an an-
notation agent, which is used to compute high-level interpretations of videos.
3. Showing how to support decomposition-based scalability for reasoning in the
context of long streams of video content.
    There already exist various proposals for annotation languages (see, e.g.,
early approaches based on MPEG-7 [4, 5] or newer ones dedicated to knowledge
management [6]). However, none of the languages has been developed while
keeping in mind scalable stream-based reasoning w.r.t. an ontology (rather than
mere data retrieval). Reasoning is used for media interpretation, which is a ser-
vice being used for computer-aided semantics annotation of multimedia (see the
EU project CASAM http://www.casam-project.eu/). The CASAM Multi-
media Content Ontology introduced in this paper (called MCO for short) is an
extension and modification of a previous multimedia ontology described in [7].
    Scalability is a significant issue in at least two respects. If we talk about
interpretation of a video document, then, on the one hand, there is the time
dimension to be considered. On the other hand, another dimension is the inter-
pretation depth. As we have argued above, we can assume that interpretation
is based on explicit (symbolic) low-level information for each perceptive unit. A
perceptive unit, such as a video shots, is called “segment” in MCO. The aim
of interpretation is to compute high-level information for a segment given the
knowledge acquired so far. Thus, in our annotation language we need to be able
to represent time information as well as to support the ability to draw conclu-
sions on higher levels of interpretation. Hence, the notion of a segment has to be
appropriately defined using an ontology, and assertions representing interpreta-
tion results at various levels of detail have to be attached to segments using an
appropriate annotation language.
    It can easily be seen that this kind of two-dimensional streaming scenario,
with multiple streams for multiple modalities, yields a significant growth of as-
sertions over time. Although our low-level annotation language is based on a
description logic for which efficient typical-case reasoning systems are known,
we need to exploit new partitioning techniques to break down the data descrip-
tions used for interpretation into smaller pieces to be handled over time. This is
even more important if low-level results become available for time frames in an
asynchronous way (maybe with substantial time delays according to the intri-
cacies of certain tools for different modalities). In order to improve scalability,
we identify and use locality in the video stream. Given an annotation of a video
stream, we use split-operations to compute so called “islands,” which are suf-
ficient for reasoning with respect to the current state of annotation. Running
interpretation tasks on separate islands (instead of the whole set of assertions)
improves performance significantly.
     The remaining part of the paper is structured as follows. Section 2 presents
our motivation for choosing the description logic ALH−    f R+ (D) as an ontology
language for definition of an annotation language MCO. In Section 3, we intro-
duce the Multimedia Content Ontology in detail and explain how ALH−       f R+ (D)
is used to represent the content of multimedia documents. We discuss scalability
issues and present solutions for partitioning large sets of assertions into man-
ageable “islands” or “chunks” in Section 4 such that interpretation processes
run efficiently. Several examples demonstrate the effectiveness of the proposed
techniques. We conclude in Section 5. Due to space constraints, retrieval is not
covered in this paper (We refer to [8] for details).


2   Representation of Multimedia Content

In order to describe multimedia documents in terms of annotations (stored as
metadata), the Moving Pictures Experts Group (MPEG) has specified the ISO
standard Multimedia Content Description Interface, also denoted as MPEG-7
[4]. In this framework, XML descriptions of multimedia data are associated with
content, with the objective to allow for efficient search and retrieval of multi-
media documents. The MPEG-7 schema language provides for restrictions on
valid media descriptions, for which XML query languages are defined. If, in the
context of linked data, the inherent problem of XML query languages comes into
play, namely the lack of facilities for querying the name of a relation of which a
certain annotation tuple is an element, then RDF-based representations are ben-
eficial. Proposals for using RDF in the context of MPEG-7-alike representations
have been discussed in the literature as well (e.g., [5]), and retrieval languages
such as SPARQL can be used to find media objects based on RDF content de-
scriptions. RDF query answering with respect to ontologies means that data
(tuples, or triples to be more precise) that can be inferred w.r.t. the ontology
are implicitly added to what is given explicitly in the RDF annotation. Efficient
query engines might not materialize implicit tuples, though. In this case, given
the implicit tuples (deductive closure), more media objects are likely to be found
if ontologies come into play for query answering. Note that w.r.t. an ontology, a
set of RDF triples can also become inconsistent. Inconsistencies can be detected
with reasoning engines, but this is not relevant throughout this paper (although
inconsistencies also restrict the set of possible annotations in the same way as
an XML schema restricts the set of valid annotations).
     Ontology languages such as RDFS or OWL2 are languages which have a
formal semantics, a feature that is beneficial for formally defining decision prob-
lems and checking correctness of corresponding decision procedures (aka infer-
ence algorithms). However, specific ontologies (aka knowledge bases) specified
using an expressive ontology language reveal more of the “semantics” of media
documents as mere RDF triples do. Ontologies achieve this by adding (lots of)
implicit content description tuples to the annotations given explicitly as part of
the documents’ annotations. Thus, we have “semantics” in the sense of formal
semantics of a representation language and “semantics” in the sense of implicit
tuples added to the explicitly given ones. Many papers in the Semantic Web lit-
erature amalgamate these two kinds of “semantics”, suggesting that semantics
in the sense of content descriptions come for free using formal representation
languages. For the latter notion of “semantics,” we prefer the name content de-
scription in order not to confuse the reader. Content descriptions do not come
for free but must be derived using media interpretation processes, which require
dedicated knowledge bases for interpretation knowledge [9].
    In MPEG-7, a multimedia document is related to its modality specific con-
tent, composed of, e.g., video, audio, or text, and each of these parts consists
of a set of segments specifying “regions” of modality specific data. At first, we
believe it is necessary to be able to specify more general resp. more specific con-
cepts and roles (e.g. that video content also is of type multimedia content) for
building up a taxonomy. Further, it is important to be able to specify concept
disjointness. In order to represent relations (e.g. from content to segments) roles
can be specified, whose domain and range usually are constrained to a specific
concept (e.g., the role hasMediaDecomposition is constrained to only relate in-
stances of multimedia content to multimedia segments) and which are possibly
functional or transitive. In addition, we propose that for modality specific con-
cepts the range of roles is further restricted to modality specific concepts (e.g.
audio content is only allowed to be related to audio segments). Finally, for repre-
senting multimedia content it is usually necessary to be able to specify concrete
domains such as integers or strings.
    We argue that this expressivity is sufficient for the representation and in-
terpretation of multimedia content for a large range of problems. For example,
we propose to abandon existential restrictions on the right side of inclusion ax-
ioms, since we believe that it is not required to constrain multimedia content
descriptions to consist of “anonymous” individuals of a specific type (which can-
not be retrieved explicitly [10]). The respective DL is denoted ALH−     f R+ (D)
(restricted attributive concept language with role hierarchies, functional roles,
transitive roles and concrete domains). We made several experiments with the
DL reasoner RacerPro [11] strongly indicating that reasoning with ALH−   f R+ (D)
is efficient.
    We now shortly introduce the descripton logic (DL) nomenclature. A DL
signature is a tuple S = (CN, RN, AN, IN), where CN = {A1 , ..., An } is the
set of concept names (we also use A for concept names in the sequel). RN =
{R1 , ..., Rm } is the set of role names. Further, AN is a set of concrete domain
attributes (i.e., roles whose range is a concrete domain). The signature also con-
tains a component IN indicating a set of individuals. A DL knowledge base
OS = (T , A), defined with respect to a signature S, is comprised of a termi-
nological component T (called Tbox ) and an assertional component A (called
Abox ). In the following we just write O if the signature is clear from context.
An ALH−   f R+ (D) Tbox is a set of axioms A1 v A2 and R1 v R2 (atomic sub-
sumption), A1 v ¬A2 (disjointness), ∃R.> v A and > v ∀R.A (domain and
range restrictions on roles), > v (≤ 1 R) (functional roles), T rans(R) (transi-
tive roles) and A1 v ∀R.A2 (local range restrictions on roles). An Abox A i is a
set of concept assertions A(a) and role assertions R(a, b), where A is a concept
name, R is a role name, and a, b represent individuals. Aboxes can also contain
equality (a = b) and inequality assertions (a 6= b) as well as attribute assertions
of the form Attr(a, val) where Attr is an attribute and val is either a string
or an integer (with the obvious denotation). For a detailed introduction to the
incorporation of concrete domains into DLs, to the semantics of concepts and
roles, as well as an introduction to the satisfiability conditions for axioms and
assertions we refer to [12] and [13], respectively. Standard DL decision problems
are also formally defined in [13] (e.g., computing the concept and role hierarchies,
as well as concept-based and conjunctive instance retrieval).


3     The Multimedia Content Ontology
In this section, the CASAM Multimedia Content Ontology is presented to an ex-
tent that the solution to scalabilitiy problems can be understood. The full MCO
can be found at http://www.sts.tu-harburg.de/casam/mco.owl. In contrast
to approaches transforming the complete MPEG-7 standard to RDFS [5] or
OWL [14], our approach is inspired by using only those parts of MPEG-7 de-
scribing a general structure for multimedia documents. The main objective is to
effectively exploit quantitative and qualitative time information in order to relate
co-occurring observations. Co-occurrences are detected either within the same
or between different modalities regarding the video shots. In the following, we
focus on axioms relating concept and role names required for these capabilities.

3.1   Concept Hierarchy and Role Hierarchy
In Fig. 3.1 the concept hierarchy (on the left) and the role hierarchy (on the
right) of the multimedia ontology is shown. A complete multimedia document
is represented by the concept MultimediaDocument. Only one instance of type
MultimediaDocument should be specified for a document to be annotated.
    Individuals which are instances of (subconcepts of) MultimediaContent rep-
resent different modalities of the video. The concept VideoContent represents
the video modality and holds all video segments. In the same way AudioContent
holds all segments from the audio modality.
TextContent represents text paragraphs associated with certain segments or the
whole video. Auxiliary text documents that are related with the whole anno-
tated video are represented by the subconcept AuxiliaryContent. During the
annotation process, a user can make free text annotations, which describe the
Fig. 1. Concept Hierarchy (lhs) and role hierarchy (rhs) structure of the CASAM
Multimedia Content Ontology


whole multimedia document or a single segment (shot) of the video. These
free text annotations are represented by GlobalUserAnnotationContent resp.
LocalUserAnnotationContent. As speech, recognized in the video, is transformed
into text, the concept SpeechRecognitionContent is also subsumed by TextContent.
    To represent parts of the content, MultimediaContent instances can be de-
composed into MultimediaSegment instances. TextSegment refers to words in the
text modality. The concept SegmentLocator is used to specify start and end of
segments. The concrete values of start and end represent temporal position infor-
mation for audio and video, or denote character positions for text. BoundingBox
is used to determine the position of a recognized object in a video frame. All
concepts within the same hierarchy level are disjoint.

    The role hasLogicalDecomposition decomposes the whole media document
into the different parts by relating instances of type MultimediaDocument with
modality specific content description individuals (that is, instances of the concept
M ultimediaContent). An individual of the type MultimediaContent is associ-
ated to its segments by the role hasMediaDecomposition. To relate an individual
of type MultimediaSegment with its locators, the role hasSegmentLocator is used.
    The roles nextTextContent and nextTextSegment are used to specify the or-
der in the text paragraphs resp. words. Both are transitive roles. Subroles of
correlatesWith can be used to represent associations between content descrip-
tors. A TextContent instance can be related to a VideoSegment using the role
belongsTo. We also use a small subset of the Allen relations [15] to relate video
segment, or, more precisely, the locators associated with video segments. Note
that we do not require reasoning on Allen relations since the corresponding re-
lations are generated based on quantitative data. While o (overlap) describes an
intersection between audio and video locators, m (meets) describes the align-
ment of two video or two audio segments. Please note that we compute (qualita-
tive) relations such as o using (quantitative) information about locator objects.
Quantitative information is given in terms of restrictions on values for attributes
hasStart and hasEnd (see Section 3.3).
    The role depicts is used to establish a mapping from individuals of the Mul-
timedia Content Ontology to observations from the domain ontology that were
extracted by analysis modules. In a similar way as depicts, hasInterpretation
provides a map to individuals that were generated as a part of interpretations
of observations. To represent the aggregating characteristic of high-level inter-
pretations, the role associatedWith is used to related high-level interpretations
with other interpretations or directly with observations.


3.2   Range Restrictions

Range restrictions on roles constrain the corresponding role fillers to be of a
specific type. For example,

             > v ∀hasM ediaDecomposition.M ultimediaSegment

defines the range restrictions on the role hasM ediaDecomposition such that the
role filler is constrained to be of type M ultimediaSegment.
    Local range restrictions constrain the range of roles further when the role is
applied to a specific concept. The local range restriction

          AudioContent v ∀hasM ediaDecomposition.AudioSegment

specifies that the range of the role hasM ediaDecomposition associated with the
concept AudioContent is further restricted to AudioSegment.


3.3   Attribute Values

The attributes hasStart and hasEnd are used to specify time information of
video or audio segments. For example:

      AudioLocator(as1 ), AudioLocator(al1 ), hasSegmentLocator(as1 , al1 )
                  hasStart(al1 , “00:43”), hasEnd(al1 , “00:52”)
defines the starting time of an AudioSegment, e.g., as1 by specifying concrete
values to its corresponding AudioLocator al1 . Integer values are used to specify
character positions to identify words in larger text strings. Also regarding the
text modality, the property hasConcreteValue is used to associate strings to
instances of specific types such as CityN ame.
    Given quantitative information about start and end time, qualitative rela-
tions between locator instances are computed by the media interpretation agent.
From the potential 13 qualitative relations defined by Allen [15] we explicitly rep-
resent o (overlaps) , d (during), and m (meets) between segment locators. As
shown in the next section the main motivation for switching from a quantita-
tive temporal representation to a qualitative one is to achieve scalability. Given
the M I Agent introduced above (see 4.1 for more details), qualitative relations
allows to partition the interpretation Abox(es) that always grow(s) over time.


4     Scalable Video Interpretation
As we have seen, for improving shot-based video annotation, interpretations are
computed for co-occurrences of locator individuals according to temporal infor-
mation. In the course of the video interpretations, Aboxes grow significantly.
If, e.g., a particular video segment is focused on because there is a new asser-
tion coming in, referring to this video segment, only very few other assertions
are relevant. Large parts of Aboxes (e.g., for temporally far away parts of the
video) need not be processed. In this section we formalize the subdivision of large
Aboxes into meaningful parts (partitions) such that reasoning problems are han-
dled in the same way as with the large Abox. Reasoning on the small partitions,
also called island reasoning, is known to improve reasoning performance signif-
icantly [16]. We start with the introduction of some important aspects of the
media interpretation agent.

4.1   The Multimedia Interpretation Agent
In [9], an agent for Multimedia Interpretation (MI Agent) was introduced. It
uses a probabilistic interpretation engine which, among others, is based upon
abduction. The idea is to generate explanations for observations in the form of
hypothesized Abox assertions. Given the added assertions and a set of rules as
part of the agent’s knowledge base, the observations are then entailed. The agent
computes assertions that “support” the observations. The MI Agent receives
percepts in the form of assertions that represent the ongoing video analysis
and annotation process. The assertions are received in a streaming way by the
MI Agent in small bunches, which we formalize as sets Γ here.
    Each Γ is added to the Abox that the agent maintains. Subsequently, a set
of forward-chaining rules is applied. The general form of these rules is
                          Q1 (Y1 ), . . . , Qn (Yn ) → P (X)
                                            md1: MultimediaDocument
                                                                     hasDocID
                                                                                           “ ID57“

                       hasLogicalDecomposition                        hasLogicalDecomposition


                 vc1: VideoContent                                                ac1: AudioContent


               hasMediaDecomposition                                              hasMediaDecomposition


                                                                                                 as2: AudioSegment
                                    …   … …                                        …
               vs1: VideoSegment                                     … …           as1: AudioSegment
      vs2: VideoSegment
                              depicts   hasSegmentLocator               depicts     hasSegmentLocator
                                                                o

                                m                           d
                                                                                       m                            …
                                          vl1: VideoLocator                                          al2: AudioLocator
      …     vl2: VideoLocator
                                                                                     al1: AudioLocator
                                                 CarDoorSlam
                          c1: Car
                                                                    ds1: DoorSlam
                             hasStart        hasEnd                     hasStart           hasEnd


                                00:00:40 00:00:50                         00:00:43 00:00:49


Fig. 2. A multimedia content structure with co-occurring domain ontology individuals


where Q1 , . . . , Qn , P denote concept or role names and underlined letters denote
(possible) tuples of all-quantified variables with the condition that each variable
appearing in P (X) does also appear in at least one Qi (Yi ). In order to be able
to apply a rule, appropriate individuals have to be substituted for the variables.
Conclusions P (i) are then added to the Abox. For the conclusions P (i) the
M I Agent seeks further explanation using an abduction process. The main idea
is to backward chain a set of rules of the form introduced above. Due to space
restrictions, this process cannot be explained in detail, and we refer to [9]. In
any case, if the Aboxes get larger and large, performance will degrade if there
are no specific techniques employed.

Example 1 A car is shown in a video shot, represented by an assertion Car(c1 ),
and there is the sound of a door slam, represented by an assertion DoorSlam(ds1 ).
The car and the door slam are associated to video and audio segments, resp.
Those, in turn, are associated with locator objects. Now assume that the car and
the door slam co-occur, i.e., the locator objects for the audio segment is located
during the video segment. Figure 2 depicts a complete scenario for the example.

    Using relations between time points, one might use rules to define a during
relation as a view based on the quantitative temporal information for the locator
objects. However, using relations between time points, in principle, every locator
might be associated with every other locator, and thus the agent can hardly
partition the large Abox into smaller parts. Therefore, we have designed the
agent in such a way that it adds qualitative relations such as overlaps (o), during
(d), and meets (m) to make certain temporal information explicit that is hidden
in the quantitative locator time specifications. The motivation for the agent to
switch to the more verbose qualitative representation is that the input Abox
becomes partitionable.
    Qualitative temporal relations are used in forward-chaining rules to compute
assertions that are then explained by the agents (see above). For instance, based
on the forward-chaining rule

       ∀x, xl, y, yl, w, z V ideoSegment(x), hasSegmentLocator(x, xl),
       V ideoLocator(xl), AudioSegment(y), hasSegmentLocator(y, yl),
            AudioLocator(yl), d(yl, xl), depicts(x, w), depicts(y, z),
                 Car(w), DoorSlam(z) → CarDoorSlam(w, z)

the role assertion CarDoorSlam(c1 , ds1 ) (marked with an ellipse in Figure 2) is
generated and added to the Abox. This new assertion is seen as a specific obser-
vation that requires an explanation [9]. Possible explanations, e.g., are car entry
or car exit events, which might be represented using assertions CarEntry(i1 ) or
CarExit(i2 ), where i1 and i2 are new individuals. Both individuals are associ-
ated with the car end the door slam individuals (role associatedW ith, see above).
Inevitably, in the course of explanation generation, the Abox grows again signif-
icantly. This leads to very large Aboxes (imagine the annotation of a two-hour
movie) and the application of forward-chaining rules (as well as the abduction
process) will be very inefficient, since complex joins for huge relations can hardly
be avoided in order to check whether rules are applicable (and to compute the
bindings for variables). Pretty soon, the video description Abox does not fit into
main memory any longer. In the following, we present a proposal to overcome
the problem of Aboxes becoming too large.


4.2   Island Reasoning

As stated before, the input can be considered as a stream. The information
content derived from a stream is collected over time and stored together with
the interpretations in an Abox or in multiple ones, respectively, if more than one
interpretation is possible. These Aboxes are put to the previously introduced
agenda A. The more knowledge is gathered, the larger those Aboxes become and
the longer it takes to complete all computations such as applying the forward-
chaining rules or arranging the interpretation process itself. Current state-of-the-
art DL reasoning systems cannot deal with this amount of information any more,
because they rely on in-memory structures. To overcome this problem, in [16]
island-based reasoning for ALCHI ontologies is proposed as a solution. In the
meantime the island approach is extended to SHIQ(D) by a more fine-grained
syntactical analysis. Since SHIQ(D) is a more expressive description logic than
ALH−  f R+ (D), the mechanism is also applicable for our annotation language.
    The underlying idea is that only a small subset of concept and role assertions
called island is necessary to perform instance checking for a particular given
individual i and a given (complex) concept C. The approach chosen here is to
identify role assertions which can be used during the application of a tableau
algorithm for instance checking [13] (note that (T , A) ? C(i) can be reduced
to checking whether (T ∪ A ∪ {¬C(i)}) is unsatisfiable via a tableau algorithm).
First, the ontology is transformed into a normal form, called shallow normal
form. For the details of the transformation please refer to [16]. Given the shallow
normal form, a so-called ∀-info structure for an ontology O is used to determine
which concepts are (worst-case) propagated over role assertions in an Abox. This
helps to define a notion of separability. The following definition of O-separability
is used to determine the importance of role assertions in a given Abox A.
Definition 1. Given an ontology O = (T , A), a role assertion R(a, b) is called
O-separable, if we have IN C(O) iff IN C(hT , A2 }i), where
  A2 = A \ {R(a, b)} ∪ {R(a, b0 ), R(a0 , b)} ∪ {b0 : C|b : C ∈ A} ∪ {a0 : C|a : C ∈ A},

s.t. a0 and b0 are fresh individual names and IN C(O) denotes an inconsistent
ontology O.

                                     Figure 3 shows a graphical representation
      a       b                      of a role that matches the definition of O-
          R                          separability. Informally speaking, the idea
                                     is that O-separable assertions will never be
                                     used to propagate “complex and new infor-
      a       b�    a�       b
         R             R             mation” via role assertions. The extraction
                                     of islands for instance checking in an ontol-
                                     ogy O, given an individual i, is now straight-
 Fig. 3. O-separability of a role R
                                     forward. A graph search can be used that
starts from an individual i and follows each non-O-separable role assertion in
the original Abox, until at most O-separable role assertions are left. All visited
assertions are then worst-case relevant for the reasoning process. Regarding the
proposed MCO, the objective is that implicit information due to value restric-
tions ∀R.A(i) prevents a separation for role assertions R(i, j), if A(j) is not
explicitly specified in the respective Abox.

Example 1 (cont.) Applying the definition of O-separability to the Abox de-
picted in Figure 2, islands are computed as shown in Figure 4. Instead of applying
all possible substitutions, the forward-chaining rule does only need to be applied
to the island with the locators vl1 and al1 in order to add CarDoorSlam(c1 , ds1 ).
This enables parallel processing for abduction and retrieval scenarios. However,
given the local range restriction for AudioContent, if as1 is not explicitly speci-
fied as AudioSegment but rather as M ultimediaSegment, the definition of O-
separability would be violated for hasM ediaDecomposition(ac1 , as1 )—so that
the respective island would be larger than before.

    This general Abox modularization approach has proven well regarding scala-
bility issues. For more details, in particular regarding a theoretical and practical
underpinning of the island approach, we refer the reader to [17].
                                            md1: MultimediaDocument
                                                                   hasDocID
                                                                                         “ ID57“
                           hasLogicalDecomposition             hasLogicalDecomposition

                       vc1‘: VideoContent                           ac1‘: AudioContent

         md1‘‘: MultimediaDocument                                      md1‘: MultimediaDocument
                hasLogicalDecomposition                                 hasLogicalDecomposition
                 vc1: VideoContent                                       ac1: AudioContent
               hasMediaDecomposition                                      hasMediaDecomposition
              vs1‘: VideoSegment       … …                      … … as ‘: AudioSegment
                                                                                1

               vc1‘‘: VideoContent                                            ac1‘‘: AudioContent
              hasMediaDecomposition                                           hasMediaDecomposition
              vs1: VideoSegment                                               as1: AudioSegment

                             depicts   hasSegmentLocator           depicts      hasSegmentLocator
                                                                                        m
                              m                            d

                                                                                       al2‘: AudioLocator
           vl2‘: VideoLocator             vl1: VideoLocator                         al1: AudioLocator
                                                CarDoorSlam
                        c1: Car
                                                                 ds1: DoorSlam
                            hasStart         hasEnd                  hasStart         hasEnd


                                00:00:40 00:00:50                    00:00:43 00:00:49


              Fig. 4. Multimedia content structure divided into islands


5   Conclusion

Under the consideration of MPEG-7 and [7], a multimedia content ontology
has been introduced that is represented with the DL ALH−     f R+ (D). The expres-
siveness of this logic has proven to be sufficient for the MCO arising from the
scenarios considered in the CASAM project. As the MCO covers most of the rel-
evant concepts, roles and attributes to be expected in video streaming scenarios,
the ALH−  f R+ (D) can be safely assumed to be sufficient for similar multimedia
interpretation scenarios. Based on time information in Aboxes corresponding
to this MCO, a multimedia agent performs stream-based forward-chaining and
abductive backward chaining in order to obtain interpretation possibilities. Par-
titioning techniques ensure that interpretation Aboxes can be decomposed into
manageable parts such that even large videos can be handled (Aboxes can be
swapped to secondary memory).
    Some initial experiments were performed to see how the approach behaves in
the CASAM context. The results are very promising and almost all roles were
O-separable after qualitative assertions were added to Aboxes such that quanti-
tative information is no longer required. Thus, switching from a quantitative to
a qualitative representation provides practical benefits for the agent.
    Our work complements other work on stream reasoning, i.e., for efficiently
maintaining materialized views as described in [18]. We show that in some cases
the views based on quantitative information can be avoided.
References
 1. Castano, S., Espinosa, S., Ferrara, A., Karkaletsis, V., Kaya, A., Möller, R., Mon-
    tanelli, S., Petasis, G., Wessel, M.: Multimedia interpretation for dynamic ontology
    evolution. Journal of Logic and Computation 19(5) (2008) 859–897
 2. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. International
    Journal on Semantic Web and Information Systems (IJSWIS), 5(3) (2009) 1–22
 3. Espinosa, S., Kaya, A., Möller, R.: The BOEMIE Semantic Browser: A semantic
    application exploiting rich semantic metadata. In: Proceedings of the Applications
    of Semantic Technologies Workshop (AST-2009), Lübeck, Germany. (2009)
 4. ISO/IEC15938-5FCD: Multimedia content description interface (MPEG-7) (2002)
 5. HunterJ, J.: Adding multimedia to the semantic web: Building an MPEG-7 ontol-
    ogy. In: Proc. of the 1st Semantic Web Working Symposium, Stanford University,
    California, USA. (2001) pp. 261–283
 6. Staab, S., Franz, T., Görlitz, O., Saathoff, C., Schenk, S., Sizov, S.: Lifecycle
    knowledge management: Getting the semantics across in X-Media. In: Foundations
    of Intelligent Systems, 15th International Symposium, ISMIS 2006, Bari, Italy,
    September 2006. LNCS, Springer 1–10
 7. Dasiopoulou, S., Dalakleidi, T., Tzouvaras, V., Kompatsiaris, Y.: D3.4 - Multi-
    media ontologies. Boemie, project deliverable, National Technical University of
    Athens (2008)
 8. Wandelt, S., Möller, R.: Updatable island reasoning over alchi ontologies. In:
    Conference on Knowledge Engineering and Ontology Development (KEOD). (2009)
    CEUR Workshop Proceedings (Vol. 477).
 9. Gries, O., Möller, R., Nafissi, A., Rosenfeld, M., Sokolski, K., Wessel, M.: A prob-
    abilistic abduction engine for media interpretation. In Alferes, J., Hitzler, P.,
    Lukasiewicz, T., eds.: Proc. International Conference on Web Reasoning and Rule
    Systems (RR-2010). (2010)
10. Haarslev, V., Möller, R.: On the scalability of description logic instance retrieval.
    Journal of Automated Reasoning 41(2) (2008) 99–142
11. Haarslev, V., Moeller, R.: Racer: A core inference engine for the semantic web. In:
    Proc. of the 2nd International Workshop on Evaluation of Ontology-based Tools,
    located at the 2nd International Semantic Web Conference ISWC. (2003)
12. Baader, F., Hanschke, P.: A scheme for integrating concrete domains into concept
    languages. International Conference on Artificial Intelligence (1991)
13. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F., eds.:
    The Description Logic Handbook: Theory, Implementation and Application. Cam-
    bridge UP: Cambridge, NY. (2003)
14. Garcia, R., Celma, O.: Semantic integration and retrieval of multimedia metadata.
    In: Proc. of the 4th International Semantic Web Conference (ISWC), Galway, Ire-
    land. (2005)
15. Allen, J.F.: Maintaining knowledge about temporal intervals. Commun. ACM
    26(11) (1983) 832–843
16. Wandelt, S., Möller, R.: Island Reasoning for ALCHI Ontologies. In: Proceedings
    of the 5th International Conference on Formal Ontology in Information Systems
    (FOIS-04), IOS Press (2008)
17. Wandelt, S., Möller, R.: Towards abox modularization of semi-expressive descrip-
    tion logics. Journal of Applied Ontology 7(2) (2012) 133–167
18. Barbieri, D., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: Incremental
    reasoning on streams and rich background knowledge. In Aroyo, L., Antoniou, G.,
Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T., eds.:
The Semantic Web: Research and Applications. Volume 6088 of Lecture Notes in
Computer Science. Springer Berlin / Heidelberg (2010) 1–15