MOWIS: A system for building Multimedia Ontologies from
              Web Information Sources

                       Vincenzo Moscato, Antonio Penta, Fabio Persia, Antonio Picariello
                                                           University of Naples
                                                 Dipartimento di Informatica e Sistemistica
                                                      via Claudio 21, 80125, Naples
                                       {vmoscato,a.penta,fabio.persia,picus}@unina.it

ABSTRACT                                                                       Throughout the rest of paper, we will try to give an answer to
Defining ontologies within the multimedia domain still remains a            all the previous cited aspects; in particular the original contribution
challenging task, due to the complexity of multimedia data and the          of this work is: (i) to propose a novel multimedia ontology frame-
related associated knowledge. In this paper, we propose: i) a novel         work, in particular related to the image domain; (ii) to propose a
multimedia ontology model that combine both low level descrip-              technique for building ontologies, that operates on large corpora of
tors and high level semantic concepts; ii) an automatic construction        human annotated repositories, namely the Flickr [7] database, con-
of ontologies using the Flickrweb services, that provide images,            sidering both low level image processing strategies and keywords
tags, keywords and sometimes useful annotation describing both              and annotations produced by humans when they store the produced
the content of an image and personal interesting information. Even-         data.
tually, we describe an example of automatic ontology construction              We provide an algorithm for creating image ontology in a spe-
in a specific domain.                                                       cific domain gathering together all this different information. We
                                                                            then provide an example of automatic construction of image ontol-
                                                                            ogy and a discussion of the encountered problems and the provided
1.     INTRODUCTION                                                         solutions. We concluded that the framework is promising and suf-
   Nowadays, a lot of repositories containing both multimedia and           ficiently scalable to different domains.
the related annotations or metadata are publicly available on the              The remaind of paper is organized as follows. Section 2 out-
web. Such kind of information may be used for an automatically              lines the related work related to the multimedia ontology topic. In
generation of multimedia knowledge, particularly suitable for a va-         Section 3 the process for building an Image Ontology is described.
riety of applications, such as information retrieval, browsing, data        Section 4 details the system architecture with some implementation
mining and so on.                                                           issues and a case study for our process is shown in Section 5. In
   It is well known in the literature that despite the tons of papers       Section 6 some discussions and conclusions are reported.
produced about multimedia databases and knowledge representa-
tions, there is not yet an accepted solution to the problem of how
to represent, organize and manage multimedia data and the related
                                                                            2.    RELATED WORKS
semantics by means of a formal framework.                                      In the last few years, several papers have been presented about
Usually, a multimedia database is described by means of “flat”              multimedia systems based on knowledge models, image ontolo-
metadata, the most of the times using a predefined set of metadata          gies, fuzzy extension of ontology theories.
(as in mpeg standard), or sometimes using small annotation in nat-             In almost all the works, multimedia ontologies are effectively
ural languages: such kind of structures are substantially inadequate        used to perform semantic annotation of the media content by man-
to support complete retrieval by content of image documents.                ually associating the terms of the ontology with the individual ele-
   It is the authors’ opinion that there is still a great work to do with   ments of the image or video domains [12], thus demonstrating that
respect to the intensional aspects of a multimedia ontology:                the use of ontologies can enhance classification precision and im-
                                                                            age retrieval performance.
     • what a multimedia ontology is: is it a taxonomy, or a seman-            Instead of creating a new ontology from the scratch, other ap-
       tic network of metadata (tags, annotations)?                         proaches [3] extend WordNet to image specific concepts, using an
     • does a multimedia ontology support concrete data: what is            annotated image corpus as an intermediate step to compute similar-
       the role of rough data – image, video, audio data– if any?           ity between example images and images in the image collection.
                                                                               For solving the uncertain reasoning problems, the theory of fuzzy
     • what a multimedia semantic is: how to define and capture the         ontologies is presented in several works, as an extension of ontolo-
       semantics of multimedia data?                                        gies with crisp concepts as in the paper [6], that presents a complete
     • how to build extensional ontologies: once defined a suitable         fuzzy framework for ontologies. While in [8], the authors introduce
       formal framework, can we automatically build the defined             a description logic framework for the interpretation of image con-
       multimedia ontologies?                                               tents.
                                                                               Multimedia semantic papers based on MPEG-7 [9] are very in-
                                                                            teresting. The MPEG-7 framework consists of Descriptors (Ds)
                                                                            and Descriptor Schemes (DSs) that represent features for multime-
                                                                            dia, and more complex structures grouping Ds and DSs, respec-
Appears in the Proceedings of the 1st Italian Information Retrieval
                                                                            tively.
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.
http://ims.dei.unipd.it/websites/iir10/
index.html
Copyright owned by the authors.
   In particular, the MPEG-7 standard includes tags that describe
visual features (e.g. color), audio features (e.g. timbre), structure
(e.g. moving regions and video segments), semantic (e.g. object
and events), management (e.g. creator and format), collection or-
ganization (e.g. collections and models), summaries (e.g. hierar-
chies of key frames) and, even, user preferences (e.g. for search)
of multimedia. In this way the standard includes descriptions of
low-level media-specific features that can often be automatically
extracted from the different media types.
   Unfortunately, MPEG-7 is not currently suitable for describing
top-level multimedia features, because: i) its XML Schema-based
nature prevents the effective manipulation of descriptions and its
use of URNs is cumbersome for the web; ii) it is not open to the
web standards for representing knowledge.
   Other efforts have been also done in order to translate the se-
mantic of the standard in some knowledge representation languages
[11]. All these methods perform a one to one translation of MPEG-
7 types into OWL concepts and properties.
   Finally, a very interesting work reported in [1] has been proposed
in order to define a multimedia ontology. The authors try to define
a novel multimedia ontology that takes into account the semantic of                Figure 1: Image Ontology Building Process
MPEG-7 standard. They started using some patterns derived from
the foundational ontology DOLCE [10]. In particular they used two
design patterns Descriptions & Situations (D & S) and Ontology of           1. V is a finite set of nodes that can be of different kinds:
Information Objects (OIO). The obtained ontology already covers                   • low-level nodes (Vl ), corresponding to an image with
a very large part of the standard, while their modeling approach                    the related properties:
has the aim to offer even more possibilities for multimedia annota-
tion than MPEG-7 since it is truly interoperable with existing web                     – content (e.g. texture, shape, color, objects, etc...)
ontologies. This approach fits interoperability purposes, but some                       or more enhanced features;
constraints on the image semantics are introduced.                                     – metadata (e.g. author, title, description, tags, etc...);
                                                                                  • high-level nodes (Vh ), corresponding to general con-
3.     BUILDING AN IMAGE ONTOLOGY                                                   cepts domain-specific concepts, or image content con-
                                                                                    cepts (that could be associated to low-level nodes);
3.1     An Image Ontology Model                                             2. E is a subset of (V × V);
   An ontology is usually referred as an “explicit specification of
a conceptualization” which is, in turn, “the objects, concepts, and         3. ρ is a function that associates to each couple of nodes a label
other entities that are presumed to exist in some area of interest and         indicating the kind of relationship between the two nodes ρs ,
the relationships that hold among them”.                                       and its reliability degree ρr ∈ [0, 1]: ρ : E → hρs , ρr i.
   Stressing its conceptual nature, an ontology may be considered
as a theory used to represent relevant notion about domain model-          Depending on the type of relationship in our model, we distin-
ing, a “domain” being classified in terms of concepts, relationships     guish among:
and constraint on them.                                                     • similarity relationship: relates between two low-level nodes
   Let us consider the image domain: as usual in a given media, we            (images) in function of their similarity degree, exploiting
detect symbols, objects and concepts; in a certain image we have a            classical algorithms of image matching based on low-level
region of pixels (symbol) related to a portion of multimedia data;            features (e.g. color, texture, shape, etc...);
this region is an instance (object) of the certain concept.
   In other words, we can detect concepts but we are not able to dis-       • representativeness relationship: relates between high-level
ambiguate among the instances without some specific knowledge.                and low-level nodes, containing those content features that
   A simplified version of the described vision process will consider         better represents the associated concept, by means of cluster-
only two main levels: Low and High. In fact, the knowledge asso-              ing or classification algorithms that determine the probability
ciated to an image can be easily described at two different levels of         that an image is a valid representative of the concept;
analysis:                                                                   • semantic relationship: relates between two high-level nodes
     • Low level - the low-level descriptions of raw images;                  (example are those relationships such hypernym/hyponim,
                                                                              holonym/meronym, synonym, retrievable on lexical databases).
     • High level - general or domain-specific image content con-
       cepts that can be derivable or less from low-level ones.          3.2    The Image Ontology building
                                                                            The purpose of the image ontology building process (figure 1) is
  It’s the author’s opinion that an image ontology has to take into      to perform the construction of the graph representing image ontol-
account these specific characteristics, as captured by the following     ogy by a super-visioned approach.
definition:                                                                 The process is made of:
   D EFINITION 1 (I MAGE O NTOLOGY ). An Image Ontology is                  1. a definition of an initial taxonomy containing few high level
a directed and labeled graph (V, E, ρ), where:                                 nodes (related to the main concepts of a specific domain),
   2. an extraction of useful information (images and annotations         vocabulary from which tags are chosen can easily lead to the pres-
      related to the taxonomy concepts) from several annotated            ence of synonyms (multiple tags for the same concept), homonyms
      web repositories,                                                   (same tag used with different meaning), and polysemies (same tag
                                                                          with multiple related meanings). Also inaccurate or irrelevant tags
   3. a content-based analysis on the row-data and a semantic pro-        result from the so called ‘meta-noise’, e.g. lack of stemming (nor-
      cessing on the related textual annotations,                         malization of word inflections), and from heterogeneity of users
                                                                          and contexts: hence an effective use of the tags requires these to be
   4. the ontology construction.                                          stemmed, disambiguated, and opportunely selected.
                                                                             To these purposes, information coming from tags could be use-
3.2.1     Taxonomy definition                                             fully analyzed in combination with titles and descriptions by suit-
   Our image ontology building process is domain-oriented. Thus,          able NLP technique that overcome the linguistic and semantic het-
during this step, it is necessary to define an initial taxonomy con-      erogeneity of such information, in order to extract a set of relevant
taining relevant concepts hierarchy of the considered domain that         keywords which more effectively represent image content.
is represented by a subset of high level nodes.                              In particular, the semantic processing, which is applied to the
                                                                          textual data attached to a given image can be decomposed into a set
3.2.2     Information extraction                                          of sequential sub-tasks [13]: meta-noise and named entity filtering,
   The main objective of this task is to fetch images and the related     linguistic normalization, part of speech tagging, tokenization, word
textual annotations from web repositories, corresponding to the leaf      sense disambiguation and topic extraction. Thus, the result of the
high-level nodes of the image ontology, and to extract some useful        semantic processing task is a set of labels (topics) with an associ-
low and high level information. Apposite communication API or             ated confidence value - that represents the relative importance of
web-services are exploited to obtain requested information.               the label (with respect to the other ones in the annotations) -, from
   In this paper we used Flickr as web image repository.                  the set of tags, title an descriptions.

3.2.3 Content-Based analysis                                              3.2.5     Ontology building
   The goal of such a task is to obtain a low-level description of im-       As previously discussed, the obtained knowledge in terms of im-
ages in terms of content features, using classical Computer Vision        ages, low-level characteristics and labels is then merged and trans-
techniques.                                                               lated in the shape of a graph representing image ontology.
   We decided to use the salient points technique - based on the An-         In particular, in a first step, all images whose relevant labels are
imate Vision paradigm [2] - that exploits color, texture and shape        associated with a high confidence value to the high-level nodes,
information associated with those regions of the image that are rel-      corresponding to the taxonomy leaves, will be represented by ap-
evant to human attention (Focus of Attention), in order to obtain a       posite low-level nodes; in addition, couple of image nodes, whose
compact characterization (namely Information Path) that could be          similarity (computed by means of the Information Path Matching
also used to evaluate the similarity between images, and for index-       algorithm [2]) is greater than a threshold will be linked by an edge
ing issues.                                                               having as reliability degree the related similarity measure.
An information path IP=hFs (ps ; τs ),hb (Fs ),ΣFs i can be seen as       In the successive step, previous images are clustered by used a Bal-
a particular data structure that contains, for each region F (ps ; τs )   anced Expectation Maximazation algorithm [2] applied in the fea-
surrounding a given salient point (where ps is the center of the re-      ture spaces defined by the Information Path descriptors, in order
gion and τs is the the observation time spent by a human to detect        to determine for the high-level nodes the set of images that better
the point), the color features in terms of HSV histogram hb (Fs ),        could represent the related concepts. Apposite edges (represen-
and the texture and shape features in terms of wavelet covariance         tative relationships) link such nodes with representatives of each
signatures ΣFs (see [2] for more details).                                cluster.
Eventually, apposite super-visioned classification algorithms are         Eventually, by means of a Learning Tag Relevance algorithm [4],
exploited to determine content features [2].                              topics that are more relevant with respect to the content of images
                                                                          belonging to the same cluster (winner topics) are promoted to be
3.2.4     Semantic processing                                             image ontology high-level nodes. In particular, the tag relevance σ
                                                                          of a generic tag τ of the most significant image (centroid) of cluster
   In this task the main objective is to discover textual labels that
                                                                          C is computed by the following formula:
better reflect image semantic using NLP techniques and topic de-
tection algorithms on the textual annotations coming from the con-
                                                                                            m
                                                                                            X
sidered image repositories. For what Flickr concerns, images usu-                                                      tf (τ.i) · (a + 1)
ally have three main attached information: i) a title, ii) a content            σ(τ, C) =         |idf (τ ) ·                                    |   (1)
                                                                                            i=1
                                                                                                                tf (τ ) + a · (1 − b + b · UUi )
description and iii) a set of keywords, namely tags.
   Titles in the majority of the cases contain text that summarizes       where: tf (τ, i) is the term frequency of topic τ with respect to the
the content of the images, while in other cases consist of automat-       topics of all images belonging to C, Ui , U are the number of topics
ically generated text that is not useful in the indexing process. De-     of i − th image of C and the average number of tags related to
scriptions are very short and usually are not posted for retrieval        all images belonging to C respectively, idf (τ ) is the inverse docu-
purposes: they typically contain sentences concerning context of          ment frequency of τ in C. The winner topics, whose relevance is
the picture, or user opinion. Finally, Tags are simple keywords           greater than a threshold, are finally inserted as high-level nodes in
users are asked (actually they may not insert any tag) to submit,         the ontology and linked, from one hand to the image node that cor-
that should describe the context of the image (e.g. among tags for a      responds to the cluster centroid and, from the other one, to those
picture of an “elephant in an African landscape”, you will probably       nodes which semantic distance (i.e. Wu/Palmer) is the minimum
see the words ‘elephant’, ‘Africa’ and ‘landscape’).                      with respect to the current topic. If it is possible, the new ontology
   The simple use of tags does not improve the efficiency of in-          edge is labeled with the type of semantic relationship (e.g. hyper-
dexing and searching contents. The absence of restrictions to the         nym/hyponim, holonym/meronym, etc...).
                                                                          high level nodes for the considered domain.
                                                                             We used Flickr [7] as multimedia repository of annotated images.
                                                                          Flickr is one of the most popular web-based tagging system, that
                                                                          allows human participants to annotate a particular resource, such
                                                                          as web pages, blogs, images, with a freely chosen set of keywords,
                                                                          or tags, together with a short description of the content.
                                                                             This kind of system has been recently termed folksonomy [5], i.e.
                                                                          a folk taxonomy of important and emerging concepts within user
                                                                          groups. The dynamic nature of these repositories assures the rich-
                                                                          ness of the annotation; in addition, they are quite accurate, because
                                                                          they are produced by humans that want to share their images and
                                                                          the experience they have had, using tags and an annotation process.
                                                                             The Flickr repository has been queried using as search keywords
                                                                          the logical AND between concepts reported in the leaf nodes of the
                                                                          taxonomy and the one corresponding to the root node and exploting
                                                                          some filters on user ids, in order to retrieve images really belong-
                                                                          ing to the domain. Each retrieved image undergoes a content-based
                                                                          analysis to determine the low-level description – i.e. the IP (In-
                  Figure 2: System Architecture                           formation Path) and content features. Moreover, in a first step we
                                                                          estimated similarity existing between each couple of different im-
                                                                          ages by comparing their IP s by means of the image path matching
   Thus, image ontology can be generated in an incremental way            algorithm [2].
and in correspondence of pick-up operations from the Flickr repos-           All images belonging to the same concept are then clustered
itory.                                                                    into different groups, which contain images that are more similar
                                                                          among themselves. We used as clustering procedure the BEM al-
                                                                          gorithm [2], that is recursively invoked to dynamically determine
4.    THE SYSTEM ARCHITECTURE                                             more fitting clusters without knowing a-priori the number of clus-
   The system architecture that supports the image ontology build-        ters themselves (that is usually proportional to number of images
ing process is shown in figure 2. User generates by an apposite           related to the current concept). Then we selected for each cluster
graphical interface an OWL file coding the initial taxonomy con-          the representative image as the closest one to all the other images
taining relevant concepts of the considered domain. Such a file is        of the cluster, and a suitable representation probability is associated
then the input of the Information Fetching module that downloads          to each representative image on the base of minimum and average
images and the related annotations from the Multimedia Reposi-            distances.
tory, using as search keywords the concepts related to leaf nodes of         The process is iterated for each taxonomy leaf concept and the
the taxonomy and some filters on users.                                   ontology is incrementally built: images belonging to different top-
   A Storage Engine module receives such information and stores           ics could be linked on the base of their similarity values allowing
image annotations (title, description, author, tags, labels, etc...) in   to merge the multimedia knowledge in a unique graph. Thus, the
a dedicated RDF Database and raw data together low-level charac-          more relevant tags are propagated in the ontology and linked to the
teristics in a Image Database. Each image is then identified in these     other nodes.
databases by an URI (Uniform Resource Identifier).                           We report in figure 3 a step by step complete example of the
   Finally, the Information Extraction and Information Processor          generation of Capri ontology.
analyze both high level information stored into the RDF database
and low level information contained into the Image database, in
order to generate/update, by means of Ontology Manager and of
                                                                          6. CONCLUSION
Clustering Manager, and in according to the described process, a             In this paper we have addressed the problem of building a multi-
graph which represents final multimedia ontology.                         media ontology in an automatic way using annotated image repos-
   For what implementation issues concerns, we notice that: (i)           itories. Our work differs from the previous papers presented in the
the initial taxonomy is generated by a JAVA desktop application           literature for different reasons. First, we propose a notion of mul-
that uses Protégé API; (ii) Flickr has been chosen as the multi-          timedia ontology, described by means of a graph and particularly
media repository; (iii) the Information Fetching module has been          suitable for managing the different levels of semantics of images.
implemented as a JAVA application that exploits Flickr API; (iv)          In addition, we obtain a dynamic generation of image ontologies
the RDF and Image Database have been realized by Sesame and               using tags and annotations already produced by users in their so-
PostegreSQL DBMS, respectively; (v) the Information Fetcing and           cial web networks.
Indexing packages have been implemented by apposite JAVA pack-               Further works will be devoted to produce experimental results to
ages.                                                                     evaluate the effectiveness of the produced ontologies with respect
                                                                          to other approaches by means of different criteria: class match mea-
                                                                          sure, density measure, semantic similarity measure, betweenness
5. A CASE STUDY                                                           measure.
   This section describes a case study for our image ontology build-
ing process. In particular, we have built an ontology pertinent to
Capri, a wonderful Italian island of the Sorrentine Peninsula, on
the south side of the Gulf of Naples. A set of experts of natu-
ral and cultural attractions of Capri provided as initial taxonomy a
graph reported containing the most relevant concepts in terms of
                                                Figure 3: Bulding of the Capri Ontology


7.   REFERENCES                                                            MPEG-7: Multimedia Content Description Interface. 2002.
                                                                      [10] C. Masolo and et al. The wonderweb library of foundational
 [1] R. Arndt, R. Troncy, S. Staab, and L. Hardman. Adding
                                                                           ontologies (wfol). Technical report, WonderWeb Deliverable
     formal semantics to mpeg-7: Designing a well-founded
                                                                           17, 2002.
     multimedia ontology for the web. Technical report,
                                                                      [11] J. V. Ossenbruggen, F. Nack, and L. Hardman. That obscure
     University of Koblenz, 2007.
                                                                           object of desire: Multimedia metadata on the web, part 2.
 [2] G. Boccignone, A. Chianese, V. Moscato, and A. Picariello.
                                                                           IEEE Multimedia, 12:54–63, 2005.
     Context-sensitive queries for image retrieval in digital
                                                                      [12] G. Stamou and et al. Multimedia annotations on the semantic
     libraries. Journal of Intelligent Information Systems, 31(1),
                                                                           web. Multimedia, IEEE, 13:86 – 90, 2006.
     2008.
                                                                      [13] D. Trieschnigg and W. Kraaij. Tno hierarchical topic
 [3] Y. Chang and H. Chen. Approaches of using a word-image
                                                                           detection report at tdt 2004. Proceedings of Corpus
     ontology and an annotated image corpus as intermedia for
                                                                           Linguistics 2005, 99(7):1–8, 2004.
     cross-language image retrieval. In Proceedings of
     Cross-Language Evaluation Forum, 2006.
 [4] S. Golder and A. Hubemann. Usage patterns of collaborative
     tagging systems. Information Science, 2006.
 [5] L. Kennedy, M. Naaman, S. Ahern, R. Nair, and
     T. Rattenbury. How flickr helps us make sense of the world:
     context and content in community-contributed media
     collections. ACM Multimedia, 2007.
 [6] C. Lee, Z. Jian, and L. Huang. A fuzzy ontology and its
     application to news summarization. IEEE Transactions on
     Systems, Man and Cybernetics,, 35:859 – 880, 2005.
 [7] K. Lerman and L. Jones. Social browsing on flickr. CoRR,
     abs/cs/0612047, 2006.
 [8] R. Maller and B. Neumann. Ontology-based reasoning
     techniques for multimedia interpretation and retrieval. In
     Springer, editor, Semantic Multimedia and Ontologies, pages
     55–98. Springer London, 2008.
 [9] B. Manjunath, P. Salembier, and T. Sikora. Introduction to