MOWIS: A system for building Multimedia Ontologies from Web Information Sources Vincenzo Moscato, Antonio Penta, Fabio Persia, Antonio Picariello University of Naples Dipartimento di Informatica e Sistemistica via Claudio 21, 80125, Naples {vmoscato,a.penta,fabio.persia,picus}@unina.it ABSTRACT Throughout the rest of paper, we will try to give an answer to Defining ontologies within the multimedia domain still remains a all the previous cited aspects; in particular the original contribution challenging task, due to the complexity of multimedia data and the of this work is: (i) to propose a novel multimedia ontology frame- related associated knowledge. In this paper, we propose: i) a novel work, in particular related to the image domain; (ii) to propose a multimedia ontology model that combine both low level descrip- technique for building ontologies, that operates on large corpora of tors and high level semantic concepts; ii) an automatic construction human annotated repositories, namely the Flickr [7] database, con- of ontologies using the Flickrweb services, that provide images, sidering both low level image processing strategies and keywords tags, keywords and sometimes useful annotation describing both and annotations produced by humans when they store the produced the content of an image and personal interesting information. Even- data. tually, we describe an example of automatic ontology construction We provide an algorithm for creating image ontology in a spe- in a specific domain. cific domain gathering together all this different information. We then provide an example of automatic construction of image ontol- ogy and a discussion of the encountered problems and the provided 1. INTRODUCTION solutions. We concluded that the framework is promising and suf- Nowadays, a lot of repositories containing both multimedia and ficiently scalable to different domains. the related annotations or metadata are publicly available on the The remaind of paper is organized as follows. Section 2 out- web. Such kind of information may be used for an automatically lines the related work related to the multimedia ontology topic. In generation of multimedia knowledge, particularly suitable for a va- Section 3 the process for building an Image Ontology is described. riety of applications, such as information retrieval, browsing, data Section 4 details the system architecture with some implementation mining and so on. issues and a case study for our process is shown in Section 5. In It is well known in the literature that despite the tons of papers Section 6 some discussions and conclusions are reported. produced about multimedia databases and knowledge representa- tions, there is not yet an accepted solution to the problem of how to represent, organize and manage multimedia data and the related 2. RELATED WORKS semantics by means of a formal framework. In the last few years, several papers have been presented about Usually, a multimedia database is described by means of “flat” multimedia systems based on knowledge models, image ontolo- metadata, the most of the times using a predefined set of metadata gies, fuzzy extension of ontology theories. (as in mpeg standard), or sometimes using small annotation in nat- In almost all the works, multimedia ontologies are effectively ural languages: such kind of structures are substantially inadequate used to perform semantic annotation of the media content by man- to support complete retrieval by content of image documents. ually associating the terms of the ontology with the individual ele- It is the authors’ opinion that there is still a great work to do with ments of the image or video domains [12], thus demonstrating that respect to the intensional aspects of a multimedia ontology: the use of ontologies can enhance classification precision and im- age retrieval performance. • what a multimedia ontology is: is it a taxonomy, or a seman- Instead of creating a new ontology from the scratch, other ap- tic network of metadata (tags, annotations)? proaches [3] extend WordNet to image specific concepts, using an • does a multimedia ontology support concrete data: what is annotated image corpus as an intermediate step to compute similar- the role of rough data – image, video, audio data– if any? ity between example images and images in the image collection. For solving the uncertain reasoning problems, the theory of fuzzy • what a multimedia semantic is: how to define and capture the ontologies is presented in several works, as an extension of ontolo- semantics of multimedia data? gies with crisp concepts as in the paper [6], that presents a complete • how to build extensional ontologies: once defined a suitable fuzzy framework for ontologies. While in [8], the authors introduce formal framework, can we automatically build the defined a description logic framework for the interpretation of image con- multimedia ontologies? tents. Multimedia semantic papers based on MPEG-7 [9] are very in- teresting. The MPEG-7 framework consists of Descriptors (Ds) and Descriptor Schemes (DSs) that represent features for multime- dia, and more complex structures grouping Ds and DSs, respec- Appears in the Proceedings of the 1st Italian Information Retrieval tively. Workshop (IIR’10), January 27–28, 2010, Padova, Italy. http://ims.dei.unipd.it/websites/iir10/ index.html Copyright owned by the authors. In particular, the MPEG-7 standard includes tags that describe visual features (e.g. color), audio features (e.g. timbre), structure (e.g. moving regions and video segments), semantic (e.g. object and events), management (e.g. creator and format), collection or- ganization (e.g. collections and models), summaries (e.g. hierar- chies of key frames) and, even, user preferences (e.g. for search) of multimedia. In this way the standard includes descriptions of low-level media-specific features that can often be automatically extracted from the different media types. Unfortunately, MPEG-7 is not currently suitable for describing top-level multimedia features, because: i) its XML Schema-based nature prevents the effective manipulation of descriptions and its use of URNs is cumbersome for the web; ii) it is not open to the web standards for representing knowledge. Other efforts have been also done in order to translate the se- mantic of the standard in some knowledge representation languages [11]. All these methods perform a one to one translation of MPEG- 7 types into OWL concepts and properties. Finally, a very interesting work reported in [1] has been proposed in order to define a multimedia ontology. The authors try to define a novel multimedia ontology that takes into account the semantic of Figure 1: Image Ontology Building Process MPEG-7 standard. They started using some patterns derived from the foundational ontology DOLCE [10]. In particular they used two design patterns Descriptions & Situations (D & S) and Ontology of 1. V is a finite set of nodes that can be of different kinds: Information Objects (OIO). The obtained ontology already covers • low-level nodes (Vl ), corresponding to an image with a very large part of the standard, while their modeling approach the related properties: has the aim to offer even more possibilities for multimedia annota- tion than MPEG-7 since it is truly interoperable with existing web – content (e.g. texture, shape, color, objects, etc...) ontologies. This approach fits interoperability purposes, but some or more enhanced features; constraints on the image semantics are introduced. – metadata (e.g. author, title, description, tags, etc...); • high-level nodes (Vh ), corresponding to general con- 3. BUILDING AN IMAGE ONTOLOGY cepts domain-specific concepts, or image content con- cepts (that could be associated to low-level nodes); 3.1 An Image Ontology Model 2. E is a subset of (V × V); An ontology is usually referred as an “explicit specification of a conceptualization” which is, in turn, “the objects, concepts, and 3. ρ is a function that associates to each couple of nodes a label other entities that are presumed to exist in some area of interest and indicating the kind of relationship between the two nodes ρs , the relationships that hold among them”. and its reliability degree ρr ∈ [0, 1]: ρ : E → hρs , ρr i. Stressing its conceptual nature, an ontology may be considered as a theory used to represent relevant notion about domain model- Depending on the type of relationship in our model, we distin- ing, a “domain” being classified in terms of concepts, relationships guish among: and constraint on them. • similarity relationship: relates between two low-level nodes Let us consider the image domain: as usual in a given media, we (images) in function of their similarity degree, exploiting detect symbols, objects and concepts; in a certain image we have a classical algorithms of image matching based on low-level region of pixels (symbol) related to a portion of multimedia data; features (e.g. color, texture, shape, etc...); this region is an instance (object) of the certain concept. In other words, we can detect concepts but we are not able to dis- • representativeness relationship: relates between high-level ambiguate among the instances without some specific knowledge. and low-level nodes, containing those content features that A simplified version of the described vision process will consider better represents the associated concept, by means of cluster- only two main levels: Low and High. In fact, the knowledge asso- ing or classification algorithms that determine the probability ciated to an image can be easily described at two different levels of that an image is a valid representative of the concept; analysis: • semantic relationship: relates between two high-level nodes • Low level - the low-level descriptions of raw images; (example are those relationships such hypernym/hyponim, holonym/meronym, synonym, retrievable on lexical databases). • High level - general or domain-specific image content con- cepts that can be derivable or less from low-level ones. 3.2 The Image Ontology building The purpose of the image ontology building process (figure 1) is It’s the author’s opinion that an image ontology has to take into to perform the construction of the graph representing image ontol- account these specific characteristics, as captured by the following ogy by a super-visioned approach. definition: The process is made of: D EFINITION 1 (I MAGE O NTOLOGY ). An Image Ontology is 1. a definition of an initial taxonomy containing few high level a directed and labeled graph (V, E, ρ), where: nodes (related to the main concepts of a specific domain), 2. an extraction of useful information (images and annotations vocabulary from which tags are chosen can easily lead to the pres- related to the taxonomy concepts) from several annotated ence of synonyms (multiple tags for the same concept), homonyms web repositories, (same tag used with different meaning), and polysemies (same tag with multiple related meanings). Also inaccurate or irrelevant tags 3. a content-based analysis on the row-data and a semantic pro- result from the so called ‘meta-noise’, e.g. lack of stemming (nor- cessing on the related textual annotations, malization of word inflections), and from heterogeneity of users and contexts: hence an effective use of the tags requires these to be 4. the ontology construction. stemmed, disambiguated, and opportunely selected. To these purposes, information coming from tags could be use- 3.2.1 Taxonomy definition fully analyzed in combination with titles and descriptions by suit- Our image ontology building process is domain-oriented. Thus, able NLP technique that overcome the linguistic and semantic het- during this step, it is necessary to define an initial taxonomy con- erogeneity of such information, in order to extract a set of relevant taining relevant concepts hierarchy of the considered domain that keywords which more effectively represent image content. is represented by a subset of high level nodes. In particular, the semantic processing, which is applied to the textual data attached to a given image can be decomposed into a set 3.2.2 Information extraction of sequential sub-tasks [13]: meta-noise and named entity filtering, The main objective of this task is to fetch images and the related linguistic normalization, part of speech tagging, tokenization, word textual annotations from web repositories, corresponding to the leaf sense disambiguation and topic extraction. Thus, the result of the high-level nodes of the image ontology, and to extract some useful semantic processing task is a set of labels (topics) with an associ- low and high level information. Apposite communication API or ated confidence value - that represents the relative importance of web-services are exploited to obtain requested information. the label (with respect to the other ones in the annotations) -, from In this paper we used Flickr as web image repository. the set of tags, title an descriptions. 3.2.3 Content-Based analysis 3.2.5 Ontology building The goal of such a task is to obtain a low-level description of im- As previously discussed, the obtained knowledge in terms of im- ages in terms of content features, using classical Computer Vision ages, low-level characteristics and labels is then merged and trans- techniques. lated in the shape of a graph representing image ontology. We decided to use the salient points technique - based on the An- In particular, in a first step, all images whose relevant labels are imate Vision paradigm [2] - that exploits color, texture and shape associated with a high confidence value to the high-level nodes, information associated with those regions of the image that are rel- corresponding to the taxonomy leaves, will be represented by ap- evant to human attention (Focus of Attention), in order to obtain a posite low-level nodes; in addition, couple of image nodes, whose compact characterization (namely Information Path) that could be similarity (computed by means of the Information Path Matching also used to evaluate the similarity between images, and for index- algorithm [2]) is greater than a threshold will be linked by an edge ing issues. having as reliability degree the related similarity measure. An information path IP=hFs (ps ; τs ),hb (Fs ),ΣFs i can be seen as In the successive step, previous images are clustered by used a Bal- a particular data structure that contains, for each region F (ps ; τs ) anced Expectation Maximazation algorithm [2] applied in the fea- surrounding a given salient point (where ps is the center of the re- ture spaces defined by the Information Path descriptors, in order gion and τs is the the observation time spent by a human to detect to determine for the high-level nodes the set of images that better the point), the color features in terms of HSV histogram hb (Fs ), could represent the related concepts. Apposite edges (represen- and the texture and shape features in terms of wavelet covariance tative relationships) link such nodes with representatives of each signatures ΣFs (see [2] for more details). cluster. Eventually, apposite super-visioned classification algorithms are Eventually, by means of a Learning Tag Relevance algorithm [4], exploited to determine content features [2]. topics that are more relevant with respect to the content of images belonging to the same cluster (winner topics) are promoted to be 3.2.4 Semantic processing image ontology high-level nodes. In particular, the tag relevance σ of a generic tag τ of the most significant image (centroid) of cluster In this task the main objective is to discover textual labels that C is computed by the following formula: better reflect image semantic using NLP techniques and topic de- tection algorithms on the textual annotations coming from the con- m X sidered image repositories. For what Flickr concerns, images usu- tf (τ.i) · (a + 1) ally have three main attached information: i) a title, ii) a content σ(τ, C) = |idf (τ ) · | (1) i=1 tf (τ ) + a · (1 − b + b · UUi ) description and iii) a set of keywords, namely tags. Titles in the majority of the cases contain text that summarizes where: tf (τ, i) is the term frequency of topic τ with respect to the the content of the images, while in other cases consist of automat- topics of all images belonging to C, Ui , U are the number of topics ically generated text that is not useful in the indexing process. De- of i − th image of C and the average number of tags related to scriptions are very short and usually are not posted for retrieval all images belonging to C respectively, idf (τ ) is the inverse docu- purposes: they typically contain sentences concerning context of ment frequency of τ in C. The winner topics, whose relevance is the picture, or user opinion. Finally, Tags are simple keywords greater than a threshold, are finally inserted as high-level nodes in users are asked (actually they may not insert any tag) to submit, the ontology and linked, from one hand to the image node that cor- that should describe the context of the image (e.g. among tags for a responds to the cluster centroid and, from the other one, to those picture of an “elephant in an African landscape”, you will probably nodes which semantic distance (i.e. Wu/Palmer) is the minimum see the words ‘elephant’, ‘Africa’ and ‘landscape’). with respect to the current topic. If it is possible, the new ontology The simple use of tags does not improve the efficiency of in- edge is labeled with the type of semantic relationship (e.g. hyper- dexing and searching contents. The absence of restrictions to the nym/hyponim, holonym/meronym, etc...). high level nodes for the considered domain. We used Flickr [7] as multimedia repository of annotated images. Flickr is one of the most popular web-based tagging system, that allows human participants to annotate a particular resource, such as web pages, blogs, images, with a freely chosen set of keywords, or tags, together with a short description of the content. This kind of system has been recently termed folksonomy [5], i.e. a folk taxonomy of important and emerging concepts within user groups. The dynamic nature of these repositories assures the rich- ness of the annotation; in addition, they are quite accurate, because they are produced by humans that want to share their images and the experience they have had, using tags and an annotation process. The Flickr repository has been queried using as search keywords the logical AND between concepts reported in the leaf nodes of the taxonomy and the one corresponding to the root node and exploting some filters on user ids, in order to retrieve images really belong- ing to the domain. Each retrieved image undergoes a content-based analysis to determine the low-level description – i.e. the IP (In- Figure 2: System Architecture formation Path) and content features. Moreover, in a first step we estimated similarity existing between each couple of different im- ages by comparing their IP s by means of the image path matching Thus, image ontology can be generated in an incremental way algorithm [2]. and in correspondence of pick-up operations from the Flickr repos- All images belonging to the same concept are then clustered itory. into different groups, which contain images that are more similar among themselves. We used as clustering procedure the BEM al- gorithm [2], that is recursively invoked to dynamically determine 4. THE SYSTEM ARCHITECTURE more fitting clusters without knowing a-priori the number of clus- The system architecture that supports the image ontology build- ters themselves (that is usually proportional to number of images ing process is shown in figure 2. User generates by an apposite related to the current concept). Then we selected for each cluster graphical interface an OWL file coding the initial taxonomy con- the representative image as the closest one to all the other images taining relevant concepts of the considered domain. Such a file is of the cluster, and a suitable representation probability is associated then the input of the Information Fetching module that downloads to each representative image on the base of minimum and average images and the related annotations from the Multimedia Reposi- distances. tory, using as search keywords the concepts related to leaf nodes of The process is iterated for each taxonomy leaf concept and the the taxonomy and some filters on users. ontology is incrementally built: images belonging to different top- A Storage Engine module receives such information and stores ics could be linked on the base of their similarity values allowing image annotations (title, description, author, tags, labels, etc...) in to merge the multimedia knowledge in a unique graph. Thus, the a dedicated RDF Database and raw data together low-level charac- more relevant tags are propagated in the ontology and linked to the teristics in a Image Database. Each image is then identified in these other nodes. databases by an URI (Uniform Resource Identifier). We report in figure 3 a step by step complete example of the Finally, the Information Extraction and Information Processor generation of Capri ontology. analyze both high level information stored into the RDF database and low level information contained into the Image database, in order to generate/update, by means of Ontology Manager and of 6. CONCLUSION Clustering Manager, and in according to the described process, a In this paper we have addressed the problem of building a multi- graph which represents final multimedia ontology. media ontology in an automatic way using annotated image repos- For what implementation issues concerns, we notice that: (i) itories. Our work differs from the previous papers presented in the the initial taxonomy is generated by a JAVA desktop application literature for different reasons. First, we propose a notion of mul- that uses Protégé API; (ii) Flickr has been chosen as the multi- timedia ontology, described by means of a graph and particularly media repository; (iii) the Information Fetching module has been suitable for managing the different levels of semantics of images. implemented as a JAVA application that exploits Flickr API; (iv) In addition, we obtain a dynamic generation of image ontologies the RDF and Image Database have been realized by Sesame and using tags and annotations already produced by users in their so- PostegreSQL DBMS, respectively; (v) the Information Fetcing and cial web networks. Indexing packages have been implemented by apposite JAVA pack- Further works will be devoted to produce experimental results to ages. evaluate the effectiveness of the produced ontologies with respect to other approaches by means of different criteria: class match mea- sure, density measure, semantic similarity measure, betweenness 5. A CASE STUDY measure. This section describes a case study for our image ontology build- ing process. In particular, we have built an ontology pertinent to Capri, a wonderful Italian island of the Sorrentine Peninsula, on the south side of the Gulf of Naples. A set of experts of natu- ral and cultural attractions of Capri provided as initial taxonomy a graph reported containing the most relevant concepts in terms of Figure 3: Bulding of the Capri Ontology 7. REFERENCES MPEG-7: Multimedia Content Description Interface. 2002. [10] C. Masolo and et al. The wonderweb library of foundational [1] R. Arndt, R. Troncy, S. Staab, and L. Hardman. Adding ontologies (wfol). Technical report, WonderWeb Deliverable formal semantics to mpeg-7: Designing a well-founded 17, 2002. multimedia ontology for the web. Technical report, [11] J. V. Ossenbruggen, F. Nack, and L. Hardman. That obscure University of Koblenz, 2007. object of desire: Multimedia metadata on the web, part 2. [2] G. Boccignone, A. Chianese, V. Moscato, and A. Picariello. IEEE Multimedia, 12:54–63, 2005. Context-sensitive queries for image retrieval in digital [12] G. Stamou and et al. Multimedia annotations on the semantic libraries. Journal of Intelligent Information Systems, 31(1), web. Multimedia, IEEE, 13:86 – 90, 2006. 2008. [13] D. Trieschnigg and W. Kraaij. Tno hierarchical topic [3] Y. Chang and H. Chen. Approaches of using a word-image detection report at tdt 2004. Proceedings of Corpus ontology and an annotated image corpus as intermedia for Linguistics 2005, 99(7):1–8, 2004. cross-language image retrieval. In Proceedings of Cross-Language Evaluation Forum, 2006. [4] S. Golder and A. Hubemann. Usage patterns of collaborative tagging systems. Information Science, 2006. [5] L. Kennedy, M. Naaman, S. Ahern, R. Nair, and T. Rattenbury. How flickr helps us make sense of the world: context and content in community-contributed media collections. ACM Multimedia, 2007. [6] C. Lee, Z. Jian, and L. Huang. A fuzzy ontology and its application to news summarization. IEEE Transactions on Systems, Man and Cybernetics,, 35:859 – 880, 2005. [7] K. Lerman and L. Jones. Social browsing on flickr. CoRR, abs/cs/0612047, 2006. [8] R. Maller and B. Neumann. Ontology-based reasoning techniques for multimedia interpretation and retrieval. In Springer, editor, Semantic Multimedia and Ontologies, pages 55–98. Springer London, 2008. [9] B. Manjunath, P. Salembier, and T. Sikora. Introduction to