Introduction

Ontology Based Semantic Image Interpretation

Ivan Donadello

donadello@fbk.eu 0 1 0 DISI, University of Trento , Via Sommarive 9, I-38123, Trento , Italy 1 Fondazione Bruno Kessler , Via Sommarive 18, I-38123, Trento , Italy

Semantic image interpretation (SII) leverages Semantic Web ontologies for generating a mathematical structure that describes the content of images. SII algorithms consider the ontologies only in a late phase of the SII process to enrich these structures. In this research proposal we study a well-founded framework that combines logical knowledge with low-level image features in the early phase of SII. The image content is represented with a partial model of an ontology. Each element of the partial model is grounded to a set of segments of the image. Moreover, we propose an approximate algorithm that searches for the most plausible partial model. The comparison of our method with a knowledgeblind baseline shows that the use of ontologies significantly improves the results.

Introduction

Semantic image interpretation (SII) is the task of generating a semantically rich structure that describes the content of an image [ 8 ]. This structure is both human and machine understandable and can be encoded by using the Semantic Web (SW) language RDF. The first advantage is that RDF enables the enrichment of the semantic content of images with SW resources, the second one is that an RDF based description of images enables content-based image retrieval via query languages like SPARQL.

The main challenge in SII is bridging the so called semantic gap [ 3 ], which is the complex correlation between low-level image features and high-level semantic concepts. High-level knowledge plays a key role in bridging the semantic gap [ 17,18 ]. This knowledge can be found in the ontologies provided by the SW.

Most of the current approaches to SII exploit ontologies at a later stage when some hypothesis (a geometric description of the objects and their spatial relations) of the image content have already been formulated by a bottom-up approach (see for instance [ 13,15,17,11,12,3,6,1 ]). In these cases background knowledge is exploited to check the consistency of the output and/or to infer new facts. These works do not consider uncertainty coming from the low-level image analysis or require a set of DL rules for defining what is abducible, which need to be manually crafted.

In this research proposal we study a general framework for SII that allows the integration of ontologies with low-level image features. The framework takes as input the ontology and exploits it in the process of image interpretation. The output is a description of the content of an image in terms of a (most plausible) partial logical model ? I thank my advisor Luciano Serafini for his precious help, suggestions and patience. of the ontology [ 15 ]. Instead of lifting up low-level features into a logical form using concrete domain (as in [ 11 ]) we proceed in the opposite direction, by compiling down the background knowledge into low-level features. This allows us a more flexible inference in processing numeric information and to use simpler, and more efficient, logical reasoners for the semantic part. This partial model is generated by using optimisation methods (e.g. clustering) that integrate numeric and logical information. Our contribution is a formal framework for SII that integrates low-level features and logical axioms. Moreover, we developed an early prototype and we evaluated it, with promising results, on the task of detecting complex objects starting from the presence of their parts [ 5 ]. 2

Theoretical framework

The proposed framework takes as input a labelled picture that is a picture partitioned into segments (regions of pixels) using a semantic segmentation algorithm [ 4,7 ]. Each segment has a set of weighted labels that represent the level of confidence of the semantic segmentation. Labels are taken from the signature which is the alphabet of the ontology. A labelled picture is a pair P = hS; Li where S = fs1 : : : ; sng is a set of segments of the picture P , and L is a function that associates to each segment s 2 S a set L(s) of weighted labels hl; wi 2 (0; 1].

In this research proposal we study a method for discovering new objects (e.g., composite objects) and relations between objects by exploiting low-level image features and a Description Logic (DL) [ 2 ] ontology. The ontology has the classical signature = C ] R ] I of symbols for concepts, relations and individuals respectively. We adopt the standard definitions for syntax and semantics of DL3 . An ontology O on is a set of DL axioms. An interpretation of a DL signature is a pair I = I ; I , where I is a non empty set and I is a function that interprets the symbols of in

I . I is a model of an ontology O if it satisfies all the axioms in O. The axioms of the ontology are constraints on the states of the world. A picture, however, provides only a partial view of the state of the world, indeed, it could show a person with only one (visible) leg. Therefore, the content of a picture is not isomorphic to a model, as a model could contain objects not appearing in the picture (the invisible leg). The content of a picture should instead be represented as a partial model4.

Definition 1 (Partial model). Let I and I0 be two interpretations of the signatures and 0 respectively, with 0; I0 is an extension of I, or equivalently I0 extends I, if I I0 , aI = aI0 , CI = CI0 \ I and RI = RI0 \ I I , for all a 2 I , C 2 C and R 2 R. Ip is a partial model for a ontology O, in symbols Ip j=p O, if there is a model I of O that extends Ip.

In this framework the use of DL ontologies is twofold: first they are a terminological source for labelled pictures, second the DL inference services are exploited to check if an interpretation is a partial model and thus inferring new facts. The semantic interpretation of a picture is a partial model plus an alignment, called grounding, of every element of Ip with the segments of the picture. 3 In this paper we use the SHIQ DL. 4 This intuition was introduced in [ 15 ], our formalization however is slightly different.

Definition 2 (Semantically interpreted picture). Given an ontology O with signature

and a labelled picture P = hS; Li, a semantically interpreted picture is a triple S = (P ; Ip; G)O where: – Ip = h Ip ; Ip i is a partial model of O; – G Ip S is a left-total and surjective relation called grounding relation: if hd; si 2 G then there exists an l 2 L(s) such that: 1. if l 2 C then d 2 lIp ; 2. if l 2 I then d = lIp ; 3. if l 2 R then hd; d0i 2 RI or hd0; di 2 RI for some d0 2 I .

The grounding of every d 2 Ip , denoted by G(d), is the set fs 2 S j hd; si 2 Gg.

There are many possible explanations of the picture content, thus there are many partial models describing a picture via a grounding relation. We define a cost function S that assigns a cost to a partial model based on its adherence to the image content: the higher the adherence the lower the cost. The most plausible partial model Ip is the partial model that minimizes S , in symbols:

Ip = argmin S (P ; Ip; G)O

Ipj=pO G Ip S (1) The definition of S has to take into account low-level features of the segments and highlevel semantic features of the partial model derivable form the ontology. Intuitively the cost function measures the semantic gap between the two types of features.

Definition 3 (Semantic image interpretation problem). Given an ontology O and a

labelled picture P , a cost function S , the semantic image interpretation problem is the construction of a semantically interpreted picture S = (P ; Ip; G)O that minimizes S . 3

Method

In this proposal we restrict to the recognition of complex objects from their parts. For example, given a labelled picture where only some parts of a man (the legs, one arm and the head) and of a horse (the legs, the muzzle and the tail) are labelled we want to infer the presence of some logical individuals with their classes (man and horse respectively). These individuals are linked with their parts through the partOf relation. This can be seen as a clustering problem and we specify the cost function in terms of clustering optimisation. The parts (simple objects) are the input of the clustering problem whereas a single cluster contains the parts of a composite object. In addition, the parts to cluster are the individuals d 2 Ip with the following features: – a set of low-level image features extracted from G(d), the grounding of d; – a set of semantic features corresponding the most specific concepts extracted from the set fC 2 C j d 2 CIp g assigned to d by Ip.

We use the centroid of G(d) as a numeric feature but the approach can be generalised to other features. Clustering algorithms are based on some distance between the input elements defined in terms of their features. Let G (d; d0) be the Euclidean distance of the centroids of G(d) and G(d0), Os(d; d0) a semantic distance between simple objects and Oc(d; d0) a semantic distance between a simple object and its corresponding composite object. We define the cost function as the quality measure of the clustering: (1

S(hP; Ip; GiO) =

0 0 d;d02(9hasPart:>)Ip

1 G (d; d0)A + 1 ( G (d0; d00) + Os(d0; d00)) +

X hd0;di2partOfIp

C ( G (d0; d) + Oc(d0; d))C :

C A Following [ 9 ], the first component of the above equation measures the centroid distance between the composite objects (inter-cluster distance). The second component estimates the distance between the elements of each single cluster (intra-cluster distance).

Minimising analytically the above equation is rather complex, thus we developed an iterative algorithm that at each loop groups the several parts of a composite object approximating the cost function. If the grouping is not a partial model the algorithm enters in the next loop and selects another clustering. In the first step our algorithm generates an initial partial model Ip from P = hS; Li where Ip contains an element ds for every segment s 2 S and any concept C in the labelled picture is interpreted as CIp = fdsjC 2 L(s)g. The grounding G is the set of pair hds; si. Then, the algorithm enters in a loop where a non-parametric clustering procedure [ 10 ] clusters the input elements d 2 Ip by using their numeric and semantic features according to G and s . Each cluster cl corresponds to a composite object dcl which is introduced in Ip O and is connected via the hasPart relation to the elements of cl. We predict the type of this new individual via abductive reasoning: the type is the ontology concept that shares the maximum number of parts with the elements of the cluster. For example, if we cluster some elements of type Tail, Muzzle and Arm an abducted ontology concept will be Horse. These new facts are introduced in Ip and the algorithm checks if Ip is a partial model of O by using a DL reasoner (Pellet [ 16 ]). If true the algorithm returns Ip otherwise it extends the input elements with a set of consistency features that encode information about the inconsistency of Ip. These features tend to separate (resp. join) the segments that have been joined (resp. separated) in the previous clustering. The cluster of our example is inconsistent because a horse does not have arms. Then the algorithm returns at the beginning of the loop. 4

Evaluation

To evaluate our approach we created, by using LABELME [ 14 ], a dataset of 204 labelled pictures. For each picture we manually annotated simple objects, composite objects and their part-whole relations5. We also created a simple ontology6 with a basic formalisation of meronymy in the domains of: houses, trees, people, and street vehicles. We built a ground truth by associating every single labelled picture P to its partial model encoded in an ABox AP . The partial model returned by our algorithm is encoded in the AP ABox, in order to compare AP with AP we define the following two measures.

Grouping (GRP): this measure expresses how good is our algorithm at grouping parts of the same composite object. We define precision, recall and F1 measure on the set of siblings (the parts of the same composite object): sibl(A) = fhd; d0i j 9d00 : partOf(d; d00); partOf(d0; d00) 2 Ag: Thus: precGRP(P) = jsibl(AP ) \ sibl(AP )j jsibl(AP )j recGRP(P) = jsibl(AP ) \ sibl(AP )j jsibl(AP )j

Complex-object prediction (COP): this measure expresses how good is our algorithm at predicting the type of the composite object. We define precision, recall and F1 measure on the types of the composite object each part is assigned to: ptype(A) = fhd; Ci j 9d0 : fpartOf(d; d0); C(d0)g Ag. Thus: precCOP(P) = jptype(AP ) \ ptype(AP )j recCOP(P) = jptype(AP ) \ ptype(AP )j jptype(AP )j jptype(AP )j To measure how the semantics improves the recognition of composite objects from their parts we implemented a baseline that clusters without semantic features, see Table 1. We can see that the explicit use of semantic knowledge via semantic distance, abductive and deductive reasoning improves the baseline that relies only on numeric features. 5

Conclusions

We proposed a well-founded and general framework for SII that integrates symbolic information of an ontology with low-level numeric features of a picture. An image is interpreted as a (most plausible) partial model of an ontology that allows the query about the semantic content. We applied the framework to the specific task of recognizing composite objects from their parts. The evaluation shows good results and the injection of semantic knowledge improves the performance with respect to a semantically-blind baseline. As future work, we want to extend our evaluation by using more low-level features, by studying other relations and by using a semantic segmentation algorithm as source of labelled pictures. 5 An example of labelled picture is available at http://bit.ly/1DXZxic 6 The ontology is available at http://bit.ly/1AruGh0

1. Atif , J. , Hudelot , C. , Bloch , I. : Explanatory reasoning for image understanding using formal concept analysis and description logics . Systems, Man, and Cybernetics: Systems, IEEE Transactions on 44(5) , 552 - 570 (May 2014 )

2. Baader , F. , Calvanese , D. , McGuinness , D.L. , Nardi , D. , Patel-Schneider , P.F . (eds.): The Description Logic Handbook: Theory, Implementation, and Applications . Cambridge University Press, New York, NY, USA ( 2003 )

3. Bannour , H. , Hudelot , C. : Towards ontologies for image interpretation and annotation . In: Martinez, J.M. (ed.) 9th International Workshop on Content-Based Multimedia Indexing , CBMI 2011 , Madrid, Spain, June 13-15, 2011 . pp. 211 - 216 . IEEE ( 2011 )

4. Carreira , J. , Caseiro , R. , Batista , J. , Sminchisescu , C. : Semantic segmentation with secondorder pooling . In: Fitzgibbon, A. , Lazebnik , S. , Perona , P. , Sato , Y. , Schmid , C. (eds.) Computer Vision - ECCV 2012 . LNCS, Springer Berlin Heidelberg ( 2012 )

5. Donadello , I. , Serafini , L. : Mixing low-level and semantic features for image interpretation . In: Agapito, L. , Bronstein , M.M. , Rother , C. (eds.) Computer Vision - ECCV 2014 Workshops. LNCS , Springer International Publishing ( 2014 ), best paper award.

6. Espinosa , S. , Kaya , A. , Mo¨ller, R.: Logical formalization of multimedia interpretation . In: Paliouras, G. , Spyropoulos , C. , Tsatsaronis , G . (eds.) Knowledge-Driven Multimedia Information Extraction and Ontology Evolution, Lecture Notes in Computer Science , vol. 6050 , pp. 110 - 133 . Springer Berlin Heidelberg ( 2011 )

7. Gould , S. , Zhao , J. , He , X. , Zhang , Y.: Superpixel graph label transfer with learned distance metric . In: Fleet, D. , Pajdla , T. , Schiele , B. , Tuytelaars , T. (eds.) Computer Vision - ECCV 2014. Lecture Notes in Computer Science , Springer International Publishing ( 2014 )

8. Hudelot , C. , Maillot , N. , Thonnat , M. : Symbol grounding for semantic image interpretation: From image data to semantics . In: Proc. of the 10th IEEE Intl. Conf. on Computer Vision Workshops. ICCVW '05 , IEEE Computer Society ( 2005 )

9. Jung , Y. , Park , H. , Du , D.Z. , Drake , B.L. : A decision criterion for the optimal number of clusters in hierarchical clustering . Journal of Global Optimization 25 ( 1 ), 91 - 111 ( 2003 )

10. Kohonen , T. : The self-organizing map . Proc. of the IEEE 78 ( 9 ), 1464 - 1480 ( Sep 1990 )

11. Neumann , B. , Mo¨ller, R.: On scene interpretation with description logics . Image and Vision Computing 26 ( 1 ), 82 - 101 ( 2008 ), cognitive Vision-Special Issue

12. Peraldi , I.S.E. , Kaya , A., M o¨ller , R.: Formalizing multimedia interpretation based on abduction over description logic aboxes . In: Proc. of the 22nd Intl. Workshop on Description Logics (DL 2009 ). CEUR Workshop Proceedings , vol. 477 . CEUR-WS.org ( 2009 )

13. Reiter , R. , Mackworth , A.K. : A logical framework for depiction and image interpretation . Artificial Intelligence 41 ( 2 ), 125 - 155 ( 1989 )

14. Russell , B.C. , Torralba , A. , Murphy , K.P. , Freeman , W.T.: Labelme: A database and webbased tool for image annotation . Int. J. Comput. Vision 77 ( 1-3 ), 157 - 173 (May 2008 )

15. Schroder , C. , Neumann , B. : On the logics of image interpretation: model-construction in a formal knowledge-representation framework . In: Image Processing , 1996 . Proceedings., Int. Conf. on. vol. 1 , pp. 785 - 788 ( Sep 1996 )

16. Sirin , E. , Parsia , B. , Grau , B.C. , Kalyanpur , A. , Katz , Y. : Pellet: A practical owl-dl reasoner . Web Semant . 5 ( 2 ), 51 - 53 ( Jun 2007 )

17. Town , C. : Ontological inference for image and video analysis . Mach. Vision Appl . 17 ( 2 ), 94 - 115 ( Apr 2006 )

18. Yuille , A. , Oliva , A. : Frontiers in computer vision: Nsf white paper ( November 2010 ), http://www.frontiersincomputervision.com/WhitePaperInvite.pdf