=Paper=
{{Paper
|id=Vol-1485/paper4
|storemode=property
|title=Ontology Based Semantic Image Interpretation
|pdfUrl=https://ceur-ws.org/Vol-1485/paper4.pdf
|volume=Vol-1485
|dblpUrl=https://dblp.org/rec/conf/aiia/Donadello15
}}
==Ontology Based Semantic Image Interpretation==
Ontology Based Semantic Image Interpretation Ivan Donadello1,2? 1 Fondazione Bruno Kessler, Via Sommarive 18, I-38123, Trento, Italy 2 DISI, University of Trento, Via Sommarive 9, I-38123, Trento, Italy donadello@fbk.eu Abstract. Semantic image interpretation (SII) leverages Semantic Web ontolo- gies for generating a mathematical structure that describes the content of images. SII algorithms consider the ontologies only in a late phase of the SII process to enrich these structures. In this research proposal we study a well-founded frame- work that combines logical knowledge with low-level image features in the early phase of SII. The image content is represented with a partial model of an on- tology. Each element of the partial model is grounded to a set of segments of the image. Moreover, we propose an approximate algorithm that searches for the most plausible partial model. The comparison of our method with a knowledge- blind baseline shows that the use of ontologies significantly improves the results. 1 Introduction Semantic image interpretation (SII) is the task of generating a semantically rich struc- ture that describes the content of an image [8]. This structure is both human and ma- chine understandable and can be encoded by using the Semantic Web (SW) language RDF. The first advantage is that RDF enables the enrichment of the semantic content of images with SW resources, the second one is that an RDF based description of images enables content-based image retrieval via query languages like SPARQL. The main challenge in SII is bridging the so called semantic gap [3], which is the complex correlation between low-level image features and high-level semantic con- cepts. High-level knowledge plays a key role in bridging the semantic gap [17,18]. This knowledge can be found in the ontologies provided by the SW. Most of the current approaches to SII exploit ontologies at a later stage when some hypothesis (a geometric description of the objects and their spatial relations) of the im- age content have already been formulated by a bottom-up approach (see for instance [13,15,17,11,12,3,6,1]). In these cases background knowledge is exploited to check the consistency of the output and/or to infer new facts. These works do not consider uncer- tainty coming from the low-level image analysis or require a set of DL rules for defining what is abducible, which need to be manually crafted. In this research proposal we study a general framework for SII that allows the in- tegration of ontologies with low-level image features. The framework takes as input the ontology and exploits it in the process of image interpretation. The output is a de- scription of the content of an image in terms of a (most plausible) partial logical model ? I thank my advisor Luciano Serafini for his precious help, suggestions and patience. of the ontology [15]. Instead of lifting up low-level features into a logical form using concrete domain (as in [11]) we proceed in the opposite direction, by compiling down the background knowledge into low-level features. This allows us a more flexible infer- ence in processing numeric information and to use simpler, and more efficient, logical reasoners for the semantic part. This partial model is generated by using optimisation methods (e.g. clustering) that integrate numeric and logical information. Our contribu- tion is a formal framework for SII that integrates low-level features and logical axioms. Moreover, we developed an early prototype and we evaluated it, with promising results, on the task of detecting complex objects starting from the presence of their parts [5]. 2 Theoretical framework The proposed framework takes as input a labelled picture that is a picture partitioned into segments (regions of pixels) using a semantic segmentation algorithm [4,7]. Each segment has a set of weighted labels that represent the level of confidence of the se- mantic segmentation. Labels are taken from the signature Σ which is the alphabet of the ontology. A labelled picture is a pair P = hS, Li where S = {s1 . . . , sn } is a set of segments of the picture P, and L is a function that associates to each segment s ∈ S a set L(s) of weighted labels hl, wi ∈ Σ × (0, 1]. In this research proposal we study a method for discovering new objects (e.g., com- posite objects) and relations between objects by exploiting low-level image features and a Description Logic (DL) [2] ontology. The ontology has the classical signature Σ = ΣC ] ΣR ] ΣI of symbols for concepts, relations and individuals respectively. We adopt the standard definitions for syntax and semantics of DL3 . An ontology O on Σ is a set of DL axioms. An interpretation of a DL signature Σ is a pair I = ∆I , ·I , where ∆I is a non empty set and ·I is a function that interprets the symbols of Σ in ∆I . I is a model of an ontology O if it satisfies all the axioms in O. The axioms of the ontology are constraints on the states of the world. A picture, however, provides only a partial view of the state of the world, indeed, it could show a person with only one (visible) leg. Therefore, the content of a picture is not isomorphic to a model, as a model could contain objects not appearing in the picture (the invisible leg). The content of a picture should instead be represented as a partial model4 . Definition 1 (Partial model). Let I and I 0 be two interpretations of the signatures Σ and Σ 0 respectively, with Σ ⊆ Σ 0 ; I 0 is an extension of I, or equivalently I 0 extends 0 0 0 0 I, if ∆I ⊆ ∆I , aI = aI , C I = C I ∩ ∆I and RI = RI ∩ ∆I × ∆I , for all a ∈ ΣI , C ∈ ΣC and R ∈ ΣR . Ip is a partial model for a ontology O, in symbols Ip |=p O, if there is a model I of O that extends Ip . In this framework the use of DL ontologies is twofold: first they are a terminological source for labelled pictures, second the DL inference services are exploited to check if an interpretation is a partial model and thus inferring new facts. The semantic inter- pretation of a picture is a partial model plus an alignment, called grounding, of every element of ∆Ip with the segments of the picture. 3 In this paper we use the SHIQ DL. 4 This intuition was introduced in [15], our formalization however is slightly different. Definition 2 (Semantically interpreted picture). Given an ontology O with signature Σ and a labelled picture P = hS, Li, a semantically interpreted picture is a triple S = (P, Ip , G)O where: – Ip = h∆Ip , ·Ip i is a partial model of O; – G ⊆ ∆Ip × S is a left-total and surjective relation called grounding relation: if hd, si ∈ G then there exists an l ∈ L(s) such that: 1. if l ∈ ΣC then d ∈ lIp ; 2. if l ∈ ΣI then d = lIp ; 3. if l ∈ ΣR then hd, d0 i ∈ RI or hd0 , di ∈ RI for some d0 ∈ ∆I . The grounding of every d ∈ ∆Ip , denoted by G(d), is the set {s ∈ S | hd, si ∈ G}. There are many possible explanations of the picture content, thus there are many partial models describing a picture via a grounding relation. We define a cost function S that assigns a cost to a partial model based on its adherence to the image content: the higher the adherence the lower the cost. The most plausible partial model Ip∗ is the partial model that minimizes S, in symbols: Ip∗ = argmin S(P, Ip , G)O (1) Ip |=p O I G⊆∆ p ×S The definition of S has to take into account low-level features of the segments and high- level semantic features of the partial model derivable form the ontology. Intuitively the cost function measures the semantic gap between the two types of features. Definition 3 (Semantic image interpretation problem). Given an ontology O and a labelled picture P, a cost function S, the semantic image interpretation problem is the construction of a semantically interpreted picture S = (P, Ip , G)O that minimizes S. 3 Method In this proposal we restrict to the recognition of complex objects from their parts. For example, given a labelled picture where only some parts of a man (the legs, one arm and the head) and of a horse (the legs, the muzzle and the tail) are labelled we want to infer the presence of some logical individuals with their classes (man and horse respectively). These individuals are linked with their parts through the partOf relation. This can be seen as a clustering problem and we specify the cost function in terms of clustering optimisation. The parts (simple objects) are the input of the clustering problem whereas a single cluster contains the parts of a composite object. In addition, the parts to cluster are the individuals d ∈ ∆Ip with the following features: – a set of low-level image features extracted from G(d), the grounding of d; – a set of semantic features corresponding the most specific concepts extracted from the set {C ∈ ΣC | d ∈ C Ip } assigned to d by Ip . We use the centroid of G(d) as a numeric feature but the approach can be generalised to other features. Clustering algorithms are based on some distance between the input elements defined in terms of their features. Let δG (d, d0 ) be the Euclidean distance of the centroids of G(d) and G(d0 ), δO s (d, d0 ) a semantic distance between simple objects and c 0 δO (d, d ) a semantic distance between a simple object and its corresponding composite object. We define the cost function as the quality measure of the clustering: X S(hP, Ip , GiO ) = α δG (d, d0 ) + d,d0 ∈(∃hasPart.>)Ip X X (1−α) (δG (d0 , d00 ) + δO s (d0 , d00 )) + (δG (d0 , d) + δO c (d0 , d)) . hd0 ,di∈partOf Ip hd0 ,di∈partOf Ip hd00 ,di∈partOf Ip Following [9], the first component of the above equation measures the centroid distance between the composite objects (inter-cluster distance). The second component estimates the distance between the elements of each single cluster (intra-cluster distance). Minimising analytically the above equation is rather complex, thus we developed an iterative algorithm that at each loop groups the several parts of a composite object approximating the cost function. If the grouping is not a partial model the algorithm enters in the next loop and selects another clustering. In the first step our algorithm generates an initial partial model Ip from P = hS, Li where ∆Ip contains an element ds for every segment s ∈ S and any concept C in the labelled picture is interpreted as C Ip = {ds |C ∈ L(s)}. The grounding G is the set of pair hds , si. Then, the algorithm enters in a loop where a non-parametric clustering procedure [10] clusters the input elements d ∈ ∆Ip by using their numeric and semantic features according to δG and s δO . Each cluster cl corresponds to a composite object dcl which is introduced in Ip and is connected via the hasPart relation to the elements of cl. We predict the type of this new individual via abductive reasoning: the type is the ontology concept that shares the maximum number of parts with the elements of the cluster. For example, if we cluster some elements of type Tail, Muzzle and Arm an abducted ontology concept will be Horse. These new facts are introduced in Ip and the algorithm checks if Ip is a partial model of O by using a DL reasoner (Pellet [16]). If true the algorithm returns Ip otherwise it extends the input elements with a set of consistency features that encode information about the inconsistency of Ip . These features tend to separate (resp. join) the segments that have been joined (resp. separated) in the previous clustering. The cluster of our example is inconsistent because a horse does not have arms. Then the algorithm returns at the beginning of the loop. 4 Evaluation To evaluate our approach we created, by using L ABEL M E [14], a dataset of 204 labelled pictures. For each picture we manually annotated simple objects, composite objects and Table 1. Performance of the proposed algorithm for SII and comparison with the baseline. The reported data are the average of the three measures on each single picture. prec GRP rec GRP F 1GRP prec COP rec COP F 1COP SII 0.61 0.89 0.67 0.73 0.75 0.74 Baseline 0.45 0.71 0.48 0.66 0.69 0.66 their part-whole relations5 . We also created a simple ontology6 with a basic formalisa- tion of meronymy in the domains of: houses, trees, people, and street vehicles. We built a ground truth by associating every single labelled picture P to its partial model en- coded in an ABox AP . The partial model returned by our algorithm is encoded in the A∗P ABox, in order to compare AP with A∗P we define the following two measures. Grouping (GRP): this measure expresses how good is our algorithm at grouping parts of the same composite object. We define precision, recall and F1 measure on the set of siblings (the parts of the same composite object): sibl(A) = {hd, d0 i | ∃d00 : partOf(d, d00 ), partOf(d0 , d00 ) ∈ A}. Thus: |sibl(AP ) ∩ sibl(A∗P )| |sibl(AP ) ∩ sibl(A∗P )| prec GRP (P) = rec GRP (P) = |sibl(A∗P )| |sibl(AP )| Complex-object prediction (COP): this measure expresses how good is our algo- rithm at predicting the type of the composite object. We define precision, recall and F1 measure on the types of the composite object each part is assigned to: ptype(A) = {hd, Ci | ∃d0 : {partOf(d, d0 ), C(d0 )} ⊂ A}. Thus: |ptype(AP ) ∩ ptype(A∗P )| |ptype(AP ) ∩ ptype(A∗P )| prec COP (P) = ∗ rec COP (P) = |ptype(AP )| |ptype(AP )| To measure how the semantics improves the recognition of composite objects from their parts we implemented a baseline that clusters without semantic features, see Table 1. We can see that the explicit use of semantic knowledge via semantic distance, abductive and deductive reasoning improves the baseline that relies only on numeric features. 5 Conclusions We proposed a well-founded and general framework for SII that integrates symbolic information of an ontology with low-level numeric features of a picture. An image is interpreted as a (most plausible) partial model of an ontology that allows the query about the semantic content. We applied the framework to the specific task of recognizing composite objects from their parts. The evaluation shows good results and the injection of semantic knowledge improves the performance with respect to a semantically-blind baseline. As future work, we want to extend our evaluation by using more low-level features, by studying other relations and by using a semantic segmentation algorithm as source of labelled pictures. 5 An example of labelled picture is available at http://bit.ly/1DXZxic 6 The ontology is available at http://bit.ly/1AruGh0 References 1. Atif, J., Hudelot, C., Bloch, I.: Explanatory reasoning for image understanding using formal concept analysis and description logics. Systems, Man, and Cybernetics: Systems, IEEE Transactions on 44(5), 552–570 (May 2014) 2. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge Uni- versity Press, New York, NY, USA (2003) 3. Bannour, H., Hudelot, C.: Towards ontologies for image interpretation and annotation. In: Martinez, J.M. (ed.) 9th International Workshop on Content-Based Multimedia Indexing, CBMI 2011, Madrid, Spain, June 13-15, 2011. pp. 211–216. IEEE (2011) 4. Carreira, J., Caseiro, R., Batista, J., Sminchisescu, C.: Semantic segmentation with second- order pooling. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Com- puter Vision – ECCV 2012. LNCS, Springer Berlin Heidelberg (2012) 5. Donadello, I., Serafini, L.: Mixing low-level and semantic features for image interpretation. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) Computer Vision - ECCV 2014 Work- shops. LNCS, Springer International Publishing (2014), best paper award. 6. Espinosa, S., Kaya, A., Möller, R.: Logical formalization of multimedia interpretation. In: Paliouras, G., Spyropoulos, C., Tsatsaronis, G. (eds.) Knowledge-Driven Multimedia Infor- mation Extraction and Ontology Evolution, Lecture Notes in Computer Science, vol. 6050, pp. 110–133. Springer Berlin Heidelberg (2011) 7. Gould, S., Zhao, J., He, X., Zhang, Y.: Superpixel graph label transfer with learned distance metric. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. Lecture Notes in Computer Science, Springer International Publishing (2014) 8. Hudelot, C., Maillot, N., Thonnat, M.: Symbol grounding for semantic image interpretation: From image data to semantics. In: Proc. of the 10th IEEE Intl. Conf. on Computer Vision Workshops. ICCVW ’05, IEEE Computer Society (2005) 9. Jung, Y., Park, H., Du, D.Z., Drake, B.L.: A decision criterion for the optimal number of clusters in hierarchical clustering. Journal of Global Optimization 25(1), 91–111 (2003) 10. Kohonen, T.: The self-organizing map. Proc. of the IEEE 78(9), 1464–1480 (Sep 1990) 11. Neumann, B., Möller, R.: On scene interpretation with description logics. Image and Vision Computing 26(1), 82 – 101 (2008), cognitive Vision-Special Issue 12. Peraldi, I.S.E., Kaya, A., Möller, R.: Formalizing multimedia interpretation based on ab- duction over description logic aboxes. In: Proc. of the 22nd Intl. Workshop on Description Logics (DL 2009). CEUR Workshop Proceedings, vol. 477. CEUR-WS.org (2009) 13. Reiter, R., Mackworth, A.K.: A logical framework for depiction and image interpretation. Artificial Intelligence 41(2), 125–155 (1989) 14. Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: A database and web- based tool for image annotation. Int. J. Comput. Vision 77(1-3), 157–173 (May 2008) 15. Schroder, C., Neumann, B.: On the logics of image interpretation: model-construction in a formal knowledge-representation framework. In: Image Processing, 1996. Proceedings., Int. Conf. on. vol. 1, pp. 785–788 (Sep 1996) 16. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A practical owl-dl reasoner. Web Semant. 5(2), 51–53 (Jun 2007) 17. Town, C.: Ontological inference for image and video analysis. Mach. Vision Appl. 17(2), 94–115 (Apr 2006) 18. Yuille, A., Oliva, A.: Frontiers in computer vision: Nsf white paper (November 2010), http://www.frontiersincomputervision.com/WhitePaperInvite.pdf