=Paper=
{{Paper
|id=Vol-2145/p03
|storemode=property
|title=Ontology Based Image Recognition: A Review
|pdfUrl=https://ceur-ws.org/Vol-2145/p03.pdf
|volume=Vol-2145
|authors=Sandeepak Bhandari,Audrius Kulikajevas
}}
==Ontology Based Image Recognition: A Review==
Ontology Based Image Recognition: A Review Sandeepak Bhandari Audrius Kulikajevas Department of Software Engineering Department of Multimedia Engineering Kaunas University of Technology Kaunas University of Technology Kaunas, Lithuania Kaunas, Lithuania Sandeepak525@gmail.com akulikajevas@gmail.com Abstract — Due to lack of domain knowledge about the possibilities to create the ground truths involve image annotation semantics of the image, image retrieval rate is usually unsatisfying. by keywords, free text annotations or annotations based on To improve this, image labels must be provided by the author of ontologies [3] which allows to add hierarchical structure to the dataset. Applying ontology in digital image recognition to collection of keywords in order to produce a taxonomy. extract relevant information such as timeline, features, Furthermore, ontologies can be used for activity recognition visualization to help understand and interpret it, so that we can from video feed [4] [5], allowing cognitive vision systems to focus on the most relevant information. semantically identify activities from low-level events. Traditional object recognition methods rely on detecting an Keywords — Ontology, features, image labeling, classification, individual object as a whole, ontology based methods are semantic gap, cognitive vision. capable of detecting objects based on their individual component I. INTRODUCTION compositions [6]. Thus, making ontology a powerful tool to be used in the future of computer vision and machine learning. Computer vision as of late has seen a resurgence in popularity with the increased interests in machine learning, Anne-Marrie Tousch et al. [7] have made contributions in specifically neural networks. Despite of this, majority of work is survey of semantic techniques used for image annotation. In the restricted to specific domain knowledge such as specific object paper, authors have made an analysis about the nature of instance recognition, face recognition, etc. which makes these semantics to describe the image, where they pointed out three type of computer systems lacks the flexibility and adaptability main levels to describe the semantics. First level describes to different domain object recognition. The term cognitive objective and concrete object descriptions in relation to abstract vision [1] has been introduced to encapsulate an attempt to ones for example a crying person is an objective description, achieve more robust and adaptive computer vision systems. while inferred pain would be subjective one without deeper Cognitive vision systems can infer spatial relations such as an knowledge of the semantic context. Second level compares ocean is often near a beach, a lake is besides a grassland [2]. generic versus specific objects, also referred to as individual With this paper we try to evaluate previous research done on instances in object recognition, i.e. a bridge vs. Golden gate object recognition tasks with the addition of semantics in a form bridge. Third and final semantics level is split into four facets of of domain ontologies. This paper is organized as follows: time, localization, events or objects. Another contribution to the Section II provides short introduction in related work. Section discussion was regarding the semantic analysis, explanation of Errore. L'origine riferimento non è stata trovata. provides the semantic gap and observations on discrepancies between in-depth analysis of Semantic Web applications in the field of human ability to recognize almost instantly an enormous number object recognition and tasks related to it, such as segmentation. of objects versus the possibilities of image recognitions done by Section IV describes possible applications of the Semantic Web machines. They clarify the term semantics in the context that in the domain specific object recognition tasks. Finally, Section they (and by extension us) use in their paper as an image Errore. L'origine riferimento non è stata trovata. provides description in natural language, with semantic analysis referring our conclusions to the effectiveness of using ontology in the to any kind of transcription of an image into linguistic field of computer vision. expression. Furthermore, they describe several approaches for semantic analysis using unstructured vocabularies such as: II. RELATED WORK 1. Direct methods using plain representation of data and plain There has been a variety of recent studies in the field of statistical methods. cognitive computing. With the rise of machine learning and neural networks it is worth to re-evaluate the benefits of 2. Linguistic methods based on the use of an intermediate applying ontologies to existing object classification and visual vocabulary between raw numerical data and high- recognition tasks. One of the most important tasks when it level semantics. comes to object recognition is the creation of image labeling 3. Compositional methods where parts of an image are system for the identification of ground truths. Some of the identified before the whole image or its parts are annotated. 4. Structural methods where a geometry of parts is used. Copyright held by the author(s). 13 5. Hierarchical compositional methods where a hierarchy of segmentation is a very challenging, albeit necessary task for any parts is constructed for recognition. kind of image recognition algorithm therefore any optimizations of it are a welcome addition to any cognitive vision system. In 6. Communicating methods when information is shared this section of the paper we will focus on application of the between categories. semantic web technologies in order to optimize the results of the 7. Hierarchical methods that search for hierarchical segmentation algorithms. The quality of recognition highly relationships between categories. depends on the ability to segment any given frame. Previously classical algorithms such as watershed have been used to 8. Multilabel methods assigning several global labels achieve image segmentation. However, such algorithms had simultaneously to an image. lacked the efficiency, precision and robustness in real world Finally, authors have touched on semantic image analysis scenarios where occlusion, motion and illumination play key using structured vocabularies, where they distinguish two main roles in the scene. There has been some research in adapting relationship types commonly found in ontologies: Is-A-Part-Of convolutional neural networks for the segmentation task such in and Is-A relationship. Where the later signifies inheritance and techniques such as Mask R-CNN [9]. However, due to the fact former signifies composition, such as being part of a bigger that CNNs highly depend on the domain they have been trained model. However, according to authors such relationships are not on, they have troubles detecting changes even in the domains descriptive enough and suggest finer organizations of methods that they are familiar. In the paper [10] a method was proposed on how semantic relations are introduced in the system such as: which involves simultaneous image segmentation and detection of simple objects, imitating partially how the human vision 1. Linguistic methods where the semantic structure is used at works. The initial region labeling is performed based on regions the level of vocabulary, independently from the image, e.g. low-level descriptors with concepts stored in an ontological to expand the vocabulary. knowledge base. This allows the proposed technique to associate 2. Compositional methods that use meronyms, e.g. each region to a fuzzy set of candidate labels. Afterwards, a components. merging process is performed based on new similarity measurements and merging criteria that are defined at the 3. Communication methods that use semantic relations to share semantic level with the use of fuzzy set operations. Furthermore, information between concepts, be it for the feature this approach is invariant to the chosen region growing extraction or for classification. algorithm and can be applied to any of them with certain 4. Hierarchical methods use Is-A relations to improve modifications, to demonstrate this, authors apply semantics to categorization and/or to allow classification at different watershed and RSST segmentation, experimentally showing that semantic levels. semantic watershed had 90% accuracy, semantic RSST accuracy of 88% compared to classical RSST approach of 82%. There have been other contributions to the discussion of Other experiments have shown a similar increase in accuracy of automatic image annotation techniques by applying semantics 7-8%. To achieve these improvements, authors have proposed [8] where author review on different approaches of automatic to adjust merging processes as well as termination criteria of the image annotation are reviewed: 1) generative model based classical region growing algorithms. What is more, a novel image annotation, 2) discriminative model based image ontological representation for context is introduced, combining annotation, 3) graph model based image annotation. Due to the fuzzy theory and fuzzy algebra with characteristics derived from rapid advancement of digital technology in the last few years, the semantic web, such as reification. Membership degrees of there has been an increasingly large number of images available labels are assigned to regions derived from the semantic on the Web, making manual annotation of images an impossible segmentation are re-estimated appropriately, according to task. However, we will be focusing on the steps of applying context-based membership degree readjustment algorithm, semantics to each step of image recognition individually along which utilizes ontological knowledge, to optimize membership with applications of semantics to a narrow domain. degrees for the detected concepts of each region in the scene. While the contextualization and initial region labeling steps are III. SEMANTIC WEB IN OBJECT RECOGNITION domain specific and require the domain of the images to be Image retrieval is the key task to be solved in the science of provided as input, the rest of the approach is domain computer vision. While more classical approaches had to an independent. In another paper [11], authors have shown that extent worked in the past for simple object recognition, more applying semantics to a convolutional neural network for the general approaches are required in the modern times. In this task of image segmentation can greatly increase performance. section we present research done in the past on application of Authors have experimentally proven that a neural network that ontology to some steps of image recognition to improve image is trained end-to-end, pixel-to-pixel on semantic segmentation retrieval performance. We also present TABLE I. for can achieve the best asymptotical and absolute performance comparison between such methods. results without the downside of other methods such as patch- wise training that lack the efficiency of convolutional training, A. Semantics application in image segmentation or the needed inclusion of superpixels such as the ones used in One of the main tasks in any image recognition software is [12] where they are used to generate semantic objects parts. the image segmentation step. During this step, an image is However, the superpixels used in the later additionally give a segmented into viable detection Regions of Interest (ROIs) to bridge between low-level and high-level features by optimize the following recognition steps. However, image 14 incorporating semantic knowledge allowing to infer labels of associated relations among various resources (e.g., Web pages individual segmented regions. or documents in digital library) aiming at extending the loosely connected network of no semantics (e.g., the Web) to an B. Semantics application to image labeling association-rich network. Since the theory of cognitive science Convolutional neural networks have shown great considers that the associated relations can make one resource performance in their ability to correctly detect the objects and more comprehensive to users, the motivation of SLN is to actions in the domains that they are familiar with. However, in organize the associated resources loosely distributed in the Web order for the network to be able to interpret the image it sees for effectively supporting the Web intelligent activities such as correctly it needs a vast amount of data samples from which to browsing, knowledge discovery and publishing, etc. The tags compare against. There already exists databases of labeled and surrounding texts of multimedia resources are used to images such as ImageNet that can provide useful training data, represent the semantic content. The relatedness between tags although with the rapid development of social media, automatic and surrounding texts are implemented in the semantic Link techniques capable of effectively understanding and labeling the Network model. The data sets including about 100 thousand media are required. Content aware systems are capable of images with social tags from Flickr are used to evaluate the indexing, searching, retrieving, filtering and recommending proposed method. Two datamining tasks including clustering multimedia from the vast quantity of media posted of social and searching are performed by the proposed framework, which media. However, such unconstrained data has very high shows the effectiveness and robust of the proposed framework. complexity of objects, events and interactions in the consumer videos. Such unconstrained domains create numerous problems C. Semantic web application to recognize the object based on for previously available video analysis algorithms, such as the individual parts ones capable of recognizing human activity. What is more, home Semantic web (ontologies) gives us the powerful ability to videos being the most prevalent genre suffer from increased infer certain attributes about the object based on the domain good feature extraction problems due to poor lightning knowledge of the individual object components of said object. conditions, occlusion, clutter in the scene, shaking and/or low- Allowing us to instead of recognizing a specific object instance resolution cameras and various other background noise. In the or it’s class to recognize individual components available in the reviewed paper [13] authors try to address these problems by scene and based on the known Is-A/Is-Part-Of relationships infer introducing a new attribute-learning framework that learns a what kind of objects can be created with those components. This unified semilatent attribute space. Latent attributes are used to novel way of applying ontologies to the task of object represent all shared aspects of the data, which are not explicitly recognition was the application of object semantics based on included in users’ sparse and incomplete annotations. `Latent what the object is being used for [15]. In the case of the paper, attributes are used as complimentary annotations for user the semantics were used for tool recognition where the type of specified attributes and are instead discovered by the model tool is inferred by what it’s functionality in relation to human through joint learning of semilatent attribute space. This gives hand is. In the work, authors assert that objects do not change authors a mechanism for semantic feature reduction from the functionality based on changes to their details such as a cup raw data in multiple modalities to a unified lower dimensional having multiple handles would still be considered a cup. Instead, semantic attribute space. These semilatent attributes are used to they focus on object parts and their combinations to assign a bridge the semantic gap with reduced dependence on function to a tool. This approach gives the advantage that the completeness of attribute ontology and accuracy of the training system is not trained to wantonly different individual objects and attribute labels. Described method has given the authors the will instead detect parts that contribute to the fundamental tool flexibility to learn a full semantic-attribute space of the video functionality. The proposed method consists of three main feed irrespective of how well defined and complete the user stages: preprocessing, object signature extraction and object given data about it is. Furthermore, they have managed to similarity calculations. Object signature extraction is subdivided improve multitask and N-shot learning by leveraging latent into two steps: part signature extraction and pose signature attributes, went beyond existing zero-shot learning approaches extraction. During part signature extraction a support vector by exploiting latent attributes, leveraged attributes in machine (SVM) is used to find the characteristic descriptions of conjunction with multimodal data to improve cross-media given object. Pose signatures describe how parts are attached to understanding, enabling new tasks such as explicitly learning each other which provides the information on how the parts are which modalities attributes appear in. Finally, the proposed rotated to respect to one another and locations at which parts are method is applicable to large multimedia data sets as it is connected to one another. Once the features are extracted expressed in a significantly more scalable way than previously function analyzer algorithm is ran which allows to compare available techniques, making the technique invariant to the objects and assign functional meanings to them thus, achieving length of the given input video or the density of available good these tasks: recognizing the object, generalize the object features in it. between different objects with different number of parts, assigning multiple functions to the same object, providing the Other research [14] supports the notion that that multimedia ability to find another use for an object. Authors have shown that resources “in the wild” are growing at a staggering rate and that their method unlike deep convolutional neural networks do not the rapidly increasing number of multimedia resources has require such extensive training sets and can generalize on very brought an urgent need to develop intelligent methods to few training samples. organize and process them. In this paper, the Semantic Link Network model is used for organizing multimedia resources. Semantic Link Network (SLN) is designed to establish 15 TABLE I. ONTOLOGY BASED IMAGE RETRIEVAL METHODS S. No Topics/Concepts Pros Cons Utilizations diverse watchwords; Use Keyword based image retrieval (text Can't depict a picture totally and 1. at least one picture properties; based, field based, structure based) semantically Keywords portraying picture data Low-level feature: color (histogram Shading likeness based recovery; Shading alone can't depict the full 2. and moments, dominant color, color Color cognizance vector based picture content cluster, etc.) recovery Low-level feature: shape (fourier Shape alone can't depict the full 3. transform, curvature scale, template Template matching method picture content tatching, etc.) Low-level feature: texture (wavelet Shading likeness based recovery; Shading and surface alone can't 4. transform, edge statistics, Gabor Color cognizance vector based depict the full picture content filters, statistical based, etc.) recovery Foreseeing amino corrosive changes Scale Invariant Feature Transform: Science application; Concluded to 5. in Protein structure; Features for SIFT discover better descriptor picture recovery Novel scale and revolution invariant Not totally indented to endeavor 6. Speeded Up Robust Feature: SURF element depiction; CBIR visual semantic hole filling consideration demonstrate Study: CBIR with abnormal state Ontology based image retrieval Basic ontology with restricted visual 7. semantics Ontology based methods highlights intellectual vision Ontology formation Domain Knowledge acquisition Visual Idea Ontology Knowledge Base Examples Similarity measure Reducing semantic gap Automatic Image retrieval Retrieval process Retrieval results Results concept extraction engine Image database User query Image samples Fig. 1. Design ontology connected picture recovery process The Fig. 1 setting portrayal could be surrounded utilizing Description Logics (DL), the information portrayal could be metaphysics from the above said picture ideas. Applying framed. The DAML (DARPA Markup Language) and OIL 16 (Ontology Interface Language) are utilized for this usage which summarize the main applications of ontologies in GEOBIA, is accessible with OWL (Web Ontology Language). Standards especially for data discovery, automatic image interpretation, for portraying connection between picture includes in data interoperability, workflow management and data metaphysics can be characterized utilizing the DL moreover. publication. Among the broad spectrum of applications for Once the idea cosmology encircled (for instance spatial ontologies, mention systems engineering, interoperability and philosophy), the similitude coordinating of client inquiry with communication. Because the method is a part of systems extricated picture highlight is evaluated through the metaphysics engineering, GEOBIA experts follow a series of analytical chain of command. This furnishes more closeness to client procedures to develop a system designed to produce geographic question with pictures in database. There are a few instruments information. These procedures principally involve (i) data have been produced to be specific “OntoVis” which perform discovery and (ii) data processing and analysis, i.e., image three undertakings space learning securing, metaphysics driven interpretation. visual securing and picture case administration. The advantage of utilizing visual thought metaphysics is to fill the semantic C. Application to sports events hole however much as could be expected amongst low and In this paper [19], authors present an ontology-based abnormal state idea information extraction and retrieval system and its application in the soccer domain. In general, authors deal with three issues in IV. SEMANTIC WEB IN DOMAIN SPECIFIC TASKS semantic search, namely, usability, scalability and retrieval performance. Authors propose a keyword-based semantic A. Application in robotics retrieval approach. The performance of the system is improved Mobile robots are one of the key applications that benefit considerably using domain-specific information extraction, from object recognition technologies. More and more robots are inferencing and rules. Scalability is achieved by adapting a entering human living and working environments, where to semantic indexing approach and representing the whole world operate successfully they are faced with multitude of real world as small independent models. The system is implemented using challenges such as being able to handle many objects located in the state-of-the-art technologies in Semantic Web and its different places. To overcome these challenges a solution of performance is evaluated against traditional systems as well as robot efficiency needs to be found both in terms of the query expansion methods. Furthermore, a detailed computational efficiency and reducing the amount of detected evaluation is provided to observe the performance gain due to false positives. Addition of ontology has shown to be beneficial domain-specific information extraction and inferencing. in the object recognition tasks for methods such as RoboEarth [16] where the addition of such semantic mapping layer has Authors presented a novel semantic retrieval framework and experimentally shown to decrease computational time by only its application in the soccer domain, which includes all the checking against the 10 most promising object annotations in aspects of Semantic Web, namely, ontology development, large object databases. RoboEarth describes methodology they information extraction, ontology population, inferencing, name as “action recipes” which semantically describe what semantic rules, semantic indexing and retrieval. When these action need to be performed to complete a specific task such as technologies are combined with the comfort of keyword based map the environment or to determine the location of an object. search interface, authors obtain a user-friendly, high For example, “ObjectSearch” action recipe based on the prior performance and scalable semantic retrieval system. The partial room knowledge and its landmarks is capable to infer evaluation results show that this approach can easily outperform potential locations from where the desired object might be both the traditional approach and the query expansion methods. detected. Moreover, authors observed that the system can answer complex semantic queries without requiring formal queries such as B. Application in Geographic Information Science SPARQL. Authors observe that the system can get close to the GEOBIA (Geographic Object-Based Image Analysis) [17] performance of SPARQL, which is the best that can be achieved is not only a hot topic of current remote sensing and with semantic querying. Finally, authors show how the structural geographical research. It is believed to be a paradigm in remote ambiguities can be resolved easily using semantic indexing. sensing and Geographic Information Science (GIScience). It aims to develop automated methods for partitioning remote V. CONCLUSIONS sensing (RS) imagery into meaningful image objects, and to In this paper we present the importance and usefulness of assess their characteristics through spatial, spectral, textural, and applying ontology for image recognition tasks. We have temporal features, thus generating new geographic information reviewed different ontology based techniques, compared them in a GIS ready format. to more classical approaches such as SIFT and SURF and provided with a list of benefits and possible drawbacks of using Geographic Object-Based Image Analysis (GEOBIA) [18] such techniques. With our paper we have concluded that represents the most innovative new trend for processing remote applying semantics can greatly improve not only the overall sensing images that has appeared during the last decade. performance of object recognition but also the performance and However, its application is mainly based on expert knowledge, quality of individual tasks required for object recognition such which consequently highlights important scientific issues with as image segmentation. Moreover, we have found that ontology respect to the robustness of the methods applied in GEOBIA. In can be used to substantially reduce semantic gap i.e. the this paper, authors argue that GEOBIA would benefit from difference between the understanding of images by human and another technical enhancement involving knowledge interpretation of images by machine, allowing for better representation techniques such as ontologies. Authors automatization in training neural networks, as the dataset 17 preparation can be offloaded to a machine instead of being Segmentation Using Mixture Models and Multiple manufactured by hand. Finally, we have discussed the concept CRFs,” IEEE Trans. Image Process., vol. 25, no. 7, pp. of semantic web, based on which the ontology can be formed. 3233–3248, 2016. [13] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong, “Learning multimodal latent attributes,” IEEE Trans. REFERENCES Pattern Anal. Mach. Intell., vol. 36, no. 2, pp. 303–316, 2014. [1] P. Auer et al., “A Research Roadmap of Cognitive [14] U. Manzoor and M. A. Balubaid, “Semantic Image Vision,” IST Proj. IST-2001-35454, 2005. Retrieval : An Ontology Based Approach,” vol. 4, no. [2] J. P. Schober, T. Hermes, and O. Herzog, “Content- 4, pp. 1–8, 2015. based image retrieval by ontology-based object [15] M. Schoeler and F. Worgotter, “Bootstrapping the recognition,” KI-2004 Work. Appl. Descr. Logics, no. Semantics of Tools: Affordance Analysis of Real January, 2004. World Objects on a Per-part Basis,” IEEE Trans. Cogn. [3] A. Hanbury, “A survey of methods for image Dev. Syst., vol. 8, no. 2, pp. 84–98, 2016. annotation,” J. Vis. Lang. Comput., vol. 19, no. 5, pp. [16] L. Riazuelo et al., “RoboEarth Semantic Mapping: A 617–627, 2008. Cloud Enabled Knowledge-Based Approach,” IEEE [4] D. Tahmoush and C. Bonial, “Applying Attributes to Trans. Autom. Sci. Eng., vol. 12, no. 2, pp. 432–443, Improve Human Activity Recognition,” Appl. Imag. 2015. Pattern Recognit. Work. (AIPR), 2015 IEEE, 2015. [17] H. Y. Gu, H. T. Li, L. Yan, and X. J. Lu, “A framework [5] U. Akdemir, P. Turaga, and R. Chellappa, “An for Geographic Object-Based Image Analysis ontology based approach for activity recognition from (GEOBIA) based on geographic ontology,” Int. Arch. video,” Proceeding 16th ACM Int. Conf. Multimed., pp. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS 709–712, 2008. Arch., vol. 40, no. 7W4, pp. 27–33, 2015. [6] S. Tongphu, B. Suntisrivaraporn, B. Uyyanonvara, and [18] D. Arvor, L. Durieux, S. Andrés, and M. A. Laporte, M. N. Dailey, “Ontology-based object recognition of “Advances in Geographic Object-Based Image car sides,” 2012 9th Int. Conf. Electr. Eng. Comput. Analysis with ontologies: A review of main Telecommun. Inf. Technol. ECTI-CON 2012, 2012. contributions and limitations from a remote sensing [7] A. M. Tousch, S. Herbin, and J. Y. Audibert, “Semantic perspective,” ISPRS J. Photogramm. Remote Sens., vol. hierarchies for image annotation: A survey,” Pattern 82, pp. 125–137, 2013. Recognit., vol. 45, no. 1, pp. 333–345, 2012. [19] S. Kara, Ö. Alan, O. Sabuncu, S. Akpınar, N. K. [8] D. Zhang, M. M. Islam, and G. Lu, “A review on Cicekli, and F. N. Alpaslan, “An ontology-based automatic image annotation techniques,” Pattern retrieval system using semantic indexing,” Inf. Syst., Recognit., vol. 45, no. 1, pp. 346–362, 2012. vol. 37, no. 4, pp. 294–305, 2011. [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask [20] M: Wróbel, J. T. Starczewski, and C. Napoli, R-CNN,” 2017. “Handwriting recognition with extraction of letter [10] T. Athanasiadis, P. Mylonas, Y. Avrithis, and S. fragments”, International Conference on Artificial Kollias, “Semantic image segmentation and object Intelligence and Soft Computing, pp. 183-192, 2017. labeling,” IEEE Trans. Circuits Syst. Video Technol., [21] J. T. Starczewski, S. Pabiasz, N. Vladymyrska, A. vol. 17, no. 3, pp. 298–311, 2007. Marvuglia, C. Napoli, and M. Woźniak, “Self [11] E. Shelhamer, J. Long, and T. Darrell, “Fully organizing maps for 3D face understanding”. Convolutional Networks for Semantic Segmentation,” International Conference on Artificial Intelligence and IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, Soft Computing, pp. 210-217, 2017. pp. 640–651, 2017. [12] M. Zand, S. Doraisamy, A. A. Halin, and M. R. Mustaffa, “Ontology-Based Semantic Image 18