Vineeta Singh et al. MAICS 2017 pp. 85–91 Image understanding - a brief review of scene classification and recognition Vineeta Singh, Deeptha Girish, and Anca Ralescu EECS Department, ML 0030 University of Cincinnati Cincinnati, OH 45221, USA singhvi@mail.uc.edu, girishde@mail.uc.edu, Anca.Ralescu@uc.edu Abstract ing the semantic context of the scene. According to how a scene is recognized in an image, scene recognition algo- With over 40 years of history, image understanding, in par- rithms can be broadly divided into two categories. ticular, scene classification and recognition remains central to machine vision. With an abundance of image and video • Scene recognition based on object detection. databases, it is necessary to be able to sort and retrieve the images and videos in a way that is both efficient and effec- • Scene recognition using low-level image features tive. This is possible only if the categories of images and/or their context are known to a user. Hence, the ability to clas- sify and recognize scenes accurately is of utmost importance. Scene recognition using object recognition This paper presents a brief survey of the advances in scene recognition and classification algorithms. (SR-OR) Depending on its goal, image understanding(IU) can be Using object recognition for scene classification is a defined in many different ways. However, in general, IU straight-forward and intuitive approach to scene classifica- means describing the image content, the objects in it, loca- tion and it can assist in distinguishing very complex scenes tion and relations between objects, and most recently, de- which might otherwise prove difficult to do using standard scribing the events in an image. In (Ralescu 1995) IU is low level features. equated with producing a verbal description of the image In the paper by Li-Jia Li et al. (Li et al. 2010) the authors content. Scene analysis (as part of IU) and categorization is argue that although ”robust low-level image features have a highly useful ability of humans, who are able to categorize been proven to be effective representations for scene classifi- complex natural scenes containing animals or vehicles very cation; but pixels, or even local image patches, carry little se- quickly (Thorpe, Fize, and Marlot 1996), with little or no mantic meanings. For high level visual tasks, such low-level attention (Li et al. 2003). When a scene is presented to hu- image representations are potentially not enough. ” To com- mans, they are able to quickly identify the scene, i.e., within bat this drawback of local features, they propose a high-level a short period of exposure (< 100 ms). How do humans image representation, called the Object Bank(OB), where an perform all of these tasks the way they do, is yet to be fully image is represented by integrating the response of the im- understood. To date, the classic text by Marr (Marr 1982) re- age to various object detectors. These object detectors or fil- mains one of the sources of understanding the human vision ters are blind to the testing dataset or visual task. Using OB systems. representation, superior performances on high level visual Many researchers have tried to imbibe this incredible ca- recognition tasks can be achieved with simple regularized pability of the human vision system into their algorithms for logistic regression. Their algorithm uses the current state-of- image processing, scene understanding and recognition. In the-art object detectors of Felzenszwalb et al. (Felzenszwalb the presence of a wealth of literature on this and related sub- et al. 2010), as well as the geometric context classifiers (stuff jects, surveys of the field, even a limited one, as the present detectors) of Hoeim et al. (Hoiem, Efros, and Hebert 2005) one necessarily is (due to space constraints) are bound to be for pre-training the object detectors. very useful, by reviewing the methods for scene recognition OB offers a rich set of object features, while presenting a and classification. challenge – curse of dimensionality due to the presence of Perhaps, the first issue to consider is the concept of scene multiple class of objects within a single image, which then as a technical concept to capture the natural concept. Ac- yields feature vectors of very high dimension. The perfor- cording to Xiao et al. (Xiao et al. 2010) a scene is a place mance of the system plateaus at a point when the number of in which a human can act within, or a place to which a hu- object detection filters is too high. According to the authors, man being could navigate. Therefore, scene recognition and the system performance is best, when the number of object scene classification algorithms must delve into understand- filters is moderate. 85 Image understanding - a brief review of scene classification and recognition pp. 85–91 Scene recognition using low-level image achieves agreement with the human classification in 91 of features (SR-LLF) the 98 scenes (i.e., approximately 92.93%) The paper by Guérin-Dugué et al. (Guérin-Dugué and Many of the papers in scene recognition are built around the Oliva 2000) uses an approach similar to Gorkani and Picard question, ’Can we recognize the context of a scene without (Gorkani and Picard 1994) but extend this approach with having first recognized the objects that are present? There more categories and introduce the selection of the optimal are a lot of reasons for avoiding object recognition for the scale for this categorization task. They use local dominant purpose of scene recognition. While there are many robust orientation (LDO) features for classifying real-world scenes OR algorithms, using SR-OR can be problematic because into four categories (outdoor urban scenes, indoor, closed the OR portion of the algorithm is treated as a black box, and landscapes and open landscapes). Instead of using the LDO therefore, the OR errors propagate to the SR segment. OR features directly, they propose compact coding in a few fea- also faces problems due to lighting conditions and occlusion. tures by Fourier series decomposition, and introduce the spa- To avoid this many studies tend to use low level feature for tial scale parameter to optimize the categorization. For each scene understanding. scale and spatial location the dominant orientation and it’s The challenge in SR-LLF is to find low-level features in strength are estimated. The best discrimination ratios were the image that can be used successfully to infer its seman- obtained with a representation at a median spatial scale or tic context. Among the many features can be extracted from when combining two different scales. the image for the purpose of scene recognition, texture, ori- The paper (Csurka et al. 2004) presents a bag of key- entation, and color have been used extensively in literature, points approach to visual categorization. The procedure implemented with different data sets and with different clas- starts by detection and description of image patches. Then sifiers. a vocabulary of image descriptors is created after applying In (Renninger and Malik 2004) an algorithm which mim- the vector quantification algorithm. SIFT decriptors (Lowe ics the humans’ ability to identify scenes with limited expo- 1999) are used as features for this algorithm. This is fol- sure is presented. The algorithm is based on a simple tex- lowed by constructing a bag of key-points which counts the ture analysis of the image which can provide a useful cue number of patches assigned to each cluster. Finally, a multi- to rapid scene identification. The relevant features within a class classifier (SVM) is implemented, treating the bag of texture are the first order statistics of textons which deter- points as the feature vector and thus, determining which cat- mine strength of texture discrimination. This idea is derived egory the image belongs to. from Juselz’s work (Julesz 1981) (Julesz 1986) (For a dis- It is clear that counts, or histograms, suggest that scene cussion on Juselz’s work see (Marr 1977)). According to recognition and analysis could benefit from probabilistic ap- Juselz, textons are the elements in the image that govern our proaches. Indeed, some algorithms use probability models perception of texture. They are calculated by convolving the to describe the scene based on the extracted features. image with certain filter banks. The textons based model The paper A Bayesian Hierarchical Model for Learning learns the local texture features which correspond to vari- Natural Scene Categories (Fei-Fei and Perona 2005) uses ous scene categories, which is done by filtering a set of 250 low level texture features as image descriptors. Each patch training images and then learning the prototypical distribu- of the input image is represented using a code word (simi- tions. The number of occurrences of each feature within a lar to bag of keywords approach). The code word is taken particular image is stored as an histogram, creating a holistic from a codebook – a large vocabulary of code words – ob- texture descriptor for the image. To learn the most prevalent tained from 650 training examples from 13 categories (with features, they use k-means clustering to find the 100 proto- around 50 images for each category). In this framework, typical responses. When identifying a new test image, its initially the local regions are clustered into different inter- histogram is matched against stored examples. It is con- mediate themes and then into categories. The learning algo- cluded that early scene identification can be explained with rithm for achieving the model that best represents the distri- a simple texture recognition model. This model leads to sim- bution of code words to represent scenes is a modified Latent ilar identifications and confusions as a human subject. Dirichlet model(Blei, Ng, and Jordan 2003). Unlike tradi- The same objective, i.e., understanding human perception tional scene models where there is a hard assignment of an of scenes is pursued in (Gorkani and Picard 1994). The image to one theme, the algorithm produces a collection of paper investigates the measure of dominant perceived ori- themes that could be associated with an image. entation developed to match the output of a human study In (Singhal, Luo, and Zhu 2003) a probabilistic approach involving 40 subjects. These global multi-scale orientation is used for content detection within the scene. The labels features were used to detect vacation photos belonging to generated are very similar to scene labels. The authors ”city/suburb”. The authors state that orientation is an impor- present a holistic approach to determine the scene content, tant feature for texture recognition and discrimination. The based on a set of individual material detection algorithms as algorithm finds the local orientation and its strength at each well as probabilistic spatial context models. Material detec- pixel of the image. The implementation extracts orientation tion is the problem of identifying key semantic objects such information over multiple scales using a steerable pyramid as sky, grass, foliage, water, and snow in images. In order to (Rock 1990) and then combines the orientations from these detect materials the algorithm combines low-level features different scales and decides which one is dominant perceptu- with unique region analysis and inputs this to a classifier ally. The reported results show that the orientation features to obtain individual material belief maps. To avoid mis- 86 Vineeta Singh et al. MAICS 2017 pp. 85–91 classification of materials in images they devise a spatial for indoor - outdoor scene retrieval problem. Their algo- context aware material detection system which constraints rithm extracts three types of features: 1) histogram in the the beliefs to conform to the probabilistic spatial context ohta color space 2) multi-resolution simultaneous autore- models. gressive model parameters 3) coefficients of shift invariant The bag of keypoints model (Sivic and Zisserman 2009) DCT. They exhibit that performance is improved by com- corresponds to a histogram of the number of occurrences puting features on sub-blocks, classifying these sub-blocks of particular image patterns in a given image. Most papers and then combining these results by stacking. mentioned above use this concept in some form. This is This paper (Serrano, Savakis, and Luo 2004) by Serrano adapted from the bag of words model in natural language et al. uses simplified low level features to predict the se- processing. mantic category of scenes. This is integrated probabilisti- In (Lazebnik, Schmid, and Ponce 2006) the authors ar- cally using Bayesian network to give a final indoor/outdoor gue that in spite of impressive levels of performance, the classification. Low-dimensional color and wavelet texture bag of features model represents the image as an orderless features are used to classify scenes using the support vec- collection of local features, thereby disregarding all infor- tor machine (SVM). These wavelet texture features are used mation about the spatial layout of the features. To overcome here instead of the popular MSAR texture features to reduce this aspect, they devise a method for recognizing scene cate- the computational complexity. gories based on the approximate global geometry correspon- dence. They compute a spatial pyramid by partitioning the Other approaches image into increasingly fine sub-regions and computing his- tograms of local features found in each sub-region. The spa- Various other approaches exist in the literature of scene tial pyramid is an extension of the orderless bag of features recognition, as reviewed below. model of image representation, which is improved upon by the introduction of a kernel based recognition method. This Semantic Typicality method works by computing a rough geometric correspon- The concepts of typicality and prototype have made a signif- dence on a global scale using an approximation technique icant impact in cognitive science. See for example the work adapted from the pyramid matching scheme of (Grauman pioneered by Eleanor Rosch and her collaborators, (Rosch and Darrell 2007). This method involves repeatedly sub- 1973), (Rosch and Mervis 1975), (Rosch et al. 1976). In dividing the image and computing histograms of local fea- computer vision, (Vogel and Schiele 2004) introduces an in- tures at increasingly fine resolutions. The spatial pyramid teresting concept of semantic typicality in categorizing of approach can be thought of as an alternative formulation of real world natural scenes. The proposed typicality measure locally orderless image where a fixed hierarchy of rectangu- is used to grade the similarity of an image with respect to lar windows is defined. The spatial pyramid framework is a scene category. Typicality is defined as a measure for the based on the idea that the best results will be achieved when uncertainty of annotation judgment. This is an important multiple resolutions are combined in a principled way. The concept because many natural scenes are ambiguous and the features calculated are subdivided into weak features, ori- categorization accuracy sometimes reflects the opinion of a ented edge points, and strong features, SIFT descriptors. K- particular person who performed the annotation. Therefore, means clustering is performed on a random subset of patches the authors believe that attention should be directed to mod- from the training set to form a visual vocabulary. Multi- eling the typicality of a particular scene after manual anno- class classification is done with the support vector machine tation. The semantic typicality measure is used to find the (SVM) , trained using the one versus all rule. similarity of natural real-world scenes with respect to six Though fewer algorithms use color based features, in cer- scenes including coasts, rivers/lakes, forests, plains, moun- tain cases this descriptor is very powerful in discriminating tains and sky/clouds. scenes. The typicality based approach is evaluated on an image Color descriptors can be used for scene and object recog- database of 700 natural scenes. The attribute score is a nition (Van De Sande, Gevers, and Snoek 2010) in order to representation which is predictive of typicality. Typicality increase illumination invariance and discriminative power. is a function of frequency of occurrence, that is, the items From theoretical and experimental results it is shown that, deemed most typical have attributes that are very common to invariance to light intensity changes and light color changes the category. Local semantic concepts act as scene category affect category recognition. Various color descriptors were attributes. They are calculated from the sub-regions which analyzed and evaluated. Color descriptors based on his- are represented by a combined 84-bin linear histogram in tograms, color moments moment invariants and color SIFT the HSI color space, and a 72-bin edge direction histogram. were used as descriptors, and it was concluded that SIFT Classification is done by a k-nearest neighbor classifier. The based descriptors performed considerably better that his- categorization experiment was carried out using manually togram and moment based descriptors. annotated images from the database. By analyzing the se- mantic similarities and dissimilarities of the aforementioned Indoor-Outdoor classification categories a set of nine local semantic concepts emerged as In the paper by Szummer et al. (Szummer and Picard 1998) being most discriminant: sky, water, grass, trunks, foliage, the authors show that high-level scene properties can be in- fields, rocks, flowers, and sand. The local semantic concepts ferred from classification of low-level features specifically were extracted on a 10 ⇥ 10 grid of image sub-regions and 87 Image understanding - a brief review of scene classification and recognition pp. 85–91 the frequency of occurrence in a particular image was rep- a general concept learning problem (Ralescu and Baldwin resented by concept occurrence vector. For each category, a 1989), was developed. It makes use of Conceptual Struc- category prototype is defined as the most typical example for tures (Sowa 1983) for knowledge representation, and sup- that category, which constructed as the means over the con- port logic programming (Baldwin 1986) for inference. Ex- cept occurrence vectors of the category members. The im- amples of a concept (e.g., a ’car’) are used to construct a age typicality was measured by computing the Mahalanobis memory aggregate(MA), which rather than average all ex- distance between the images’ concept occurrence vector and amples, keeps track of various probability distributions of the prototypical representation in order to classify the image the object features. Counter-examples, i.e., descriptions that as a particular scene. are very similar to a concept, but fail to be instances of that concept, are used in a similar manner to construct a Configural Recognition counter example memory aggregate (CMA). Matching be- The goal of (Lipson, Grimson, and Sinha 1997) is to clas- tween conceptual structures describing an object candidate sify scenes based on their content. Most of the solutions and the MA and CMA produce supports for and against the that are available for scene recognition rely on color his- recognition of a concept. The result is therefore, qualified by tograms and local texture statistics. The authors state that a support pair, whose values (1, 1) mean complete recogni- these features cannot capture a scenes’ global configuration. tion, (0, 0) complete rejection, (0, 1) total uncertainty. To overcome this they present a novel approach, which they call configural recognition for encoding scene class struc- Deformable part based models ture in images. The configural recognition scheme encodes In (Pandey and Lazebnik 2011), the author comments that class models as a set of salient low resolution image regions weakly supervised discovery of common visual structure and salient qualitative relations between the regions. An ex- in highly variable and that cluttered images present a ma- ample of qualitative relationships are: ’given three regions, jor problem in recognition. In order to address this prob- a blue region(A), a white region (B) and a gray region (C), lem, the authors propose using deformable part-based mod- Snow-capped mountains always have region A above region els (DPM) with latent SVM training. For scene recognition, B which is above region C’. deformable part-based models capture recurring visual ele- The class models are described using seven types of rel- ments and salient objects. The DPM represents an object by ative relationships between image patches. Each of them low-resolution root filters and a set of high resolution part has the following values: Less than, greater than, or equal filters in a flexible spatial configuration. The image is rep- to. The relationships encoded are relative color between im- resented by a variation of histogram of oriented gradients age regions, the relative luminance between the patches, the (HOG) features which are used to classify scenes using lin- spatial relationships (relative horizontal and vertical descrip- ear SVM. tions) and the relative size of the patch. Based on this, each region in the image is grouped into directional equivalence Covariance descriptor classes, such as above and below. The generated model acts as deformable templates. When The paper (Yang et al. 2016) proposes a supervised col- compared with the image, the model can be deformed by laborative kernel coding method based on covariance de- moving the patches around so that the model best matches scriptor (covd) for scene level geographic image classifica- the image in terms of relative luminance and photometric at- tion. Covariance descriptor is a covariance matrix of differ- tributes. An improvement to this system can be made where ent features such as color, spatial location, and gradient that instead of hand crafting the models,an automated process is rotation and scale invariant but it lies in the Riemannian could take a set of example images and generate a set of space (i.e., non- Euclidean space) and therefore, the tradi- templates which describe the relevant relationships between tional computational and mathematical models used in the the pictures in the example set. Euclidean space cannot be used. A fuzzy part based model was described in (Miyajima and The major contribution of this paper is that they propose Ralescu 1993) and fuzzy sets were also widely and effec- a supervised kernel coding model that transforms covd into tively used for spatial descriptors in an image. A very pow- a discriminative feature representation and obtain a corre- erful formal model, based on fuzzy sets, for the description sponding linear classifier. The method can be seen as a of spatial relations in an image was introduced in (Miyajima three step process. The first step is to extract the covd fea- and Ralescu 1994), (Miyajima and Ralescu 1994) and fur- tures from the geographical scene image. In the second ther extended by (Bloch 1999). A comparison of the fuzzy step supervised collaborative kernel coding involving dic- approaches for the description of directional relative posi- tionary coefficients in the coding representation phase and tion between object in an image can be found in (Bloch and linear classification phase is performed. Lastly, in the clas- Ralescu 2003), and a review of these approached can be sification stage, based on dictionary coefficients and learned found in (Bloch 2005). Furthermore, more recently, fuzzy linear classifier a label vector is derived. A novel objec- spatial relations were integrated in deformable models and tive function is proposed to combine the collaborative ker- applied to MRI images (Colliot, Camara, and Bloch 2006). nel coding phase and the classification phase. This method In (Ralescu and Baldwin 1987) a new approach for con- gives satisfying performance on high resolution aerial im- cept learning from examples and counter-examples with ap- age dataset proving to be an efficient method for scene level plications to a vision learning system, later extended to geographic image classification. 88 Vineeta Singh et al. MAICS 2017 pp. 85–91 Shape of the scene idea is to jointly evaluate and makes conclusions about loca- The paper (Oliva and Torralba 2001) takes a very different tion, regions, class and spatial information of objects, pres- approach to scene recognition: rather than looking at the ence of a class in an image and also the scene type. The idea scene as a configuration of objects the paper proposes to is to recover and connect the multiple different aspects of a consider the scene as an individual object, with a unitary scene. This problem is framed as a prediction problem in a shape. A computational model to find the shape of the scene graphical model defined over hierarchies of regions of dif- using a few perceptual dimensions specifically dedicated to ferent sizes, auxiliary variables encoding scene type, pres- describing spatial properties of the scene is proposed. It is ence of a given class in a scene and correctness of bounding shown that the holistic spatial scene properties, called Spa- boxes obtained by the object detector. Class labels of image tial Envelope(SE) properties may be reliably estimated using segments at two different levels of segmentation hierarchy, spectral and coarsely localized information. namely segments and large super segments is proposed. Bi- Given and environment V , its spatial envelope SE(V ) is nary variables indicate which classes are present in images defined as a composite set of boundaries such as walls, sec- and multi-labeled variable represents scene type. Segments tion, elevation etc. that define the shape of the space. A and super segments are used to assign semantic class labels group of 17 observers were asked to categorize 81 images to each pixel in an image. Super segments are used to create into categories based on some global aspect. Based on the long range dependencies and they also prove to be more effi- classification results, the criterion for classification of scenes cient computationally. The holistic loss function is defined, was agreed to be based upon the degree of naturalness, de- which is a weighted sum of losses from each task. State of gree of openess, degree of roughness, degree of expansion the art performance is achieved in MSRC-21 benchmark and and degree of ruggedness. Therefore, the purpose of the the approach is much faster than the existing approaches. spatial envelope model is to show that modeling these five Another very interesting approach is present in (Li and spatial properties is adequate to understand the high-level Fei-Fei 2007), which goes beyond scene recognition to event description of the scene. Their algorithm learns the spectral recognition. An event in a static image is defined as a hu- signatures (the global energy spectrum and the spectogram) man activity taking place in the specific environment. The of basic scene categories from labeled training data. A learn- objective is to recognize/classify the event in the image as ing algorithm (regression) is then used to find the relation well as provide a number of semantic labels to the object and between the global features and spectral features. scene environment within the image. It is assumed that con- ditioned on the event, scene and objects are independent of Beyond Scene recognition each other, but both their presence influences the probability Certain algorithms detect scenes and then use scene recog- of predicting the event. For scene recognition they adopt a nition as a prior in order to find more structure in the image, model similar to the Bayesian model of (Fei-Fei and Perona thus motivating further study in the field of scene analysis. 2005). Scene recognition heavily influences the event recog- In Using Forest to see trees (Murphy et al. 2003) an in- nition, an in fact, as a first approximation event recognition tuitive approach to detect the presence of object based on is essentially scene recognition. The robust bag of words the detected scenes is presented. The approach is suggested model is used in order to recognize objects. In addition to by psychological evidence that people perform rapid global scene and object recognition they understand the importance scene analysis before and conducting more detailed local ob- of layout of the image in accurately identifying the event. ject analysis. Based on this the authors propose to use the They use some simple geometric cues to define the layout of whole image as a global feature in order to overcome ambi- the image and manage to provide integrative and hierarchi- guities which might occur at the local level. They extend the cal labels to an image by performing the what (event), where notion of gist from (Oliva and Torralba 2006) by combining (scene) and who (object) recognition of the entire scene by the prior suggested by the gist to the output of bottom-up lo- using a generative model in order to represent the image. cal object detectors which are trained using boosting. They An extensive Scene Understanding (SUN) database con- also use the same set of features for object detection in the sisting of 899 categories and 130519 images is created in image. The image is divided into patches at different scales (Xiao et al. 2010). This work is motivated by the authors’ (image pyramid) and each patch is convolved with 13 zero- belief that the existing data sets for scene classification fail mean filters which include oriented edges, a Laplacian filter, to capture the richness and diversity of daily life environ- corner detectors and long edge detectors. This is represented ments. The authors claim to have built the most complete by two statistics, variance, and kurtosis derived from the his- dataset with a number of different scene image categories togram of image patches at two scales and with 30 spatial with different functionalities that are important enough to masks. The kurtosis is omitted for scene recognition. The have unique identities. They measure human performance features are further reduced in dimensionality using PCA to on scene classification and compare it with the state of the give the PCA-gist. A one vs all binary classifier is trained art algorithms, using the SUN database. Both human and al- for recognizing each type of scene using boosting applied to gorithm results had errors, with the humans erring between the gist. They further this by using scene as a latent common semantically similar categories, while algorithms erring be- cause upon which the presence of the object is conditionally tween semantically unrelated scenes due to spurious visual dependent on. matches. It was also recorded that the best features agree Understanding whole image, or holistic scene understand- more with correct human classifications and make the same ing is is described in (Yao, Fidler, and Urtasun 2012). The mistakes as humans do. The computational algorithms need 89 Image understanding - a brief review of scene classification and recognition pp. 85–91 a much larger number of features to performs as well as hu- puter Vision and Pattern Recognition, 2005. CVPR 2005. mans. They also propose the notion of recognizing scene IEEE Computer Society Conference on, volume 2, 524–531. type within images rather than labeling an entire image with IEEE. a scene because the real world often contains combination Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and of scenes. This is a new interesting idea and could also be Ramanan, D. 2010. Object detection with discriminatively one of the directions in which the future scene recognition trained part-based models. IEEE transactions on pattern algorithms can progress. analysis and machine intelligence 32(9):1627–1645. Conclusion Gorkani, M. M., and Picard, R. W. 1994. Texture orienta- tion for sorting photos” at a glance”. In Pattern Recognition, It can be seen, even based on the limited number of papers 1994. Vol. 1-Conference A: Computer Vision & Image Pro- reviewed here, that image understanding, scene recognition cessing., Proceedings of the 12th IAPR International Con- can be approached from various different directions. At ference on, volume 1, 459–464. IEEE. a very high level, the approaches can be divided into two main categories - using low-level features, and using ob- Grauman, K., and Darrell, T. 2007. The pyramid match ject recognition. However, many other techniques are in- kernel: Efficient learning with sets of features. Journal of tegrated into each of these approaches, including probabilis- Machine Learning Research 8(Apr):725–760. tic, and/or fuzzy techniques, in order to deal with the uncer- Guérin-Dugué, A., and Oliva, A. 2000. Classification of tainty which often attends the result of image understanding. scene photographs from local orientations features. Pattern When it come to evaluating low level feature approach and Recognition Letters 21(13):1135–1140. object recognition approach, the goal of the image under- Hoiem, D.; Efros, A. A.; and Hebert, M. 2005. Auto- standing must be taken into account. Scene recognition per- matic photo pop-up. ACM transactions on graphics (TOG) forms better when low level features are used. Local features 24(3):577–584. help override the effects of occluded objects, low lighting conditions. Most commonly used features for scene detec- Julesz, B. 1981. Textons, the elements of texture perception, tion include texture, texture orientation and strength, ’Gist’ and their interactions. Nature 290(5802):91–97. of the image, SIFT descriptor, edge orientation, histograms Julesz, B. 1986. Texton gradients: The texton theory revis- in different color space (e.g., Ohta, HSI, RGB), histograms ited. Biological cybernetics 54(4):245–251. of angles between segmented regions, coefficients of shift- Lazebnik, S.; Schmid, C.; and Ponce, J. 2006. Beyond invariant DCT. These features can be successfully mapped bags of features: Spatial pyramid matching for recogniz- into semantic image descriptors. ing natural scene categories. In Computer vision and pat- tern recognition, 2006 IEEE computer society conference References on, volume 2, 2169–2178. IEEE. Baldwin, J. F. 1986. Support logic programming. In Fuzzy Li, L.-J., and Fei-Fei, L. 2007. What, where and who? clas- sets theory and applications. Springer. 133–170. sifying events by scene and object recognition. In Computer Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent Vision, 2007. ICCV 2007. IEEE 11th International Confer- dirichlet allocation. Journal of machine Learning research ence on, 1–8. IEEE. 3(Jan):993–1022. Li, F. F.; VanRullen, R.; Koch, C.; and Perona, P. 2003. Bloch, I., and Ralescu, A. 2003. Directional relative position Natural scene categorization in the near absence of attention: between objects in image processing: a comparison between further explorations. Journal of Vision 3(9):331–331. fuzzy approaches. pattern Recognition 36(7):1563–1582. Li, L.-J.; Su, H.; Fei-Fei, L.; and Xing, E. P. 2010. Object Bloch, I. 1999. Fuzzy relative position between objects bank: A high-level image representation for scene classifica- in image processing: a morphological approach. IEEE tion & semantic feature sparsification. In Advances in neural transactions on pattern analysis and machine intelligence information processing systems, 1378–1386. 21(7):657–664. Lipson, P.; Grimson, E.; and Sinha, P. 1997. Configuration Bloch, I. 2005. Fuzzy spatial relationships for image pro- based scene classification and image indexing. In Computer cessing and interpretation: a review. Image and Vision Com- Vision and Pattern Recognition, 1997. Proceedings., 1997 puting 23(2):89–110. IEEE Computer Society Conference on, 1007–1013. IEEE. Colliot, O.; Camara, O.; and Bloch, I. 2006. Integration Lowe, D. G. 1999. Object recognition from local scale- of fuzzy spatial relations in deformable modelsapplication invariant features. In Computer vision, 1999. The proceed- to brain mri segmentation. Pattern recognition 39(8):1401– ings of the seventh IEEE international conference on, vol- 1414. ume 2, 1150–1157. Ieee. Csurka, G.; Dance, C.; Fan, L.; Willamowski, J.; and Bray, C. 2004. Visual categorization with bags of keypoints. In Marr, D. 1977. Artificial intelligence – a personal view. Workshop on statistical learning in computer vision, ECCV, Artificial Intelligence 9(1):37–48. volume 1, 1–2. Prague. Marr, D. 1982. Vision: A computational approach. Fei-Fei, L., and Perona, P. 2005. A bayesian hierarchi- Miyajima, K., and Ralescu, A. 1993. Modeling of natural cal model for learning natural scene categories. In Com- objects including fuzziness and application to image under- 90 Vineeta Singh et al. MAICS 2017 pp. 85–91 standing. In Fuzzy Systems, 1993., Second IEEE Interna- Rosch, E. H. 1973. Natural categories. Cognitive psychology tional Conference on, 1049–1054. IEEE. 4(3):328–350. Miyajima, K., and Ralescu, A. 1994. Spatial organization Serrano, N.; Savakis, A. E.; and Luo, J. 2004. Improved in 2d segmented images: representation and recognition of scene classification using efficient low-level features and se- primitive spatial relations. Fuzzy Sets and Systems 65(2- mantic cues. Pattern Recognition 37(9):1773–1784. 3):225–236. Singhal, A.; Luo, J.; and Zhu, W. 2003. Probabilistic spatial Murphy, K.; Torralba, A.; Freeman, W.; et al. 2003. Using context models for scene content understanding. In Com- the forest to see the trees: a graphical model relating fea- puter Vision and Pattern Recognition, 2003. Proceedings. tures, objects and scenes. Advances in neural information 2003 IEEE Computer Society Conference on, volume 1, I–I. processing systems 16:1499–1506. IEEE. Oliva, A., and Torralba, A. 2001. Modeling the shape of Sivic, J., and Zisserman, A. 2009. Efficient visual search of the scene: A holistic representation of the spatial envelope. videos cast as text retrieval. IEEE transactions on pattern International journal of computer vision 42(3):145–175. analysis and machine intelligence 31(4):591–606. Oliva, A., and Torralba, A. 2006. Building the gist of Sowa, J. F. 1983. Conceptual structures: information pro- a scene: The role of global image features in recognition. cessing in mind and machine. Progress in brain research 155:23–36. Szummer, M., and Picard, R. W. 1998. Indoor-outdoor im- age classification. In Content-Based Access of Image and Pandey, M., and Lazebnik, S. 2011. Scene recognition and Video Database, 1998. Proceedings., 1998 IEEE Interna- weakly supervised object localization with deformable part- tional Workshop on, 42–51. IEEE. based models. In Computer Vision (ICCV), 2011 IEEE In- ternational Conference on, 1307–1314. IEEE. Thorpe, S.; Fize, D.; and Marlot, C. 1996. Speed of pro- cessing in the human visual system. nature 381(6582):520. Ralescu, A. L., and Baldwin, J. F. 1987. Concept learning from examples with applications to a vision learning system. Van De Sande, K.; Gevers, T.; and Snoek, C. 2010. Evaluat- In Alvey Vision Conference, 1–8. ing color descriptors for object and scene recognition. IEEE transactions on pattern analysis and machine intelligence Ralescu, A. L., and Baldwin, J. F. 1989. Concept learning 32(9):1582–1596. from examples and counter examples. International Journal Vogel, J., and Schiele, B. 2004. A semantic typicality mea- of Man-Machine Studies 30(3):329–354. sure for natural scene categorization. In Joint Pattern Recog- Ralescu, A. L. 1995. Image understanding = verbal descrip- nition Symposium, 195–203. Springer. tion of the image contents. SOFT, Journal of the Japanese Xiao, J.; Hays, J.; Ehinger, K. A.; Oliva, A.; and Torralba, Society for Fuzzy Theory 7(4):739–746. A. 2010. Sun database: Large-scale scene recognition from Renninger, L. W., and Malik, J. 2004. When is scene abbey to zoo. In Computer vision and pattern recognition identification just texture recognition? Vision research (CVPR), 2010 IEEE conference on, 3485–3492. IEEE. 44(19):2301–2311. Yang, C.; Liu, H.; Wang, S.; and Liao, S. 2016. Scene-level Rock, I. 1990. The perceptual world. Scientific American geographic image classification based on a covariance de- 127. scriptor using supervised collaborative kernel coding. Sen- Rosch, E., and Mervis, C. B. 1975. Family resemblances: sors 16(3):392. Studies in the internal structure of categories. Cognitive psy- Yao, J.; Fidler, S.; and Urtasun, R. 2012. Describing the chology 7(4):573–605. scene as a whole: Joint object detection, scene classification and semantic segmentation. In Computer Vision and Pattern Rosch, E.; Mervis, C. B.; Gray, W. D.; Johnson, D. M.; and Recognition (CVPR), 2012 IEEE Conference on, 702–709. Boyes-Braem, P. 1976. Basic objects in natural categories. IEEE. Cognitive psychology 8(3):382–439. 91