Vineeta Singh et al.                                          MAICS 2017                                                   pp. 85–91


     Image understanding - a brief review of scene classification and recognition

                                  Vineeta Singh, Deeptha Girish, and Anca Ralescu
                                                 EECS Department, ML 0030
                                                   University of Cincinnati
                                                 Cincinnati, OH 45221, USA
                               singhvi@mail.uc.edu, girishde@mail.uc.edu, Anca.Ralescu@uc.edu


                           Abstract                                     ing the semantic context of the scene. According to how
                                                                        a scene is recognized in an image, scene recognition algo-
  With over 40 years of history, image understanding, in par-           rithms can be broadly divided into two categories.
  ticular, scene classification and recognition remains central
  to machine vision. With an abundance of image and video               • Scene recognition based on object detection.
  databases, it is necessary to be able to sort and retrieve the
  images and videos in a way that is both efficient and effec-          • Scene recognition using low-level image features
  tive. This is possible only if the categories of images and/or
  their context are known to a user. Hence, the ability to clas-
  sify and recognize scenes accurately is of utmost importance.            Scene recognition using object recognition
  This paper presents a brief survey of the advances in scene
  recognition and classification algorithms.                                               (SR-OR)

   Depending on its goal, image understanding(IU) can be                Using object recognition for scene classification is a
defined in many different ways. However, in general, IU                 straight-forward and intuitive approach to scene classifica-
means describing the image content, the objects in it, loca-            tion and it can assist in distinguishing very complex scenes
tion and relations between objects, and most recently, de-              which might otherwise prove difficult to do using standard
scribing the events in an image. In (Ralescu 1995) IU is                low level features.
equated with producing a verbal description of the image                    In the paper by Li-Jia Li et al. (Li et al. 2010) the authors
content. Scene analysis (as part of IU) and categorization is           argue that although ”robust low-level image features have
a highly useful ability of humans, who are able to categorize           been proven to be effective representations for scene classifi-
complex natural scenes containing animals or vehicles very              cation; but pixels, or even local image patches, carry little se-
quickly (Thorpe, Fize, and Marlot 1996), with little or no              mantic meanings. For high level visual tasks, such low-level
attention (Li et al. 2003). When a scene is presented to hu-            image representations are potentially not enough. ” To com-
mans, they are able to quickly identify the scene, i.e., within         bat this drawback of local features, they propose a high-level
a short period of exposure (< 100 ms). How do humans                    image representation, called the Object Bank(OB), where an
perform all of these tasks the way they do, is yet to be fully          image is represented by integrating the response of the im-
understood. To date, the classic text by Marr (Marr 1982) re-           age to various object detectors. These object detectors or fil-
mains one of the sources of understanding the human vision              ters are blind to the testing dataset or visual task. Using OB
systems.                                                                representation, superior performances on high level visual
   Many researchers have tried to imbibe this incredible ca-            recognition tasks can be achieved with simple regularized
pability of the human vision system into their algorithms for           logistic regression. Their algorithm uses the current state-of-
image processing, scene understanding and recognition. In               the-art object detectors of Felzenszwalb et al. (Felzenszwalb
the presence of a wealth of literature on this and related sub-         et al. 2010), as well as the geometric context classifiers (stuff
jects, surveys of the field, even a limited one, as the present         detectors) of Hoeim et al. (Hoiem, Efros, and Hebert 2005)
one necessarily is (due to space constraints) are bound to be           for pre-training the object detectors.
very useful, by reviewing the methods for scene recognition                 OB offers a rich set of object features, while presenting a
and classification.                                                     challenge – curse of dimensionality due to the presence of
   Perhaps, the first issue to consider is the concept of scene         multiple class of objects within a single image, which then
as a technical concept to capture the natural concept. Ac-              yields feature vectors of very high dimension. The perfor-
cording to Xiao et al. (Xiao et al. 2010) a scene is a place            mance of the system plateaus at a point when the number of
in which a human can act within, or a place to which a hu-              object detection filters is too high. According to the authors,
man being could navigate. Therefore, scene recognition and              the system performance is best, when the number of object
scene classification algorithms must delve into understand-             filters is moderate.


                                                                   85
Image understanding - a brief review of scene classification and recognition                                            pp. 85–91


    Scene recognition using low-level image                            achieves agreement with the human classification in 91 of
              features (SR-LLF)                                        the 98 scenes (i.e., approximately 92.93%)
                                                                          The paper by Guérin-Dugué et al. (Guérin-Dugué and
Many of the papers in scene recognition are built around the           Oliva 2000) uses an approach similar to Gorkani and Picard
question, ’Can we recognize the context of a scene without             (Gorkani and Picard 1994) but extend this approach with
having first recognized the objects that are present? There            more categories and introduce the selection of the optimal
are a lot of reasons for avoiding object recognition for the           scale for this categorization task. They use local dominant
purpose of scene recognition. While there are many robust              orientation (LDO) features for classifying real-world scenes
OR algorithms, using SR-OR can be problematic because                  into four categories (outdoor urban scenes, indoor, closed
the OR portion of the algorithm is treated as a black box, and         landscapes and open landscapes). Instead of using the LDO
therefore, the OR errors propagate to the SR segment. OR               features directly, they propose compact coding in a few fea-
also faces problems due to lighting conditions and occlusion.          tures by Fourier series decomposition, and introduce the spa-
To avoid this many studies tend to use low level feature for           tial scale parameter to optimize the categorization. For each
scene understanding.                                                   scale and spatial location the dominant orientation and it’s
   The challenge in SR-LLF is to find low-level features in            strength are estimated. The best discrimination ratios were
the image that can be used successfully to infer its seman-            obtained with a representation at a median spatial scale or
tic context. Among the many features can be extracted from             when combining two different scales.
the image for the purpose of scene recognition, texture, ori-             The paper (Csurka et al. 2004) presents a bag of key-
entation, and color have been used extensively in literature,          points approach to visual categorization. The procedure
implemented with different data sets and with different clas-          starts by detection and description of image patches. Then
sifiers.                                                               a vocabulary of image descriptors is created after applying
   In (Renninger and Malik 2004) an algorithm which mim-               the vector quantification algorithm. SIFT decriptors (Lowe
ics the humans’ ability to identify scenes with limited expo-          1999) are used as features for this algorithm. This is fol-
sure is presented. The algorithm is based on a simple tex-             lowed by constructing a bag of key-points which counts the
ture analysis of the image which can provide a useful cue              number of patches assigned to each cluster. Finally, a multi-
to rapid scene identification. The relevant features within a          class classifier (SVM) is implemented, treating the bag of
texture are the first order statistics of textons which deter-         points as the feature vector and thus, determining which cat-
mine strength of texture discrimination. This idea is derived          egory the image belongs to.
from Juselz’s work (Julesz 1981) (Julesz 1986) (For a dis-                It is clear that counts, or histograms, suggest that scene
cussion on Juselz’s work see (Marr 1977)). According to                recognition and analysis could benefit from probabilistic ap-
Juselz, textons are the elements in the image that govern our          proaches. Indeed, some algorithms use probability models
perception of texture. They are calculated by convolving the           to describe the scene based on the extracted features.
image with certain filter banks. The textons based model                  The paper A Bayesian Hierarchical Model for Learning
learns the local texture features which correspond to vari-            Natural Scene Categories (Fei-Fei and Perona 2005) uses
ous scene categories, which is done by filtering a set of 250          low level texture features as image descriptors. Each patch
training images and then learning the prototypical distribu-           of the input image is represented using a code word (simi-
tions. The number of occurrences of each feature within a              lar to bag of keywords approach). The code word is taken
particular image is stored as an histogram, creating a holistic        from a codebook – a large vocabulary of code words – ob-
texture descriptor for the image. To learn the most prevalent          tained from 650 training examples from 13 categories (with
features, they use k-means clustering to find the 100 proto-           around 50 images for each category). In this framework,
typical responses. When identifying a new test image, its              initially the local regions are clustered into different inter-
histogram is matched against stored examples. It is con-               mediate themes and then into categories. The learning algo-
cluded that early scene identification can be explained with           rithm for achieving the model that best represents the distri-
a simple texture recognition model. This model leads to sim-           bution of code words to represent scenes is a modified Latent
ilar identifications and confusions as a human subject.                Dirichlet model(Blei, Ng, and Jordan 2003). Unlike tradi-
   The same objective, i.e., understanding human perception            tional scene models where there is a hard assignment of an
of scenes is pursued in (Gorkani and Picard 1994). The                 image to one theme, the algorithm produces a collection of
paper investigates the measure of dominant perceived ori-              themes that could be associated with an image.
entation developed to match the output of a human study                   In (Singhal, Luo, and Zhu 2003) a probabilistic approach
involving 40 subjects. These global multi-scale orientation            is used for content detection within the scene. The labels
features were used to detect vacation photos belonging to              generated are very similar to scene labels. The authors
”city/suburb”. The authors state that orientation is an impor-         present a holistic approach to determine the scene content,
tant feature for texture recognition and discrimination. The           based on a set of individual material detection algorithms as
algorithm finds the local orientation and its strength at each         well as probabilistic spatial context models. Material detec-
pixel of the image. The implementation extracts orientation            tion is the problem of identifying key semantic objects such
information over multiple scales using a steerable pyramid             as sky, grass, foliage, water, and snow in images. In order to
(Rock 1990) and then combines the orientations from these              detect materials the algorithm combines low-level features
different scales and decides which one is dominant perceptu-           with unique region analysis and inputs this to a classifier
ally. The reported results show that the orientation features          to obtain individual material belief maps. To avoid mis-


                                                                  86
Vineeta Singh et al.                                    MAICS 2017                                                   pp. 85–91


classification of materials in images they devise a spatial          for indoor - outdoor scene retrieval problem. Their algo-
context aware material detection system which constraints            rithm extracts three types of features: 1) histogram in the
the beliefs to conform to the probabilistic spatial context          ohta color space 2) multi-resolution simultaneous autore-
models.                                                              gressive model parameters 3) coefficients of shift invariant
   The bag of keypoints model (Sivic and Zisserman 2009)             DCT. They exhibit that performance is improved by com-
corresponds to a histogram of the number of occurrences              puting features on sub-blocks, classifying these sub-blocks
of particular image patterns in a given image. Most papers           and then combining these results by stacking.
mentioned above use this concept in some form. This is                  This paper (Serrano, Savakis, and Luo 2004) by Serrano
adapted from the bag of words model in natural language              et al. uses simplified low level features to predict the se-
processing.                                                          mantic category of scenes. This is integrated probabilisti-
   In (Lazebnik, Schmid, and Ponce 2006) the authors ar-             cally using Bayesian network to give a final indoor/outdoor
gue that in spite of impressive levels of performance, the           classification. Low-dimensional color and wavelet texture
bag of features model represents the image as an orderless           features are used to classify scenes using the support vec-
collection of local features, thereby disregarding all infor-        tor machine (SVM). These wavelet texture features are used
mation about the spatial layout of the features. To overcome         here instead of the popular MSAR texture features to reduce
this aspect, they devise a method for recognizing scene cate-        the computational complexity.
gories based on the approximate global geometry correspon-
dence. They compute a spatial pyramid by partitioning the                               Other approaches
image into increasingly fine sub-regions and computing his-
tograms of local features found in each sub-region. The spa-         Various other approaches exist in the literature of scene
tial pyramid is an extension of the orderless bag of features        recognition, as reviewed below.
model of image representation, which is improved upon by
the introduction of a kernel based recognition method. This          Semantic Typicality
method works by computing a rough geometric correspon-               The concepts of typicality and prototype have made a signif-
dence on a global scale using an approximation technique             icant impact in cognitive science. See for example the work
adapted from the pyramid matching scheme of (Grauman                 pioneered by Eleanor Rosch and her collaborators, (Rosch
and Darrell 2007). This method involves repeatedly sub-              1973), (Rosch and Mervis 1975), (Rosch et al. 1976). In
dividing the image and computing histograms of local fea-            computer vision, (Vogel and Schiele 2004) introduces an in-
tures at increasingly fine resolutions. The spatial pyramid          teresting concept of semantic typicality in categorizing of
approach can be thought of as an alternative formulation of          real world natural scenes. The proposed typicality measure
locally orderless image where a fixed hierarchy of rectangu-         is used to grade the similarity of an image with respect to
lar windows is defined. The spatial pyramid framework is             a scene category. Typicality is defined as a measure for the
based on the idea that the best results will be achieved when        uncertainty of annotation judgment. This is an important
multiple resolutions are combined in a principled way. The           concept because many natural scenes are ambiguous and the
features calculated are subdivided into weak features, ori-          categorization accuracy sometimes reflects the opinion of a
ented edge points, and strong features, SIFT descriptors. K-         particular person who performed the annotation. Therefore,
means clustering is performed on a random subset of patches          the authors believe that attention should be directed to mod-
from the training set to form a visual vocabulary. Multi-            eling the typicality of a particular scene after manual anno-
class classification is done with the support vector machine         tation. The semantic typicality measure is used to find the
(SVM) , trained using the one versus all rule.                       similarity of natural real-world scenes with respect to six
   Though fewer algorithms use color based features, in cer-         scenes including coasts, rivers/lakes, forests, plains, moun-
tain cases this descriptor is very powerful in discriminating        tains and sky/clouds.
scenes.                                                                 The typicality based approach is evaluated on an image
   Color descriptors can be used for scene and object recog-         database of 700 natural scenes. The attribute score is a
nition (Van De Sande, Gevers, and Snoek 2010) in order to            representation which is predictive of typicality. Typicality
increase illumination invariance and discriminative power.           is a function of frequency of occurrence, that is, the items
From theoretical and experimental results it is shown that,          deemed most typical have attributes that are very common to
invariance to light intensity changes and light color changes        the category. Local semantic concepts act as scene category
affect category recognition. Various color descriptors were          attributes. They are calculated from the sub-regions which
analyzed and evaluated. Color descriptors based on his-              are represented by a combined 84-bin linear histogram in
tograms, color moments moment invariants and color SIFT              the HSI color space, and a 72-bin edge direction histogram.
were used as descriptors, and it was concluded that SIFT             Classification is done by a k-nearest neighbor classifier. The
based descriptors performed considerably better that his-            categorization experiment was carried out using manually
togram and moment based descriptors.                                 annotated images from the database. By analyzing the se-
                                                                     mantic similarities and dissimilarities of the aforementioned
Indoor-Outdoor classification                                        categories a set of nine local semantic concepts emerged as
In the paper by Szummer et al. (Szummer and Picard 1998)             being most discriminant: sky, water, grass, trunks, foliage,
the authors show that high-level scene properties can be in-         fields, rocks, flowers, and sand. The local semantic concepts
ferred from classification of low-level features specifically        were extracted on a 10 ⇥ 10 grid of image sub-regions and


                                                                87
Image understanding - a brief review of scene classification and recognition                                             pp. 85–91


the frequency of occurrence in a particular image was rep-              a general concept learning problem (Ralescu and Baldwin
resented by concept occurrence vector. For each category, a             1989), was developed. It makes use of Conceptual Struc-
category prototype is defined as the most typical example for           tures (Sowa 1983) for knowledge representation, and sup-
that category, which constructed as the means over the con-             port logic programming (Baldwin 1986) for inference. Ex-
cept occurrence vectors of the category members. The im-                amples of a concept (e.g., a ’car’) are used to construct a
age typicality was measured by computing the Mahalanobis                memory aggregate(MA), which rather than average all ex-
distance between the images’ concept occurrence vector and              amples, keeps track of various probability distributions of
the prototypical representation in order to classify the image          the object features. Counter-examples, i.e., descriptions that
as a particular scene.                                                  are very similar to a concept, but fail to be instances of
                                                                        that concept, are used in a similar manner to construct a
Configural Recognition                                                  counter example memory aggregate (CMA). Matching be-
The goal of (Lipson, Grimson, and Sinha 1997) is to clas-               tween conceptual structures describing an object candidate
sify scenes based on their content. Most of the solutions               and the MA and CMA produce supports for and against the
that are available for scene recognition rely on color his-             recognition of a concept. The result is therefore, qualified by
tograms and local texture statistics. The authors state that            a support pair, whose values (1, 1) mean complete recogni-
these features cannot capture a scenes’ global configuration.           tion, (0, 0) complete rejection, (0, 1) total uncertainty.
To overcome this they present a novel approach, which they
call configural recognition for encoding scene class struc-             Deformable part based models
ture in images. The configural recognition scheme encodes               In (Pandey and Lazebnik 2011), the author comments that
class models as a set of salient low resolution image regions           weakly supervised discovery of common visual structure
and salient qualitative relations between the regions. An ex-           in highly variable and that cluttered images present a ma-
ample of qualitative relationships are: ’given three regions,           jor problem in recognition. In order to address this prob-
a blue region(A), a white region (B) and a gray region (C),             lem, the authors propose using deformable part-based mod-
Snow-capped mountains always have region A above region                 els (DPM) with latent SVM training. For scene recognition,
B which is above region C’.                                             deformable part-based models capture recurring visual ele-
   The class models are described using seven types of rel-             ments and salient objects. The DPM represents an object by
ative relationships between image patches. Each of them                 low-resolution root filters and a set of high resolution part
has the following values: Less than, greater than, or equal             filters in a flexible spatial configuration. The image is rep-
to. The relationships encoded are relative color between im-            resented by a variation of histogram of oriented gradients
age regions, the relative luminance between the patches, the            (HOG) features which are used to classify scenes using lin-
spatial relationships (relative horizontal and vertical descrip-        ear SVM.
tions) and the relative size of the patch. Based on this, each
region in the image is grouped into directional equivalence
                                                                        Covariance descriptor
classes, such as above and below.
   The generated model acts as deformable templates. When               The paper (Yang et al. 2016) proposes a supervised col-
compared with the image, the model can be deformed by                   laborative kernel coding method based on covariance de-
moving the patches around so that the model best matches                scriptor (covd) for scene level geographic image classifica-
the image in terms of relative luminance and photometric at-            tion. Covariance descriptor is a covariance matrix of differ-
tributes. An improvement to this system can be made where               ent features such as color, spatial location, and gradient that
instead of hand crafting the models,an automated process                is rotation and scale invariant but it lies in the Riemannian
could take a set of example images and generate a set of                space (i.e., non- Euclidean space) and therefore, the tradi-
templates which describe the relevant relationships between             tional computational and mathematical models used in the
the pictures in the example set.                                        Euclidean space cannot be used.
   A fuzzy part based model was described in (Miyajima and                 The major contribution of this paper is that they propose
Ralescu 1993) and fuzzy sets were also widely and effec-                a supervised kernel coding model that transforms covd into
tively used for spatial descriptors in an image. A very pow-            a discriminative feature representation and obtain a corre-
erful formal model, based on fuzzy sets, for the description            sponding linear classifier. The method can be seen as a
of spatial relations in an image was introduced in (Miyajima            three step process. The first step is to extract the covd fea-
and Ralescu 1994), (Miyajima and Ralescu 1994) and fur-                 tures from the geographical scene image. In the second
ther extended by (Bloch 1999). A comparison of the fuzzy                step supervised collaborative kernel coding involving dic-
approaches for the description of directional relative posi-            tionary coefficients in the coding representation phase and
tion between object in an image can be found in (Bloch and              linear classification phase is performed. Lastly, in the clas-
Ralescu 2003), and a review of these approached can be                  sification stage, based on dictionary coefficients and learned
found in (Bloch 2005). Furthermore, more recently, fuzzy                linear classifier a label vector is derived. A novel objec-
spatial relations were integrated in deformable models and              tive function is proposed to combine the collaborative ker-
applied to MRI images (Colliot, Camara, and Bloch 2006).                nel coding phase and the classification phase. This method
   In (Ralescu and Baldwin 1987) a new approach for con-                gives satisfying performance on high resolution aerial im-
cept learning from examples and counter-examples with ap-               age dataset proving to be an efficient method for scene level
plications to a vision learning system, later extended to               geographic image classification.


                                                                   88
Vineeta Singh et al.                                         MAICS 2017                                                     pp. 85–91


Shape of the scene                                                        idea is to jointly evaluate and makes conclusions about loca-
The paper (Oliva and Torralba 2001) takes a very different                tion, regions, class and spatial information of objects, pres-
approach to scene recognition: rather than looking at the                 ence of a class in an image and also the scene type. The idea
scene as a configuration of objects the paper proposes to                 is to recover and connect the multiple different aspects of a
consider the scene as an individual object, with a unitary                scene. This problem is framed as a prediction problem in a
shape. A computational model to find the shape of the scene               graphical model defined over hierarchies of regions of dif-
using a few perceptual dimensions specifically dedicated to               ferent sizes, auxiliary variables encoding scene type, pres-
describing spatial properties of the scene is proposed. It is             ence of a given class in a scene and correctness of bounding
shown that the holistic spatial scene properties, called Spa-             boxes obtained by the object detector. Class labels of image
tial Envelope(SE) properties may be reliably estimated using              segments at two different levels of segmentation hierarchy,
spectral and coarsely localized information.                              namely segments and large super segments is proposed. Bi-
   Given and environment V , its spatial envelope SE(V ) is               nary variables indicate which classes are present in images
defined as a composite set of boundaries such as walls, sec-              and multi-labeled variable represents scene type. Segments
tion, elevation etc. that define the shape of the space. A                and super segments are used to assign semantic class labels
group of 17 observers were asked to categorize 81 images                  to each pixel in an image. Super segments are used to create
into categories based on some global aspect. Based on the                 long range dependencies and they also prove to be more effi-
classification results, the criterion for classification of scenes        cient computationally. The holistic loss function is defined,
was agreed to be based upon the degree of naturalness, de-                which is a weighted sum of losses from each task. State of
gree of openess, degree of roughness, degree of expansion                 the art performance is achieved in MSRC-21 benchmark and
and degree of ruggedness. Therefore, the purpose of the                   the approach is much faster than the existing approaches.
spatial envelope model is to show that modeling these five                   Another very interesting approach is present in (Li and
spatial properties is adequate to understand the high-level               Fei-Fei 2007), which goes beyond scene recognition to event
description of the scene. Their algorithm learns the spectral             recognition. An event in a static image is defined as a hu-
signatures (the global energy spectrum and the spectogram)                man activity taking place in the specific environment. The
of basic scene categories from labeled training data. A learn-            objective is to recognize/classify the event in the image as
ing algorithm (regression) is then used to find the relation              well as provide a number of semantic labels to the object and
between the global features and spectral features.                        scene environment within the image. It is assumed that con-
                                                                          ditioned on the event, scene and objects are independent of
Beyond Scene recognition                                                  each other, but both their presence influences the probability
Certain algorithms detect scenes and then use scene recog-                of predicting the event. For scene recognition they adopt a
nition as a prior in order to find more structure in the image,           model similar to the Bayesian model of (Fei-Fei and Perona
thus motivating further study in the field of scene analysis.             2005). Scene recognition heavily influences the event recog-
   In Using Forest to see trees (Murphy et al. 2003) an in-               nition, an in fact, as a first approximation event recognition
tuitive approach to detect the presence of object based on                is essentially scene recognition. The robust bag of words
the detected scenes is presented. The approach is suggested               model is used in order to recognize objects. In addition to
by psychological evidence that people perform rapid global                scene and object recognition they understand the importance
scene analysis before and conducting more detailed local ob-              of layout of the image in accurately identifying the event.
ject analysis. Based on this the authors propose to use the               They use some simple geometric cues to define the layout of
whole image as a global feature in order to overcome ambi-                the image and manage to provide integrative and hierarchi-
guities which might occur at the local level. They extend the             cal labels to an image by performing the what (event), where
notion of gist from (Oliva and Torralba 2006) by combining                (scene) and who (object) recognition of the entire scene by
the prior suggested by the gist to the output of bottom-up lo-            using a generative model in order to represent the image.
cal object detectors which are trained using boosting. They                  An extensive Scene Understanding (SUN) database con-
also use the same set of features for object detection in the             sisting of 899 categories and 130519 images is created in
image. The image is divided into patches at different scales              (Xiao et al. 2010). This work is motivated by the authors’
(image pyramid) and each patch is convolved with 13 zero-                 belief that the existing data sets for scene classification fail
mean filters which include oriented edges, a Laplacian filter,            to capture the richness and diversity of daily life environ-
corner detectors and long edge detectors. This is represented             ments. The authors claim to have built the most complete
by two statistics, variance, and kurtosis derived from the his-           dataset with a number of different scene image categories
togram of image patches at two scales and with 30 spatial                 with different functionalities that are important enough to
masks. The kurtosis is omitted for scene recognition. The                 have unique identities. They measure human performance
features are further reduced in dimensionality using PCA to               on scene classification and compare it with the state of the
give the PCA-gist. A one vs all binary classifier is trained              art algorithms, using the SUN database. Both human and al-
for recognizing each type of scene using boosting applied to              gorithm results had errors, with the humans erring between
the gist. They further this by using scene as a latent common             semantically similar categories, while algorithms erring be-
cause upon which the presence of the object is conditionally              tween semantically unrelated scenes due to spurious visual
dependent on.                                                             matches. It was also recorded that the best features agree
   Understanding whole image, or holistic scene understand-               more with correct human classifications and make the same
ing is is described in (Yao, Fidler, and Urtasun 2012). The               mistakes as humans do. The computational algorithms need


                                                                     89
Image understanding - a brief review of scene classification and recognition                                          pp. 85–91


a much larger number of features to performs as well as hu-           puter Vision and Pattern Recognition, 2005. CVPR 2005.
mans. They also propose the notion of recognizing scene               IEEE Computer Society Conference on, volume 2, 524–531.
type within images rather than labeling an entire image with          IEEE.
a scene because the real world often contains combination             Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and
of scenes. This is a new interesting idea and could also be           Ramanan, D. 2010. Object detection with discriminatively
one of the directions in which the future scene recognition           trained part-based models. IEEE transactions on pattern
algorithms can progress.                                              analysis and machine intelligence 32(9):1627–1645.
                       Conclusion                                     Gorkani, M. M., and Picard, R. W. 1994. Texture orienta-
                                                                      tion for sorting photos” at a glance”. In Pattern Recognition,
It can be seen, even based on the limited number of papers            1994. Vol. 1-Conference A: Computer Vision & Image Pro-
reviewed here, that image understanding, scene recognition            cessing., Proceedings of the 12th IAPR International Con-
can be approached from various different directions. At               ference on, volume 1, 459–464. IEEE.
a very high level, the approaches can be divided into two
main categories - using low-level features, and using ob-             Grauman, K., and Darrell, T. 2007. The pyramid match
ject recognition. However, many other techniques are in-              kernel: Efficient learning with sets of features. Journal of
tegrated into each of these approaches, including probabilis-         Machine Learning Research 8(Apr):725–760.
tic, and/or fuzzy techniques, in order to deal with the uncer-        Guérin-Dugué, A., and Oliva, A. 2000. Classification of
tainty which often attends the result of image understanding.         scene photographs from local orientations features. Pattern
When it come to evaluating low level feature approach and             Recognition Letters 21(13):1135–1140.
object recognition approach, the goal of the image under-             Hoiem, D.; Efros, A. A.; and Hebert, M. 2005. Auto-
standing must be taken into account. Scene recognition per-           matic photo pop-up. ACM transactions on graphics (TOG)
forms better when low level features are used. Local features         24(3):577–584.
help override the effects of occluded objects, low lighting
conditions. Most commonly used features for scene detec-              Julesz, B. 1981. Textons, the elements of texture perception,
tion include texture, texture orientation and strength, ’Gist’        and their interactions. Nature 290(5802):91–97.
of the image, SIFT descriptor, edge orientation, histograms           Julesz, B. 1986. Texton gradients: The texton theory revis-
in different color space (e.g., Ohta, HSI, RGB), histograms           ited. Biological cybernetics 54(4):245–251.
of angles between segmented regions, coefficients of shift-           Lazebnik, S.; Schmid, C.; and Ponce, J. 2006. Beyond
invariant DCT. These features can be successfully mapped              bags of features: Spatial pyramid matching for recogniz-
into semantic image descriptors.                                      ing natural scene categories. In Computer vision and pat-
                                                                      tern recognition, 2006 IEEE computer society conference
                       References                                     on, volume 2, 2169–2178. IEEE.
Baldwin, J. F. 1986. Support logic programming. In Fuzzy              Li, L.-J., and Fei-Fei, L. 2007. What, where and who? clas-
sets theory and applications. Springer. 133–170.                      sifying events by scene and object recognition. In Computer
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent                Vision, 2007. ICCV 2007. IEEE 11th International Confer-
dirichlet allocation. Journal of machine Learning research            ence on, 1–8. IEEE.
3(Jan):993–1022.                                                      Li, F. F.; VanRullen, R.; Koch, C.; and Perona, P. 2003.
Bloch, I., and Ralescu, A. 2003. Directional relative position        Natural scene categorization in the near absence of attention:
between objects in image processing: a comparison between             further explorations. Journal of Vision 3(9):331–331.
fuzzy approaches. pattern Recognition 36(7):1563–1582.
                                                                      Li, L.-J.; Su, H.; Fei-Fei, L.; and Xing, E. P. 2010. Object
Bloch, I. 1999. Fuzzy relative position between objects               bank: A high-level image representation for scene classifica-
in image processing: a morphological approach. IEEE                   tion & semantic feature sparsification. In Advances in neural
transactions on pattern analysis and machine intelligence             information processing systems, 1378–1386.
21(7):657–664.
                                                                      Lipson, P.; Grimson, E.; and Sinha, P. 1997. Configuration
Bloch, I. 2005. Fuzzy spatial relationships for image pro-            based scene classification and image indexing. In Computer
cessing and interpretation: a review. Image and Vision Com-           Vision and Pattern Recognition, 1997. Proceedings., 1997
puting 23(2):89–110.                                                  IEEE Computer Society Conference on, 1007–1013. IEEE.
Colliot, O.; Camara, O.; and Bloch, I. 2006. Integration
                                                                      Lowe, D. G. 1999. Object recognition from local scale-
of fuzzy spatial relations in deformable modelsapplication
                                                                      invariant features. In Computer vision, 1999. The proceed-
to brain mri segmentation. Pattern recognition 39(8):1401–
                                                                      ings of the seventh IEEE international conference on, vol-
1414.
                                                                      ume 2, 1150–1157. Ieee.
Csurka, G.; Dance, C.; Fan, L.; Willamowski, J.; and Bray,
C. 2004. Visual categorization with bags of keypoints. In             Marr, D. 1977. Artificial intelligence – a personal view.
Workshop on statistical learning in computer vision, ECCV,            Artificial Intelligence 9(1):37–48.
volume 1, 1–2. Prague.                                                Marr, D. 1982. Vision: A computational approach.
Fei-Fei, L., and Perona, P. 2005. A bayesian hierarchi-               Miyajima, K., and Ralescu, A. 1993. Modeling of natural
cal model for learning natural scene categories. In Com-              objects including fuzziness and application to image under-


                                                                 90
Vineeta Singh et al.                                      MAICS 2017                                                    pp. 85–91


standing. In Fuzzy Systems, 1993., Second IEEE Interna-                Rosch, E. H. 1973. Natural categories. Cognitive psychology
tional Conference on, 1049–1054. IEEE.                                 4(3):328–350.
Miyajima, K., and Ralescu, A. 1994. Spatial organization               Serrano, N.; Savakis, A. E.; and Luo, J. 2004. Improved
in 2d segmented images: representation and recognition of              scene classification using efficient low-level features and se-
primitive spatial relations. Fuzzy Sets and Systems 65(2-              mantic cues. Pattern Recognition 37(9):1773–1784.
3):225–236.                                                            Singhal, A.; Luo, J.; and Zhu, W. 2003. Probabilistic spatial
Murphy, K.; Torralba, A.; Freeman, W.; et al. 2003. Using              context models for scene content understanding. In Com-
the forest to see the trees: a graphical model relating fea-           puter Vision and Pattern Recognition, 2003. Proceedings.
tures, objects and scenes. Advances in neural information              2003 IEEE Computer Society Conference on, volume 1, I–I.
processing systems 16:1499–1506.                                       IEEE.
Oliva, A., and Torralba, A. 2001. Modeling the shape of                Sivic, J., and Zisserman, A. 2009. Efficient visual search of
the scene: A holistic representation of the spatial envelope.          videos cast as text retrieval. IEEE transactions on pattern
International journal of computer vision 42(3):145–175.                analysis and machine intelligence 31(4):591–606.
Oliva, A., and Torralba, A. 2006. Building the gist of                 Sowa, J. F. 1983. Conceptual structures: information pro-
a scene: The role of global image features in recognition.             cessing in mind and machine.
Progress in brain research 155:23–36.                                  Szummer, M., and Picard, R. W. 1998. Indoor-outdoor im-
                                                                       age classification. In Content-Based Access of Image and
Pandey, M., and Lazebnik, S. 2011. Scene recognition and               Video Database, 1998. Proceedings., 1998 IEEE Interna-
weakly supervised object localization with deformable part-            tional Workshop on, 42–51. IEEE.
based models. In Computer Vision (ICCV), 2011 IEEE In-
ternational Conference on, 1307–1314. IEEE.                            Thorpe, S.; Fize, D.; and Marlot, C. 1996. Speed of pro-
                                                                       cessing in the human visual system. nature 381(6582):520.
Ralescu, A. L., and Baldwin, J. F. 1987. Concept learning
from examples with applications to a vision learning system.           Van De Sande, K.; Gevers, T.; and Snoek, C. 2010. Evaluat-
In Alvey Vision Conference, 1–8.                                       ing color descriptors for object and scene recognition. IEEE
                                                                       transactions on pattern analysis and machine intelligence
Ralescu, A. L., and Baldwin, J. F. 1989. Concept learning              32(9):1582–1596.
from examples and counter examples. International Journal
                                                                       Vogel, J., and Schiele, B. 2004. A semantic typicality mea-
of Man-Machine Studies 30(3):329–354.
                                                                       sure for natural scene categorization. In Joint Pattern Recog-
Ralescu, A. L. 1995. Image understanding = verbal descrip-             nition Symposium, 195–203. Springer.
tion of the image contents. SOFT, Journal of the Japanese              Xiao, J.; Hays, J.; Ehinger, K. A.; Oliva, A.; and Torralba,
Society for Fuzzy Theory 7(4):739–746.                                 A. 2010. Sun database: Large-scale scene recognition from
Renninger, L. W., and Malik, J. 2004. When is scene                    abbey to zoo. In Computer vision and pattern recognition
identification just texture recognition?       Vision research         (CVPR), 2010 IEEE conference on, 3485–3492. IEEE.
44(19):2301–2311.                                                      Yang, C.; Liu, H.; Wang, S.; and Liao, S. 2016. Scene-level
Rock, I. 1990. The perceptual world. Scientific American               geographic image classification based on a covariance de-
127.                                                                   scriptor using supervised collaborative kernel coding. Sen-
Rosch, E., and Mervis, C. B. 1975. Family resemblances:                sors 16(3):392.
Studies in the internal structure of categories. Cognitive psy-        Yao, J.; Fidler, S.; and Urtasun, R. 2012. Describing the
chology 7(4):573–605.                                                  scene as a whole: Joint object detection, scene classification
                                                                       and semantic segmentation. In Computer Vision and Pattern
Rosch, E.; Mervis, C. B.; Gray, W. D.; Johnson, D. M.; and
                                                                       Recognition (CVPR), 2012 IEEE Conference on, 702–709.
Boyes-Braem, P. 1976. Basic objects in natural categories.
                                                                       IEEE.
Cognitive psychology 8(3):382–439.


                                                                  91