<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Image understanding - a brief review of scene classification and recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vineeta Singh</string-name>
          <email>singhvi@mail.uc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deeptha Girish</string-name>
          <email>girishde@mail.uc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anca Ralescu</string-name>
          <email>Anca.Ralescu@uc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EECS Department, ML 0030 University of Cincinnati Cincinnati</institution>
          ,
          <addr-line>OH 45221</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>85</fpage>
      <lpage>91</lpage>
      <abstract>
        <p>With over 40 years of history, image understanding, in particular, scene classification and recognition remains central to machine vision. With an abundance of image and video databases, it is necessary to be able to sort and retrieve the images and videos in a way that is both efficient and effective. This is possible only if the categories of images and/or their context are known to a user. Hence, the ability to classify and recognize scenes accurately is of utmost importance. This paper presents a brief survey of the advances in scene recognition and classification algorithms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Depending on its goal, image understanding(IU) can be
defined in many different ways. However, in general, IU
means describing the image content, the objects in it,
location and relations between objects, and most recently,
describing the events in an image. In
        <xref ref-type="bibr" rid="ref37">(Ralescu 1995)</xref>
        IU is
equated with producing a verbal description of the image
content. Scene analysis (as part of IU) and categorization is
a highly useful ability of humans, who are able to categorize
complex natural scenes containing animals or vehicles very
quickly
        <xref ref-type="bibr" rid="ref50">(Thorpe, Fize, and Marlot 1996)</xref>
        , with little or no
attention
        <xref ref-type="bibr" rid="ref19">(Li et al. 2003)</xref>
        . When a scene is presented to
humans, they are able to quickly identify the scene, i.e., within
a short period of exposure (&lt; 100 ms). How do humans
perform all of these tasks the way they do, is yet to be fully
understood. To date, the classic text by Marr
        <xref ref-type="bibr" rid="ref26">(Marr 1982)</xref>
        remains one of the sources of understanding the human vision
systems.
      </p>
      <p>Many researchers have tried to imbibe this incredible
capability of the human vision system into their algorithms for
image processing, scene understanding and recognition. In
the presence of a wealth of literature on this and related
subjects, surveys of the field, even a limited one, as the present
one necessarily is (due to space constraints) are bound to be
very useful, by reviewing the methods for scene recognition
and classification.</p>
      <p>
        Perhaps, the first issue to consider is the concept of scene
as a technical concept to capture the natural concept.
According to Xiao et al.
        <xref ref-type="bibr" rid="ref53">(Xiao et al. 2010)</xref>
        a scene is a place
in which a human can act within, or a place to which a
human being could navigate. Therefore, scene recognition and
scene classification algorithms must delve into
understanding the semantic context of the scene. According to how
a scene is recognized in an image, scene recognition
algorithms can be broadly divided into two categories.
• Scene recognition based on object detection.
• Scene recognition using low-level image features
      </p>
    </sec>
    <sec id="sec-2">
      <title>Scene recognition using object recognition (SR-OR)</title>
      <p>Using object recognition for scene classification is a
straight-forward and intuitive approach to scene
classification and it can assist in distinguishing very complex scenes
which might otherwise prove difficult to do using standard
low level features.</p>
      <p>
        In the paper by Li-Jia Li et al.
        <xref ref-type="bibr" rid="ref21">(Li et al. 2010)</xref>
        the authors
argue that although ”robust low-level image features have
been proven to be effective representations for scene
classification; but pixels, or even local image patches, carry little
semantic meanings. For high level visual tasks, such low-level
image representations are potentially not enough. ” To
combat this drawback of local features, they propose a high-level
image representation, called the Object Bank(OB), where an
image is represented by integrating the response of the
image to various object detectors. These object detectors or
filters are blind to the testing dataset or visual task. Using OB
representation, superior performances on high level visual
recognition tasks can be achieved with simple regularized
logistic regression. Their algorithm uses the current
state-ofthe-art object detectors of Felzenszwalb et al.
        <xref ref-type="bibr" rid="ref10">(Felzenszwalb
et al. 2010)</xref>
        , as well as the geometric context classifiers (stuff
detectors) of Hoeim et al.
        <xref ref-type="bibr" rid="ref14 ref8">(Hoiem, Efros, and Hebert 2005)</xref>
        for pre-training the object detectors.
      </p>
      <p>OB offers a rich set of object features, while presenting a
challenge – curse of dimensionality due to the presence of
multiple class of objects within a single image, which then
yields feature vectors of very high dimension. The
performance of the system plateaus at a point when the number of
object detection filters is too high. According to the authors,
the system performance is best, when the number of object
filters is moderate.</p>
    </sec>
    <sec id="sec-3">
      <title>Scene recognition using low-level image features (SR-LLF)</title>
      <p>Many of the papers in scene recognition are built around the
question, ’Can we recognize the context of a scene without
having first recognized the objects that are present? There
are a lot of reasons for avoiding object recognition for the
purpose of scene recognition. While there are many robust
OR algorithms, using SR-OR can be problematic because
the OR portion of the algorithm is treated as a black box, and
therefore, the OR errors propagate to the SR segment. OR
also faces problems due to lighting conditions and occlusion.
To avoid this many studies tend to use low level feature for
scene understanding.</p>
      <p>The challenge in SR-LLF is to find low-level features in
the image that can be used successfully to infer its
semantic context. Among the many features can be extracted from
the image for the purpose of scene recognition, texture,
orientation, and color have been used extensively in literature,
implemented with different data sets and with different
classifiers.</p>
      <p>
        In
        <xref ref-type="bibr" rid="ref38 ref52">(Renninger and Malik 2004)</xref>
        an algorithm which
mimics the humans’ ability to identify scenes with limited
exposure is presented. The algorithm is based on a simple
texture analysis of the image which can provide a useful cue
to rapid scene identification. The relevant features within a
texture are the first order statistics of textons which
determine strength of texture discrimination. This idea is derived
from Juselz’s work
        <xref ref-type="bibr" rid="ref15">(Julesz 1981)</xref>
        <xref ref-type="bibr" rid="ref16">(Julesz 1986)</xref>
        (For a
discussion on Juselz’s work see
        <xref ref-type="bibr" rid="ref24">(Marr 1977)</xref>
        ). According to
Juselz, textons are the elements in the image that govern our
perception of texture. They are calculated by convolving the
image with certain filter banks. The textons based model
learns the local texture features which correspond to
various scene categories, which is done by filtering a set of 250
training images and then learning the prototypical
distributions. The number of occurrences of each feature within a
particular image is stored as an histogram, creating a holistic
texture descriptor for the image. To learn the most prevalent
features, they use k-means clustering to find the 100
prototypical responses. When identifying a new test image, its
histogram is matched against stored examples. It is
concluded that early scene identification can be explained with
a simple texture recognition model. This model leads to
similar identifications and confusions as a human subject.
      </p>
      <p>
        The same objective, i.e., understanding human perception
of scenes is pursued in
        <xref ref-type="bibr" rid="ref11 ref27">(Gorkani and Picard 1994)</xref>
        . The
paper investigates the measure of dominant perceived
orientation developed to match the output of a human study
involving 40 subjects. These global multi-scale orientation
features were used to detect vacation photos belonging to
”city/suburb”. The authors state that orientation is an
important feature for texture recognition and discrimination. The
algorithm finds the local orientation and its strength at each
pixel of the image. The implementation extracts orientation
information over multiple scales using a steerable pyramid
        <xref ref-type="bibr" rid="ref39">(Rock 1990)</xref>
        and then combines the orientations from these
different scales and decides which one is dominant
perceptually. The reported results show that the orientation features
achieves agreement with the human classification in 91 of
the 98 scenes (i.e., approximately 92.93%)
      </p>
      <p>
        The paper by Gue´rin-Dugue´ et al.
        <xref ref-type="bibr" rid="ref13">(Gue´rin-Dugue´ and
Oliva 2000)</xref>
        uses an approach similar to Gorkani and Picard
        <xref ref-type="bibr" rid="ref11 ref27">(Gorkani and Picard 1994)</xref>
        but extend this approach with
more categories and introduce the selection of the optimal
scale for this categorization task. They use local dominant
orientation (LDO) features for classifying real-world scenes
into four categories (outdoor urban scenes, indoor, closed
landscapes and open landscapes). Instead of using the LDO
features directly, they propose compact coding in a few
features by Fourier series decomposition, and introduce the
spatial scale parameter to optimize the categorization. For each
scale and spatial location the dominant orientation and it’s
strength are estimated. The best discrimination ratios were
obtained with a representation at a median spatial scale or
when combining two different scales.
      </p>
      <p>
        The paper
        <xref ref-type="bibr" rid="ref7">(Csurka et al. 2004)</xref>
        presents a bag of
keypoints approach to visual categorization. The procedure
starts by detection and description of image patches. Then
a vocabulary of image descriptors is created after applying
the vector quantification algorithm. SIFT decriptors
        <xref ref-type="bibr" rid="ref23">(Lowe
1999)</xref>
        are used as features for this algorithm. This is
followed by constructing a bag of key-points which counts the
number of patches assigned to each cluster. Finally, a
multiclass classifier (SVM) is implemented, treating the bag of
points as the feature vector and thus, determining which
category the image belongs to.
      </p>
      <p>It is clear that counts, or histograms, suggest that scene
recognition and analysis could benefit from probabilistic
approaches. Indeed, some algorithms use probability models
to describe the scene based on the extracted features.</p>
      <p>
        The paper A Bayesian Hierarchical Model for Learning
Natural Scene Categories
        <xref ref-type="bibr" rid="ref8">(Fei-Fei and Perona 2005)</xref>
        uses
low level texture features as image descriptors. Each patch
of the input image is represented using a code word
(similar to bag of keywords approach). The code word is taken
from a codebook – a large vocabulary of code words –
obtained from 650 training examples from 13 categories (with
around 50 images for each category). In this framework,
initially the local regions are clustered into different
intermediate themes and then into categories. The learning
algorithm for achieving the model that best represents the
distribution of code words to represent scenes is a modified Latent
Dirichlet model
        <xref ref-type="bibr" rid="ref2 ref3">(Blei, Ng, and Jordan 2003)</xref>
        . Unlike
traditional scene models where there is a hard assignment of an
image to one theme, the algorithm produces a collection of
themes that could be associated with an image.
      </p>
      <p>
        In
        <xref ref-type="bibr" rid="ref3 ref45">(Singhal, Luo, and Zhu 2003)</xref>
        a probabilistic approach
is used for content detection within the scene. The labels
generated are very similar to scene labels. The authors
present a holistic approach to determine the scene content,
based on a set of individual material detection algorithms as
well as probabilistic spatial context models. Material
detection is the problem of identifying key semantic objects such
as sky, grass, foliage, water, and snow in images. In order to
detect materials the algorithm combines low-level features
with unique region analysis and inputs this to a classifier
to obtain individual material belief maps. To avoid
misclassification of materials in images they devise a spatial
context aware material detection system which constraints
the beliefs to conform to the probabilistic spatial context
models.
      </p>
      <p>
        The bag of keypoints model
        <xref ref-type="bibr" rid="ref47">(Sivic and Zisserman 2009)</xref>
        corresponds to a histogram of the number of occurrences
of particular image patterns in a given image. Most papers
mentioned above use this concept in some form. This is
adapted from the bag of words model in natural language
processing.
      </p>
      <p>
        In
        <xref ref-type="bibr" rid="ref17 ref31">(Lazebnik, Schmid, and Ponce 2006)</xref>
        the authors
argue that in spite of impressive levels of performance, the
bag of features model represents the image as an orderless
collection of local features, thereby disregarding all
information about the spatial layout of the features. To overcome
this aspect, they devise a method for recognizing scene
categories based on the approximate global geometry
correspondence. They compute a spatial pyramid by partitioning the
image into increasingly fine sub-regions and computing
histograms of local features found in each sub-region. The
spatial pyramid is an extension of the orderless bag of features
model of image representation, which is improved upon by
the introduction of a kernel based recognition method. This
method works by computing a rough geometric
correspondence on a global scale using an approximation technique
adapted from the pyramid matching scheme of
        <xref ref-type="bibr" rid="ref12 ref18">(Grauman
and Darrell 2007)</xref>
        . This method involves repeatedly
subdividing the image and computing histograms of local
features at increasingly fine resolutions. The spatial pyramid
approach can be thought of as an alternative formulation of
locally orderless image where a fixed hierarchy of
rectangular windows is defined. The spatial pyramid framework is
based on the idea that the best results will be achieved when
multiple resolutions are combined in a principled way. The
features calculated are subdivided into weak features,
oriented edge points, and strong features, SIFT descriptors.
Kmeans clustering is performed on a random subset of patches
from the training set to form a visual vocabulary.
Multiclass classification is done with the support vector machine
(SVM) , trained using the one versus all rule.
      </p>
      <p>Though fewer algorithms use color based features, in
certain cases this descriptor is very powerful in discriminating
scenes.</p>
      <p>
        Color descriptors can be used for scene and object
recognition
        <xref ref-type="bibr" rid="ref51">(Van De Sande, Gevers, and Snoek 2010)</xref>
        in order to
increase illumination invariance and discriminative power.
From theoretical and experimental results it is shown that,
invariance to light intensity changes and light color changes
affect category recognition. Various color descriptors were
analyzed and evaluated. Color descriptors based on
histograms, color moments moment invariants and color SIFT
were used as descriptors, and it was concluded that SIFT
based descriptors performed considerably better that
histogram and moment based descriptors.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Indoor-Outdoor classification</title>
      <p>
        In the paper by Szummer et al.
        <xref ref-type="bibr" rid="ref49">(Szummer and Picard 1998)</xref>
        the authors show that high-level scene properties can be
inferred from classification of low-level features specifically
for indoor - outdoor scene retrieval problem. Their
algorithm extracts three types of features: 1) histogram in the
ohta color space 2) multi-resolution simultaneous
autoregressive model parameters 3) coefficients of shift invariant
DCT. They exhibit that performance is improved by
computing features on sub-blocks, classifying these sub-blocks
and then combining these results by stacking.
      </p>
      <p>
        This paper
        <xref ref-type="bibr" rid="ref38 ref44 ref52">(Serrano, Savakis, and Luo 2004)</xref>
        by Serrano
et al. uses simplified low level features to predict the
semantic category of scenes. This is integrated
probabilistically using Bayesian network to give a final indoor/outdoor
classification. Low-dimensional color and wavelet texture
features are used to classify scenes using the support
vector machine (SVM). These wavelet texture features are used
here instead of the popular MSAR texture features to reduce
the computational complexity.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Other approaches</title>
      <p>Various other approaches exist in the literature of scene
recognition, as reviewed below.</p>
    </sec>
    <sec id="sec-6">
      <title>Semantic Typicality</title>
      <p>
        The concepts of typicality and prototype have made a
significant impact in cognitive science. See for example the work
pioneered by Eleanor Rosch and her collaborators,
        <xref ref-type="bibr" rid="ref43">(Rosch
1973)</xref>
        ,
        <xref ref-type="bibr" rid="ref40">(Rosch and Mervis 1975)</xref>
        ,
        <xref ref-type="bibr" rid="ref41">(Rosch et al. 1976)</xref>
        . In
computer vision,
        <xref ref-type="bibr" rid="ref38 ref52">(Vogel and Schiele 2004)</xref>
        introduces an
interesting concept of semantic typicality in categorizing of
real world natural scenes. The proposed typicality measure
is used to grade the similarity of an image with respect to
a scene category. Typicality is defined as a measure for the
uncertainty of annotation judgment. This is an important
concept because many natural scenes are ambiguous and the
categorization accuracy sometimes reflects the opinion of a
particular person who performed the annotation. Therefore,
the authors believe that attention should be directed to
modeling the typicality of a particular scene after manual
annotation. The semantic typicality measure is used to find the
similarity of natural real-world scenes with respect to six
scenes including coasts, rivers/lakes, forests, plains,
mountains and sky/clouds.
      </p>
      <p>The typicality based approach is evaluated on an image
database of 700 natural scenes. The attribute score is a
representation which is predictive of typicality. Typicality
is a function of frequency of occurrence, that is, the items
deemed most typical have attributes that are very common to
the category. Local semantic concepts act as scene category
attributes. They are calculated from the sub-regions which
are represented by a combined 84-bin linear histogram in
the HSI color space, and a 72-bin edge direction histogram.
Classification is done by a k-nearest neighbor classifier. The
categorization experiment was carried out using manually
annotated images from the database. By analyzing the
semantic similarities and dissimilarities of the aforementioned
categories a set of nine local semantic concepts emerged as
being most discriminant: sky, water, grass, trunks, foliage,
fields, rocks, flowers, and sand. The local semantic concepts
were extracted on a 10 ⇥ 10 grid of image sub-regions and
the frequency of occurrence in a particular image was
represented by concept occurrence vector. For each category, a
category prototype is defined as the most typical example for
that category, which constructed as the means over the
concept occurrence vectors of the category members. The
image typicality was measured by computing the Mahalanobis
distance between the images’ concept occurrence vector and
the prototypical representation in order to classify the image
as a particular scene.</p>
    </sec>
    <sec id="sec-7">
      <title>Configural Recognition</title>
      <p>
        The goal of
        <xref ref-type="bibr" rid="ref22">(Lipson, Grimson, and Sinha 1997)</xref>
        is to
classify scenes based on their content. Most of the solutions
that are available for scene recognition rely on color
histograms and local texture statistics. The authors state that
these features cannot capture a scenes’ global configuration.
To overcome this they present a novel approach, which they
call configural recognition for encoding scene class
structure in images. The configural recognition scheme encodes
class models as a set of salient low resolution image regions
and salient qualitative relations between the regions. An
example of qualitative relationships are: ’given three regions,
a blue region(A), a white region (B) and a gray region (C),
Snow-capped mountains always have region A above region
B which is above region C’.
      </p>
      <p>The class models are described using seven types of
relative relationships between image patches. Each of them
has the following values: Less than, greater than, or equal
to. The relationships encoded are relative color between
image regions, the relative luminance between the patches, the
spatial relationships (relative horizontal and vertical
descriptions) and the relative size of the patch. Based on this, each
region in the image is grouped into directional equivalence
classes, such as above and below.</p>
      <p>The generated model acts as deformable templates. When
compared with the image, the model can be deformed by
moving the patches around so that the model best matches
the image in terms of relative luminance and photometric
attributes. An improvement to this system can be made where
instead of hand crafting the models,an automated process
could take a set of example images and generate a set of
templates which describe the relevant relationships between
the pictures in the example set.</p>
      <p>
        A fuzzy part based model was described in
        <xref ref-type="bibr" rid="ref27">(Miyajima and
Ralescu 1993)</xref>
        and fuzzy sets were also widely and
effectively used for spatial descriptors in an image. A very
powerful formal model, based on fuzzy sets, for the description
of spatial relations in an image was introduced in
        <xref ref-type="bibr" rid="ref11 ref27">(Miyajima
and Ralescu 1994)</xref>
        ,
        <xref ref-type="bibr" rid="ref11 ref27">(Miyajima and Ralescu 1994)</xref>
        and
further extended by
        <xref ref-type="bibr" rid="ref4">(Bloch 1999)</xref>
        . A comparison of the fuzzy
approaches for the description of directional relative
position between object in an image can be found in
        <xref ref-type="bibr" rid="ref3">(Bloch and
Ralescu 2003)</xref>
        , and a review of these approached can be
found in
        <xref ref-type="bibr" rid="ref5">(Bloch 2005)</xref>
        . Furthermore, more recently, fuzzy
spatial relations were integrated in deformable models and
applied to MRI images
        <xref ref-type="bibr" rid="ref31 ref6">(Colliot, Camara, and Bloch 2006)</xref>
        .
      </p>
      <p>
        In
        <xref ref-type="bibr" rid="ref34">(Ralescu and Baldwin 1987)</xref>
        a new approach for
concept learning from examples and counter-examples with
applications to a vision learning system, later extended to
a general concept learning problem
        <xref ref-type="bibr" rid="ref36">(Ralescu and Baldwin
1989)</xref>
        , was developed. It makes use of Conceptual
Structures
        <xref ref-type="bibr" rid="ref48">(Sowa 1983)</xref>
        for knowledge representation, and
support logic programming
        <xref ref-type="bibr" rid="ref1">(Baldwin 1986)</xref>
        for inference.
Examples of a concept (e.g., a ’car’) are used to construct a
memory aggregate(MA), which rather than average all
examples, keeps track of various probability distributions of
the object features. Counter-examples, i.e., descriptions that
are very similar to a concept, but fail to be instances of
that concept, are used in a similar manner to construct a
counter example memory aggregate (CMA). Matching
between conceptual structures describing an object candidate
and the MA and CMA produce supports for and against the
recognition of a concept. The result is therefore, qualified by
a support pair, whose values (1, 1) mean complete
recognition, (0, 0) complete rejection, (0, 1) total uncertainty.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Deformable part based models</title>
      <p>
        In
        <xref ref-type="bibr" rid="ref33">(Pandey and Lazebnik 2011)</xref>
        , the author comments that
weakly supervised discovery of common visual structure
in highly variable and that cluttered images present a
major problem in recognition. In order to address this
problem, the authors propose using deformable part-based
models (DPM) with latent SVM training. For scene recognition,
deformable part-based models capture recurring visual
elements and salient objects. The DPM represents an object by
low-resolution root filters and a set of high resolution part
filters in a flexible spatial configuration. The image is
represented by a variation of histogram of oriented gradients
(HOG) features which are used to classify scenes using
linear SVM.
      </p>
    </sec>
    <sec id="sec-9">
      <title>Covariance descriptor</title>
      <p>
        The paper
        <xref ref-type="bibr" rid="ref54">(Yang et al. 2016)</xref>
        proposes a supervised
collaborative kernel coding method based on covariance
descriptor (covd) for scene level geographic image
classification. Covariance descriptor is a covariance matrix of
different features such as color, spatial location, and gradient that
is rotation and scale invariant but it lies in the Riemannian
space (i.e., non- Euclidean space) and therefore, the
traditional computational and mathematical models used in the
Euclidean space cannot be used.
      </p>
      <p>The major contribution of this paper is that they propose
a supervised kernel coding model that transforms covd into
a discriminative feature representation and obtain a
corresponding linear classifier. The method can be seen as a
three step process. The first step is to extract the covd
features from the geographical scene image. In the second
step supervised collaborative kernel coding involving
dictionary coefficients in the coding representation phase and
linear classification phase is performed. Lastly, in the
classification stage, based on dictionary coefficients and learned
linear classifier a label vector is derived. A novel
objective function is proposed to combine the collaborative
kernel coding phase and the classification phase. This method
gives satisfying performance on high resolution aerial
image dataset proving to be an efficient method for scene level
geographic image classification.</p>
    </sec>
    <sec id="sec-10">
      <title>Shape of the scene</title>
      <p>
        The paper
        <xref ref-type="bibr" rid="ref29">(Oliva and Torralba 2001)</xref>
        takes a very different
approach to scene recognition: rather than looking at the
scene as a configuration of objects the paper proposes to
consider the scene as an individual object, with a unitary
shape. A computational model to find the shape of the scene
using a few perceptual dimensions specifically dedicated to
describing spatial properties of the scene is proposed. It is
shown that the holistic spatial scene properties, called
Spatial Envelope(SE) properties may be reliably estimated using
spectral and coarsely localized information.
      </p>
      <p>Given and environment V , its spatial envelope SE(V ) is
defined as a composite set of boundaries such as walls,
section, elevation etc. that define the shape of the space. A
group of 17 observers were asked to categorize 81 images
into categories based on some global aspect. Based on the
classification results, the criterion for classification of scenes
was agreed to be based upon the degree of naturalness,
degree of openess, degree of roughness, degree of expansion
and degree of ruggedness. Therefore, the purpose of the
spatial envelope model is to show that modeling these five
spatial properties is adequate to understand the high-level
description of the scene. Their algorithm learns the spectral
signatures (the global energy spectrum and the spectogram)
of basic scene categories from labeled training data. A
learning algorithm (regression) is then used to find the relation
between the global features and spectral features.</p>
    </sec>
    <sec id="sec-11">
      <title>Beyond Scene recognition</title>
      <p>Certain algorithms detect scenes and then use scene
recognition as a prior in order to find more structure in the image,
thus motivating further study in the field of scene analysis.</p>
      <p>
        In Using Forest to see trees
        <xref ref-type="bibr" rid="ref28">(Murphy et al. 2003)</xref>
        an
intuitive approach to detect the presence of object based on
the detected scenes is presented. The approach is suggested
by psychological evidence that people perform rapid global
scene analysis before and conducting more detailed local
object analysis. Based on this the authors propose to use the
whole image as a global feature in order to overcome
ambiguities which might occur at the local level. They extend the
notion of gist from
        <xref ref-type="bibr" rid="ref31">(Oliva and Torralba 2006)</xref>
        by combining
the prior suggested by the gist to the output of bottom-up
local object detectors which are trained using boosting. They
also use the same set of features for object detection in the
image. The image is divided into patches at different scales
(image pyramid) and each patch is convolved with 13
zeromean filters which include oriented edges, a Laplacian filter,
corner detectors and long edge detectors. This is represented
by two statistics, variance, and kurtosis derived from the
histogram of image patches at two scales and with 30 spatial
masks. The kurtosis is omitted for scene recognition. The
features are further reduced in dimensionality using PCA to
give the PCA-gist. A one vs all binary classifier is trained
for recognizing each type of scene using boosting applied to
the gist. They further this by using scene as a latent common
cause upon which the presence of the object is conditionally
dependent on.
      </p>
      <p>
        Understanding whole image, or holistic scene
understanding is is described in
        <xref ref-type="bibr" rid="ref55">(Yao, Fidler, and Urtasun 2012)</xref>
        . The
idea is to jointly evaluate and makes conclusions about
location, regions, class and spatial information of objects,
presence of a class in an image and also the scene type. The idea
is to recover and connect the multiple different aspects of a
scene. This problem is framed as a prediction problem in a
graphical model defined over hierarchies of regions of
different sizes, auxiliary variables encoding scene type,
presence of a given class in a scene and correctness of bounding
boxes obtained by the object detector. Class labels of image
segments at two different levels of segmentation hierarchy,
namely segments and large super segments is proposed.
Binary variables indicate which classes are present in images
and multi-labeled variable represents scene type. Segments
and super segments are used to assign semantic class labels
to each pixel in an image. Super segments are used to create
long range dependencies and they also prove to be more
efficient computationally. The holistic loss function is defined,
which is a weighted sum of losses from each task. State of
the art performance is achieved in MSRC-21 benchmark and
the approach is much faster than the existing approaches.
      </p>
      <p>
        Another very interesting approach is present in
        <xref ref-type="bibr" rid="ref12 ref18">(Li and
Fei-Fei 2007)</xref>
        , which goes beyond scene recognition to event
recognition. An event in a static image is defined as a
human activity taking place in the specific environment. The
objective is to recognize/classify the event in the image as
well as provide a number of semantic labels to the object and
scene environment within the image. It is assumed that
conditioned on the event, scene and objects are independent of
each other, but both their presence influences the probability
of predicting the event. For scene recognition they adopt a
model similar to the Bayesian model of
        <xref ref-type="bibr" rid="ref8">(Fei-Fei and Perona
2005)</xref>
        . Scene recognition heavily influences the event
recognition, an in fact, as a first approximation event recognition
is essentially scene recognition. The robust bag of words
model is used in order to recognize objects. In addition to
scene and object recognition they understand the importance
of layout of the image in accurately identifying the event.
They use some simple geometric cues to define the layout of
the image and manage to provide integrative and
hierarchical labels to an image by performing the what (event), where
(scene) and who (object) recognition of the entire scene by
using a generative model in order to represent the image.
      </p>
      <p>
        An extensive Scene Understanding (SUN) database
consisting of 899 categories and 130519 images is created in
        <xref ref-type="bibr" rid="ref53">(Xiao et al. 2010)</xref>
        . This work is motivated by the authors’
belief that the existing data sets for scene classification fail
to capture the richness and diversity of daily life
environments. The authors claim to have built the most complete
dataset with a number of different scene image categories
with different functionalities that are important enough to
have unique identities. They measure human performance
on scene classification and compare it with the state of the
art algorithms, using the SUN database. Both human and
algorithm results had errors, with the humans erring between
semantically similar categories, while algorithms erring
between semantically unrelated scenes due to spurious visual
matches. It was also recorded that the best features agree
more with correct human classifications and make the same
mistakes as humans do. The computational algorithms need
a much larger number of features to performs as well as
humans. They also propose the notion of recognizing scene
type within images rather than labeling an entire image with
a scene because the real world often contains combination
of scenes. This is a new interesting idea and could also be
one of the directions in which the future scene recognition
algorithms can progress.
      </p>
    </sec>
    <sec id="sec-12">
      <title>Conclusion</title>
      <p>It can be seen, even based on the limited number of papers
reviewed here, that image understanding, scene recognition
can be approached from various different directions. At
a very high level, the approaches can be divided into two
main categories - using low-level features, and using
object recognition. However, many other techniques are
integrated into each of these approaches, including
probabilistic, and/or fuzzy techniques, in order to deal with the
uncertainty which often attends the result of image understanding.
When it come to evaluating low level feature approach and
object recognition approach, the goal of the image
understanding must be taken into account. Scene recognition
performs better when low level features are used. Local features
help override the effects of occluded objects, low lighting
conditions. Most commonly used features for scene
detection include texture, texture orientation and strength, ’Gist’
of the image, SIFT descriptor, edge orientation, histograms
in different color space (e.g., Ohta, HSI, RGB), histograms
of angles between segmented regions, coefficients of
shiftinvariant DCT. These features can be successfully mapped
into semantic image descriptors.
standing. In Fuzzy Systems, 1993., Second IEEE
International Conference on, 1049–1054. IEEE.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <year>1986</year>
          .
          <article-title>Support logic programming</article-title>
          .
          <source>In Fuzzy sets theory and applications</source>
          . Springer.
          <fpage>133</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ng</surname>
          </string-name>
          , A. Y.; and
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M. I.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research 3</source>
          (Jan):
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bloch</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ralescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Directional relative position between objects in image processing: a comparison between fuzzy approaches</article-title>
          .
          <source>pattern Recognition</source>
          <volume>36</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1563</fpage>
          -
          <lpage>1582</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bloch</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Fuzzy relative position between objects in image processing: a morphological approach</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>21</volume>
          (7):
          <fpage>657</fpage>
          -
          <lpage>664</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Bloch</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Fuzzy spatial relationships for image processing and interpretation: a review</article-title>
          .
          <source>Image and Vision Computing</source>
          <volume>23</volume>
          (
          <issue>2</issue>
          ):
          <fpage>89</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Colliot</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Camara</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Bloch</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Integration of fuzzy spatial relations in deformable modelsapplication to brain mri segmentation</article-title>
          .
          <source>Pattern recognition 39</source>
          <volume>(8)</volume>
          :
          <fpage>1401</fpage>
          -
          <lpage>1414</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Csurka</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Dance,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Willamowski</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Bray</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Visual categorization with bags of keypoints</article-title>
          . In Workshop on statistical learning in
          <source>computer vision</source>
          , ECCV, volume
          <volume>1</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . Prague.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>A bayesian hierarchical model for learning natural scene categories</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2005</year>
          .
          <source>CVPR</source>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          IEEE Computer Society Conference on, volume
          <volume>2</volume>
          ,
          <fpage>524</fpage>
          -
          <lpage>531</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Felzenszwalb</surname>
            ,
            <given-names>P. F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R. B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>McAllester</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Object detection with discriminatively trained part-based models</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>32</volume>
          (9):
          <fpage>1627</fpage>
          -
          <lpage>1645</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Gorkani</surname>
            ,
            <given-names>M. M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Picard</surname>
            ,
            <given-names>R. W.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>Texture orientation for sorting photos” at a glance”</article-title>
          .
          <source>In Pattern Recognition</source>
          ,
          <year>1994</year>
          . Vol.
          <volume>1</volume>
          -
          <string-name>
            <surname>Conference</surname>
            <given-names>A</given-names>
          </string-name>
          :
          <string-name>
            <surname>Computer</surname>
            <given-names>Vision &amp; Image</given-names>
          </string-name>
          <string-name>
            <surname>Processing</surname>
          </string-name>
          .,
          <source>Proceedings of the 12th IAPR International Conference on</source>
          , volume
          <volume>1</volume>
          ,
          <fpage>459</fpage>
          -
          <lpage>464</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Grauman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>The pyramid match kernel: Efficient learning with sets of features</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>8</volume>
          (Apr):
          <fpage>725</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>Gue´rin-</article-title>
          <string-name>
            <surname>Dugue</surname>
            <given-names>´</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , and
            <surname>Oliva</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2000</year>
          .
          <article-title>Classification of scene photographs from local orientations features</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>21</volume>
          (
          <issue>13</issue>
          ):
          <fpage>1135</fpage>
          -
          <lpage>1140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Hoiem</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Efros</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ; and Hebert,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2005</year>
          .
          <article-title>Automatic photo pop-up</article-title>
          .
          <source>ACM transactions on graphics (TOG) 24</source>
          (
          <issue>3</issue>
          ):
          <fpage>577</fpage>
          -
          <lpage>584</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Julesz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>1981</year>
          .
          <article-title>Textons, the elements of texture perception, and their interactions</article-title>
          .
          <source>Nature</source>
          <volume>290</volume>
          (
          <issue>5802</issue>
          ):
          <fpage>91</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Julesz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>1986</year>
          .
          <article-title>Texton gradients: The texton theory revisited</article-title>
          .
          <source>Biological cybernetics</source>
          <volume>54</volume>
          (4):
          <fpage>245</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ponce</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories</article-title>
          .
          <source>In Computer vision and pattern recognition</source>
          ,
          <source>2006 IEEE computer society conference on</source>
          , volume
          <volume>2</volume>
          ,
          <fpage>2169</fpage>
          -
          <lpage>2178</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.-J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>What, where and who? classifying events by scene and object recognition</article-title>
          .
          <source>In Computer Vision</source>
          ,
          <year>2007</year>
          .
          <article-title>ICCV 2007</article-title>
          . IEEE 11th International Conference on,
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F. F.; VanRullen</given-names>
          </string-name>
          , R.; Koch,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; and Perona,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>Natural scene categorization in the near absence of attention: further explorations</article-title>
          .
          <source>Journal of Vision</source>
          <volume>3</volume>
          (
          <issue>9</issue>
          ):
          <fpage>331</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          -J.; Su,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ; and
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. P.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>Object bank: A high-level image representation for scene classification &amp; semantic feature sparsification</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>1378</volume>
          -
          <fpage>1386</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Lipson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Grimson</surname>
            , E.; and Sinha,
            <given-names>P.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Configuration based scene classification and image indexing</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>1997</year>
          . Proceedings., 1997 IEEE Computer Society Conference on,
          <fpage>1007</fpage>
          -
          <lpage>1013</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D. G.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Object recognition from local scaleinvariant features</article-title>
          .
          <source>In Computer vision</source>
          ,
          <year>1999</year>
          .
          <source>The proceedings of the seventh IEEE international conference on</source>
          , volume
          <volume>2</volume>
          ,
          <fpage>1150</fpage>
          -
          <lpage>1157</lpage>
          . Ieee.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Marr</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1977</year>
          .
          <article-title>Artificial intelligence - a personal view</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>Artificial Intelligence</source>
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>37</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Marr</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1982</year>
          .
          <article-title>Vision: A computational approach</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Miyajima</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ralescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>1993</year>
          .
          <article-title>Modeling of natural objects including fuzziness and application to image underMiyajima, K</article-title>
          ., and
          <string-name>
            <surname>Ralescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>Spatial organization in 2d segmented images: representation and recognition of primitive spatial relations</article-title>
          .
          <source>Fuzzy Sets and Systems</source>
          <volume>65</volume>
          (
          <issue>2- 3</issue>
          ):
          <fpage>225</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Freeman</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; et al.
          <year>2003</year>
          .
          <article-title>Using the forest to see the trees: a graphical model relating features, objects and scenes</article-title>
          .
          <source>Advances in neural information processing systems</source>
          <volume>16</volume>
          :
          <fpage>1499</fpage>
          -
          <lpage>1506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>Modeling the shape of the scene: A holistic representation of the spatial envelope</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>International journal of computer vision 42</source>
          (3):
          <fpage>145</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Building the gist of a scene: The role of global image features in recognition.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>Progress in brain research</source>
          <volume>155</volume>
          :
          <fpage>23</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Pandey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Scene recognition and weakly supervised object localization with deformable partbased models</article-title>
          .
          <source>In Computer Vision</source>
          (ICCV),
          <year>2011</year>
          IEEE International Conference on,
          <fpage>1307</fpage>
          -
          <lpage>1314</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Ralescu</surname>
            ,
            <given-names>A. L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <year>1987</year>
          .
          <article-title>Concept learning from examples with applications to a vision learning system.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <source>In Alvey Vision Conference</source>
          ,
          <volume>1</volume>
          -
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Ralescu</surname>
            ,
            <given-names>A. L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <year>1989</year>
          .
          <article-title>Concept learning from examples and counter examples</article-title>
          .
          <source>International Journal of Man-Machine Studies</source>
          <volume>30</volume>
          (
          <issue>3</issue>
          ):
          <fpage>329</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>Ralescu</surname>
            ,
            <given-names>A. L.</given-names>
          </string-name>
          <year>1995</year>
          .
          <article-title>Image understanding = verbal description of the image contents</article-title>
          .
          <source>SOFT, Journal of the Japanese Society for Fuzzy Theory</source>
          <volume>7</volume>
          (
          <issue>4</issue>
          ):
          <fpage>739</fpage>
          -
          <lpage>746</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Renninger</surname>
            ,
            <given-names>L. W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>When is scene identification just texture recognition</article-title>
          ?
          <source>Vision research</source>
          <volume>44</volume>
          (19):
          <fpage>2301</fpage>
          -
          <lpage>2311</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>Rock</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>1990</year>
          .
          <article-title>The perceptual world</article-title>
          .
          <source>Scientific American</source>
          <volume>127</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Rosch</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Mervis</surname>
            ,
            <given-names>C. B.</given-names>
          </string-name>
          <year>1975</year>
          .
          <article-title>Family resemblances: Studies in the internal structure of categories</article-title>
          .
          <source>Cognitive psychology 7</source>
          (
          <issue>4</issue>
          ):
          <fpage>573</fpage>
          -
          <lpage>605</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>Rosch</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mervis</surname>
            ,
            <given-names>C. B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gray</surname>
          </string-name>
          , W. D.; Johnson, D. M.; and
          <string-name>
            <surname>Boyes-Braem</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>1976</year>
          .
          <article-title>Basic objects in natural categories</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <source>Cognitive psychology 8</source>
          (
          <issue>3</issue>
          ):
          <fpage>382</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <string-name>
            <surname>Rosch</surname>
            ,
            <given-names>E. H.</given-names>
          </string-name>
          <year>1973</year>
          .
          <article-title>Natural categories</article-title>
          .
          <source>Cognitive psychology 4</source>
          (
          <issue>3</issue>
          ):
          <fpage>328</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>Serrano</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Savakis</surname>
            ,
            <given-names>A. E.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Improved scene classification using efficient low-level features and semantic cues</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>37</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1773</fpage>
          -
          <lpage>1784</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Luo</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Probabilistic spatial context models for scene content understanding</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2003</year>
          . Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          2003 IEEE Computer Society Conference on, volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>I-I</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <surname>Sivic</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Efficient visual search of videos cast as text retrieval</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>31</volume>
          (4):
          <fpage>591</fpage>
          -
          <lpage>606</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <string-name>
            <surname>Sowa</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <year>1983</year>
          .
          <article-title>Conceptual structures: information processing in mind and machine</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <string-name>
            <surname>Szummer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Picard</surname>
            ,
            <given-names>R. W.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Indoor-outdoor image classification</article-title>
          .
          <source>In Content-Based Access of Image and Video Database</source>
          ,
          <year>1998</year>
          . Proceedings., 1998 IEEE International Workshop on,
          <fpage>42</fpage>
          -
          <lpage>51</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <surname>Thorpe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fize</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Marlot</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>1996</year>
          .
          <article-title>Speed of processing in the human visual system</article-title>
          .
          <source>nature</source>
          <volume>381</volume>
          (
          <issue>6582</issue>
          ):
          <fpage>520</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <string-name>
            <surname>Van De Sande</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gevers</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Evaluating color descriptors for object and scene recognition</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>32</volume>
          (9):
          <fpage>1582</fpage>
          -
          <lpage>1596</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <string-name>
            <surname>Vogel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Schiele</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>A semantic typicality measure for natural scene categorization</article-title>
          .
          <source>In Joint Pattern Recognition Symposium</source>
          ,
          <fpage>195</fpage>
          -
          <lpage>203</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ehinger</surname>
            ,
            <given-names>K. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Sun database: Large-scale scene recognition from abbey to zoo</article-title>
          . In
          <source>Computer vision and pattern recognition (CVPR)</source>
          ,
          <source>2010 IEEE conference on</source>
          ,
          <fpage>3485</fpage>
          -
          <lpage>3492</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Scene-level geographic image classification based on a covariance descriptor using supervised collaborative kernel coding</article-title>
          .
          <source>Sensors</source>
          <volume>16</volume>
          (
          <issue>3</issue>
          ):
          <fpage>392</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Fidler,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; and Urtasun,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation</article-title>
          .
          <source>In Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2012 IEEE Conference on</source>
          ,
          <fpage>702</fpage>
          -
          <lpage>709</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>