<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Two-layered Photo Classification Based on Semantic and Syntactic Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Seungji Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yong Man Ro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Image and Video Systems Lab., Information and Communications Univ.</institution>
          ,
          <addr-line>Munji 103-6, Yuseong, Daejeon, 305-714</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A novel approach to semantic classification for generic home photos is proposed. The proposed method consists of two-layered SVM classifiers. The first layer aims to predict the likelihood of pre-defined local photo semantics based on camera metadata and regional low-level visual features. In the second layer, one or more global photo semantics are detected based on the likelihood ratio. To construct classifiers in the first layer producing a posterior probability, we use parametric model to fit the output confidence value of SVM classifiers to posterior probability. We also exploit concept merging process based on a set of semantic-confidence map in order to cope with selecting the more likelihood photo semantics on overlapping local photo regions.</p>
      </abstract>
      <kwd-group>
        <kwd>Photo album</kwd>
        <kwd>Semantic classification</kwd>
        <kwd>Camera metadata</kwd>
        <kwd>SVM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Recently, it is affordable to keep a complete digital record of one’s whole life. One
main issue is to minimize user’s manual tasks in organizing and managing a large
number of photo collections. Semantic classification of arbitrary image has been a
challenge in recent years. The goal of semantic classification is to discover image
semantics from given pre-defined semantic concepts. The need for semantic
classification has been rightly raised in digital home photo area.</p>
      <p>
        One state-of-art classification approach is to use support vector machine (SVM)
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. So far, many classification methods have employed empirical risk minimization
(ERM) for learning classifier. ERM only utilizes the loss function defined for
classifier and is equivalent to Bayesian decision theory with a particular choice of
prior. Thus, ERM approaches often lead a classifier to be over-fitted, i.e., classifier is
usually too much fitted to only training data. Unlike ERM, structural risk
minimization (SRM) aims to minimize generalization error. SVM is based on the idea
of SRM. The generalization error is bounded by the sum of the training set error and a
term depending on the VC dimension of the learning machine. By minimizing the
upper bound, high generalization can be archived. The generalization error of SVM is
related not to the input dimensionality of the problem, but to the margin with
separating the data. This explains why SVM can have good performance even in
problems with a large number of inputs. To date, SVM has been applied successfully
to a wide range of problems.
      </p>
      <sec id="sec-1-1">
        <title>In particular, the semantic classification problem can be usually simpler and thus</title>
        <p>
          easier by using multi-layered approach. Multi-layered classification approach aims to
solve a classical image understanding problem that requires the effective interaction
of high-level image semantics and low-level image features. Many researchers have
successfully employed the multi-layered approach to semantic classification.
Unfortunately, naïve SVM is inappropriate for multi-layered classifier because the
output of the SVM should be a calibrated posterior probability to enable
postprocessing. Basically, SVM is a discriminative classifier, not based on any generative
model. So, the output confidence of any classifier in a certain layer should be
probabilistically modeled before being used as the probabilistic input of any classifier
in the next layer. A few studies have been pressed to solve this problem [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Platt
proposed a good parametric model to fit the SVM output to the posterior probability,
instead of estimating the class-conditional density. The parameters of the model are
adapted to give the best probability output [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Lin et al. improved implementation of
Platt’s model [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. They solved the problem that Platt’s implementation may not
converge to the minimum solution. Although Lin’s method increases complexity, it
gives better convergence properties.
        </p>
        <p>
          Nevertheless, capturing high-level image semantics with low-level features remain
a challenge to real application due to low performance. Unlike image, photo usually
includes its camera metadata as well as pixel data itself. The metadata is obtained
from Exif header from photo file [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Camera metadata is of great benefit to semantic
photo classification in that it provides several useful cues. In particular, taken
date/time stamp has been successfully employed to cluster a sequence of unlabeled
photos by meaningful event or situation groups [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Especially in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
taken date/time stamp and color features have been combined together to cluster
photos by events in an automatic manner. In general, user demand for event clustering
tends to exhibit little coherence in terms of low-level features, though syntactic
information, such as camera metadata, could help to organize event clusters in more
semantically meaningful groups. In our previous studies [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], we also developed
an unsupervised photo clustering scheme based on situation – that presents similar
background scenery taken in a close proximity of time – as associating camera
metadata and low-level features.
        </p>
        <p>
          Especially for semantic photo classification, Boutell et al. proposed a probabilistic
approach to incorporate camera metadata with content-based visual features in scene
classification [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. They exploited a useful set of camera metadata, which is related to
scene brightness, flash, subject distance and focal length and verified it in some global
visual semantics such as indoor/outdoor, sunset, and man-made/natural scenes.
However, Boutell’s method has one major disadvantage on the applications to generic
scene classification. One is that, as assumed in his study, Boutell’s method has limited
application to a few global scenes since it used only global features, such as camera
metadata and global visual features. A photo usually contains many local semantics.
So, to extend the use of camera metadata to the classification of many other local and
global visual semantics, the camera metadata probably need to be incorporated with
visual features of local photo region. For example, let see a photo that contains human
face in foreground behind background scenery. If its camera focus is on the person,
subject distance and focal length will be short. Given this knowledge, Boutell’s
classifier may have a difficulty of detecting background scenery in spite of using
lowlevel visual features.
        </p>
        <p>In this paper, a semantic classification scheme for generic home photos is proposed.
The proposed method consists of two-layered SVM classifiers. The first layer aims to
predict the likelihood of pre-defined local photo semantics based on camera metadata
and regional low-level visual features. In the second layer, we determine one or more
global photo semantics based on the likelihood ratio. To construct classifiers in the
first layer producing a posterior probability, we use parametric model to fit the output
confidence value of SVM classifiers to posterior probability. Local photo semantics
provide an intermediate level of photo semantics by bridging the semantic gap of
lowlevel features and high-level photo semantics. We also exploit concept merging based
on a set of semantic-confidence map so as to cope with selecting the more likelihood
photo semantics on overlapping local photo regions. For multi-class determination in
global photo semantics, we propose to use three different criterions.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <sec id="sec-2-1">
        <title>2.1 Local Semantic Classification</title>
        <p>2.1.1 Regional Division for Local Semantics
Most of the current digital cameras support auto-focusing (AF) system that works as
moving the camera lens in and out until the sharpest possible image of the subjects is
projected onto the image receptor such as CCD and CMOS. All AF systems provide a
certain number of censoring regions. A censoring region usually forms rectangle. This
means that photographer’s intension can be found in the rectangle censoring regions.</p>
        <p>Indeed, the best representation of local visual semantics in photo is given by object
segmentation, which could produce elaborate object contours. So far, however, there
seems no almighty method for object segmentation. Rather, the object segmentation is
usually expensive in computation and even sometimes produces incomplete results in
complex natural images. So, instead, we approach a simple block segmentation to
capture visual semantics that appear on local photo regions. The block segmentation
is relatively inexpensive. But, to boost its low segmentation performance, we employ
a set of region template, denoted as photographic region template (PRT), whose idea
originates from the rectangle censoring system of digital camera. Thus, although PRT
is used in a block tessellation with a fixed number of blocks, it could be fast and good
enough to detect what the photographer intended to capture when taking the picture.
The basic observation behind the PRT is that mainly-concerned subjects would be
usually focused, taking larger portion and being sharpener than other un-concerned
subjects. Thus, many other most likely small, blurred subjects would be often out of
concern in the photo.</p>
        <p>In order to build meaningful region templates, three conditions are considered: the
region template should be large enough to detect semantics in the local photo region,
simultaneously be small enough not to be time-consuming in feature extraction and
similarity measure, and support spatial scalability to detect photo semantics over
various scale subjects. From this observation, we propose a photographic region
template as shown in Fig. 1. The region template is composed of ten local regions:
one center region (R1 in Fig. 1), four corner regions (R 2, R 3, R 4, and R 5 in Fig. 1),
two horizontal regions (R 6 and R 7 in Fig. 1), two vertical regions (R 8 and R 9 in Fig.
1), and a whole photo region (R 10 in Fig. 1). The four corner regions are parts of the
vertical, horizontal, and whole regions. Note that one center and four corner regions
are referred to as basis regions. The use of basis region set will be presented in local
semantic classification. The center region overlaps partially with the corner, vertical
and horizontal regions, and entirely with the whole photo region.</p>
        <p>Center</p>
        <p>R1
R6</p>
        <p>Horizontal</p>
        <p>R2
R7</p>
        <p>R3</p>
        <p>Corner</p>
        <p>
          R4
Vertical
2.1.3 Local Semantic Learning
SVM is employed as local semantic classifiers in the first layer. It gives a good binary
classifier that is used to find the decision function of optimal linear hyper-plane given
labeled training data. SVM is a constructive learning procedure rooted in statistical
learning theory [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. It is based on the principle of structural risk minimization, which
aims at minimizing the bound on the generalization error rather than minimizing the
mean square error over the data set. As a result, an SVM tends to perform well when
applied to data outside the training set. The hyper-plane can be linearly separable in
high-dimensional feature space ( h ). Input feature in the space ( F ) is mapped onto
the feature space via a nonlinear mapping ( ϕ : F → h ), allowing one to perform
nonlinear analysis of the input features using a linear method. In generic SVM, a
kernel is designed to map the input data space to the feature space. With the ‘kernel
trick’ property [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], the kernel can be considered as similarity measures between two
feature vectors without explicit computation of the map ϕ . Using kernel function,
SVM classifier can be trained with features of training data. For this, an optimal
hyper-plane is found to correctly classify the training data. By the optimization
theorem of SVM, the decision function ( Φlocal ) to predict the local concept ( xnlocal ) of
n
unseen feature vector ( F ) is formed as follows,
where K is a kernel function that can be a linear function, radial-basis function
(RBF), polynomial function, sigmoid function, etc., and, in this paper, RBF kernel
fuction that is the most popular choice of kernel types is selected. Fnt is the tth support
vector of the hyper-plane for the local concept ( xnlocal ), an is the vector of
corresponding weighting values of the support vector, zn is the corresponding class
vector of the support vector, and bn is the threshold optimized for the local concept
        </p>
        <p>
          Constructing the SVM classifier to produce a posterior probability, the output
confidence value of the SVM is fitted to a parametric sigmoid model [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ]. The form
of parametric sigmoid fitting model for the classifier of a local photo semantic xlocal is
n
(1)
(2)
(3)
(4)
where N + is the number of positive samples and N− is the number of negative
samples. Then, the best parameters for a local photo semantic are obtained as
minimizing the following cross-entropy error function.
        </p>
        <p>arg min− ∑{y'i ⋅log pi + (1− y'i )log(1− pi )},</p>
        <p>(A,B) i
Pn (y = 1 | Φlocal (F)) =
n</p>
        <p>1
1 + exp(A ⋅ Φlocal (F) + B)
n
,
where A and B are parameters to determine the shape of the sigmoid model. So, the
SVM output ranged from −∞ to ∞ is fitted to the probabilistic output ranged from 0
to 1.</p>
        <p>
          The best parameters (A, B) are estimated by solving the following regularized
maximum likelihood problem with a set of labeled training example. Given a training
set (Φlnocal (Fi ), yi ) , let us define a new training set (Φlnocal (Fi ), y'i ) , where the y'i is
target probability value. The new target value is used instead of (0, 1) for all of the
training data in the sigmoid fit. This aims at making the new target value converge to
(0, 1) when the training set size approaches infinity. The new target value y'i is
defined as follows,
 N+ +1

y'i =  N+1+ 2
 N− + 2
,
,
yi = 1
yi = −1
,
where pi denotes Pn (yi | Φlnocal (F)). We adopt Lin’s method [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to find the optimized
parameters minimizing the above error function.
2.1.3 Integration of Camera Metadata and Local Visual Features
To integrate camera metadata with low-level visual features in the proposed photo
classification, we first generalize the following probabilistic combination scheme. Let
X = {x1, x2 ,L, xI } be a set of I photo semantic classes that frequently appear in home
photos. And, let Fcam = {fc1am , fc2am ,L, fcaJm } be a useful set of J camera metadata, and
Flow = {flo1w , flo2w ,L, floKw } be a set of K low-level visual features. Then, the likelihood
of a semantic class, xi ∈ X , on the given features, F = {Fcam , Flow}, can be represented
by the joint conditional probability as follows,
        </p>
        <p>P(xi F) = P(xi Fcam , Flow ) ,
,
,
(5)
(6)
(7)
(8)</p>
        <p>By the Bayesian theorem, the joint conditional probability can be decomposed as
follows,</p>
        <p>P(xi F) = P(xi Fcam , Flow ) =</p>
        <p>P(xi )P(Fcam , Flow xi ) ,</p>
        <p>P(Fcam , Flow )
Let us embody (1) to local semantics. For this, let Xlocal = {xlocal , xlocal ,L, xNlocal } be a
1 2
set of N local semantics. Then, the joint conditional probability of a local semantic
xlocal ∈ Xlocal given an input feature set Flocal = {Fcam , Flloowcal } – where camera metadata is
n
not local, but global – for the local photo regions can be written as follows,
P(xnlocal Flocal ) = P(xnlocal Fcam , Flloowcal ) =</p>
        <p>P(xnlocal )P(Fcam , Flloowcal xlocal )</p>
        <p>n
P(Fcam , Flloowcal )</p>
        <p>The camera metadata ( Fcam ) is independent of the low-level features ( Flloowcal ), so
that (3) can be written again as follows,</p>
        <p>P(xnlocal )P(Fcam , Flloowcal xlocal )</p>
        <p>n
P(Fcam , Flloowcal )
=</p>
        <p>P(xnlocal )P(Fcam xlloowcal )P(Fcam xnlocal )</p>
        <p>P(Fcam )P(Flloowcal )
2.1.4. Local Semantic Classification
As mentioned above, the input photo to be classified is divided into ten local regions
by the photographic region template. Multiple low-level visual features are extracted
from each local region and fed into the local concept detectors. For the local photo
semantic classification, let R = {R1, R2 ,L, R10 } be a set of the local regions. Then, the
feature vector of a local region ( R ∈ R ) is denoted as F R = {Fcam , FloRw}. Equations (7)
and (8) can be specified for the local region as follows,
P(xlocal F R ) = P(xlocal Fcam , FloRw ) =
n n</p>
        <p>P(xlocal )P(Fcam xnlocal )P(FloRw xlocal )
n n</p>
        <p>P(Fcam )P(FloRw )
,
where the camera metadata ( Fcam ) and corresponding probability P(Fcam xnlocal ) is the
same over all local regions given an input photo. The P(FloRw xlocal ) is regarded as the
n
R ) about the SVM model of the local
probability of the local region feature ( Flow
concept ( xlocal ). So, it is estimated by the sigmoid model as follows,
n</p>
        <p>P(FloRw xlocal ) ≈
n</p>
        <p>1
1+ exp(A ⋅ Φlocal (FloRw )+ B)</p>
        <p>n
P(Fcam xnlocal ) ≈</p>
        <p>1
1 + exp(A ⋅ Φ n (Fcam ) + B)
.
.</p>
        <p>Over all local regions ( R ), the probability set of the local concept ( xlocal ) can be
n
written as follows,</p>
        <p>P(xlocal Flow</p>
        <p>R ) = {P(xlocal FR1 ), P(xlocal FloRw2 ),L, P(xlocal FR10 )}.
n n low n n low
(9)
(10)
(11)
(12)
(13)
(14)</p>
        <p>Similarly, the P(Fcam xnlocal ) is regarded as the probability of the camera metadata
feature ( Fcam ) about the SVM model of the local concept ( xlocal ).So, it is also
n
estimated by a sigmoid function as follows,
where vlocal stands for the degree of likelihood of the n local concept set about the R
n,R
local regions feature. Table 1 shows the probability of the local concept for each local
region.</p>
        <p>Given Xlocal = {xlocal , x2local ,L, xNlocal } , the probability set of the local concept set
1
( Xlocal ) can be written as follows,</p>
        <p>P(Xlocal FloRw ) = {P(xlocal FlRow ), P(xlocal FloRw ),L, P(xlocal FlRow )}</p>
        <p>1 2 N
P(xlocal F R1 ), P(xlocal FloRw2 ),L, P(xlocal FloRw10 ),L ,</p>
        <p>1 low 1 1
=  </p>
        <p>, P(xNlocal FloRw1 ), P(xNlocal FloRw2 ),L, P(xNlocal FloRw10 ) 
If vlocal = P(xlocal FloRw ), (12) can be written again as follows,
n,R n
Vlocal = {vlocal , v2lo,1cal ,L, vnlo,1cal , v1lo,2cal , v2lo,c2al ,L, vnlo,c2al ,L, v1lo,1c0al , v2lo,1c0al ,L, vnlo,1ca0l }</p>
        <p>1,1</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Global Semantic Classification</title>
        <p>vnlo,bcal = max(vnlo,tcal t ∈ Rbmap ) ,
where, for example, if the basis region is R2, vnlo,c2al = max(vnlo,tcal t ∈ {2,6,8,10}) .
2.2.1 Association of Local Semantics with Global Semantics</p>
        <p>We express the degree of strength of the semantic link between local semantics and
global semantics. The higher value stands for a stronger connection between concepts.
This approach could bridge the semantic gap between low-level features and
highlevel concepts. Thus, the global concepts are trained based on the confidence vectors
of the local SVM models. Similar to the local concepts, the decision function ( Φ global )
m
to predict the local concept ( xmglobal ) of unseen confidence feature vector ( VRlocal ) given
local regions ( R ) is formed as follows,
Φ global (Vlocal ) = ∑ amt zmt K (Vmt , Vlocal )+ bm ,</p>
        <p>m t
where Vmt is support vector of the hyper-plane for the global concept ( xmglobal ).</p>
        <p>To find more likelihood semantics on the overlapping local regions, a concept
merging is performed by keeping the most confident concepts for the five basis local
regions ( Rbasis ) that consists of one center and four corner regions, that is, the region
set can be defined as Rbasis = {R1, R2 , R3, R4 , R5}, where rightly Rbasis ⊂ R . The concept
merging is performed with semantic confidence map used to keep the most confident
concept for the basis local regions set.</p>
        <p>The semantic confidence map gives five different combinations of overlapping
local regions as shown in Fig. 3. Then, the confidence value of a local concept ( xnlocal )
of the a basis region ( Rb ∈ Rbasis ) is calculated as follows,
(15)
(16)
Semantic
confidence
map 1</p>
        <p>Semantic
confidence
map 2</p>
        <p>Semantic
confidence
map 3</p>
        <p>Semantic
confidence
map 4</p>
        <p>Semantic
confidence
map 5
Basis region
set
2.2.2 Global Semantic Classification
Given a basis local region, the merged confidence values for all local concepts are
used to classify the local regions into the target classes. In this paper, one of the main
targets is to detect multi-classes, meaning that an input photo can be labeled by one or
more classes. For this, we propose three criterions for multi-class categorization.
Given the probability values for the five basis local regions of an input photo, the
three categorization criterions are as follows:
1) α criterion: In this case, every basis local regions can have only one class whose
probability value is the top-most over all global concept classes given each basis local
region.
2) β criterion: In this case, every basis local regions can have only one or no class.
That is, a basis local region can have a single class whose probability value is close
enough, i.e., higher than a threshold.
3) γ criterion: In this case, first of all, the probability values for all basis local region
are aligned in ascending order. Then, the top-N classes with respect to the probability
value are assigned to classes of the input photo, whose probability values should be
close enough, i.e., higher than a threshold.</p>
        <p>In the case of α criterion, the classifier assigns the class of a basis local regions
( Rb ) to a concept satisfying the following MAP condition, given by,</p>
        <p> P(x global ) ∏N P(vtlo,bcal x global )
cα = acr=g1,2m,L,aMx c ∏tN=1t=1P(vtlo,bcal ) c  = acr=g1,2m,L,aMxP(xcglobal ) ∏tN=1 P(vtlo,bcal xcglobal ) ,
(17)
where cα is one predicted class of the basis local regions. Accordingly, the classifier
by α criterion generates five predicted classes for an input photo.</p>
        <p>In the case of β criterion, the classifier assigns the class of a basis local regions
( Rb ) to one concept or none satisfying the following condition, given by,

 c ,
cβ =  α
none,
if P(xcgαlobal ) ∏tN=1 P(vtlo,bcal xcgαlobal ) ≥ Pth ,</p>
        <p>otherwise
where cβ is the predicted class of the basis local regions, and Pth is the threshold
value for categorization criterion. Accordingly, the classifier by β criterion generates
five or less than five predicted classes for input photos.</p>
        <p>In the case of γ criterion, the classifier assigns the class of an input photo to
multiple concepts satisfying the following condition, given by,
cγ = c, if P(xcglobal ) ∏N P(vtlo,bcal xcglobal ) ≥ Pth for any class and any basis region ,
t=1
(18)
(19)
where cγ is the predicted class of the input photo.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Experiments</title>
      <p>To demonstrate the proposed photo classification, experiments were performed with
the official database of the MPEG-7 visual core experiment 2 (VCE-2) test data set
that comprises 3086 real home photos. The goal of the MPEG-7 VCE-2 is to verify
the usefulness of the MPEG-7 visual descriptors for photo classification. All of the
photos in the database were contributed by several participants in the MPEG-7 VCE-2.
The MPEG-7 VCE-2 also provides corresponding ground truth (GT) set for the
databases.</p>
      <p>The official GT set is given by seven semantic classes that would popularly appear
in home photos. It was cross-verified by several participants in the MPEG-7 VCE-2
who are experts in content-based image analysis. The seven semantic classes includes
‘architecture’, ‘indoor’, terrain’, ‘night’, ‘snowscape’, ‘waterside’, and ‘sunset’. Note
that the GT set was strictly made to avoid missing any human visual preference in
browsing photos. That is, an important rule in the GT decision was that a photo could
be labeled with one or more semantic classes of which a scene could be detectable by
the human eye. Therefore, many of the photos were labeled by multiple classes.</p>
      <p>As totally independent of the test data set, 1597 photos were used for training data.
They were also from the MPEG-7 VCE-2 official training data set. Of the training set,
800 were from general home photos, and 797 were from the Corel photo collection.
For training local semantic classifier, we patched the training photos to local regions
and then manually selected positive and negative samples for each class from the
subphoto collection by human visual perception. The negative samples for each concept
were randomly selected from the positive samples of other all concepts.</p>
      <p>
        For learning local semantics, multiple low-level visual features are extracted from
the patched photo database. For this, five MPEG-7 descriptors are employed for color
and texture features [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]: color structure (CS), color layout (CL), and scalable
color (SC) descriptors are used for color features; and homogeneous texture (HT) and
edge histogram (EH) descriptors are used for texture features.
      </p>
      <p>In this paper, we build nine important families of concepts that would frequently
appear in local regions of general home photos. The families of the local concepts
consists of ‘ground’, ‘human’, ‘indoor’, ‘mountain’, ‘night’, ‘plant’, ‘sky’, ‘structure’,
and ‘water’. The concept families are sub-divided to the 34 local concepts as follows:
- Seven ‘ground’ concepts: ‘gravel’, ‘park’, ‘pavement’, ‘road’, ‘rock’, ‘sand’, and
‘sidewalk’;
- Two ‘human’ concepts: ‘face’ and ‘people’;
- Two ‘indoor’ concepts: ‘indoor’ and ‘indoor-light’;
- Three ‘mountain’ concepts: ‘field’, ‘peak’, and ‘wood’;
- Two ‘night’ concepts: ‘night’ and ‘street-light’;
- Three ‘plant’ concepts: ‘flowers’, ‘leaves’, and ‘trees’;
- Four ‘sky’ concepts: ‘cloudy’, ‘sunny’, ‘sunset’, and ‘sunset-on-mountain’;
- Five ‘structure’ concepts: ‘brick’, ‘arch’, ‘buildings’, ‘wall’, and ‘windows’;
- Six ‘water’ concepts: ‘beach’, ‘high-wave’, ‘low-wave’, ‘still water’, ‘mirrored
water’, and ‘ice (snow)’</p>
      <p>Accuracy, recall, and precision are well-known measures to evaluate classification
performance. As in general definition, accuracy = (TP + TN) / (total number of
samples), recall = TP / (TP + FN), and precision = TP / (TP + FP), where TP, TN, FP,
and FN stand for ‘true positive’ when the case is positive and predicted positive, ‘true
negative’ when the case is negative and predicted negative, ‘false positive’ when the
case is negative but predicted positive and ‘false negative’ when the case is positive
but predicted negative, respectively.</p>
      <p>The sigmoid parameters were calculated for each local semantic classifier. Fig. 3
shows the histogram of positive and negative samples for indoor classifier. The solid
line is the class-conditional probability of negative samples, while the dashed line is
that of positive samples. As shown in Fig. 3, the histogram is not Gaussian, probably
due to the small number of training data. Fig. 4 is derived by using Bayes’ rule on the
histogram estimates of the class-conditional densities. The sigmoid fit works well, as
can be seen in Fig. 4.</p>
      <p>First, we measured classification performance without local semantic features, i.e.,
with only global low-level features. In Table 1, (a) column shows its average
performance for each global concept. The average performance was measured with a
threshold showing minimum difference between recall and accuracy. The results
show that night class has the best performance at about 90% and architecture class the
worst at about 61%. To verify the usefulness of the two-layered classification scheme,
we also measured classification performance with local low-level features and local
semantic features. In Table 1, (b) column shows its average result for each global
concept. Adding local semantic features made global semantic classification perform
much better in indoor class, as compared with the case of using only global low-level
features. Thus, local semantic features would be useful to catch local indoor
semantics. In other classes, recall and accuracy slightly increased.</p>
      <p>The camera metadata includes exposure-time (refer to ET), aperture number (refer
to AN), focal length (refer to FL), and flash-fired or not (refer to FF). It is denoted
that the camera metadata would be considered for only indoor/outdoor and
night/daytime classes since it would not be useful for other semantic classes.</p>
      <p>So, given this constraint, in order to employ the camera metadata in local semantic
classification, we first constructed two local semantic classifiers: indoor/outdoor and
night/daytime classifiers. Fig. 5 shows the two local semantic classifiers with camera
metadata. Fig. 5-(a) shows the indoor/outdoor classifier that outputs probability
values for indoor and outdoor classes by using several useful camera metadata as
syntactic features. Similarly, Fig. 5-(b) shows the night/daytime classifier that outputs
probability values for night and daytime classes by using several useful camera
metadata for syntactic features. In order to associate the two classifiers with the 34
local concepts, we make a classification scheme as seen in Fig. 5-(c). As such, the
first step is to classify the input camera metadata into indoor or outdoor classes. The
indoor probability is assigned to indoor classes, and the outdoor probability is
assigned to outdoor classes. The second step is to classify the input camera metadata
into night and daytime classes. The night probability is assigned to night classes and
the daytime probability is assigned to daytime classes that include ground, human,
mountain, sky, structure, plant, and water classes.</p>
      <p>Camera metadata input</p>
      <p>Camera metadata input
Indoor/outdoor
classifier</p>
      <p>Night/daytime
classifier
Probability for indoor and outdoor classes</p>
      <p>Probability for night and daytime classes
(a) Indoor/outdoor classifier</p>
      <p>(b) Night/daytime classifier
Indoor/outdoor</p>
      <p>classifier</p>
      <p>Night classes
Indoor classes</p>
      <p>Night/daytime
classifier</p>
      <p>Ground classes
Human classes
Mountain classes</p>
      <p>Sky classes
Structure classes</p>
      <p>Plant classes
Water classes
(c) Combination of the two classifiers to detect local photo semantics</p>
      <p>
        The proposed method was also compared with related work using Bayesian
network classifier with global visual features and camera metadata [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The main
difference of our method from the Boutell’s one is that we provide a scheme to
employ local semantic features especially for the two-layered SVM classifier. Our
assumption is that the proposed method will outperform the conventional one in local
photo semantic classification. Table 3 shows the categorization results of the two
different methods. The training and testing data was the same as the above experiment.
      </p>
      <sec id="sec-3-1">
        <title>As seen in the results, almost categories except for architecture were better detected by the proposed method than by the conventional method. In indoor and terrain, both methods showed similar performance. But, the proposed method much better detected other categories such as night, snowscape, sunset and waterside.</title>
        <p>Performance</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Conclusions</title>
      <p>This paper exploits a scheme to employ syntactic features, such as camera metadata,
for semantic classification. We select a two-layered approach to detect local and
global photo semantics. The camera metadata provide useful cues independent of
photo contents, facilitating the discovery of photo semantics. Our approach is
characterized in the following two schemes: one is the scheme that incorporates
syntactic features to low-level visual features for detecting local photo semantics; the
other is the scheme that uses the local photo semantics as features for detecting global
photo semantics. Concept merging is also proposed to select more likelihood semantic
concepts on overlapping local regions. The efficacy of the proposed categorization
method was demonstrated with 3086 MPEG-7 VCE-2 official databases. The
experiment results showed that the proposed method would be useful to detect
multiple semantic meaning of generic home photos. In future, we will extend the
application of the proposed classification scheme to other syntactic features. In
addition, we need to compare the proposed method to other similar approaches such
as to Boutell’s using Bayesian network.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Platt</surname>
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Probabilistic outputs for support vector machines and comparison to regularized likelihood methods</article-title>
          . In: A.
          <string-name>
            <surname>J. Smola</surname>
            ,
            <given-names>P.L.</given-names>
          </string-name>
          <string-name>
            <surname>Bartlett</surname>
            , B. Sch¨olkopf, and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schuurmans</surname>
          </string-name>
          , Advances in Large Margin Classifiers, Cambridge, MIT Press (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lin</surname>
            <given-names>H.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>C.J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Weng R</surname>
          </string-name>
          .C.
          <article-title>: A note on Platt's probabilistic outputs for support vector machines</article-title>
          .
          <source>Technical report</source>
          , National Taiwan University (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <article-title>Exchangeable image file format for digital still cameras</article-title>
          ,
          <source>JEITA CP-3451</source>
          ,
          <string-name>
            <given-names>Japan</given-names>
            <surname>Electronics</surname>
          </string-name>
          and Information Technology Industries Association
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Loui</surname>
            <given-names>A.C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Savakis</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automated event clustering and quality screening of consumer pictures for digital albuming</article-title>
          .
          <source>IEEE Trans. of Multimedia</source>
          .
          <volume>5</volume>
          (
          <issue>3</issue>
          ) (
          <year>2003</year>
          )
          <fpage>390</fpage>
          -
          <lpage>402</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lim</surname>
            <given-names>J.H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            <given-names>Q.</given-names>
          </string-name>
          , and Mulhem P.:
          <article-title>Home photo content modeling for personalized eventbased retrieval</article-title>
          .
          <source>IEEE Trans of Multimedia</source>
          .
          <volume>10</volume>
          (
          <issue>4</issue>
          ) (
          <year>2003</year>
          )
          <fpage>24</fpage>
          -
          <lpage>37</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cooper</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foote</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girgensohn</surname>
            <given-names>A.</given-names>
          </string-name>
          , and Wilcox L.:
          <article-title>Temporal event clustering for digital photo collections</article-title>
          .
          <source>Proc. of ACM Multimedia</source>
          .
          <article-title>(</article-title>
          <year>2003</year>
          )
          <fpage>364</fpage>
          -
          <lpage>373</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Yang</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            <given-names>H.K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ro Y.M.</surname>
          </string-name>
          <article-title>: Category Classification using Multiple MPEG-7 Descriptors</article-title>
          . CISST.
          <volume>1</volume>
          (
          <year>2002</year>
          )
          <fpage>396</fpage>
          -
          <lpage>401</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Yang</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            <given-names>J.H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ro Y.M.</surname>
          </string-name>
          <article-title>: Automatic Image Categorization using MPEG-7 Description</article-title>
          .
          <source>Proc. of SPIE Electronic Imaging on Internet Imaging</source>
          .
          <volume>5018</volume>
          (
          <year>2003</year>
          )
          <fpage>139</fpage>
          -
          <lpage>147</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Boutell</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            <given-names>J</given-names>
          </string-name>
          .:
          <article-title>Beyond pixels: Exploiting camera metadata for photo classification</article-title>
          .
          <source>Pattern Recognition</source>
          .
          <volume>38</volume>
          (
          <year>2005</year>
          )
          <fpage>935</fpage>
          -
          <lpage>946</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Muller</surname>
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>An introduction to kernel-based learning algorithms</article-title>
          .
          <source>IEEE Trans. on Neural Networks</source>
          .
          <volume>12</volume>
          (
          <issue>2</issue>
          ) (
          <year>2001</year>
          )
          <fpage>181</fpage>
          -
          <lpage>201</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ro</surname>
            <given-names>Y.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            <given-names>H.K.</given-names>
          </string-name>
          :
          <article-title>Hierarchical rotational invariant similarity measurement for MPEG-7 homogeneous texture descriptor</article-title>
          .
          <source>Electronics Letters</source>
          .
          <volume>36</volume>
          (
          <issue>15</issue>
          ) (
          <year>2000</year>
          )
          <fpage>1268</fpage>
          -
          <lpage>1270</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Yang</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            <given-names>J.H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ro Y.M.</surname>
          </string-name>
          <article-title>: Automatic Image Categorization using MPEG-7 Description</article-title>
          .
          <source>Proc. of SPIE Electronic Imaging on Internet Imaging</source>
          .
          <volume>5018</volume>
          (
          <year>2003</year>
          )
          <fpage>139</fpage>
          -
          <lpage>147</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V. N.:</given-names>
          </string-name>
          <article-title>The Nature of Statistical Learning Theory</article-title>
          , second ed. Springer (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>