<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Photography Aesthetics with Deep CNNs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gautam Malu∗</string-name>
          <email>gautam.malu@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raju S. Bapi</string-name>
          <email>raju.bapi@iiit.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bipin Indurkhya</string-name>
          <email>bipin@iiit.ac.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Photography, Aesthetics, Aesthetic Attributes, Deep Convolution</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ACM Reference format:</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gautam Malu, Raju S. Bapi, and Bipin Indurkhya. 2017. Learning Photography Aesthetics with Deep CNNs. In Proceedings of The 28th Modern Arti cial, Intelligence and Cognitive Science Conference, Purdue University Fort Wayne</institution>
          ,
          <addr-line>April 2017 (MAICS'17), 8 pages.</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>International Institute of Information, Technology</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <institution>&amp; University of Hyderabad</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>International Institute of Information, Technology</institution>
          ,
          <addr-line>Hyderabad, Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Neural Network</institution>
          ,
          <addr-line>Residual Networks</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>129</fpage>
      <lpage>136</lpage>
      <abstract>
        <p>Automatic photo aesthetic assessment is a challenging arti cial intelligence task. Existing computational approaches have focused on modeling a single aesthetic score or class (good or bad photo), however these do not provide any details on why the photograph is good or bad; or which attributes contribute to the quality of the photograph. To obtain both accuracy and human-interpretability, we advocate learning the aesthetic attributes along with the prediction of the overall score. For this purpose, We propose a novel multitask deep convolution neural network (DCNN), which jointly learns eight aesthetic attributes along with the overall aesthetic score. We report near-human performance in the prediction of the overall aesthetic score. To understand the internal representation of these attributes in the learned model, we also develop the visualization technique using back propagation of gradients. These visualizations highlight the important image regions for the corresponding attributes, thus providing insights about model's understanding of these attributes. We showcase the diversity and complexity associated with di erent attributes through a qualitative analysis of the activation maps.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>
        Aesthetics is the study of science behind the concept and perception
of beauty. Although aesthetics of photograph is subjective, some
aspect of its depends on the standard photography practices and
general visual design rules. With the ever increasing volume of
digital photographs, automatic aesthetic assessment is becoming
increasingly useful for various applications, such as a personal
photo assistant, photo manager, photo enhancement, image
retrieval etc. Conventionally, automatic aesthetic assessment tasks
have been modeled as either a regression problem (single aesthetic
∗Corresponding author
score) [
        <xref ref-type="bibr" rid="ref27 ref7 ref8">7, 8, 27</xref>
        ] or as a classi cation problem (aesthetically good or
bad photograph) [
        <xref ref-type="bibr" rid="ref13 ref14 ref2">2, 13, 14</xref>
        ].
      </p>
      <p>
        Intensive data driven approaches have made substantial progress
in this task, although it is a very subjective and context dependent
task. Earlier approaches used custom designed features based on
photography rules (e.g., focus, color harmony, contrast, lighting,
rule of thirds) and semantic information (e.g., human pro le, scene
category) from low level image descriptors (e.g. color histograms,
wavelet analysis) [
        <xref ref-type="bibr" rid="ref1 ref12 ref15 ref16 ref2 ref21 ref24 ref4 ref9">1, 2, 4, 9, 12, 15, 16, 21, 24</xref>
        ] and generic image
descriptors [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. With the evolution of deep learning based
techniques, recent approaches have introduced deep convolution neural
networks (DCNN) in aesthetic assessment tasks [
        <xref ref-type="bibr" rid="ref10 ref13 ref14 ref26 ref8">8, 10, 13, 14, 26</xref>
        ].
      </p>
      <p>
        Although these approaches give near-human performance in
classifying whether a photograph is “good" or “bad", they do not
give detailed insights or explanation for such claims. For example, if
a photograph received a bad rating, one would not get any insights
about the attributes (e.g., poor lighting, dull colors etc.) that led to
that rating. We propose an approach in which we identify (eight)
such attributes (such as Color Harmony, Depth of Field etc.) and
report those along with the overall score. For this purpose, we
propose a multi-task deep convolution network (DCNN ) which
simultaneously learns the eight aesthetic attributes along with the
overall aesthetic score. We train and test our model on the recently
released aesthetics and attribute database (AADB) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Following
are the eight attributes as mentioned in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (Figure 1):
      </p>
      <p>We also develop attribute activation maps (Figure 3) for
visualization of these attributes. These maps highlight the salient regions
for the corresponding attribute, thus providing us insights about
the representation of these attributes in our trained model.
In summary, followings are the main contributions of our paper:
(1) We propose a novel deep learning based approach which
simultaneously learns eight aesthetic attributes along with
the overall score. These attributes enable us to provide
more detailed feedback on automatic aesthetic assessment.
(2) We also develop localized representation of these attributes
from our learned model. We call these attribute activation
maps (Figure 3). These maps provide us more insights about
model’s interpretability of the attributes.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Most of the earlier works have used low-level image features to
design high level aesthetic attributes as mid-level features and trained
aesthetic classi er over these features. Datta et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed 56
visual features based on standard photography and visual design
rules to encapsulate aesthetic attributes from low-level image
features. Dhar et al. divided aesthetic attributes into three categories
Compositional (e.g. Depth of eld, Rule of thirds), Content(e.g. faces,
animals, scene types), Sky-Illumination (e.g. clear sky, sunset sky).
They trained individual classi ers for these attributes from
lowlevel features (e.g. color histograms, center surrounding wavelets,
haar features) and used outputs of these classi ers as input features
for the aesthetic classi er.
      </p>
      <p>
        Marchesotti et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], proposed to learn aesthetic attributes
from textual comments on the photographs using generic image
features. Despite increased performance, many of these textual
attributes (good, looks great, nice try) do not map to well-de ned
visual characteristics. Lu et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed to learn several
meaningful style attributes, and used these to ne-tune the training of
aesthetics classi cation network. Kong et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed attribute
and content adaptive DCNN for aesthetic score prediction.
      </p>
      <p>However, none of the previous works report the aesthetic
attributes themselves. These attributes are used as features to predict
the overall aesthetic score/class. In this paper, we learn aesthetic
attributes along with the overall score, not just as intermediate
features but as auxiliary information. Aesthetic assessment is
relatively easier in images with evident high and low aesthetics than in
ordinary images with marginal aesthetics (Figure 2). For these
images, attributes information would greatly supplement the quality
of feedback from an automatic aesthetic assessment system.</p>
      <p>
        Recently, deep learning techniques have shown signi cant
performance gains in various computer vision tasks such as object
classi cation, localization [
        <xref ref-type="bibr" rid="ref11 ref23 ref25">11, 23, 25</xref>
        ]. In deep learning, non-linear
features are learned in a hierarchical fashion in increasing
complexity (e.g. colors, edges, objects). The aesthetic attributes can be learned as
combinations of these features. Deep learning techniques have shown
signi cant performance gains in comparison with traditional machine learning
approaches for aesthetic assessment tasks [
        <xref ref-type="bibr" rid="ref10 ref13 ref14 ref26 ref8">8, 10, 13, 14, 26</xref>
        ]. Unlike
traditional machine learning techniques, features are also learned during training
in deep learning techniques. However these internal representations of
DCNNs are still opaque. Various visualization techniques [
        <xref ref-type="bibr" rid="ref17 ref22 ref28 ref29 ref30 ref5">5, 17, 22, 28–30</xref>
        ]
have been proposed to visualize the internal representations of DCNNs
in an attempt to have a better understanding of their working. However,
these visualization techniques have not been applied in aesthetic assessment
tasks. In this article, we apply the gradient based visualization technique
proposed by Zhou et al. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] to obtain attribute activation maps. These maps
provide localized representation of these attributes. Additionally we also
apply similar visualization technique [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] to the model provided by Kong
et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to obtain similar maps for qualitative comparison of our results
with the earlier approach.
      </p>
      <sec id="sec-2-1">
        <title>Learning Photography Aesthetics with Deep CNNs</title>
      </sec>
      <sec id="sec-2-2">
        <title>MAICS’17, April 2017, Purdue University Fort Wayne</title>
        <p>
          3
3.1
We use the deep residual network (ResNet50) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to train all the attributes
along with the overall aesthetic score. ResNet50 has 50 layers which can
be divided into 16 successive residual blocks. Each residual block contains
3 convolution layers followed by the batch normalization layer (Figure 3).
Each residual block is followed by a recti ed linear activation layer (ReLU)
[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. We take these recti ed convolution maps from the ReLU output of
all these 16 residual blocks, and pool features from each of these 16 blocks
with a global average pooling (GAP) layer. GAP layer gives the spatial
average of these recti ed convolution maps. Then we concatenate all these
pooled features and use this as a feature for a fully connected layer which
produces the desired outputs (aesthetic attributes and the overall score)
as shown in Figure 3. We model the attribute and score prediction as a
regression problem with mean squared error as loss function. Due to this
simple connectivity structure, we are able to identify the importance of
image regions by projecting the weights of the output layer on to the
recti ed convolution maps, a technique we call attribute activation mapping.
This technique was rst introduced by Zhou et al. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] to get class activation
maps for di erent semantic classes in image classi cation task.
3.2
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Attribute Activation Mapping</title>
      <p>For a given image, let fk (x; y) represent the activation of unit k in the
recti ed convolution map at spatial location (x, y). Then, for unit k, the
result of performing global average pooling is F k = Px,y fk (x; y). Thus, for
a given attribute a, the input to the regression layer, Ra , is Px wka Fk where
w a is the weight corresponding to attribute a for unit k. Essentially, w a
k k
indicates the importance of Fk for attribute a as shown in Figure 3.</p>
      <p>
        We also synthesized similar attribute maps from the model proposed by
Kong et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We did not have the nal attribute and content adapted
model from [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] due to patent rights but Kong et al. shared the attribute
adapted model with us. That model is based on alexnet architecture [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
consisting of fully connected layers along with convolution layers. In this
architecture, outputs of convolution layers are separated from desired
outputs by three stacked fully connected layers. The outputs from last FC layer
are regression scores of attributes. In this architecture we compute weight
of layer k for attribute a as summation of gradients (ga ) of outputs with
k
respect to k t h convolution layer wka = Px,y gka (x; y). This technique was
rst introduced by Selvaraju et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] to get class activation maps for
di erent semantic classes and visual explanation (answers for questions).
3.3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Implementation Details</title>
      <p>
        Out of 10000 samples present in the AADB dataset, we have trained our
model on 8500 training samples. 500 and 1000 images have been set aside for
validation and testing purposes, respectively. As the number of training
samples (8500) is not adequate for training of such a deep network (23,715,852
parameters) from scratch, we used a pre-trained ResNet50. It was trained
on 1000-class Imagenet classi cation dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] with approximately 1:2
million images. We xed the input image size to 299 × 299. We used
horizontal ip of the input images as a data augmentation technique. The last
residual block gives convolution maps of size 10 × 10, so we reduce the sizes
of the convolution maps from the previous Res-Blocks to the same size with
appropriate sized average pooling. As ResNet50 has batch normalization
layers, it is very sensitive to batch size. We xed the batch size to 16 and
trained it for 16 epochs. We report our model's performance on test set (1000
images) provided in AADB. We have made our implementation publicly
available 1.
1https://github.com/gautamMalu/Aesthetic_attributes_maps
As mentioned earlier, we have used the aesthetics and attribute database
(AADB) provided by Kong et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. AADB provides overall ratings for
the photographs along with the ratings on the eleven aesthetic attributes
as mentioned in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (Figure 1). Users were asked to provide information
about the e ectiveness of these attributes on the overall aesthetic score. For
example, if object emphasis is positively contributing towards the overall
aesthetics of a photograph, user will give a score of +1 for the attribute,
if object is not emphasized adequately and this is contributing negatively
towards the overall aesthetic score of the photograph, user will give a score
of -1 for the attribute (See Fig 5). The users also rated the overall aesthetic
score on a scale of 1 to 5, with 5 being the most aesthetically pleasing score.
Each image was rated by at least 5 persons. The mean score was taken as
the ground truth score for all attributes and the overall score.
      </p>
      <p>
        If an attribute has enhanced the image quality, it was rated positively
and if the attribute has degraded the image aesthetics it was rated
negatively. The default zero (null) means the attribute does not a ect the image
aesthetics. For example, positive vivid color means the vividness of the color
presented in an image has a positive e ect on the image aesthetics; while
the negative vivid color means the image has dull color composition. All the
attributes except for Repetition and Symmetry are normalized to the range
of [
        <xref ref-type="bibr" rid="ref1">-1, 1</xref>
        ] Repetition and Symmetry are normalized to the range of [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], as
negative values are not justi able for these two attributes. The overall score
is normalized to the range of [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ].Out of these eleven attributes, we omit
Symmetry, Repetition and Motion blur attributes from our experiment as
most of the images rated null for these attributes (Figure 4). We model the
      </p>
      <sec id="sec-4-1">
        <title>MAICS’17, April 2017, Purdue University Fort Wayne</title>
        <p>
          To evaluate the aesthetic attribute scores predicted by our model, we report
the Spearman's ranking correlation coe cient (ρ ) between the estimated
aesthetic attribute score and the corresponding ground truth score for the
testing data. The ranking correlation coe cient (ρ ) evaluates the monotonic
relationship between estimated scores and ground truth scores, hence there
is no need of explicit calibration between them. The correlation coe cient
lies in the range of [
          <xref ref-type="bibr" rid="ref1">-1, 1</xref>
          ], with greater values corresponding to higher
correlation and vice-versa. For baseline comparison, we also train a model
by ne tuning a pre-trained ResNet50 and label it as ResNet50-FT.
Finetuning here refers to modifying the last layer of the pre-trained ResNet50
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and training it for our aesthetic attribute prediction task. Table 1 lists
the performance on AADB using the two approaches. We also report the
performance of the model shared by Kong et al.[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          It should be noted that the spearman's coe cient between the estimated
overall aesthetic score and the corresponding ground truth reported by Kong
et al.[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] was 0.678. They did not report any metrics for the other aesthetic
attributes. They used ranking loss along with mean squared error as loss
2The ρ reported by Kong et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for their nal content and attribute adaptive model
is 0:678, here we are reporting the performance of the model shared by them.
functions. Their nal approach was also content adaptive. As can be seen
from the results reported in Table 1, our model managed to outperform their
approach in overall aesthetic score in-spite of only being trained with mean
square error and without any content adaptive framework . Our model
signi cantly underperformed for Rule of Thirds and Balancing elements
attributes. These attributes are location sensitive attributes. Rule of thirds
deals with positioning of the salient elements, Balancing Elements deals
with relative positioning of objects with each other and the frame. In our
model, due to use of global average pooling (GAP) layers after activation
layers we are losing location speci city. We selected GAP layer to reduce
the number of parameters. The number of training samples (8500) allows
learning of only small parameter space. We also warp the input images
to the xed size input (299x 299), thus destroying the aspect ratio. These
could be possible reasons for the under-performance of the model for these
compositional and location sensitive attributes. Across all the attributes,
our proposed method reports better results than ResNet50 ne-tuned model.
Our model performs better than the model provided by Kong et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for
ve-out-of-eight attributes.
        </p>
        <p>Aspects of aesthetic judgments are very subjective in nature. To quantify
this subjectivity. In AADB the ground-truth score is the mean score of
ratings given by di erent individuals. To quantify the agreement between
ratings, ρ between each individual's ratings and the ground-truth scores
was calculated. The average of ρ is reported in Table 2. Our model actually
outperforms the human consistently (as measured by ρ ) averaged across
all raters. However, when considering only the “power raters” who have
annotated more images, human evaluators consistently outperform model's
results.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>VISUALIZATION</title>
      <p>
        As mentioned above we generate attribute activation maps for di erent
attributes, to get their localized representations. Here we omit the following
attributes, namely, emphbalancing element and the rule of thirds, as our
model's performance is very low for these attributes as shown in Table 1.
For each attribute, we have analyzed the activation maps and present the
insights in this section. For illustration purposes, We have selected ten
samples for each attribute. Out of these ten samples, rst ve are the highest
rated by our model, and the next ve are the lowest rated. We have selected
these samples from test samples (1000) and not from the train samples. We
also have included the activation maps from model given by Kong et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
(Kong’s model). These activation maps highlight the most important regions
for the given attributes. We de ne these activation maps as “gaze" of the
model.
5.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>Object Emphasis</title>
      <p>By qualitative analysis of activation maps of object emphasis, it was
observed that model gazes at the main object on the image. Even when the
model predicts negative rating, i.e. object is not emphasized, the model
searches for regions which contain objects Figure 6. In comparison,
activation maps from Kong’s model are not always consistent as can be seen in
the second row of activation maps in Figure 6. It showcases that our model
has learned the object emphasis attribute as an attribute which is indeed
related to objects.
5.2</p>
    </sec>
    <sec id="sec-7">
      <title>Content</title>
      <p>Interestingness of content is signi cantly subjective and is a context-dependent
attribute. However, if a model is trained on this attribute, one would expect
the model would have maximum activation at the content of the image
while making this judgment. If there exists a well-de ned object in an image,
then that object is considered as the content of the image, for e.g., 2nd and
3r d columns of Figure 7. Further, it can be observed in these columns that
our proposed approach is better at identifying the content than Kong’s
model. Without the presence of explicit objects, the content of the image is
di cult to localize, for e.g. 1st and 5t h columns of Figure 7. As shown in
Figure 7, our model’s activation maps are maximally active at the content
of the image. In comparison activation maps from Kong’s models are not
consistent.
5.3</p>
    </sec>
    <sec id="sec-8">
      <title>Depth of Field</title>
      <p>On analyzing the representations of shallow depth of eld, it was observed
that model looks for blurry regions near the main object of the image while</p>
      <sec id="sec-8-1">
        <title>MAICS’17, April 2017, Purdue University Fort Wayne</title>
        <p>making the judgment as showcased in Figure 8. Shallow depth of eld
technique is used to make the subject of the photograph stand out from
its background. The model’s interpretation of it is in that direction. The
images for which model has predicted the negative score on this attribute,
the activation maps are random. Activation maps from Kong’s model also
showcase a similar behavior, these maps are more active at the corner of
the images.
5.4</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Vivid Color</title>
      <p>Vivid Color means the presence of bright and bold colors. The model’s
interpretation of this attribute seems to be along these lines. As shown in
Figure 9, model gazes at vivid color areas while making the judgment about
this attribute. For example, in 2nd column of the Figure 9 pink color of
owers and scarf, and in 3r d column butter y and ower were the most
activated regions. Authors couldn’t nd any pattern in activation maps
from Kong’s model.
5.5</p>
    </sec>
    <sec id="sec-10">
      <title>Light</title>
      <p>Good Lighting is quite a challenging concept to grasp. It does not merely
depend on the light in the photograph, but rather how that light
complements the whole composition. As shown in Figure 10, most of the time
model seems to look at bright light, or source of the light in the photograph.
Although model's behavior is consistent, its understanding of this attribute
is incomplete. This was also evident in the low correlation ratings of our
proposed model for this attribute, as reported in Table 1.
5.6</p>
    </sec>
    <sec id="sec-11">
      <title>Color Harmony</title>
      <p>Although model's performance is signi cant for this attribute, we could
not nd any consistent pattern in its activation maps. As color harmony
is of many types, e.g., analogues, complementary, triadic; it is di cult to
get a single representation pattern. For example, in the rst example shown
in Figure 11, the green color of hills is in analogous harmony with blue
color of water and sky; in the 3r d example, brown sand color is in split
complementary harmony with blue and green. The attribute activation
maps for Color Harmony are shown in Figure 11.
6</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSION</title>
      <p>In this paper, we have proposed deep convolution neural network (DCNN)
architecture to learn aesthetic attributes. Results show that estimated scores
of ve aesthetic attributes (Interestingness of Content, Object emphasis,
shallow Depth of Field, Vivid Color, and Color Harmony) correlate
significantly with their respective ground truth scores. Whereas in the case of
attributes such as Balancing Elements, Light and Rule of Thirds, the
correlation is inferior. The activation maps corresponding to the learned aesthetic
attributes such as object emphasis, content, depth of eld and vivid color
indicate that the model has acquired internal representation suitable to
highlight these attributes automatically. However, for color harmony and
light, the visualization maps were not consistent.</p>
      <p>
        Aesthetic judgment involves a degree of subjectivity. For example, in
AADB the average correlation between the mean score and an individual’s
score for the overall aesthetic score is 0.67 2. Moreover, as reported by Kong
et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the model learned on a particular dataset might not work on a
di erent dataset. Considering all these factors, empirical validity of aesthetic
judgment models is still a challenge. We suggest that the visualization
techniques presented in the current work is a step forward in that direction.
Empirical validation could proceed by asking subjects to annotate the images
(identifying the regions that correspond to di erent aesthetic attributes)
and these empirical maps could in turn be compared with the predicted
maps of the model. Such experiments need to be conducted in future to
validate the current approach.
      </p>
    </sec>
    <sec id="sec-13">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors would like to thank Ms. Shruti Naik, and Mr. Yashaswi Verma
of International Institute of Information Technology, Hyderabad, India for
their help in manuscript preparation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Subhabrata</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          , Rahul Sukthankar, and
          <string-name>
            <given-names>Mubarak</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>A framework for photo-quality assessment and enhancement based on visual aesthetics</article-title>
          .
          <source>In Proceedings of the 18th ACM international conference on Multimedia. ACM</source>
          ,
          <volume>271</volume>
          -
          <fpage>280</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ritendra</given-names>
            <surname>Datta</surname>
          </string-name>
          , Dhiraj Joshi,
          <string-name>
            <given-names>Jia</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and James Z</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Studying aesthetics in photographic images using a computational approach</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          . Springer,
          <fpage>288</fpage>
          -
          <lpage>301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jia</given-names>
            <surname>Deng</surname>
          </string-name>
          , Wei Dong, Richard Socher,
          <string-name>
            <surname>Li-Jia</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kai</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2009</year>
          .
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          .
          <article-title>CVPR 2009</article-title>
          .
          <article-title>IEEE Conference on</article-title>
          . IEEE,
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sagnik</given-names>
            <surname>Dhar</surname>
          </string-name>
          , Vicente Ordonez, and Tamara L Berg.
          <year>2011</year>
          .
          <article-title>High level describable attributes for predicting aesthetics and interestingness</article-title>
          .
          <source>In Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2011 IEEE Conference on. IEEE</source>
          ,
          <fpage>1657</fpage>
          -
          <lpage>1664</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Brox</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Inverting convolutional networks with convolutional networks</article-title>
          .
          <source>CoRR abs/1506</source>
          .02753 (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Yueying</given-names>
            <surname>Kao</surname>
          </string-name>
          , Kaiqi Huang, and
          <string-name>
            <given-names>Steve</given-names>
            <surname>Maybank</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Hierarchical aesthetic quality assessment using deep convolutional neural networks</article-title>
          .
          <source>Signal Processing: Image Communication</source>
          <volume>47</volume>
          (
          <year>2016</year>
          ),
          <fpage>500</fpage>
          -
          <lpage>510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Yueying</given-names>
            <surname>Kao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Chong</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Kaiqi</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Visual aesthetic quality assessment with a regression model</article-title>
          .
          <source>In Image Processing (ICIP)</source>
          ,
          <source>2015 IEEE International Conference on. IEEE</source>
          ,
          <fpage>1583</fpage>
          -
          <lpage>1587</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yan</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Feng</given-names>
            <surname>Jing</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The design of high-level features for photo quality assessment</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2006</year>
          IEEE Computer Society Conference on, Vol.
          <volume>1</volume>
          . IEEE,
          <fpage>419</fpage>
          -
          <lpage>426</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Shu</surname>
            <given-names>Kong</given-names>
          </string-name>
          , Xiaohui Shen,
          <string-name>
            <given-names>Zhe</given-names>
            <surname>Lin</surname>
          </string-name>
          , Radomir
          <string-name>
            <surname>Mech</surname>
            , and
            <given-names>Charless</given-names>
          </string-name>
          <string-name>
            <surname>Fowlkes</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Photo aesthetics ranking network with attributes and content adaptation</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          . Springer,
          <fpage>662</fpage>
          -
          <lpage>679</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Alex</surname>
            <given-names>Krizhevsky</given-names>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <surname>Geo</surname>
            rey
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>1097</volume>
          -
          <fpage>1105</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Li-Yun Lo</surname>
          </string-name>
          and
          <string-name>
            <surname>Ju-Chin Chen</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A statistic approach for photo quality assessment</article-title>
          .
          <source>In Information Security and Intelligence Control (ISIC)</source>
          ,
          <source>2012 International Conference on. IEEE</source>
          ,
          <fpage>107</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Xin</surname>
            <given-names>Lu</given-names>
          </string-name>
          , Zhe Lin, Hailin Jin,
          <source>Jianchao Yang, and James Z Wang</source>
          .
          <year>2014</year>
          .
          <article-title>Rapid: Rating pictorial aesthetics using deep learning</article-title>
          .
          <source>In Proceedings of the 22nd ACM international conference on Multimedia. ACM</source>
          ,
          <volume>457</volume>
          -
          <fpage>466</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Xin</surname>
            <given-names>Lu</given-names>
          </string-name>
          , Zhe Lin,
          <string-name>
            <given-names>Xiaohui</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Radomir</given-names>
            <surname>Mech</surname>
          </string-name>
          , and James Z Wang.
          <year>2015</year>
          .
          <article-title>Deep multi-patch aggregation network for image style, aesthetics, and quality estimation</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          . 990-
          <fpage>998</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Wei</surname>
            <given-names>Luo</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaogang</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Content-based photo quality assessment</article-title>
          .
          <source>In Computer Vision</source>
          (ICCV),
          <source>2011 IEEE International Conference on. IEEE</source>
          ,
          <fpage>2206</fpage>
          -
          <lpage>2213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Yiwen</given-names>
            <surname>Luo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Photo and video quality evaluation: Focusing on the subject</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          . Springer,
          <fpage>386</fpage>
          -
          <lpage>399</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Aravindh</given-names>
            <surname>Mahendran</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Understanding deep image representations by inverting them</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>5188</fpage>
          -
          <lpage>5196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Luca</surname>
            <given-names>Marchesotti</given-names>
          </string-name>
          , Naila Murray, and
          <string-name>
            <given-names>Florent</given-names>
            <surname>Perronnin</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Discovering beautiful attributes for aesthetic image analysis</article-title>
          .
          <source>International journal of computer vision 113</source>
          , 3 (
          <year>2015</year>
          ),
          <fpage>246</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Luca</surname>
            <given-names>Marchesotti</given-names>
          </string-name>
          , Florent Perronnin, Diane Larlus, and
          <string-name>
            <given-names>Gabriela</given-names>
            <surname>Csurka</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Assessing the aesthetic quality of photographs using generic image descriptors</article-title>
          .
          <source>In Computer Vision</source>
          (ICCV),
          <source>2011 IEEE International Conference on. IEEE</source>
          ,
          <fpage>1784</fpage>
          -
          <lpage>1791</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Vinod</given-names>
            <surname>Nair</surname>
          </string-name>
          and Geo rey
          <string-name>
            <given-names>E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Recti ed linear units improve restricted boltzmann machines</article-title>
          .
          <source>In Proceedings of the 27th international conference on machine learning (ICML-10)</source>
          .
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Masashi</surname>
            <given-names>Nishiyama</given-names>
          </string-name>
          , Takahiro Okabe,
          <string-name>
            <given-names>Imari</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Yoichi</given-names>
            <surname>Sato</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Aesthetic quality classi cation of photographs based on color harmony</article-title>
          .
          <source>In Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2011 IEEE Conference on. IEEE</source>
          ,
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Ramprasaath R Selvaraju</surname>
            ,
            <given-names>Abhishek Das</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishna Vedantam</surname>
            , Michael Cogswell, Devi Parikh, and
            <given-names>Dhruv</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization</article-title>
          .
          <source>arXiv preprint arXiv:1610.02391</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          .
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Xiaoshuai</surname>
            <given-names>Sun</given-names>
          </string-name>
          , Hongxun Yao, Rongrong Ji, and Shaohui Liu.
          <year>2009</year>
          .
          <article-title>Photo assessment based on computational visual attention model</article-title>
          .
          <source>In Proceedings of the 17th ACM international conference on Multimedia. ACM</source>
          ,
          <volume>541</volume>
          -
          <fpage>544</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Christian</surname>
            <given-names>Szegedy</given-names>
          </string-name>
          , Wei Liu, Yangqing Jia,
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Sermanet</surname>
          </string-name>
          , Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1-9.</source>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Xinmei</surname>
            <given-names>Tian</given-names>
          </string-name>
          , Zhe Dong,
          <string-name>
            <given-names>Kuiyuan</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Tao</given-names>
            <surname>Mei</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Query-dependent aesthetic model with deep learning for photo quality assessment</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>17</volume>
          ,
          <issue>11</issue>
          (
          <year>2015</year>
          ),
          <fpage>2035</fpage>
          -
          <lpage>2048</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Ou</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Weiming Hu, and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Gao</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Learning to predict the perceived visual quality of photos</article-title>
          .
          <source>In Computer Vision</source>
          (ICCV),
          <source>2011 IEEE International Conference on. IEEE</source>
          ,
          <fpage>225</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Matthew</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Zeiler</surname>
            and
            <given-names>Rob</given-names>
          </string-name>
          <string-name>
            <surname>Fergus</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Visualizing and understanding convolutional networks</article-title>
          .
          <source>In European conference on computer vision</source>
          . Springer,
          <fpage>818</fpage>
          -
          <lpage>833</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Bolei</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
          <year>2014</year>
          .
          <article-title>Object detectors emerge in deep scene cnns</article-title>
          .
          <source>arXiv preprint arXiv:1412.6856</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Bolei</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
          <year>2016</year>
          .
          <article-title>Learning deep features for discriminative localization</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>2921</fpage>
          -
          <lpage>2929</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>