=Paper=
{{Paper
|id=Vol-1964/NN2
|storemode=property
|title=Learning Photography Aesthetics with Deep CNNs
|pdfUrl=https://ceur-ws.org/Vol-1964/NN2.pdf
|volume=Vol-1964
|authors=Gautam Malu,Raju Surampudi Bapi,Bipin Indurkhya
|dblpUrl=https://dblp.org/rec/conf/maics/MaluSI17
}}
==Learning Photography Aesthetics with Deep CNNs==
Gautam Malu et al. MAICS 2017 pp. 129–136 Learning Photography Aesthetics with Deep CNNs Gautam Malu∗ Raju S. Bapi Bipin Indurkhya International Institute of Information International Institute of Information International Institute of Information Technology, Hyderabad Technology, Hyderabad Technology, Hyderabad Hyderabad, India & University of Hyderabad Hyderabad, India gautam.malu@research.iiit.ac.in Hyderabad, India bipin@iiit.ac.in raju.bapi@iiit.ac.in ABSTRACT score) [7, 8, 27] or as a classification problem (aesthetically good or Automatic photo aesthetic assessment is a challenging artificial bad photograph) [2, 13, 14]. intelligence task. Existing computational approaches have focused Intensive data driven approaches have made substantial progress on modeling a single aesthetic score or class (good or bad photo), in this task, although it is a very subjective and context dependent however these do not provide any details on why the photograph is task. Earlier approaches used custom designed features based on good or bad; or which attributes contribute to the quality of the pho- photography rules (e.g., focus, color harmony, contrast, lighting, tograph. To obtain both accuracy and human-interpretability, we rule of thirds) and semantic information (e.g., human profile, scene advocate learning the aesthetic attributes along with the prediction category) from low level image descriptors (e.g. color histograms, of the overall score. For this purpose, We propose a novel multi- wavelet analysis) [1, 2, 4, 9, 12, 15, 16, 21, 24] and generic image task deep convolution neural network (DCNN), which jointly learns descriptors [19]. With the evolution of deep learning based tech- eight aesthetic attributes along with the overall aesthetic score. We niques, recent approaches have introduced deep convolution neural report near-human performance in the prediction of the overall networks (DCNN) in aesthetic assessment tasks [8, 10, 13, 14, 26]. aesthetic score. To understand the internal representation of these Although these approaches give near-human performance in attributes in the learned model, we also develop the visualization classifying whether a photograph is “good" or “bad", they do not technique using back propagation of gradients. These visualiza- give detailed insights or explanation for such claims. For example, if tions highlight the important image regions for the corresponding a photograph received a bad rating, one would not get any insights attributes, thus providing insights about model’s understanding of about the attributes (e.g., poor lighting, dull colors etc.) that led to these attributes. We showcase the diversity and complexity associ- that rating. We propose an approach in which we identify (eight) ated with different attributes through a qualitative analysis of the such attributes (such as Color Harmony, Depth of Field etc.) and activation maps. report those along with the overall score. For this purpose, we propose a multi-task deep convolution network (DCNN ) which KEYWORDS simultaneously learns the eight aesthetic attributes along with the overall aesthetic score. We train and test our model on the recently Photography, Aesthetics, Aesthetic Attributes, Deep Convolution released aesthetics and attribute database (AADB) [10]. Following Neural Network, Residual Networks are the eight attributes as mentioned in [10] (Figure 1): ACM Reference format: Gautam Malu, Raju S. Bapi, and Bipin Indurkhya. 2017. Learning Photogra- (1) Balancing Element - Whether the image contains balanced phy Aesthetics with Deep CNNs. In Proceedings of The 28th Modern Artificial Intelligence and Cognitive Science Conference, Purdue University Fort Wayne, elements. April 2017 (MAICS’17), 8 pages. (2) Content - Whether the image has good/interesting content. (3) Color Harmony - Whether the overall color composition is harmonious. (4) Depth of Field - Whether the image has shallow depth of 1 INTRODUCTION field. Aesthetics is the study of science behind the concept and perception (5) Light - Whether the image has good/interesting lighting. of beauty. Although aesthetics of photograph is subjective, some (6) Object Emphasis - Whether the image emphasizes fore- aspect of its depends on the standard photography practices and ground objects. general visual design rules. With the ever increasing volume of (7) Rule of Thirds - Whether the image follows rule of thirds digital photographs, automatic aesthetic assessment is becoming principle. The rule of thirds involves dividing the photo increasingly useful for various applications, such as a personal into 9 parts with 2 vertical and 2 horizontal lines. The photo assistant, photo manager, photo enhancement, image re- important elements and leading lines are placed on/near trieval etc. Conventionally, automatic aesthetic assessment tasks these these lines and intersections of these lines. have been modeled as either a regression problem (single aesthetic (8) Vivid Color - Whether the image has vivid colors, not nec- ∗ Corresponding author essarily harmonious colors. We also develop attribute activation maps (Figure 3) for visual- MAICS’17, Purdue University Fort Wayne © 2017 Copyright held by author(s). . ization of these attributes. These maps highlight the salient regions for the corresponding attribute, thus providing us insights about 129 Learning Photography Aesthetics with Deep CNNs pp. 129–136 MAICS’17, April 2017, Purdue University Fort Wayne Malu et al. Figure 1: Sample images taken from AADB for each attribute. top row: Highest rated images, bottom row: Lowest Rated Images. All images were padded to maintain aspect ratio for illustration purposes. the representation of these attributes in our trained model. In summary, followings are the main contributions of our paper: (1) We propose a novel deep learning based approach which simultaneously learns eight aesthetic attributes along with the overall score. These attributes enable us to provide more detailed feedback on automatic aesthetic assessment. (2) We also develop localized representation of these attributes from our learned model. We call these attribute activation maps (Figure 3). These maps provide us more insights about model’s interpretability of the attributes. Figure 2: Sample images from AADB testing data. first col- 2 RELATED WORK umn: Images rated high on aesthetic score, second Column: Most of the earlier works have used low-level image features to de- Images rated at mid-level, third Column: Images rated low. sign high level aesthetic attributes as mid-level features and trained aesthetic classifier over these features. Datta et al. [2] proposed 56 visual features based on standard photography and visual design rules to encapsulate aesthetic attributes from low-level image fea- tures. Dhar et al. divided aesthetic attributes into three categories ordinary images with marginal aesthetics (Figure 2). For these im- Compositional (e.g. Depth of field, Rule of thirds), Content(e.g. faces, ages, attributes information would greatly supplement the quality animals, scene types), Sky-Illumination (e.g. clear sky, sunset sky). of feedback from an automatic aesthetic assessment system. They trained individual classifiers for these attributes from low- Recently, deep learning techniques have shown significant per- level features (e.g. color histograms, center surrounding wavelets, formance gains in various computer vision tasks such as object classification, localization [11, 23, 25]. In deep learning, non-linear haar features) and used outputs of these classifiers as input features features are learned in a hierarchical fashion in increasing complex- for the aesthetic classifier. ity (e.g. colors, edges, objects). The aesthetic attributes can be learned as Marchesotti et al. [18], proposed to learn aesthetic attributes combinations of these features. Deep learning techniques have shown sig- from textual comments on the photographs using generic image nificant performance gains in comparison with traditional machine learning features. Despite increased performance, many of these textual approaches for aesthetic assessment tasks [8, 10, 13, 14, 26]. Unlike tradi- attributes (good, looks great, nice try) do not map to well-defined tional machine learning techniques, features are also learned during training visual characteristics. Lu et al. [13] proposed to learn several mean- in deep learning techniques. However these internal representations of DC- ingful style attributes, and used these to fine-tune the training of NNs are still opaque. Various visualization techniques [5, 17, 22, 28–30] aesthetics classification network. Kong et al. [10] proposed attribute have been proposed to visualize the internal representations of DCNNs and content adaptive DCNN for aesthetic score prediction. in an attempt to have a better understanding of their working. However, these visualization techniques have not been applied in aesthetic assessment However, none of the previous works report the aesthetic at- tasks. In this article, we apply the gradient based visualization technique tributes themselves. These attributes are used as features to predict proposed by Zhou et al. [30] to obtain attribute activation maps. These maps the overall aesthetic score/class. In this paper, we learn aesthetic provide localized representation of these attributes. Additionally we also attributes along with the overall score, not just as intermediate apply similar visualization technique [22] to the model provided by Kong features but as auxiliary information. Aesthetic assessment is rela- et al. [10] to obtain similar maps for qualitative comparison of our results tively easier in images with evident high and low aesthetics than in with the earlier approach. 130 Gautam Malu et al. MAICS 2017 pp. 129–136 Learning Photography Aesthetics with Deep CNNs MAICS’17, April 2017, Purdue University Fort Wayne 3 METHOD 3.1 Architecture We use the deep residual network (ResNet50) [6] to train all the attributes along with the overall aesthetic score. ResNet50 has 50 layers which can be divided into 16 successive residual blocks. Each residual block contains 3 convolution layers followed by the batch normalization layer (Figure 3). Each residual block is followed by a rectified linear activation layer (ReLU) [20]. We take these rectified convolution maps from the ReLU output of all these 16 residual blocks, and pool features from each of these 16 blocks with a global average pooling (GAP) layer. GAP layer gives the spatial average of these rectified convolution maps. Then we concatenate all these pooled features and use this as a feature for a fully connected layer which produces the desired outputs (aesthetic attributes and the overall score) as shown in Figure 3. We model the attribute and score prediction as a regression problem with mean squared error as loss function. Due to this simple connectivity structure, we are able to identify the importance of image regions by projecting the weights of the output layer on to the rectified convolution maps, a technique we call attribute activation mapping. This technique was first introduced by Zhou et al. [30] to get class activation maps for different semantic classes in image classification task. 3.2 Attribute Activation Mapping For a given image, let f k (x, y) represent the activation of unit k in the rectified convolution map at spatial location (x, y). Then, for unit k, the P result of performing global average pooling is F k = x,y f k (x, y). Thus, for P a given attribute a, the input to the regression layer, R a , is x w ka F k where w k is the weight corresponding to attribute a for unit k. Essentially, w ka a Figure 3: Our approach for generating attribute activation indicates the importance of F k for attribute a as shown in Figure 3. maps. The predicted score for a given attribute (object em- We also synthesized similar attribute maps from the model proposed by phasis in the figure) is mapped back to the rectified convolu- Kong et al. [10]. We did not have the final attribute and content adapted tion layers to generate the attribute activation maps. These model from [10] due to patent rights but Kong et al. shared the attribute adapted model with us. That model is based on alexnet architecture [11] maps highlight the attribute-specific discriminative regions consisting of fully connected layers along with convolution layers. In this as shown in the bottom section. architecture, outputs of convolution layers are separated from desired out- puts by three stacked fully connected layers. The outputs from last FC layer are regression scores of attributes. In this architecture we compute weight 3.4 Dataset of layer k for attribute a as summation of gradients (дka ) of outputs with As mentioned earlier, we have used the aesthetics and attribute database P respect to k t h convolution layer w ka = x,y дka (x, y). This technique was (AADB) provided by Kong et al. [10]. AADB provides overall ratings for first introduced by Selvaraju et al. [22] to get class activation maps for the photographs along with the ratings on the eleven aesthetic attributes different semantic classes and visual explanation (answers for questions). as mentioned in [10] (Figure 1). Users were asked to provide information about the effectiveness of these attributes on the overall aesthetic score. For 3.3 Implementation Details example, if object emphasis is positively contributing towards the overall aesthetics of a photograph, user will give a score of +1 for the attribute, Out of 10000 samples present in the AADB dataset, we have trained our if object is not emphasized adequately and this is contributing negatively model on 8500 training samples. 500 and 1000 images have been set aside for towards the overall aesthetic score of the photograph, user will give a score validation and testing purposes, respectively. As the number of training sam- of -1 for the attribute (See Fig 5). The users also rated the overall aesthetic ples (8500) is not adequate for training of such a deep network (23,715,852 score on a scale of 1 to 5, with 5 being the most aesthetically pleasing score. parameters) from scratch, we used a pre-trained ResNet50. It was trained Each image was rated by at least 5 persons. The mean score was taken as on 1000-class Imagenet classification dataset [3] with approximately 1.2 the ground truth score for all attributes and the overall score. million images. We fixed the input image size to 299 × 299. We used hori- If an attribute has enhanced the image quality, it was rated positively zontal flip of the input images as a data augmentation technique. The last and if the attribute has degraded the image aesthetics it was rated nega- residual block gives convolution maps of size 10 × 10, so we reduce the sizes tively. The default zero (null) means the attribute does not affect the image of the convolution maps from the previous Res-Blocks to the same size with aesthetics. For example, positive vivid color means the vividness of the color appropriate sized average pooling. As ResNet50 has batch normalization presented in an image has a positive effect on the image aesthetics; while layers, it is very sensitive to batch size. We fixed the batch size to 16 and the negative vivid color means the image has dull color composition. All the trained it for 16 epochs. We report our model's performance on test set (1000 attributes except for Repetition and Symmetry are normalized to the range images) provided in AADB. We have made our implementation publicly of [-1, 1] Repetition and Symmetry are normalized to the range of [0, 1], as available 1 . negative values are not justifiable for these two attributes. The overall score is normalized to the range of [0, 1].Out of these eleven attributes, we omit Symmetry, Repetition and Motion blur attributes from our experiment as 1 https://github.com/gautamMalu/Aesthetic_attributes_maps most of the images rated null for these attributes (Figure 4). We model the 131 Learning Photography Aesthetics with Deep CNNs pp. 129–136 MAICS’17, April 2017, Purdue University Fort Wayne Malu et al. Table 1: Spearman's rank correlations for all the attributes. All correlation coefficients (ρ) are significant at p < 0.0001. The coefficients marked with a * are best results for respec- tive attributes. Attribute ResNet50-FT Kong et al.[10] Our method Balancing Elements 0.184 0.220* 0.186 Content 0.572 0.508 0.584* Color Harmony 0.452 0.471 0.475* Depth of Field 0.450 0.479 0.495* Light 0.379 0.443* 0.399 Figure 4: The distribution of all the eleven attributes in the Object Emphasis 0.658 0.602 0.666* training data of AADB. Most of the images are rated neutral Rule of Thirds 0.175 0.225* 0.178 for motion blur, repetition and symmetry. Vivid Colors 0.661 0.648 0.681* Overall Aesthetic Score 0.665 0.6542 /0.678 0.689* Table 2: Human performance on AADB. Our model actually outperforms the human consistently (as measured by ρ, last row) averaged across all raters (first row). However, when considering only the “power raters” who have annotated more images, human consistently outperform our model (second and third row). Number of Images rated Number of Raters ρ >0 195 0.6738 > 100 65 0.7013 > 200 42 0.7112 Our Approach − 0.689 Figure 5: Interface of data collection adopted by Kong et al.[10]. functions. Their final approach was also content adaptive. As can be seen from the results reported in Table 1, our model managed to outperform their approach in overall aesthetic score in-spite of only being trained with mean other eight attributes along with the overall aesthetic score as a regression square error and without any content adaptive framework . Our model problem. significantly underperformed for Rule of Thirds and Balancing elements attributes. These attributes are location sensitive attributes. Rule of thirds 4 RESULTS & DISCUSSION deals with positioning of the salient elements, Balancing Elements deals with relative positioning of objects with each other and the frame. In our To evaluate the aesthetic attribute scores predicted by our model, we report model, due to use of global average pooling (GAP) layers after activation the Spearman's ranking correlation coefficient (ρ) between the estimated layers we are losing location specificity. We selected GAP layer to reduce aesthetic attribute score and the corresponding ground truth score for the the number of parameters. The number of training samples (8500) allows testing data. The ranking correlation coefficient (ρ) evaluates the monotonic learning of only small parameter space. We also warp the input images relationship between estimated scores and ground truth scores, hence there to the fixed size input (299x 299), thus destroying the aspect ratio. These is no need of explicit calibration between them. The correlation coefficient could be possible reasons for the under-performance of the model for these lies in the range of [-1, 1], with greater values corresponding to higher compositional and location sensitive attributes. Across all the attributes, correlation and vice-versa. For baseline comparison, we also train a model our proposed method reports better results than ResNet50 fine-tuned model. by fine tuning a pre-trained ResNet50 and label it as ResNet50-FT. Fine- Our model performs better than the model provided by Kong et al. [10] for tuning here refers to modifying the last layer of the pre-trained ResNet50 five-out-of-eight attributes. [6] and training it for our aesthetic attribute prediction task. Table 1 lists Aspects of aesthetic judgments are very subjective in nature. To quantify the performance on AADB using the two approaches. We also report the this subjectivity. In AADB the ground-truth score is the mean score of performance of the model shared by Kong et al.[10]. ratings given by different individuals. To quantify the agreement between It should be noted that the spearman's coefficient between the estimated ratings, ρ between each individual's ratings and the ground-truth scores overall aesthetic score and the corresponding ground truth reported by Kong was calculated. The average of ρ is reported in Table 2. Our model actually et al.[10] was 0.678. They did not report any metrics for the other aesthetic outperforms the human consistently (as measured by ρ) averaged across attributes. They used ranking loss along with mean squared error as loss all raters. However, when considering only the “power raters” who have 2 The ρ reported by Kong et al. [10] for their final content and attribute adaptive model annotated more images, human evaluators consistently outperform model's is 0.678, here we are reporting the performance of the model shared by them. results. 132 Gautam Malu et al. MAICS 2017 pp. 129–136 Learning Photography Aesthetics with Deep CNNs MAICS’17, April 2017, Purdue University Fort Wayne Figure 6: Object Emphasis activation maps. First row: Original Images (marked with ground truth score at the bottom right), second row: Activation Maps from Kong et al. [10] model (marked with predicted score from their given model), third row: Activation Maps from our method (marked with our predicted score ). Color-bar indicates the color encoding of activation. Figure 7: Content activation maps. First row: Original Images (marked with ground truth score at the bottom right), second row: Activation Maps from Kong et al. [10] model (marked with predicted score from their given model), third row: Activation Maps from our method (marked with our predicted score ). Color-bar indicates the color encoding of activation. 5 VISUALIZATION the second row of activation maps in Figure 6. It showcases that our model As mentioned above we generate attribute activation maps for different has learned the object emphasis attribute as an attribute which is indeed attributes, to get their localized representations. Here we omit the following related to objects. attributes, namely, emphbalancing element and the rule of thirds, as our model's performance is very low for these attributes as shown in Table 1. 5.2 Content For each attribute, we have analyzed the activation maps and present the Interestingness of content is significantly subjective and is a context-dependent insights in this section. For illustration purposes, We have selected ten attribute. However, if a model is trained on this attribute, one would expect samples for each attribute. Out of these ten samples, first five are the highest the model would have maximum activation at the content of the image rated by our model, and the next five are the lowest rated. We have selected while making this judgment. If there exists a well-defined object in an image, these samples from test samples (1000) and not from the train samples. We then that object is considered as the content of the image, for e.g., 2nd and also have included the activation maps from model given by Kong et al. [10] 3r d columns of Figure 7. Further, it can be observed in these columns that (Kong’s model). These activation maps highlight the most important regions our proposed approach is better at identifying the content than Kong’s for the given attributes. We define these activation maps as “gaze" of the model. Without the presence of explicit objects, the content of the image is model. difficult to localize, for e.g. 1s t and 5t h columns of Figure 7. As shown in Figure 7, our model’s activation maps are maximally active at the content 5.1 Object Emphasis of the image. In comparison activation maps from Kong’s models are not consistent. By qualitative analysis of activation maps of object emphasis, it was ob- served that model gazes at the main object on the image. Even when the model predicts negative rating, i.e. object is not emphasized, the model 5.3 Depth of Field searches for regions which contain objects Figure 6. In comparison, activa- On analyzing the representations of shallow depth of field, it was observed tion maps from Kong’s model are not always consistent as can be seen in that model looks for blurry regions near the main object of the image while 133 Learning Photography Aesthetics with Deep CNNs pp. 129–136 MAICS’17, April 2017, Purdue University Fort Wayne Malu et al. Figure 8: Depth of Field activation maps. First row: Original Images (marked with ground truth score at the bottom right), second row: Activation Maps from Kong et al. [10] model (marked with predicted score from their given model), third row: Activation Maps from our method (marked with our predicted score ). Color-bar indicates the color encoding of activation. Figure 9: Vivid Color activation maps. First row: Original Images (marked with ground truth score at the bottom right), second row: Activation Maps from Kong et al. [10] model (marked with predicted score from their given model), third row: Activation Maps from our method (marked with our predicted score ). Color-bar indicates the color encoding of activation. making the judgment as showcased in Figure 8. Shallow depth of field model seems to look at bright light, or source of the light in the photograph. technique is used to make the subject of the photograph stand out from Although model's behavior is consistent, its understanding of this attribute its background. The model’s interpretation of it is in that direction. The is incomplete. This was also evident in the low correlation ratings of our images for which model has predicted the negative score on this attribute, proposed model for this attribute, as reported in Table 1. the activation maps are random. Activation maps from Kong’s model also showcase a similar behavior, these maps are more active at the corner of 5.6 Color Harmony the images. Although model's performance is significant for this attribute, we could not find any consistent pattern in its activation maps. As color harmony 5.4 Vivid Color is of many types, e.g., analogues, complementary, triadic; it is difficult to Vivid Color means the presence of bright and bold colors. The model’s get a single representation pattern. For example, in the first example shown interpretation of this attribute seems to be along these lines. As shown in in Figure 11, the green color of hills is in analogous harmony with blue Figure 9, model gazes at vivid color areas while making the judgment about color of water and sky; in the 3r d example, brown sand color is in split this attribute. For example, in 2nd column of the Figure 9 pink color of complementary harmony with blue and green. The attribute activation flowers and scarf, and in 3r d column butterfly and flower were the most maps for Color Harmony are shown in Figure 11. activated regions. Authors couldn’t find any pattern in activation maps from Kong’s model. 6 CONCLUSION In this paper, we have proposed deep convolution neural network (DCNN) 5.5 Light architecture to learn aesthetic attributes. Results show that estimated scores Good Lighting is quite a challenging concept to grasp. It does not merely of five aesthetic attributes (Interestingness of Content, Object emphasis, depend on the light in the photograph, but rather how that light comple- shallow Depth of Field, Vivid Color, and Color Harmony) correlate signif- ments the whole composition. As shown in Figure 10, most of the time icantly with their respective ground truth scores. Whereas in the case of 134 Gautam Malu et al. MAICS 2017 pp. 129–136 Learning Photography Aesthetics with Deep CNNs MAICS’17, April 2017, Purdue University Fort Wayne Figure 10: Light activation maps. First row: Original Images (marked with ground truth score at the bottom right), second row: Ac- tivation Maps from Kong et al. [10] model (marked with predicted score from their given model), third row: Activation Maps from our method (marked with our predicted score ). Color-bar indicates the color encoding of activation. Figure 11: Color Harmony activation maps. First row: Original Images (marked with ground truth score at the bottom right), second row: Activation Maps from Kong et al. [10] model (marked with predicted score from their given model), third row: Activation Maps from our method (marked with our predicted score ). Color-bar indicates the color encoding of activation. attributes such as Balancing Elements, Light and Rule of Thirds, the correla- ACKNOWLEDGMENTS tion is inferior. The activation maps corresponding to the learned aesthetic The authors would like to thank Ms. Shruti Naik, and Mr. Yashaswi Verma attributes such as object emphasis, content, depth of field and vivid color of International Institute of Information Technology, Hyderabad, India for indicate that the model has acquired internal representation suitable to their help in manuscript preparation. highlight these attributes automatically. However, for color harmony and light, the visualization maps were not consistent. Aesthetic judgment involves a degree of subjectivity. For example, in REFERENCES AADB the average correlation between the mean score and an individual’s [1] Subhabrata Bhattacharya, Rahul Sukthankar, and Mubarak Shah. 2010. A frame- score for the overall aesthetic score is 0.67 2. Moreover, as reported by Kong work for photo-quality assessment and enhancement based on visual aesthetics. et al. [10], the model learned on a particular dataset might not work on a In Proceedings of the 18th ACM international conference on Multimedia. ACM, different dataset. Considering all these factors, empirical validity of aesthetic 271–280. judgment models is still a challenge. We suggest that the visualization [2] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2006. Studying aesthetics in photographic images using a computational approach. In European Conference techniques presented in the current work is a step forward in that direction. on Computer Vision. Springer, 288–301. Empirical validation could proceed by asking subjects to annotate the images [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima- (identifying the regions that correspond to different aesthetic attributes) genet: A large-scale hierarchical image database. In Computer Vision and Pattern and these empirical maps could in turn be compared with the predicted Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248–255. [4] Sagnik Dhar, Vicente Ordonez, and Tamara L Berg. 2011. High level describable maps of the model. Such experiments need to be conducted in future to attributes for predicting aesthetics and interestingness. In Computer Vision and validate the current approach. Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 1657–1664. [5] Alexey Dosovitskiy and Thomas Brox. 2015. Inverting convolutional networks with convolutional networks. CoRR abs/1506.02753 (2015). [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778. 135 Learning Photography Aesthetics with Deep CNNs pp. 129–136 MAICS’17, April 2017, Purdue University Fort Wayne Malu et al. [7] Yueying Kao, Kaiqi Huang, and Steve Maybank. 2016. Hierarchical aesthetic quality assessment using deep convolutional neural networks. Signal Processing: Image Communication 47 (2016), 500–510. [8] Yueying Kao, Chong Wang, and Kaiqi Huang. 2015. Visual aesthetic quality assessment with a regression model. In Image Processing (ICIP), 2015 IEEE Inter- national Conference on. IEEE, 1583–1587. [9] Yan Ke, Xiaoou Tang, and Feng Jing. 2006. The design of high-level features for photo quality assessment. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, Vol. 1. IEEE, 419–426. [10] Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless Fowlkes. 2016. Photo aesthetics ranking network with attributes and content adaptation. In European Conference on Computer Vision. Springer, 662–679. [11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica- tion with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105. [12] Li-Yun Lo and Ju-Chin Chen. 2012. A statistic approach for photo quality assess- ment. In Information Security and Intelligence Control (ISIC), 2012 International Conference on. IEEE, 107–110. [13] Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z Wang. 2014. Rapid: Rating pictorial aesthetics using deep learning. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 457–466. [14] Xin Lu, Zhe Lin, Xiaohui Shen, Radomir Mech, and James Z Wang. 2015. Deep multi-patch aggregation network for image style, aesthetics, and quality esti- mation. In Proceedings of the IEEE International Conference on Computer Vision. 990–998. [15] Wei Luo, Xiaogang Wang, and Xiaoou Tang. 2011. Content-based photo quality assessment. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2206–2213. [16] Yiwen Luo and Xiaoou Tang. 2008. Photo and video quality evaluation: Focusing on the subject. In European Conference on Computer Vision. Springer, 386–399. [17] Aravindh Mahendran and Andrea Vedaldi. 2015. Understanding deep image representations by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5188–5196. [18] Luca Marchesotti, Naila Murray, and Florent Perronnin. 2015. Discovering beautiful attributes for aesthetic image analysis. International journal of computer vision 113, 3 (2015), 246–266. [19] Luca Marchesotti, Florent Perronnin, Diane Larlus, and Gabriela Csurka. 2011. Assessing the aesthetic quality of photographs using generic image descriptors. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 1784– 1791. [20] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve re- stricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807–814. [21] Masashi Nishiyama, Takahiro Okabe, Imari Sato, and Yoichi Sato. 2011. Aesthetic quality classification of photographs based on color harmony. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 33–40. [22] Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. 2016. Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. arXiv preprint arXiv:1610.02391 (2016). [23] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). [24] Xiaoshuai Sun, Hongxun Yao, Rongrong Ji, and Shaohui Liu. 2009. Photo assess- ment based on computational visual attention model. In Proceedings of the 17th ACM international conference on Multimedia. ACM, 541–544. [25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition. 1–9. [26] Xinmei Tian, Zhe Dong, Kuiyuan Yang, and Tao Mei. 2015. Query-dependent aes- thetic model with deep learning for photo quality assessment. IEEE Transactions on Multimedia 17, 11 (2015), 2035–2048. [27] Ou Wu, Weiming Hu, and Jun Gao. 2011. Learning to predict the perceived visual quality of photos. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 225–232. [28] Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolu- tional networks. In European conference on computer vision. Springer, 818–833. [29] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856 (2014). [30] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929. 136