<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Lightweight Auto-Crop Based on Deep Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kunxiang Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shaoqiang Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junsong Zhang</string-name>
          <email>zhangjs@xmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Central China Normal University</institution>
          ,
          <addr-line>Wuhan, Hubei</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We study the problem of image cropping, which aims at improving the aesthetic quality of images by cutting gradually from the edges around the image and re-composition. Correct composition is the key to high quality images. Most previous cropping approaches generate a great number of candidates box from input image and select the most pleasant one as the nal cropped image which is time-consuming and may be an issue where the best cropping box is not in candidate box. To address these issues and motivate by these challenges, we propose a real-time and lightweight framework based on deep reinforcement learning algorithm, name advantage actor critic(A2C), to achieve fast and automatic cropping. Speci cally, the sequential action of cropping is automatically learned through a policy network which contains a MobileNetV2 model, and the average intersection-overunion(IOU) value is designed as a part of learning reward. The model are trained by synchronous policy gradient and we show that parallel actor-learners, have an e cient learning on image cropping. Evaluating on the Flickr Cropping Dataset(FCD) and the experimental results show that our method reach the state-of-the-art performance with fewer cropping steps and time compared with some previous automatic cropping tools.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Designed and implemented the method
yArticle writing
zProvided a lot of suggestions
using the sliding window method, and (3) selects the best perception image from the aesthetic
evaluation model. However, the method is time consuming (the need to lter thousands of
candidate images), the best cropping box is not the risk within the candidates. Some methods
regard the image cropping process as the Markov decision-making process [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], simulating
human to crop the image [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. According to the global and local characteristics of the input
picture, the model generates the corresponding cropping action, and the image is gradually
cropped internally from the four-week edge until the output terminates the action or exceeds
the limit (such as cutting up to 20 times, etc.). But aesthetic quality assessment or quantitative
image aesthetics is a long-standing problem of computer vision. Lack of robustness of fractions
as reward functions [
        <xref ref-type="bibr" rid="ref1 ref26">26, 1</xref>
        ]. Based on the above discussion, in this paper we propose a lightweight
image cropping method, name LA2C, the method is based on deep enhanced learning algorithm
Advantage Actor critic(A2C) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. We look the image clipping as the Markov decision-making
process [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and show the sequential cropping process in Figure 1. Our main contributions are
summarized as follows:
1. Based on deep reinforcement learning, we propose an lightweight cropping Auto-Crop
method which can fast and correct automatic image.
2. Abandoning the use of aesthetic score, which is di cult to accurately quantify the
aesthetic quality of images as a reward. We use IOU value as part of the reward function.
3. Using the pre-trained MobileNetV2 [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] model to replace the common convolution layer
for feature extraction, improve the ability to extract image features and accelerate training
Simplify action space, including image clipping basic actions and a termination action.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Image cropping aims at improving the aesthetic quality by removing the unwanted outer
areas from a photographic or illustrated image. Most previous cropping methods rely on the
aesthetic quality assessment. We summarize representative works in image copping [
        <xref ref-type="bibr" rid="ref16 ref28 ref29">29, 16, 28</xref>
        ].
      </p>
      <p>
        Recently, deep reinforcement learning has shown promising success in automatic image
cropping. It [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] show that extracting high level features using CNNs and learning to crop photos
with Asynchronous Advantage Actor-Critic algorithm can result in state-of-the-art well quality
cropping performance. It performs that the automatic image cropping problem can be
formulated as a sequential decision-making process and novel an Aesthetics Aware Reinforcement
Learning(A2-RL) model for weakly supervised cropping problem. The model based on the
asynchronous advantage actor-critic(A3C) algorithm. CNN layers extract high level features,and
input 227×227 images, LSTM layer record the history observation and FC layers output the
action. Then calculate the aesthetic scores as the part of reward function. But the key of
this model is to nd an appropriate metric to evaluate a photo precise aesthetic score. The
traditional image evaluation metrics may not work well in this situation [
        <xref ref-type="bibr" rid="ref1 ref15 ref19 ref26 ref31 ref7">31, 26, 1, 7, 15, 19</xref>
        ].
Most previous methods for automatic image cropping include attention-based [
        <xref ref-type="bibr" rid="ref21 ref25 ref3 ref3">3, 3, 21, 25</xref>
        ]
and aesthetics-based methods [
        <xref ref-type="bibr" rid="ref10 ref22">10, 22</xref>
        ]. Recently deep learning cropping framework combined
attention and aesthetics components [
        <xref ref-type="bibr" rid="ref17 ref27 ref28 ref29">29, 28, 17, 27</xref>
        ], di erent from deep reinforcement
learning, it formulate photo cropping as a determining-adjusting process. Attention model predict
region locations where the most visually salient and generate 1,296 cropping candidates in total
by using sling-window based on human attention map. Aesthetics-aware part select the highest
aesthetically-score one as the nal cropping. But to select the highest aesthetics value form
1,296 cropping candidates means each image needs to be calculated 1,296 times through the
aesthetics model. Besides, the pleasing cropping window may not in these candidates generated
based on visually salient map.
      </p>
      <p>
        Early methods [
        <xref ref-type="bibr" rid="ref14 ref18 ref6 ref8">6, 8, 14, 18</xref>
        ] design handcrafted features relied on aesthetic knowledge.
However, due to the greater subjectivity and diversity in the measurement of image aesthetics
quality, it is di cult to determine the type and quantity of reliable features. Deep learning
performs better on aesthetic assessment and image cropping [
        <xref ref-type="bibr" rid="ref1 ref16 ref26 ref29">26, 1, 29, 16</xref>
        ].
      </p>
      <p>
        Deep reinforcement learning have been widely used in image caption [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], image editing [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]
object detectio n [
        <xref ref-type="bibr" rid="ref13 ref2">2, 13</xref>
        ] etc. Photo cropping based on deep reinforcement learning was found
result in state-of-the-art performance. We propose a novel system to achieve the auto-cropping
of images within a DRL frame which performs better and faster.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>
        We formulate automatic image cropping as a sequential decision-making process and as
agent-environment interaction problem, the Markov Decision Process(MDP) problem [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We
propose a novel automatic cropping method based advantage actor Critic (A2C) algorithm [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
Figure 2 shows the overall framework and process, the agent contains a policy network. The
policy network generates a series of cropping actions based on the current input image, and
samples the corresponding actions from the action area. Then the action space get the sampled
action to interact with the environment. The shape of the image is cropped from the four-week
edge. After the cropping action was executed in each step, the rolloutstorage stores the rewards
returned by the environment for subsequent loss calculations. And the goal of the agent is to
maximize the reward after each cropping. Next, the simple and lightweight framework will be
described in detail from the environment, the agent, and the training process in three sections.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Environment</title>
        <p>The role of the environment is as follows:
1. To provide the current observation O0 for the agent. The completion of each cropping
action will result in a change in the original image I0, resulting in a new cropped image It,
replacing the last observation It ! Ot. The advantage of using only cropped local images
as observation instead of global futures to combine local futures is to reduce the number
of duplicate pixel spaces and features and avoid wasting compute resources. Then, the
input image resize to (224, 224) before entering the policy network.
2. Give a reward for performing a cropping action. The di erence of cropping action will
directly a ect the di erence of the next observation, and the reward of the corresponding
action is given by the environment, which mimics the design of the Atari game
environment. This is completely di erent from the previous deep intensive learning automatic
cropping tool in the reward design, They use aesthetic quality assessment scores as
rewards, but it is di cult to quantify accurately the aesthetic quality of a picture is a
long-standing problem in computer vision. We propose use IOU value as the reward
instead of aesthetic quality score cause IOU value correctly present the quality of cropping.</p>
        <p>However, the agent learn faster and more e ective.
3. Action space and performing cropping actions. There are 9 actions in action space, 4
expansion actions, 4 zoom out actions, and 1 termination actions. Each action cropping
stride is 1/30 high or wide for the image. The 1/30 stride can be cropped more accurately
to the target box than the larger stride. The termination action means that the model
will learn to decide when to terminate the cropping and will eventually crop the image
output. Obviously, the cropping size is theoretically arbitrary.</p>
        <p>In addition, the envs in the advantage actor critic algorithm is operated in parallel. The
number of envs in this article is 16, and these envs run independently of each other and interact
with the same agent. After running a certain number of step, our method synchronizes update
across the network.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>The Agent</title>
        <p>The agent is the core part of the automatic cropping frame, in a nutshell, every step, the
agent outputs an action according to the current observation, and passes the action to the envs,
envs to crop the current image from the action space by selecting the corresponding cropping
method. Below from the policy network, the loss function and the implementation details
expand description.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Policy network</title>
        <p>
          The policy network consists of a pre-trained mobilenetv2 [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] and two full-attached layers.
Mobilenetv2 is a lightweight, e cient CNN model designed primarily for mobile device vision
applications. It uses convolution that can be separated at depth as an e cient building block,
Two new architectural features are introduced: 1) The linear bottleneck layer between layers,
and 2) the connection shortcut between the bottleneck layers.
        </p>
        <p>
          Drawing on the idea of migration learning, using CNN model ImageNet [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] pre-training as
feature extraction module can e ectively reduce training time and better training e ect, and
the comparison results will be shown in the experimental results. First, the current observation
input is fed into the MobileNetv2 [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] image feature extraction model that removes the last
layer, and the current feature graph is obtained, and the output-side parallel connection is
passed to a FC with 9 nodes and a FC containing 1 nodes. The former outputs 9 action values,
output = [P (0); P (1); ::::::; P (8)], p(t) indicates the possibility that action is T, which outputs
the state value to evaluate the current observation expected reward V (st).
3.3.1
        </p>
        <p>The loss function</p>
        <p>
          In order to get the best cropping e ect, we give up the way that many of the previous
method used aesthetic score as part of the reward function. The quality of quantitative image
aesthetics is a di cult problem in computer vision for a long time. At present, the advanced
quantitative model NIMA [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] not yet be able to accurately give the aesthetic score of each
image. Therefore, in terms of stability and accuracy, we propose to use average
IntersectionOver-Union (IOU) value as a cropped image that evaluates each step, the IOU value is naturally
used as the reward. IOU values are the usual criteria for measuring the accuracy of cropping,
and the speci c calculation method will be explained in detail in implementation details. When
the agent outputs a actions a(st) based on the current observation, Env executes the action
after getting a cropped image and calculates the corresponding IOU value as the reward Rt.
Rt set as follows, R(t) = +iou value when rr &gt; 0, R(t) = iou value when r &lt; 0, R(t) = 0
otherwise. This means that each time you crop IOU value increases, the agent receives a reward
and, conversely, a penalty that, when the output is terminated or exceeds the qualifying number
of cropping steps, There is no reward. At this point, the reward for an image clipping process
can be designed as Formulas (1):
rt =
(
And the loss function is designed as Formulas(2)-(5):
        </p>
        <p>loss = lossaction + lossvalue
lossaction =
lossdist</p>
        <p>V (st; v))
lossvalue =
log p(atjst; )(Rt</p>
        <p>Pt
i=1(Ri</p>
        <p>t
lossdist = H(p(st; ))</p>
        <p>V (si; v))2
BDE =</p>
        <p>Pi kBig
4</p>
        <p>Bick
Avgstep =
i=1
n
X step numi
n
(2)
(3)
(4)
(5)
(6)
(7)
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental results</title>
      <p>
        We rst present the cropping process and the test database and then exhibit the results
of this framework on the test set, using the same evaluation indicator average IOU value as
the previous work [
        <xref ref-type="bibr" rid="ref16 ref29 ref5">29, 16, 5</xref>
        ] and average boundary displacement, in addition to increasing the
average number of cropping steps per image and cutting time-consuming metrics.
4.1
      </p>
      <sec id="sec-4-1">
        <title>CUHK-ICD</title>
        <p>
          The CUHK-ICD [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] test set contains 150 images, Each image is a cropped window given
by 3 photographers, respectively. The original images collect from Chinese University of Hong
Kong's image cropping database [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. When we get the nal cropped image, we will calculate
the IOU value and the BDE value with 3 groundtruth box respectively and record the statistics.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Flickr cropping dataset</title>
        <p>
          The FCD [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] test set contains 374 images, Each image contains a manual callout box that
calculates the resulting cropping window with the box to get the IOU value. Figure 3 shows
the Deep-crop cropping results.
4.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Evaluation metrics and results</title>
        <p>
          To assess the capabilities of our methods(LA2C), test the IOU values, BDE values, cropping
steps, cropping time, and other metrics on the CUHK-ICD [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] and FCD [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] test sets, as shown
in table 1 and table 2.
        </p>
        <p>The Boundary Displacement Error(BDE) is designed as the average displacement of four
edges between the cropping box and the groundtruth rectangle:
where i 2 flef t; right; bottom; upg and fBigi means the edge of a groundtruth window or
cropping window.The lower the BDE value, the better the cropping e ect.</p>
        <p>Where step numirepresents the part i image clipping step, n represents the number of test
images.</p>
        <p>The cropping step is de ned as the cropping step for each image from the start cropping
to the end of the cumulative. Thecropping step re ects the model cropping e ciency and
whether the optimal cropping order can be calculated. The fewer cropping steps, the higher
the cropping e ciency.</p>
        <p>Crop time is de ned as the time it takes for each image to be cropped from start to nish,
re ecting the speed at which the model is cropped. The shorter the cropping time, the faster
the cropping speed.</p>
        <p>
          From the experimental data recorded in the table 1, it is obvious that our method(LA2C)
performs great on the FCD database [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] test set, with the AVG IOU value, Avg disp error
value, avg steps and AVG times four indicators fully ahead of the RankSVM-DECAFP [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ],
VFN++SW [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], A2-RL [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] method. From table 2, our method(LA2C) performs very close to
A2-RL compared to the CUHK-ICD test set [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. From table 3, our method(LA2C) performs
Method
RankSVM+DeCAF[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
VFN+SW++[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
A2-RL[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]
LA2C-LSTM(Ours)
LA2C(Ours)
better than the VFN+SW++ and A2-RL [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] model on the FCD [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] test set, especially the
cutting step is reduced by an average of 5.6 times, the increase is 41.29%, the cropping time is
shortened by 0.046s each image, and the increase is 31.29%.
        </p>
        <p>Compared with the method based on the sliding window. These method selects out the
most aesthetic image from a large number(1,125) of candidate windows. However, this cropping
process is low e ciency and time consuming. Our method regret the cropping process as the
Markov decision-making process and based on deep reinforcement learning take advantage of
shorter cropping times and more anthropomorphic. Compared with the previous method based
on deep reinforcement learning. Our method has a great advantage in cropping step and
cropping time.
4.5</p>
      </sec>
      <sec id="sec-4-4">
        <title>Limitations and future work</title>
        <p>The proposed method su ers from a few limitations. One potential de ciency point is
that our method(LA2C) does not combine professional photography with aesthetic knowledge,
simply allowing the model to learn how to crop images on its own. This may result in cropped
images with a large di erence between cropping results and human aesthetics. On the other
hand, the training samples in the database are positive samples and the model lacks negative
sample learning. In the future, we will continue to study the problem of automatic image
clipping and think about it from the following aspects. Make computer automatic cropping
model more integrated into professional photography and aesthetic knowledge, and how to
quantify and evaluate the aesthetic quality of images in model learning is helpful to further
improve the cropping e ect. In addition, we will try to migrate the method of automatic image
cropping to video auto-cropping or composition.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        In this paper, we regret the automatic image cropping problem as the Markov
decisionmaking process [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and propose a novel simple and lightweight method based on deep
reinforcement learning algorithm, name Advantage Actor Critic(A2C) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. With the currently
to accurate measurement of cropping e ect indicators, IOU value reward and a network with
strong ability to extract features, our LA2C method improve cropping accuracy and achieve
real-time cropping while increasing the average step.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Larbi</given-names>
            <surname>Abdenebaoui</surname>
          </string-name>
          , Benjamin Meyer, Albert Bruns, and
          <string-name>
            <given-names>Susanne</given-names>
            <surname>Boll</surname>
          </string-name>
          .
          <article-title>Unna: A uni ed neural network for aesthetic assessment</article-title>
          .
          <source>In 2018 International Conference on Content-Based Multimedia Indexing (CBMI)</source>
          , pages
          <fpage>1</fpage>
          <article-title>{6</article-title>
          . IEEE,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Juan</surname>
            <given-names>C Caicedo</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Svetlana</given-names>
            <surname>Lazebnik</surname>
          </string-name>
          .
          <article-title>Active object localization with deep reinforcement learning</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          , pages
          <volume>2488</volume>
          {
          <fpage>2496</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jiansheng</given-names>
            <surname>Chen</surname>
          </string-name>
          , Gaocheng Bai, Shaoheng Liang, and
          <string-name>
            <given-names>Zhengqin</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Automatic image cropping: A computational complexity study</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>507</volume>
          {
          <fpage>515</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Yi-Ling</surname>
            <given-names>Chen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tzu-Wei</surname>
            <given-names>Huang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kai-Han</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu-Chen</surname>
            <given-names>Tsai</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hwann-Tzong Chen</surname>
          </string-name>
          , and
          <string-name>
            <surname>Bing-Yu Chen</surname>
          </string-name>
          .
          <article-title>Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study</article-title>
          .
          <source>In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV)</source>
          , pages
          <fpage>226</fpage>
          {
          <fpage>234</fpage>
          . IEEE,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Yi-Ling</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Jan Klopp, Min Sun,
          <string-name>
            <surname>Shao-Yi Chien</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kwan-Liu Ma</surname>
          </string-name>
          .
          <article-title>Learning to compose with professional photographs on the web</article-title>
          .
          <source>In Proceedings of the 25th ACM international conference on Multimedia</source>
          , pages
          <volume>37</volume>
          {
          <fpage>45</fpage>
          . ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ritendra</given-names>
            <surname>Datta</surname>
          </string-name>
          , Dhiraj Joshi,
          <string-name>
            <given-names>Jia</given-names>
            <surname>Li</surname>
          </string-name>
          , and James Z Wang.
          <article-title>Studying aesthetics in photographic images using a computational approach</article-title>
          .
          <source>In European conference on computer vision</source>
          , pages
          <volume>288</volume>
          {
          <fpage>301</fpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Yubin</given-names>
            <surname>Deng</surname>
          </string-name>
          , Chen Change Loy, and
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <article-title>Image aesthetic assessment: An experimental survey</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          ,
          <volume>34</volume>
          (
          <issue>4</issue>
          ):
          <volume>80</volume>
          {
          <fpage>106</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Sagnik</given-names>
            <surname>Dhar</surname>
          </string-name>
          , Vicente Ordonez, and Tamara L Berg.
          <article-title>High level describable attributes for predicting aesthetics and interestingness</article-title>
          .
          <source>In CVPR 2011</source>
          , pages
          <fpage>1657</fpage>
          {
          <fpage>1664</fpage>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Seyed</surname>
            <given-names>A Esmaeili</given-names>
          </string-name>
          ,
          <article-title>Bharat Singh, and Larry S Davis. Fast-at: Fast automatic thumbnail generation using deep neural networks</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>4622</volume>
          {
          <fpage>4630</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Chen</surname>
            <given-names>Fang</given-names>
          </string-name>
          , Zhe Lin, Radomir
          <string-name>
            <surname>Mech</surname>
            , and
            <given-names>Xiaohui</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
          </string-name>
          .
          <article-title>Automatic image cropping using visual composition, boundary simplicity and content preservation models</article-title>
          .
          <source>In Proceedings of the 22nd ACM international conference on Multimedia</source>
          , pages
          <volume>1105</volume>
          {
          <fpage>1108</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Eunbin</surname>
            <given-names>Hong</given-names>
          </string-name>
          , Junho Jeon, and
          <string-name>
            <given-names>Seungyong</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Cnn based repeated cropping for photo composition enhancement</article-title>
          .
          <source>In CVPR workshop</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ronald</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Howard</surname>
          </string-name>
          .
          <article-title>Dynamic programming and markov processes</article-title>
          .
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Zequn</surname>
            <given-names>Jie</given-names>
          </string-name>
          , Xiaodan Liang, Jiashi Feng, Xiaojie Jin, Wen Lu, and
          <string-name>
            <given-names>Shuicheng</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <article-title>Tree-structured reinforcement learning for sequential object localization</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>127</fpage>
          {
          <fpage>135</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Yan</surname>
            <given-names>Ke</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Feng</given-names>
            <surname>Jing</surname>
          </string-name>
          .
          <article-title>The design of high-level features for photo quality assessment</article-title>
          .
          <source>In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>419</fpage>
          {
          <fpage>426</fpage>
          . IEEE,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Shu</surname>
            <given-names>Kong</given-names>
          </string-name>
          , Xiaohui Shen,
          <string-name>
            <given-names>Zhe</given-names>
            <surname>Lin</surname>
          </string-name>
          , Radomir
          <string-name>
            <surname>Mech</surname>
            , and
            <given-names>Charless</given-names>
          </string-name>
          <string-name>
            <surname>Fowlkes</surname>
          </string-name>
          .
          <article-title>Photo aesthetics ranking network with attributes and content adaptation</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          , pages
          <volume>662</volume>
          {
          <fpage>679</fpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Debang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Huikai Wu</surname>
          </string-name>
          , Junge Zhang, and Kaiqi Huang.
          <article-title>A2-rl: aesthetics aware reinforcement learning for image cropping</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>8193</volume>
          {
          <fpage>8201</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Debang</given-names>
            <surname>Li</surname>
          </string-name>
          , Junge Zhang, Kaiqi Huang, and
          <string-name>
            <surname>Ming-Hsuan Yang</surname>
          </string-name>
          .
          <article-title>Composing good shots by exploiting mutual relations</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>4213</volume>
          {
          <fpage>4222</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Wei</surname>
            <given-names>Luo</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaogang</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <article-title>Content-based photo quality assessment</article-title>
          .
          <source>In 2011 International Conference on Computer Vision</source>
          , pages
          <volume>2206</volume>
          {
          <fpage>2213</fpage>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Long</surname>
            <given-names>Mai</given-names>
          </string-name>
          , Hailin Jin, and Feng Liu.
          <article-title>Composition-preserving deep photo aesthetics assessment</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pages
          <volume>497</volume>
          {
          <fpage>506</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Volodymyr</surname>
            <given-names>Mnih</given-names>
          </string-name>
          , Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
          <string-name>
            <given-names>and Koray</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          .
          <article-title>Asynchronous methods for deep reinforcement learning</article-title>
          .
          <source>In International conference on machine learning</source>
          , pages
          <year>1928</year>
          {
          <year>1937</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Naila</surname>
            <given-names>Murray</given-names>
          </string-name>
          , Luca Marchesotti, and
          <string-name>
            <given-names>Florent</given-names>
            <surname>Perronnin</surname>
          </string-name>
          .
          <article-title>Ava: A large-scale database for aesthetic visual analysis</article-title>
          .
          <source>In 2012 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>2408</volume>
          {
          <fpage>2415</fpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Masashi</surname>
            <given-names>Nishiyama</given-names>
          </string-name>
          , Takahiro Okabe,
          <string-name>
            <given-names>Yoichi</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Imari</given-names>
            <surname>Sato</surname>
          </string-name>
          .
          <article-title>Sensation-based photo cropping</article-title>
          .
          <source>In Proceedings of the 17th ACM international conference on Multimedia</source>
          , pages
          <volume>669</volume>
          {
          <fpage>672</fpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Zhou</surname>
            <given-names>Ren</given-names>
          </string-name>
          , Xiaoyu Wang, Ning Zhang, Xutao Lv, and
          <string-name>
            <surname>Li-Jia Li</surname>
          </string-name>
          .
          <article-title>Deep reinforcement learning-based image captioning with embedding reward</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>290</volume>
          {
          <fpage>298</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sandler</surname>
          </string-name>
          , Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and
          <string-name>
            <surname>Liang-Chieh Chen</surname>
          </string-name>
          . Mobilenetv2:
          <article-title>Inverted residuals and linear bottlenecks</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>4510</volume>
          {
          <fpage>4520</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Bongwon</surname>
            <given-names>Suh</given-names>
          </string-name>
          , Haibin Ling, Benjamin B Bederson, and David W Jacobs.
          <article-title>Automatic thumbnail cropping and its e ectiveness</article-title>
          .
          <source>In Proceedings of the 16th annual ACM symposium on User interface software and technology</source>
          , pages
          <volume>95</volume>
          {
          <fpage>104</fpage>
          . ACM,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Hossein</given-names>
            <surname>Talebi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Peyman</given-names>
            <surname>Milanfar</surname>
          </string-name>
          . Nima:
          <article-title>Neural image assessment</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          ,
          <volume>27</volume>
          (
          <issue>8</issue>
          ):
          <volume>3998</volume>
          {
          <fpage>4011</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Yi</surname>
            <given-names>Tu</given-names>
          </string-name>
          , Li Niu,
          <string-name>
            <given-names>Weijie</given-names>
            <surname>Zhao</surname>
          </string-name>
          , Dawei Cheng, and Liqing Zhang.
          <article-title>Image cropping with composition and saliency aware aesthetic score map</article-title>
          .
          <source>In Proceedings of the AAAI Conference on Arti cial Intelligence</source>
          , volume
          <volume>34</volume>
          , pages
          <fpage>12104</fpage>
          {
          <fpage>12111</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Wenguan</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jianbing</given-names>
            <surname>Shen</surname>
          </string-name>
          .
          <article-title>Deep cropping via attention box prediction and aesthetics assessment</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          , pages
          <volume>2186</volume>
          {
          <fpage>2194</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Wenguan</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jianbing</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Haibin</given-names>
            <surname>Ling</surname>
          </string-name>
          .
          <article-title>A deep network solution for attention and aesthetics aware photo cropping</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Jianzhou</surname>
            <given-names>Yan</given-names>
          </string-name>
          , Stephen Lin, Sing Bing Kang, and
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <article-title>Learning the change for automatic image cropping</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>971</volume>
          {
          <fpage>978</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Runsheng</surname>
            <given-names>Yu</given-names>
          </string-name>
          , Wenyu Liu, Yasen Zhang, Zhi Qu,
          <string-name>
            <given-names>Deli</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Bo</given-names>
            <surname>Zhang</surname>
          </string-name>
          . Deepexposure:
          <article-title>Learning to expose photos with asynchronously reinforced adversarial learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>2153</fpage>
          {
          <fpage>2163</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>