<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automated Area Assessment of Ob jects Using Deep Learning Approach and Satellite Imagery Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kirill Tsyganov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexey Kozionov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaroslav Bologov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandr Andreev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleg Mangutov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Gorokhov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deloitte Analytics Institute, ZAO Deloitte &amp; Touche CIS</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe an actual case of applying deep neural networks for area assessment of different types of objects in selected geographical region through analysis of satellite images. The case was to detect, segment and asses area of buildings and agricultural lands on satellite images. We illustrate our framework of solving the problem and results validation methods. We compare performance of different convolutional neural networks in applying to our case and discuss the best quality segmentation model that was found - the U-net convolutional network. There was no training dataset of images and their corresponding masks available for our geographical region, but we constructed our own training set. Paper reports in detail on the processes of satellite imagery data preparation, images pre-processing, construction of training dataset and learning neural networks.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep learning</kwd>
        <kwd>Image segmentation</kwd>
        <kwd>Object detection</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>U-net</kwd>
        <kwd>Satellite imagery</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The paper presents main technical details of real-life client’s case in experience of
Deloitte Analytics Institute (Moscow). The paper does not pretend on scientific
novelty of applied methods in the solution but rather describes our approach of
using recent developments in machine learning in the actual industrial case.</p>
      <p>Due to the existing country legislation, the client faced a lack of systematic
recordings on agricultural and residential areas assessments and other national
statistics. The client wanted to perform a structured audit of agricultural lands
and residential areas paired with further monitoring of their development in time.
The client requested us to provide a solution for an automated area assessment,
based on an analysis of satellites imagery.</p>
      <p>
        Since the problem required an accurate solution, we decided to use deep
learning supervised approach. Basically we needed training dataset, neural
network architecture for image segmentation and computational hardware resources
to learn network on training data. We were going to experiment with publicly
available dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and in case of bad performance on test images of our region
create own dataset for our region of interest. For the neural network
architectures we took straightforward CNN [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and more complex architecture with
layers passing through each other [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For the networks’ performance evaluation
we took Jaccard index.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Satellite imagery data used in the solution</title>
      <sec id="sec-2-1">
        <title>Data specific restrictions</title>
        <p>In order to apply deep learning approach for image segmentation we needed
training set of images, i.e. pairs of satellite images and their corresponding masks
where only objects for detection were marked.</p>
        <p>There was no training dataset with agricultural lands of out interest, so we
had to construct our own dataset.</p>
        <p>The geographical region of our research had specific desert environment and
there was no training dataset of images for buildings segmentation of this region.
To overcome this issue with labeled data we tried to use publicly available aerial
imagery training set1 of another geographical region(fig. 1). But test of models,
trained on this open dataset, on images of our region of interest demonstrated
insufficient quality of recognition. Possible causes of poor quality might be the
following:
– due to the distinct geographical regions on the train and test images,
buildings in the training dataset and buildings on the test images were very
different: colors of roofs were different, shapes of buildings were different;
– projection angles on train and test images were different, it caused the size
of shadows of objects;
– image color schemes on train and test sets were significantly distinct.
1 Massachusetts Buildings Dataset publicly available at link http://www.cs.toronto.</p>
        <p>edu/~vmnih/data/.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Training dataset construction</title>
        <p>After several unsuccessful attempts to use open training datasets for our
problem, we came to the conclusion to use as training dataset satellite imagery data
of our region of interest. Since there was no available labeled dataset we
constructed such dataset by ourselves.</p>
        <p>We used satellite images with resolution of 1 meter per pixel for training and
test sets. Such resolution was able neural network to detect border structure of
small buildings with area approximately 30 square meters.</p>
        <p>To construct training dataset we took several small subregions and manually
draw a mask with buildings and agricultural lands for it (fig. 2 and fig. 3). In
order to improve generality power of out models we put in the training dataset
buildings and agricultural lands of all types from different geographical
subregions. Forming the training dataset was an iterative cycled process:
1. We trained model on the training dataset.
2. Then we tested model on test dataset.
3. Next we visually examined model’s quality of recognition on test images and
sought subregions where model performed low accuracy.
4. Finally we manually created masks for unsatisfactory recognized subregions
and added such pairs of images-masks for the subregions into the training
dataset.
5. Back to step 1.
Due to purpose of fast training dataset formation, the images in the initial
training dataset had the shapes of rectangles of different sizes. But the input for the
neural network should have one predefined size. Therefore, in order to generalize
our approach, for every image in the training dataset and its corresponding mask
we took patches by sliding window of size 64 64 with step 16 (fig. 4).
In order to enlarge training dataset without additional manual labelling of images
we used standard techniques of image data augmentation, i.e. rotations and
symmetries of original images (fig. 5). The data augmentation is applied to the
patches of square shape, so that for every patch symmetry group of square is
applied.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation metrics</title>
      <p>Objects segmentation problems commonly estimated by the Jaccard index and
visual analysis. We used the following metrics to assess the performance of
models: Jaccard index, area error, precision, recall.
3.1</p>
      <sec id="sec-3-1">
        <title>Jaccard index</title>
        <p>The Jaccard index, also known as Intersection over Union (IU) is a measure of
similarity and diversity of two sets. In order to compute Jaccard index between
two finite sets A and B you need to divide the cardinality of intersection of A
and B by the cardinality of union A and B:</p>
        <p>J (A; B) = jA \ Bj =
jA [ Bj</p>
        <p>jA \ Bj
jAj + jBj jA \ Bj
;
0</p>
        <p>J (A; B)
1:
(1)</p>
        <p>
          Jaccard index gives more penalty for error (both types of error) that precision
and recall since it uses both false positives and false negatives statistics (fig. 6).
We had two classes of objects to detect and segment: buildings and agricultural
lands with growing plants. Based on the conclusion that independent
segmentation for multiple classes performs better than multinomial segmentation for
multiple classes simultaneously [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], we decided to solve segmentation problems
for each class separately. There was also an additional argument for such
separation of problems – since second class of objects was agricultural lands with only
growing plants we were going use additional features, like vegetation indexes [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
in order to increase accuracy of distinction of growing and not growing plants.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Buildings segmentation network architecture</title>
        <p>
          We examined several architectures of convolutional neural networks: U-net [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
with different settings of hyperparameters and neural network with mixed
convolutional and fully connected layers [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>The architecture of the best performed CNN is based on U-net. Among other
differences our network has less number of merging layers – 2 merges instead of
3 – we found that learning process CNN with 3 merges is very time consuming
but does not give significant benefit in performance.</p>
        <p>Our network (fig. 7) starts with contracting procedure with the repeated
convolution, maxpooling and dropout layers and proceeds with expansion procedure
in which maxpooling is substituted with upsampling. The most important and
benefit feature in the network is the append of the output from the contracting
layers to the input in the expansive layers. This approach significantly improves
network performance on buildings’ borders structure extraction. All
convolutional layers except the last one use ReLU activation function and the output
layer uses SoftMax.
In general the problem of lands segmentation is analogous to buildings
segmentation. However, the average farm size is much bigger than average building
size, so one need to cut initial image into considerably larger patches to preserve
the information about farm structure and it’s surroundings. The segmentation
problem becomes computationally expensive when the neural network is used
for processing heavy image patches.</p>
        <p>A new approach was applied for circle farms recognition in order to overcome
computational difficulties. The main feature of the approach is to use the
combination of two heatmaps produced by different processing techniques to make
the final segmentation map. The first heatmap is produced by applying ellipsoid
filters of various sizes to initial image. Exact sizes of the filters depend on image
resolution. In this paper 5 x 5 and 50 x 50 filters were applied to 1 meter per pixel
maps. Ellipsoid filter may be described as binary image of a circle inscribed in
a square of a certain size or as matrix of zeros and ones with the ones filling the
center circle-shaped region of the matrix. During applying of this filter erosion
operation is performed. The filter slides through the image (like kernel in CNN
convolution layer) and element-wise product of filter matrix and image segment
is calculated. Minimum of these products is assigned to an anchor point that is
set to be in the center of the filter. Thus, applying of a filter transforms initial
image similarly to using convolution layer of CNN followed by minpooling layer.
As a result, filtering, like CNN, is also produces a heatmap that is shown on
Figure 9.</p>
        <p>The second heatmap is produced by running random forest classifier which
was trained to predict pixel class (farm / non-farm) based on it’s color.</p>
        <p>
          The idea behind proposed approach is to use the advantages of two
techniques, which compensate each other flaws. Color segmentation method produce
a relatively noisy heatmap, as the color of hills and roads is somewhat similar to
farms color (especially when crop is not yet grown). Shape detection method—
filtering—produces much less noise, but detected farms areas are significantly
smaller than actual ones due to information loss during erosion process. In the
joint heatmap calculated as average previous two the intensity of noise is lower
than in color segmentation map and boundaries of farms are closer to actual
than in shape detection map. Remaining noise can be removed by applying
thresholding technique and median filter [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
4.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Polygons extraction</title>
        <p>
          Neural network output due to the final softmax activation function provided
us with two probabalistic heatmaps – one with probabilities of buildings and
inverted one. But for the presentation results of recognition in the geospatial
system it is necessary to convert heatmaps into polygons form. For this task we
used thresholding of heatmaps and Douglas-Peucker algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
5
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiment results</title>
      <p>We obtained Jaccard index of approximately 0:61 for buildings recognition, and
0:65 – for circle agricultural farms. The recognition results for buildings and
circle agricultural farms can be seen at figures (8 and 9) correspondingly. As
well as Jaccard Index, we computed the total area accuracy and it had value of
94% for buildings segmentation problem on the validation dataset.
Since we have a binary classification problem (buildings, background) we used
binary cross entropy as a loss function:</p>
      <p>Hb(p) = H(p; 1
p) =
p log(p)
(1
p) log(1
p):
(2)</p>
      <p>Learning process of unet with input and output patch of size 64 64 was
not overfitting till approximately 85 epoch: starting from 85 epoch validation
loss deviated significantly with training loss decreasing smoothly and it hurt the
quality on test data (left picture on fig. 10).</p>
      <p>Jaccard indices for different sizes of patches (as input and output shape for
neural network) behaved the same starting from epoch 9 (right picture on fig. 10).
6</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>We highlight the following branches of improvements that could be done for our
solution:
– Color histogram equalization of satellite images</p>
      <p>Since satellite images in the initial photo bank could be done by
different satellites the color histograms of images can differentiate significantly.</p>
      <p>
        Such variety could harm the recognition quality. Therefore images’ color
histograms should be equalized before the further analysis. We suggest that
contrast limited adaptive histogram equalization (CLAHE) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is the most
appropriate method for images’ color equalization.
– Additional spectral bands
      </p>
      <p>
        Near-infrared range (NIR) and red edge channel could significantly enhance
the quality of recognition algorithms, especially for agricultural lands. For
example, combination of different bands with different resolution from different
satellites in one regression model demonstrates high accuracy of agricultural
land condition [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
– Training dataset formation
      </p>
      <p>Creating a mask for satellite image is a tough problem. In order to do a
significant improvement of recognition’s quality it is necessary to have masks
for all types of objects a given class. We suggest to extend the training dataset
not only by augmentation techniques of possessed images but by including
bad-recognized regions.
– Object detection phase</p>
      <p>
        Region proposal networks [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] resolve the problem of object
detection. The object detection phase can be used before image segmentation
in order to reduce noise from other objects [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
– Object boundaries adjustment by probabilistic graphical models
In order to improve localization accuracy of object boundaries it was
proposed to use combination of methods from DCNNs and probabilistic
graphical models [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Since CNNs can predict the rough position of the objects but
it is difficult for them to highlight the boundaries, authors presented a new
approach of refining objects’ boundaries by applying fully-connected
conditional random fields (CRF) for accurate boundary recovery after the final
layer of the CNNs. They proved increased performance of this approach at
PASCAL VOC-2012 image segmentation task so we think that the solution
can be applied to our problem with benefit.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        We present a report of applying deep learning approach for real life problem
of objects’ area assessment. We describe the whole solution process: collection
of satellite imagery with appropriate resolution, creation of training dataset by
manual labelling and data augmentations techniques, training and testing CNNs
and extraction buildings’ polygons from CNN’s output heatmaps. We obtained
the sufficient recognition quality (Jaccard index is 0:61 for buildings) with CNN
based on U-net architecture [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Finally we propose the next steps of the
recognition model design and feature engineering.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Olaf</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Fischer</surname>
          </string-name>
          , and Thomas Brox: U-Net:
          <article-title>Convolutional Networks for Biomedical Image Segmentation</article-title>
          .
          <source>In: Medical Image Computing and Computer-Assisted Intervention (MICCAI)</source>
          , Springer, LNCS, Vol.
          <volume>9351</volume>
          :
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Shunta</given-names>
            <surname>Saito</surname>
          </string-name>
          , Takayoshi Yamashita, and Yoshimitsu Aoki:
          <article-title>Multiple Object Extraction from Aerial Imagery with Convolutional Neural Networks In: Journal of Imaging Science and Technology</article-title>
          , Volume
          <volume>60</volume>
          ,
          <string-name>
            <surname>Number</surname>
            <given-names>1</given-names>
          </string-name>
          ,
          <year>January 2016</year>
          , pp.
          <fpage>10402</fpage>
          -
          <lpage>1</lpage>
          -10402-
          <issue>9</issue>
          (
          <issue>9</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. V.
          <article-title>Mnih: Machine Learning for Aerial Image Labeling</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Toronto,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Georgia Gkioxari, Piotr Dollar, Ross Girshick:
          <string-name>
            <surname>Mask R-CNN. PAMI</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Tukey</surname>
          </string-name>
          :
          <article-title>Non-linear (non-superposable) methods for smoothing data</article-title>
          ,
          <source>Int. Conf. Rec</source>
          .
          <source>1974 EASCON</source>
          , pp.
          <fpage>673</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>David</given-names>
            <surname>Douglas</surname>
          </string-name>
          , Thomas Peucker:
          <article-title>Algorithms for the reduction of the number of points required to represent a digitized line or its caricature</article-title>
          ,
          <source>The Canadian Cartographer</source>
          <volume>10</volume>
          (
          <issue>2</issue>
          ),
          <fpage>112</fpage>
          -
          <lpage>122</lpage>
          ,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rouse</surname>
            ,
            <given-names>J.W</given-names>
          </string-name>
          , Haas,
          <string-name>
            <given-names>R.H.</given-names>
            ,
            <surname>Scheel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            , and
            <surname>Deering</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.W.</surname>
          </string-name>
          :
          <article-title>Monitoring Vegetation Systems in the Great Plains with ERTS</article-title>
          .
          <source>Proceedings, 3rd Earth Resource Technology Satellite (ERTS) Symposium</source>
          <year>1974</year>
          , vol.
          <volume>1</volume>
          , p.
          <fpage>309</fpage>
          -
          <lpage>313</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Zuiderveld: Contrast limited adaptive histogram equalization</article-title>
          ,
          <source>Graphics gems IV</source>
          , San Diego, CA:Academic Press Professional, Inc,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Rasmus</given-names>
            <surname>Houborg</surname>
          </string-name>
          , Matthew F.
          <article-title>McCabe: High-Resolution NDVI from Planet's Constellation of Earth Observing Nano-Satellites: A New Data Source for Precision Agriculture</article-title>
          ,
          <string-name>
            <given-names>Remote</given-names>
            <surname>Sens</surname>
          </string-name>
          .
          <year>2016</year>
          ,
          <volume>8</volume>
          ,
          <fpage>768</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ross</surname>
            <given-names>Girshick</given-names>
          </string-name>
          , Jeff Donahue, Trevor Darrell, Jitendra Malik:
          <article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>
          ,
          <source>IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ross Girshick: Fast</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          ,
          <source>IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Shaoqing</surname>
            <given-names>Ren</given-names>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
          </string-name>
          , Jian Sun:
          <string-name>
            <surname>Faster R-CNN</surname>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          ,
          <source>Neural Information Processing Systems</source>
          (NIPS)
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. L.
          <string-name>
            <surname>-C. Chen</surname>
            , G. Papandreou, I. Kokkinos,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>A. L. Yuille</given-names>
          </string-name>
          <article-title>Semantic image segmentation with deep convolutional nets and fully connected CRFs</article-title>
          , ICLR,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>