Building Change Detection in Aerial Images
                    Fatima Mroueh                                           Ihab Sbeity                        Mohamad Chaitou
          Computer Science Department                          Computer Science Department               Computer Science Department
               Lebanese University                                  Lebanese University                      Lebanese University
                 Beirut, Lebanon                                     Beirut, Lebanon                          Beirut, Lebanon
          fatima.mroueh249@gmail.com                              ihab.sbeity@gmail.com                   mohamad.chaitou@ul.edu.lb


   Abstract—In this paper, we provide an approach that detects                    data, such as data generated from Digital Elevation Models
the changes in buildings between two multi-temporal aerial                        (DEM), Light Detection and Ranging (LiDAR) technology
images of different sources. Since the images in most cases are                   and other kinds of remote sensing technologies [1] [4], or
not perfectly aligned, our approach takes into consideration the
differences in the geometric aspects of the images. Differences in                limited to work with specific types of images such as GeoTiff
scale, view point or overlapping regions may be present between                   images that contain accurate geographic information such as
the pair of images. Our approach relies on segmentation to                        coordinates is the global coordinates system. Furthermore,
extract building masks from the original aerial images. Changes                   they are also limited to aligned images that are of the same
are then found by comparing the features of the pair of masks                     scale and view point (same height, same camera calibrations,
using image matching algorithms. This procedure is applied
on a set of 80 pairs of aerial images of different sizes and                      same coordinates...).
with different applied transformations, and an evaluation has                        The main problem with these techniques is that they rely
been considered in comparison with the corresponding ground                       too much on the information provided with the images, and
truth references. The evaluation yields buildings change detection                therefore they cannot be applied to any image that is not
rate of 92.7%. The results of our proposed approach suggest                       enriched with any information such as geo-spatial information.
that automatic building change detection is possible, but further
research should include improvement of the segmentation phase                        Nowadays, the automatic analysis techniques of images
to better distinguish buildings and enhancement of the change                     are very essential. Machine Learning and Computer Vision
detection method. Real time application of the process is also a                  techniques, and more specifically the image matching algo-
challenging perspective.                                                          rithms, has proven to be very efficient in the field of image
   Index Terms—change detection, aerial images, image seg-                        processing and comparison. Furthermore, there are still poor
mentation, image matching algorithms, SIFT, feature detection,
feature description                                                               scientific methodologies for detecting changes in aerial im-
                                                                                  ages, especially those that differ in geometric aspects such as
                                                                                  scale and orientation, and without being limited to additional
                            I. I NTRODUCTION
                                                                                  information about the images. Going deeper into the topic
   Aerial imagery is - as it sounds - the process of taking                       is essential to introduce new efficient insights in the field of
images from the air. It is a subset of a larger domain called                     change detection in aerial images.
Remote Sensing. It consists of acquiring data without making                         Accordingly, this research provides a complete procedure
physical contact with the objects in study [1].                                   for building change detection in aerial images using machine
   Aerial images such as satellite imagery or drone imagery                       learning and computer vision techniques and algorithms.
are considered one of the richest sources of data that can be                        The main advantage of our approach is that it does not
used in various applications. Change detection in aerial images                   take benefit from any of the information derived from the
is detecting new or disappeared objects in images registered                      aerial images. It deals with aerial images as simple PNG or
at different moments of time and possibly in various lighting,                    JPG formats without any enrichment. More importantly, it can
heights and camera calibrations [2]. Detecting the changes in                     detect changes in aerial images that differ in scale and view
aerial images of the same region and taken at different times                     point and images that have overlapped regions. This way, our
is useful and important in many domains such as: automatic                        approach can be applied to any pair of aerial images despite
map updating, field change after catastrophic events, detecting                   of their related information or their geometric aspects.
illegal buildings areas and undeclared refugees camps, analysis
of urban and suburban areas, a base for automatic monitoring                                         II. P REVIOUS S TUDIES
system and some other military applications. For these rea-                          Detecting the changes in aerial images has been an old and
sons, detecting changes in aerial images has thus become an                       long journey. In particular, changes in buildings, is an essential
important research topic [3].                                                     part of this journey.
   In fact, several techniques and approaches are designed and                       Looking at the previous studies related to our topic, one can
implemented to detect changes in aerial images. However, all                      see that most of these studies rely on data fusion; they integrate
these techniques were motivated by the availability and fusion                    multiple data sources to produce more consistent and accurate
of different types of useful and profitable remote sensing                        information than that provided by any individual data source.

 Copyright © 2019 for this paper by its authors. Use permitted under Creative
 Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                                                                       13
   For example, in their work, Nebiker et al. used image-based         changes by finding where the update line features spatially
dense digital surface models (DSMs) in order to compute                match the base line features and detects spatial changes,
a depth value for every pixel of an image, combined with               attribute changes, or both, as well as no change. However,
the aerial images for the detection of individual buildings.           all inputs to this tool must be in the same coordinates system
They used these models with object-based image analysis to             [18]. While in our case, we aim to detect changes even if we
detect changes [5]. As well, the study of Chen Lin was                 do not know the spatial location of the geographic region we
based on multi-source data. They pre-processed the data using          are working on.
triangulation of an irregular network of data points collected            To the best of our knowledge, we did not find any study that
by Light Detection And Ranging (LiDAR) technology, and                 processes aerial images independently from any other source
then, the changes were detected by finding differences in              of information to extract buildings from. Moreover, computer
height by comparing the LiDAR point measurements and the               vision techniques such as image matching algorithms are not
estimates of the building models [6]. Furthermore, Alonso et           employed to detect changes although they proved to be very
al. applied the support vector machine (SVM) classification            efficient in the comparison of images.
algorithm to a joint satellite and laser data set for the extraction      To overcome the two problems cited above, our approach
of buildings. For change detection, they suggested to compare          works in three steps. As we are interested in small-scale
an old map with more recent spatial information instead of             change detection (buildings), the first step is the segmentation
comparing a pair of images [7]. Many other studies took                phase, in which we eliminate a large part of the scene without
benefit from data sources other than the aerial image itself           losing any actual building. This is possible by extracting build-
such as Digital Elevation Models (DEMs), laser scanner data,           ings’ footprints from the aerial images. Second, we use the
indicator of vegetation (NDVI), the relationship between the           SIFT image matching algorithm to check the correspondence
buildings and their shadows, and high resolution aerial images         of the pair of images, i.e. to make sure that the images taken
in order to detect changes in buildings [8] [9] [10] [11] [12].        correspond to the same geographic region. Third, we detect
Most of these studies suffered from significant problems with          the type of transformation applied to one of the images with
small buildings and with buildings surrounded by high trees.           respect to the other (scale, rotation, overlap). The detected
   Talking about extracting the buildings before detecting             transformation is then reversed to get two images of the same
changes, this step was included in numerous studies. Some              scale and view. The last step, the difference image can be
of them went for region-based classification where each small          computed and post-processed. In this step, the changes in the
region was classified to “building” or “no-building” based on          buildings are detected.
a decision tree induced from training data (edge recording of
the buildings), and then classified to “change” or “no change”                               III. BACKGROUD
based on some conditions [13]. Other used the indicator of
                                                                       A. Image Segmentation
vegetation NDVI for distinguishing buildings from trees since
both have similar height information [14]. Neural network                 Computer vision is a field that is intended to make comput-
classifier also was employed in order to classify the regions          ers accurately understand and efciently process visual data like
in an aerial image into multiple classes (grove, building, tree,       images. Extracting information from images and understand-
shadow . . . ) by feeding the neural network with many inputs          ing image information is very critical in many applications in
such as area, average gray level, shape factor and compactness         this domain. Computer vision helps in extracting features of
[3]. Region-based segmentation was also applied using a                an image in order to simplify image analysis [19].
decision tree that rely on the geometric properties of the land           In several cases, we may not be interested in all the
cover objects such as elevation, spectral information, texture         components of the image, but only for some areas or objects
and shape [15]. The most important and precise segmentation            that have certain characteristics related to our task. Image
was applied using Convolutional Neural Networks where the              segmentation is one of the best techniques to handle this issue.
large imagery is divided into small patches, and then CNN              This technique works by isolating objects from the rest of the
is trained with those patches and their corresponding three-           image [20] [21] [22] [23] [24]. Image segmentation mainly has
channel map patches (building, road and background) [16].              the role of classifying each pixel of an image into meaningful
However this work was not including change detection.                  classes that refer to specific objects. It involves grouping of
   As for detecting changes in aerial images that have different       the elements of an image by certain criteria of homogeneity
views, Bourdis et al. stated that camera motion and viewpoint          [4]. It does not only make a prediction for an input providing
differences introduce parallax effects. Therefore, in order to be      classes, but also provides additional information regarding the
robust to viewpoint differences, they introduce an algorithm to        location of such classes.
distinguish between real changes and parallax effects based on            Deep learning techniques have proven to be very efficient in
optical flow constrained with epipolar geometry [17]. In other         solving such problems. These techniques can learn patterns in
works concerning this point, knowing the calibration of the            order to predict classes. The main deep learning architecture
camera or the spatial information about the geographic area            that is used for image segmentation, and generally speaking
were essential in order to achieve the goal [13] [17].                 for image processing, is the Convolutional Neural Network
   Furthermore, ArcGIS Pro offers a tool that detect feature           (CNN).


                                                                                                                                           14
   Frameworks like MaskRCNN, RetinaNet allow to apply
image segmentation using deep learning. However the domain
of application of some of them is restricted to scene images,
and they cannot be used in case of aerial images [25] [26].
Other frameworks that work with aerial images such as ENVI,
ERDAS Imagine, eCognition and others are also available [27]
[28]. Nevertheless they have many limitations. Some of them
do not have any vectorization tool to convert the segmented
result to use them in further analysis, others are confused
with images where the building roofs were dark and having
intensities very less as compared to other building objects [29].
B. Image Matching
   In order to compare the images, we look for specific patterns
                                                                                      Fig. 1. Worflow of our approach.
or specific features that are unique in the images and that
can be easily compared. A feature is a relevant piece of
information. It is a specific structure in the image such as         matching feature in the other image. We can summarize the
a point, an edge or a corner. The operation of finding the           process of image matching as follows: 1- Find a set of
features of an image is called Feature Detection.                    distinctive keypoints. 2- Define a region around each keypoint.
   Feature detection is the process of transforming the visual       3- Extract and normalize the region content. 4- Compute a
information of the image into the vector space. It is basically      local descriptor from the normalized region. 5- Match local
finding keypoints (or interest points) in the image. A keypoint      descriptors.
is a unique point in the local area around it. A keypoint can           Many comparative studies have been published assessing
be matched to a corresponding point in another image. The            the performance of the image matching algorithms. The real
main purpose of detecting features is giving us the possibility      challenge is to achieve true invariant feature detection under
to perform mathematical operations on them, and thus to find         any image transformation. It seems that the selection of the
similar vectors that lead us to similar objects or scenes in         adequate algorithm to complete the matching task significantly
the images. Ideally, this information should be invariant under      depends on the type of the image to be matched and in the
image transformations, so we can find the same features again        variations within an image and its matching pair in scale,
even if the image is transformed in some way.                        orientation or other transformations. Most of these studies has
   Using a specific feature detection algorithm, we search           stated that Scale Invariant Feature Transform SIFT algorithm
for such features in the first image and then we look for            performs the best against different image transformations [30]
the same features in the other image. As a result, we get a          [31] [32] [33].
set of points (xi , yi ) for each image, where xi and yi are
the coordinates of the point i detected as a feature in the                              IV. M ETHODOLODY
image. After detecting interest points, we continue to compute
a descriptor for each one of them. The regions around the               Fig. 1 represents the overall process of our approach: First,
features should be described so that the algorithm can find the      the buildings footprints from the acquired aerial image will be
similar features in the other image. This is called the Feature      extracted in order to use them in detecting changes instead of
Description.                                                         the original aerial images. To achieve this step, a segmentation
   The local appearance around each feature point is described       model for extracting buildings masks from aerial images is
in some way that is invariant under changes in translation,          built. Second, we suppose that a database is already prepared
scale and rotation. Therefore, we end up with a descriptor           containing preprocessed aerial images’ masks of the region
vector for each feature point. Feature descriptors encode            of interest. In this step, we look into in the database for the
interesting information into a series of numbers and act as a        mask that corresponds to our input mask. This step is achieved
sort of numerical ‘fingerprint’ that can be used to differentiate    by computing a similarity measure between each couple of
one image from another. Once the features and the descriptors        images using SIFT image matching algorithm. Finally, after
are extracted and computed, some preliminary feature matches         aligning the couple of masks, we detect changes by filtering
between these images will be established.                            their difference image.
   Feature matching, or more generally image matching, is
the task of establishing correspondences between two images.         A. Buildings’ Footprints Extraction
Keypoints between two images are matched by identifying                 Extracting buildings’ footprints from aerial images is a kind
their nearest neighbors. This is achieved by comparing the           of preprocessing of the images before matching. It helps us to
descriptors across the images to identify similar features.          get better results in detecting changes, since by segmenting the
For any two images, we get a set of pairs (xi , yi ), (x0i , yi0 )   images, we get rid of every element that is considered noise
where (xi , yi ) is a feature in one image and (x0i , yi0 ) is its   for us (not an object of interest).


                                                                                                                                        15
                                                                     trails, railway stations and much more, all over the world.
                                                                     OSM maps are saved on the internet and they are totally free.
                                                                     But the most important thing is that OSM is accurate and up
                                                                     to date (normally updated every day) [35].
                                                                        There are two reasons for which we are building our own
                                                                     segmentation model instead of using OSM data directly. The
                                                                     first reason is that OSM data do not cover all the regions we
                                                                     are interested in. In Lebanon for example, buildings masks are
                                                                     not provided for all the country. So, we take benefit from the
                                                                     available geometries provided by OSM in order to build the
                                                                     segmentation model. Later, this model will provide us with the
                                                                     buildings footprints for regions that are not covered by OSM.
                                                                     The second reason, which is the most important one, is that
                                                                     we may not be aware of the exact location of the image in the
                                                                     global coordinates system. In such case, we cannot use OSM
                                                                     extracts.
                                                                        GeoFabrik server from OSM provides a convenient and
       Fig. 2. Aerial image and its corresponding buildings mask.    updated extracts which we can work with [36]. GeoFabrik
                                                                     team extract, select and process free geodata for everyone.
                                                                     They create shape files, maps, and map tiles with a free
   A segmentation model is needed for this purpose. Many             of charge download service. The geometries extracted from
tools that implement this techniques are available. The used         GeoFabrik server are shape files of extension .shp. A shape
tool to achieve our goal is RoboSat [34]. RoboSat is an              file is a simple format that is used for storing the geometric
end-to-end pipeline written in Python3 for feature extraction        location and attribute information of geographic features that
from aerial and satellite imagery. Features can be anything          can be represented by points, lines or polygons. We are only
visually distinguishable in the imagery such as buildings, roads     interested in polygon representation of the buildings. These
or cars [34]. We chose to work with RoboSat since it is              shape files can be visualized as vector layer in GIS tools,
specially designed to work with aerial images and it has shown       which help us to decide at what locations we need to download
important results in this domain.                                    satellite imagery to complete the dataset.
   The data preparation tools in RoboSat help us to create              Although the masks are not always perfect, but a slightly
and prepare the dataset for training feature extraction models.      noisy dataset will still work fine with training the model on
Also, the modelling tools in RoboSat help with training fully        thousands of images and masks.
convolutional neural networks for segmentation [34].                    The next step is to download the corresponding aerial
   Fig. 2 represents an aerial image with its corresponding          imagery. Our aerial imagery is downloaded from Mapbox [37].
buildings mask.                                                      Mapbox satellite is a full global base map. It uses global
      a) Data Preparation: We first walk through creating a          satellite and aerial imagery from commercial providers such
dataset for training feature extraction model. Such dataset con-     as NASA and USAS. Mapbox provides an API that allows us
sists of satellite imagery combined with their corresponding         to download the needed satellite imagery [37].
masks for the feature we want to extract, which is building             RoboSat works with the Slippy Map tile format to abstract
in our case. We can think of these masks as binary images            away georeferenced imagery behind tiles of the same size. A
which take the value zero where there is no building and one         Slippy Map is, in general, a term referring to modern web
for building areas.                                                  maps which let you zoom and pan around. By default, the
   This dataset will serve as training set for the segmentation      Slippy Map renders tiles. Tiles are 256 x 256 pixel PNG
model. The goal is to have a model that accepts an aerial            files. Each tile is a file in a directory representing a column,
image and outputs its corresponding buildings footprints. As         and each column is a subdirectory that represents the zoom
mentioned before, the footprints will be used to detect changes      level. RoboSat offers the tool that is responsible for tiling the
instead of original aerial images, and this is to reduce all kinds   collected aerial images as well as extracted geometries.
of noise that may affect the accuracy of our application. Our           With downloaded satellite imagery and rasterized corre-
objects of interest are only buildings.                              sponding masks, our dataset is complete and ready. Fig.3
   We start by extracting geometries from OpenStreetMap              shows the downloaded aerial imagery tiles with their corre-
(OSM) project. We try then to figure out where we need               sponding buildings footprints.
satellite imagery in order to complete the training set [35].              b) Training and Modelling: The RoboSat segmentation
OpenStreetMap (OSM) project creates and provides free ge-            model is a kind of fully convolutional neural network which
ographic data. The OpenStreetMap Foundation is an interna-           we train on pairs of aerial images and corresponding masks.
tional not-to-profit organization supporting the OpenStreetMap       The training process takes place within a GeForce GTX 1080
project. This project maintains data about roads, buildings,         platform. When picking up the best checkpoint, the model


                                                                                                                                         16
                                                                        Fig. 4. SIFT matches between pairs of masks having different transformations.


     Fig. 3. Aerial images tiles with their corresponding footprints.   keypoints, the set of matched keypoints between the pair of
                                                                        images and many other useful information.
                                                                           Fig. 4 represents the matching points between pairs of
allows to predict the segmentation probabilities for every              images having different transformations. For visualization, the
pixel in an image. These segmentation probabilities indicate            original image is put on the left side and the other image is put
how likely each pixel is a background or a building. These              on the right side and the matches are drawn as lines between
probabilities are then turned into discrete segmentation masks.         both images.
The same segmentation model is used for extracting buildings               Let n and m be the number of keypoints in the first mask and
footprints from old imagery as well as the input aerial image.          second mask respectively. And let S = {Pi /i 2 1, 2, . . . , n}
                                                                        be the set of detected keypoints in the first mask and S 0 =
B. Image Correspondence
                                                                        {Pi0 /i 2 1, 2, . . . , m} be the set of detected keypoints in the
   At this point, after extracting buildings footprints from the        second mask. Let M be the set containing the pair of keypoints
original input aerial image, we need to find its corresponding          indices that match with each other. Then M = {(i, j)/Pi 2
mask from the already prepared dataset. The pair of masks               SandPj0 2 S 0 } are found as matched keypoints. This naming
will not be perfectly aligned. Many types of transformations            will be used in all next sections.
may be applied to one of the images with respect to the other.             Both images are of the same scene then there must be
Different scales, different views, overlapping regions between          proportionality between the relative distances of the keypoints.
images are examples of such transformations.                            Thus in all cases this condition must be satisfied:
   Here, and because we do not know the exact location of
the images in the global coordinates system, some similarity                                   d(Pa , Pb )   d(Pe0 , Pf0 )
                                                                                                           ⇡                                     (1)
measure is needed to find the mask from the dataset that best                                  d(Pc , Pd )   d(Pg0 , Ph0 )
matches with our input image. This similarity measure will
help us in deciding whether the two images are for the same             such that a, b, c, d 2 {1, 2, . . . , n} and e, f, g, g 2
scene or not. For this purpose, SIFT image matching algorithm           {1, 2, . . . , m} and {a, e}, {b, f }, {c, g}, {d, h} 2 M .
is used.                                                                   We compute this factor for the matched keypoints found for
   The objective here is to find a similarity measure that              the pair of images. In some cases, there might be false matches
helps us to know that the masks are extracted from the same             which lead to some disparity in the values of the factor
geographic region despite of the applied transformation.                between the matching pairs. To remove this inconsistency, we
   First, we use SIFT image matching algorithm to detect the            remove all the matching pairs that give a factor which is far
interest points in both masks (having different transforma-             from the most frequent factor. Then we compute the ratio of
tions). Then, we compute the descriptors for each one of the            the number of the remaining matching pairs over the total
images in order to use them in the matching process. SIFT               number of good matches. We rely on this ratio as a similarity
algorithm provides us with the coordinates of the detected              measure between the two images.


                                                                                                                                                        17
   In other cases, this similarity factor can vary. In order to
have a threshold that can be used in any other case, we com-
pute the similarity factor for 408 pair of masks with different
sizes and different applied transformations. We computed the
average of the proportionality factor of the 408 pairs of masks
and we got 0.88685 as an average factor. But since we are
assuming that the pair of masks that we need to compare have
differences in buildings, we accept 0.7 as a threshold.

C. Change Detection
   After finding the corresponding masks, SIFT matching
algorithm is very efficient in detecting the type of the transfor-
mation applied to one of the images with respect to the other.
We differentiate between three main types of transformations:
masks that have overlapping regions, masks that are different
in scale and masks that are different in rotation angle. We will
explain in details how to detect each type of these transfor-
mations by applying simple mathematics on the information
provided by SIFT algorithm.
      a) Overlapping Regions: For this type of transforma-
tion, we use template matching algorithms. This algorithm
is available in OpenCV library for computer vision [20].
This algorithm proved to be very efficient in detecting the
overlapping regions between two images.
   After computing the similarity measure between the pair of
masks and checking that the views correspond to the same                              Fig. 5. Difference image before and after filtering.
scene, we try to find the overlapping regions between the pair
of masks. The bigger image is then cropped to be aligned with
its overlapping region. We apply template matching algorithm                have to find the angle between the lines that are formed by
to search for the small mask in the bigger one. Although there              respective matched points. So
are some differences in the buildings, the template matching                                        d(Pa , Pb )   d(Pe0 , Pf0 )
algorithm gives us an accurate result. Now, we have two                                       ✓=                ⇡                            (3)
                                                                                                    d(Pc , Pd )   d(Pg0 , Ph0 )
aligned pair of masks that are ready to detect changes between
them.                                                                          We also remove inconsistencies because of the presence of
      b) Scale Transformation: In this type of transformation,              false matches. Now, we have two aligned pair of masks that
we aim to find the scale factor between the pair of masks.                  are ready to detect changes between them.
When we get the scale ratio , we can transform both masks to                D. Difference Image
be in the same scale. The process now is very similar to the
                                                                               Whatever was the transformation applied to one of the
one performed in computing the similarity measure since the
                                                                            images, at this point we have two aligned images. All what
ratio of distances computed there was in fact the scale factor.
                                                                            we have to do is to find the difference image. Of course,
So
                                                                            the difference image will contain some noise because of the
                       d(Pa , Pb )    d(Pe0 , Pf0 )
                    =              ⇡                           (2)          differences in the resolution of the pair of masks, the fact that
                       d(Pc , Pd )    d(Pg0 , Ph0 )                         drives us to filter the difference image.
                                                                               Filtering the noise in the difference image consists of
for all a, b, c, d 2 {1, 2, . . . , n} and e, f, g, g 2 {1, 2, . . . , m}   finding the contours in it. Contours are curves joining all
and {a, e}, {b, f }, {c, g}, {d, h} 2 M .                                   the continuous points (along the boundary) having same color
   We also remove inconsistencies because of the presence of                or intensity. Contours are very helpful for shape detection or
false matches. Now, we have two aligned pair of masks that                  recognition. Since we are using binary images, we have more
are ready to detect changes between them.                                   chance to get a better accuracy. Finding such contours rely on
     c) View Point Transformation (Orientation: In this type                detecting Canny Edges [38].
of transformation, we aim to find the rotation angle between                   Fig. 5 represents the noisy difference image and the filtered
the pair of masks. When we get the rotation angle, we                       one.
can transform both masks to be in the same orientation. To                     The contours are projected finally onto one of the original
calculate the angle of rotation between the two masks, we                   images to show the differences clearly.


                                                                                                                                                   18
                                                                                                         TABLE I
                                                                                         D ESCRIPTION OF THE TRANSFORMED MASKS

                                                                                                           Description
                                                                                           A      Scale factor = 1.56, 4 changes
                                                                                           B     Rotation angle = 73°, 2 changes
                                                                                           C    Have overlapping region, 3 changes


                                                                                                       TABLE II
                                                                            R ESULTS OF SIFT ALGORITHM APPLIED ON DIFFERENT PAIRS OF MASKS


                                                                                    No. of keypoints          No. of keypoints       No. of matches
                                                                                  in the original mask   in the transformed mask
                                                                             A             261                      216                      169
                                                                             B             261                      779                      171
                                                                             C             261                      121                      91


                                                                            in order to guarantee that the results of the image segmentation
                                                                            do not affect our evaluation.
    Fig. 6. Evaluation metrics for each epoch of the training process.         In order to show the results of the whole workflow, refer
                                                                            to Fig. 2 that shows an aerial image and its corresponding
                                                                            mask. We suppose that this image is acquired now from an
                                                                            aircraft. We also suppose that we have a database containing
                                                                            old masks (extracted from old aerial images). The goal is to
                                                                            find the mask in the database that corresponds to the mask of
                                                                            this aerial image by computing the similarity measure between
                                                                            each pair of masks.
                                                                               After extracting the buildings’ footprints from the image,
                                                                            we manually apply different transformations to the mask in
                                                                            order to evaluate our procedure. Table 1 shows the description
                                                                            of the applied transformations. We also apply manually some
                                                                            changes between the masks.
                                                                               First, we apply SIFT algorithm to the original mask with
                                                                            each one of the applied mask. Table 2 represents the results
                                                                            of SIFT algorithm, and Fig. 8 shows the resulting matching
                                                                            points found by SIFT for each pair of masks.
                                                                               Now, we compute the similarity measure and the geometric
                                                                            parameters of the pair of masks to compare them with the
                                                                            ground truth shown in Table 1. The results are shown in Table
                                                                            3.
                                                                               As shown in the table, all the similarity measure for the
Fig. 7. Comparison between ground truth buildings masks and the predicted   transformed mask with respect to the original mask are greater
ones.                                                                       or equal to the threshold. As for scale factor, the difference
                                                                            between the computed scale factor and the real one for the four
                                                                            pairs of masks does not exceed 0.1. As well, the difference
                V. R ESULTS AND D ISCUSSIONS                                between the computed rotated angle and the real one does not
   Fig. 6 shows the evaluation metrics for both training and                exceed 0.1°. Now, the difference image is computed for each
evaluation sets during the training process. We pick up the
checkpoint of the epoch 66 since it has the best values for
the validation set. This epoch has the least loss value which                                         TABLE III
is 0.0491. At the same time this epoch has the highest mean                 S IMILARITY MEASURES AND GEOMETRIC PARAMETERS COMPUTED FOR
                                                                                                    THE PAIRS OF MASKS
intersection over union value which is 0.757.
   Fig. 7 shows a comparison between ground truth masks (on                            Similarity measure    Scale factor   Rotation angle
the left) an the predicted masks (on the right).                                  A         0.692308           1.53581         0.12881
                                                                                  B          0.88636           1.01337         73.0351
   For change detection, we used the accurate buildings ex-                       C           0.875            0.94348         -0.00921
tracted from OSM to evaluate the change detection procedure,


                                                                                                                                                      19
          Fig. 8. SIFT matches between each pair of masks.


pair of masks after aligning them. Fig. 8 shows the difference
image of each of the four pair of masks.
   The procedure was applied on a test set of 80 pairs of
aerial images with different characteristics and different ap-
plied transformations in order to evaluate our procedure. The
following histograms show the accuracy rate of the results of
the change detection as well as for the geometric parameters
for each type of transformation.
   It is clear from the obtained results that our procedure works
the best with the scale transformation as well as overlapping
regions. However, some errors were encountered with rotation
and mixed transformations. The results are expectable since
SIFT algorithm is designed to be robust with scale transfor-
mation.
   As an overall rate, our procedure gives 92.7% of true change
                                                                    Fig. 9. Difference image between each pair of masks.
detection for different types of transformations.
   The strengths of our procedure can be summarized by the
following points: (1) this procedure works with simple PNG
aerial images without any additional metadata, (2) if the shape


                                                                                                                           20
                          TABLE IV                                       Furthermore, points of interest other than buildings can be
  ACCURACY RATE OF THE RESULTS OF THE CHANGE DETECTION WITH           taken into consideration in the process of change detection.
              DIFFERENT TYPES OF TRANSFORMATIONS
                                                                      It can also include roads, vegetation and any other class of
                                Transformations                       objects that can be present in aerial images.
                   Scale   Orientation Overlapping      Mixed            Additionally, enhancing the segmentation model with a
    Accuracy (%)    100       85.5           99          86.3
                                                                      larger and more suitable dataset is essential in a further
                                                                      research due to its considerable and significant effect in the
                                                                      improvement of the overall results of the approach, since
of buildings in another region are different from the buildings       detecting changes rely truly on the extracted buildings’ foot-
in the training set, simply anyone can train his own dataset          prints.
and then use the same procedure to detect changes, (3) this
                                                                                                    R EFERENCES
procedure can be extended to point of interests other than
                                                                       [1] J. D. Kiser and D. P. Paine, Aerial Photography And Image Interpreta-
buildings, finally (4) this procedure is robust against different          tion, Canada: John Wiley Sons, Inc., 2012.
types of transformations.                                              [2] M. N. Favorskaya and L. C. Jain, Computer Vision in Control Systems,
   However two main limitations encounter this procedure                   Aerial and Satellite Image Processing, vol. 135, M. N. Favorskaya and
                                                                           J. C. Lakhmi, Eds., Canberra: Springer International Publishing, 2018.
which are (1) this procedure has expensive computation time            [3] N. Paparoditis, M. Jordan and J. P. Cocquerez, ”Building Detection
so it cannot act as real time application and (2) the final results        and Reconstruction from Mid- and High-Resolution Aerial Imagery,”
are always dependent to the accuracy of the segmentation                   Computer Vision And Image Understanding, vol. 72, pp. 122-142, 1997.
                                                                       [4] G. Wilhauck, ”Comparison of Object Oriented Classification Techniques
phase.                                                                     and Standard Image Analysis For the Use of Change Detection Between
                                                                           SPOT multispectral Satellite Images and Aerial Photos,” International
                       VI. C ONCLUSION                                     Archives of Photogtammetry and Remote Sensing, vol. XXXIII, 2000.
                                                                       [5] S. Nebiker, N. Lack and M. Deuber, ”Building change detection from
   Building change detection in aerial images that differ in               historical aerial photographs using dense image matching and object-
many geometric aspects such as scale and view point is a                   based image analysis,” Remote Sensing, vol. 6, pp. 8310-8336, Septem-
challenging research topic nowadays. A complete solution                   ber 2014.
                                                                       [6] L.-C. Chen and L.-J. Lin, ”Detection of building changes from aerial
for this problem is not yet found and developed. This work                 images and light detection and ranging (LIDAR) data,” Journal of
has presented a complete procedure to detect new and de-                   Applied Remote Sensing, vol. 4, no. 1, 2010.
molished buildings in two aerial images taken at different             [7] M. C. Alonso, J. A. Malpica, F. Papi, A. Arozarena and A. Martinez-
                                                                           Agirre, ”Change detection of buildings from satellite imagery and lidar
times. Our procedure worked in three steps. The first step                 data,” International Journal of Remote Sensing, vol. 34, no. 5, p. 1652,
which is extracting building footprints from original aerial               March 2013.
images was accomplished using a segmentation model. Using              [8] I. Tomljenovic, D. Tiede and T. Blaschke, ”A building extraction ap-
                                                                           proach for airborne laser scanner data utilizaing the object based image
machine learning, specifically convolutional neural network,               analysis paradigm,” International Journal of Applied Earth Observation
this model was built by training on a large number of aerial               and Geoinformation, vol. 52, pp. 137-148, October 2016.
images coupled with their buildings masks. The second step             [9] R. B. Irvin and D. M. McKeown, ”Methods for exploiting the rela-
                                                                           tionship between buildings and their shadows in aerial imagery,” IEEE
which is image correspondence was done by calculating a                    Transactions on Systems, Man and Cybernetics, vol. 19, no. 6, 1989.
similarity factor between each pair of images. At this point,         [10] Y. Wang, ”Automatic extraction of building outline from high resolution
the pair of images that represent the same geographic area                 aerial imagery,” The International Archives of the Photogrammetry,
                                                                           Remote Sensing and Spatial Information Sciences, Vols. XLI-B3, 2016.
is found. The last step, which is change detection, benefits          [11] M. Leena, J. Hyyppa and H. Kaartinen, ”Automatic detection of changes
from image matching algorithms in particular SIFT algorithm.               from laser scanner and aerial image data for updating buildings map,”
This algorithm is applied to align the pair of the images and              Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci., vol. 35, pp. 434-
                                                                           439, July 2004.
then compute their difference in order to detect the changed          [12] M. C. A. Turker and B. Cetinkaya, ”Automatic detection of earthquake-
buildings. This procedure showed a change detection rate of                damaged buildings using DEMs created from pre- and post-earthquake
92.7% for different types of transformations.                              stereo aerial photographs,” International Journal of Remote Sensing, vol.
                                                                           26, no. 4, pp. 823-832, 16 August 2006.
                                                                      [13] F. Jung, ”Detecting building changes from multitemporal aerial stere-
          VII. C HALLENGES AND F UTURE W ORK                               opairs,” ISPRS Journal of Photogrammetry and Remote Sensing, vol.
   A big challenge is faced in our approach which is inability to          58, no. 3-4, pp. 187-201, January 2004.
                                                                      [14] F. Rottensteiner, S. Clode, J. C. Trinder and K. Kubik, ”Fusing airborne
be implemented as a real time system. The image segmentation               laser scanner data and aerial imagery for the automatic extraction
phase as well as searching in a database for the mask that                 of buildings in densely built-up areas,” International Archives of the
corresponds with the input image are computationally expen-                Photogrammetry, Remote Sensing and Spatial Information Sciences, vol.
                                                                           35, 2006.
sive, although building the model and preparing the dataset are       [15] L. Y. Chen, T.-A. Teo, Y.-C. Shao and Y.-C. Lai, ”Fusion of LIDAR
carried out only once. Further studies must be accomplished                data and optical imagery for building modeling,” International Archives
in order to find suitable solutions for this critical issue.               of Photogrammetry, Remote Sensing and Spatial Information Sciences,
                                                                           vol. 35, 2004.
   Moreover, future work can have a more specific design for          [16] S. Saito and Y. Aoki, ”Building and road detection from large aerial
the experiment. The overall findings that emerged from our                 imagery,” in Image Processing: Machine Vision Applications VIII, San
experiments gave us some promising directions to follow for                Francisco, 2015.
                                                                      [17] N. Bourdis, M. Denis and H. Sahbi, ”Constrained optical flow for aerial
building an optimal, operative, complete and automatic system              image change detection,” in 2011 IEEE International Geoscience and
in the future.                                                             Remote Sensing Symposium (IGARSS), Vancouver, Canada, 2011.


                                                                                                                                                       21
[18] ”Arcgis.om,” 2019. [Online]. Available: https://pro.arcgis.com/en/pro-
     app/tool-reference/data-management/detect-feature-changes.htm.
[19] S. Yuheng and Y. Hao, ”Image Segmentation Algorithms Overview,”
     2017.
[20] G. Bradski and A. Kaehler, Learning OpenCV, United States: O’Reilly
     Media, Inc., December 2016.
[21] N. R. Pal and S. K. Pal, ”A Review on Image Segmentation Techniques,”
     Pattern Recognition, vol. 26, no. 9, pp. 1277-1294, September 1993.
[22] K. S. Fu and J. K. Mui, ”A Survey on Image Segmentation,” Pattern
     Recoginition, vol. 13, no. 1, pp. 3-16, 1981.
[23] P.-g. Ho, Ed., Image Segmentation, 2011.
[24] A. P. Dhawan, ”Image Segmentation,” in Medical Image Analysis,
     Wiley-IEEE Press, 2011, pp. 229-264.
[25] ”fizyr/keras-retinanet,”         2019.       [Online].         Available:
     https://github.com/fizyr/keras-retinanet.
[26] W. Abdullah, ”Mask R-CNN for object detection and instance segmen-
     tation on Keras and TensorFlow,” GitHub Repository, 2017.
[27] ”ENVI - The Leading Geospatial Analytics Software,” 2019.
     [Online].      Available:     https://www.harrisgeospatial.com/Software-
     Technology/ENVI.
[28] M. J. Canty, Image Analysis, Classification and Change Detection in
     Remote Sensing, New York: Taylor Francis Group, 2014.
[29] V. Srivastava, ”Evaluation of various segmentation tools for extraction
     of urban features using high resolution remote sensing data,” The
     International Archives of the Photogrammetry, Remote Sensing and
     Spatial Information Sciences, vol. 34, no. XXX.
[30] J. Bauer, N. Sunderhauf and P. Protzel, ”Comapring Several Implemen-
     tations of Two Recently Published Feature Detectors,” in International
     Conference on Intelligent and Autonomous Systems (ICAS), Toulouse,
     France, 2007.
[31] P. M. Panchal, S. R. Panchal and S. K. Shah, ”A Comparison of SIFT
     and SURF,” International Journal of Innovative Research in Computer
     and Communication Engineering, vol. 1, no. 2, April 2013.
[32] U. M. Babri, M. Tnavir and K. Khurshid, ”Feature Based Corre-
     pondence: A Comparative Study on Image Matching Algorithms,”
     International Journal of Advamced Computer Science and Applications
     (IJACSA), vol. 7, no. 3, 2016.
[33] E. Karami, S. Prasad and M. Shehata, ”Image Matching Using SIFT,
     SURF, BRIEF and ORB: Performance Comparison for Distorted Im-
     ages,” in Newfoundland Electrical and Computer Engineering Confer-
     ence, Canada, 2015.
[34] ”Github - RoboSat,” Mapbox, 2018. [Online]. Available:
     https://github.com/mapbox/robosat.
[35] ”OpenStreetMap OSM,” OpenStreetMap Foundation OSMF, 2010. [On-
     line]. Available: www/openstreetmap.org.
[36] ”GeoFabrik,” OpenStreetMap, 2018. [Online]. Available: geofabrik.de.
[37] ”Mapbox,” Mapbox, 2010. [Online]. Available: www.mapbox.com.
[38] J. Canny, ”A Computational Approach to Edge Detection,” in Readings
     in Computer Vision, M. A. Fischler and O. Firschein, Eds., Elsevier,
     1987, pp. 184-203.


                                                                                 22