Classifying land cover from satellite images using time series
                           analytics
                 Patrick Schäfer                                      Dirk Pflugmacher                                       Ulf Leser
       Computer Science Department                                     Patrick Hostert                          Computer Science Department
       Humboldt-Universität zu Berlin                            Geography Department                           Humboldt-Universität zu Berlin
                  Germany                                    Humboldt-Universität zu Berlin                     leser@informatik.hu-berlin.de
       patrick.schaefer@hu-berlin.de                        [dirk.pflugmacher,patrick.hostert]
                                                                    @geo.hu-berlin.de

ABSTRACT                                                                                 time, e.g. because of vegetation senescence or harvesting [27].
The Earth’s surface is continuously observed by satellites, lead-                        The task can be approached in different ways. In a typical base-
ing to large multi-spectral image data sets of increasing spatial                        line setting, the different measurements per pixel are used as
resolution and temporal density. One important application of                            independent features for a classical machine learning-based clas-
satellite data is the mapping of land cover and land use changes                         sifier, such as Naive Bayes or Decision Trees [20]. In this ap-
such as urbanization, deforestation, and desertification. This in-                       proach, the temporal order of the measurements is ignored as
formation should be obtained automatically, with high accuracy,                          all features are treated as orthogonal dimensions of the feature
and at the pixel level, which implies the need to classify millions                      space. An alternative method is to include the consideration of
of pixels even when only small regions are studied. Balancing                            the order of measurements by using methods from time series
runtime and accuracy for this task becomes even more challeng-                           analytics [2, 24]. Here, every pixel is considered as a temporally
ing with the recent availability of multiple time points per pixel,                      ordered (and aligned) series of measurements, and the specific
created by periodically performed satellite scans. In this paper                         changes (increasing or decreasing slopes, periodic changes etc.)
we describe a novel approach to classify land cover from series of                       of the measurements over time are analyzed to find common-
multi-spectral satellite images based on multivariate time series                        alities and to derive classification models. Previous works (see
analytics. The main advantage of our method is that it inherently                        Section 3) have shown that this can be advantageous as land cover
models the periodic changes (seasons, agriculture etc.) underly-                         are temporally variable and often follow characteristic temporal
ing many types of land covers and that it is comparably robust to                        patterns, such as those imposed by seasons. However, given the
noise. Compared to a classical feature-based classifier, our new                         enormous scale of the data to be classified, not only the accuracy
method shows a slightly superior overall accuracy, with an in-                           of an approach is important, but also the runtime performance
crease of up to 20% in accuracy for rare land cover classes, though                      has a critical role in any practical application.
at the cost of notably increased runtime. The highest accuracy is                            In this work, we evaluate the recently proposed multivariate
achieved by combining both approaches.                                                   time series classification algorithm WEASEL+MUSE [26] for land
                                                                                         cover classification using temporally dense, medium-resolution
                                                                                         satellite images. WEASEL+MUSE models multivariate time series
1    INTRODUCTION                                                                        using the truncated Fourier transformation and discretizes mea-
Monitoring changes in land usage is an important area of research                        surements, both of which to reduce noise, builds a rich feature
as land cover is a key variable driving the Earth’s energy balance,                      space to capture also subsequences in the time series, is able to
hydrological and carbon cycle, and the provisioning of natural                           exploit similar temporal subsequences even when appearing at
resources and habitat [22]. Over the last three decades, satellite-                      very different offsets within a time series, and uses aggressive
based Earth Observation (EO) programs have made tremendous                               feature selection to remove irrelevant features and thereby speed-
progress in acquiring medium-resolution (10 − 100m) images                               up classification. Although WEASEL+MUSE was not developed
around the globe systematically and with increasing frequency                            specifically for land cover classification, many of its aspects fit
(revisit time). As a result, large volumes of medium-resolution                          nicely to the specificities of this domain, such as the inherent
satellite images are now available free of charge, enabling au-                          noise reduction and the exploitation of repetitive behaviour.
tomatic approaches to the identification of land usage and the                               We compare the prediction performance of WEASEL+MUSE
detection of land surface changes over large areas. For example,                         with the performance of an established and popular machine
the American Landsat 8 sensor images the Earth at 30-m spatial                           learning-based approach, Random Forests, using the same input
resolution every 16 days, and two European Sentinel-2 sensors ac-                        features on 23 Landsat 8 images collected in 16 day-intervals over
quire images with a revisit time of 5 days and a spatial resolution                      Reunion Island. The study region covers an area of 2866x2633
of 10 − 20 meters, which amounts to roughly 60 measurements                              pixels at 30~m spatial resolution. As reference dataset for model
for more than 300 Billions pixels (excluding oceans) in a year.                          training and validation, we used a sample of 81714 pixels that
   The free availability of medium-resolution satellite image time                       had been manually classified into 9 land cover classes.
series has spawned new possibilities for mapping land cover [9].                             Our results indicate that time-series-based algorithms improve
In the past, classification approaches operated on single images or                      land cover classification accuracy compared to non-temporal al-
stacks of images (i.e. composite classification). Multi-date classifi-                   gorithms. Our time series algorithm WEASEL+MUSE achieved
cation approaches exploit the notion that land cover can vary over                       higher classification accuracies than Random Forests. The im-
                                                                                         provement in accuracy was most notable with rare and/or difficult
© 2018 Copyright held by the owner/author(s). Published in the Workshop                  classes. Here, class-wise accuracies increased by 8 and 3 percent-
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
                                                                                         age points, respectively. Overall accuracy improved by 1%-point
under the terms of the Creative Commons license CC-by-nc-nd 4.0.                         owing to the fact that the dominant classes were less affected by


                                                                                    10
the choice of algorithm. Interestingly, Random Forests captures              applications and reference data. However, there has been a trend
different signals in the data than the time-series-based approach,           away from parametric statistical models to machine learning to
as a simple Ensemble of both approaches further improved clas-               deal with the complexity of input data and class legends.
sification accuracy. However, this increased accuracy comes at                   It has only recently become feasible to build land cover map-
the cost of an increased runtime. Thus, we will focus future work            ping algorithms that exploit the temporal domain of entire pixel
on improving the runtime of our method without loosing the                   time series with medium resolution data [12]. To this end, satel-
advancements in prediction performance we observed.                          lite images typically first undergo a series of pre-processing steps,
    The rest of this paper is organized as follows: Section 2 pro-           including the correction of atmospheric effects, geometric align-
vides background on land cover classification using satellite im-            ment and cloud and cloud-shadow masking. Once these steps
ages. Section 3 presents the current state of the art in land cover          are finished, spectral values of pixels can be traced over time to
mapping and time series classification. Section 4 describes the              identify and detect land surface changes such as deforestation [8],
test dataset and the tested classification methods: Random Forests           or urbanization [19].
and WEASEL+MUSE. Section 5 presents the experiments.
                                                                             3   RELATED WORK
2   LAND COVER CLASSIFICATION FROM                                           Although, time-series based classification (TSC) is a relatively
    SATELLITE IMAGES                                                         new area in remote sensing, the topic itself has a long tradi-
                                                                             tion and dozens of approaches exist (see [2, 24], for instance).
The classification of satellite images to extract information on
                                                                             Time-series-based classifiers can broadly be categorized into two
land cover and land use has a long history. A significant turning
                                                                             classes: Similarity-based methods use a similarity measure over
point in terrestrial Earth Observation was the launch of Landsat-1
                                                                             sequences, such as Dynamic Time Warping (DTW), to perform a
in 1972 (then called Earth Resource Technology Satellite). For the
                                                                             point-wise comparison of two time series. In contrast, feature-
first time, Landsat delivered systematic observations of the Earth
                                                                             based TSC rely on comparing features extracted from the different
land surface for land monitoring, leading to new ways of machine-
                                                                             time series, typically generated from their substructures.
assisted approaches for mapping land cover from space [7]. In the
                                                                                These two approaches are also the basis for methods in multi-
1990s and early 2000s, international and national agencies started
                                                                             variate time series classification (MTSC), where a time series is
to adopt operational land cover mapping programs with Landsat
                                                                             not made of a single stream of values but by multiple streams,
and Landsat-like data, e.g. in Europe (CORINE), USA (NLCD),
                                                                             each one usually synchronized in time. A popular similarity-
Canada (EOSD), and Australia (NCAS-LCCP). Even decades later,
                                                                             based MTSC method is dynamic time warping (DTW) [21].
the Landsat program is still active today. Since February 2013,
                                                                             Feature-based MTSC can be grouped into those methods that
Landsat 8 is taking images of the Earth at 30 m spatial resolution
                                                                             build feature from so-called shapelets [30], which are short and
in 8 spectral bands every 16 days. The radiometric quality and
                                                                             maximally discriminative subsequences of the time series, and
spectral resolution has greatly improved since the early satellites,
                                                                             methods using the bag-of-patterns (BOP) approach [3]. The stan-
and because of new global acquisition strategies and on-board
                                                                             dard BOP model [17] break up a time series into windows, rep-
storage and download capabilities, so has the sheer number of
                                                                             resent the corresponding subsequences within each window as
available images. The number of medium spatial resolution (10 −
                                                                             discrete features (or words), and finally derives a classifier from
100 m) sensors has been increasing [4], thus dense time series of
                                                                             the frequencies of the words. Different approaches to TSC (and
medium resolution are available for many parts of the globe.
                                                                             MTSC) differ not only in their accuracy on different data sets, but
    Multi-spectral sensors like Landsat record the sun’s energy
                                                                             also in their runtime for classification, which is a critical issue
reflected by a surface in a few distinct spectral wavelengths
                                                                             when it comes to very large data sets as is the case of satellite
(bands), e.g. blue, green, red in the visible spectrum (400 nm to
                                                                             images. For instance, we recently evaluated the runtime of 12
700 nm), near infrared (700 to 1100 nm), and short-wave infrared
                                                                             different state-of-the-art methods for (univariate) TSC and found
(1100 to 3000 nm). Since land surfaces with different chemical and
                                                                             differences of up to three orders of magnitude [24, 25].
structural properties often absorb and reflect sunlight differently
                                                                                Most approaches to land cover classification rely on traditional
and wavelength-dependent, information on land cover can be
                                                                             machine-learning methods (see previous section), and there have
derived from these spectral bands. For example, water absorbs
                                                                             been only a few prior studies on using time series information. For
much of the near-infrared radiation, so these wavelengths are
                                                                             example, [32] fitted harmonic functions to each satellite band and
useful for discerning land-water boundaries that are not obvious
                                                                             used the fitted parameters as features in subsequent classification
in visible light. Similarly, green vegetation absorbs much of the
                                                                             (which implies that it falls into the class of feature-based MTSC
incoming radiation in the red spectrum while reflecting about
                                                                             methods). [14] found that including temporal information into a
50% of the radiation in the near-infrared spectrum.
                                                                             model can have a bigger impact on classification accuracy than
    Much of the past research on classification algorithms has fo-
                                                                             the choice of the particular classification algorithm. New time-
cused on exploiting the spectral and spatial properties of land cov-
                                                                             series-based algorithms are needed to leverage the predictive
ers, including artificial neural networks [1], decision trees [11],
                                                                             potential of satellite time series images [6].
support vector machines [15], and spatial segmentation algo-
rithms [10]. Each algorithm has its strength and weakness with
                                                                             4 METHODS
respect to: the distributional assumptions made about the data,
training requirements, computational complexity, and robust-                 4.1 Description of Data
ness to overfitting, data noise, and errors in training data. Also           For our analysis, we used a public dataset taken from the TiSeLaC
common to all algorithms is that the work is supervised, i.e., they          (Time Series Land Cover Classification Challenge) [28]. This
need an independent reference data set (i.e., land cover infor-              dataset consists of time series of 23 Landsat 8 images collected in
mation collected in the field or from air-photos) for training a             16 day-intervals over Reunion Island in 2014. Landsat data were
model. It is fair to say, that no single algorithm works best for all        provided by the French Pôle Thématique Surfaces Continentales


                                                                        11
       1.0                                          Urban Areas                    4.2    Time Series Analytics
                                                    Other built-up surfaces
       0.8
                                                    Forests                        In general, a time series dataset contains N time series. Each time
                                                    Sparse Vegetation              series is associated with a class label y from a predefined set of
                                                    Rocks and bare soil
       0.6                                          Grassland                      labels Y. Time series classification (TSC) is the task of predicting a
                                                    Sugarcane crops                class label for a time series whose label is unknown. A classifier
       0.4                                          Other crops                    is a function that is learned from a set of labelled time series
                                                    Water
NDVI


                                                                                   (the training data), takes an unlabelled time series as input and
       0.2
                                                                                   outputs a label. In this paper, each pixel should be labeled by one
       0.0                                                                         of the nine reference land cover classes.
                                                                                      The 23 Landsat 8 images contain 10 time series of spectral
       0.2                                                                         features each of length 23. At the pixel level, this data can be
                                                                                   interpreted as a multivariate time series (MTS).
       0.4
             0       5        10        15    20                                      A multivariate time series (MTS) T = {t 1 , . . . , tn } is an ordered
                           time steps                                              sequence of n ∈ N streams ti = (ti,1 , . . . ti,m ) ∈ Rm , i.e., m
                                                                                   recorded values at each fixed time stamp. As we address MTS
 Figure 1: Normalized Difference Vegetation Index (NDVI)                           generated from satellites with a fixed revisit time, we can safely
 of 9 different land cover classes for our reference                               ignore the concrete time stamps. MTS are typically produced by
 dataset [28].                                                                     sensors recording data over time like motion captures, gestures,
                                                                                   EEG signals, hand-written letters, sign language, or the multi-
                                                                                   spectral image data sets captured by satellites.
                  ID           Land cover class    Samples
            1        Urban Areas         16000                                     4.3    Machine Learning Approach using
            2 Other built-up surfaces     3236                                            Random Forests
            3           Forests          16000
                                                                                   Random forests (RF) are an ensemble learning method widely used
            4     Sparse Vegetation      16000
                                                                                   in Earth Observation (EO) for classifying land cover and land use
            5     Rocks and bare soil    12942
                                                                                   (e.g. [31]). RF build ensembles of decision trees wherein each tree
            6         Grassland           5681
                                                                                   is trained on randomly selected features of a bootstrapped train-
            7      Sugarcane crops        7656
                                                                                   ing sample. Node splits are performed using a random subset of
            8        Other crops          1600                                     predictor variables. Because of these random components, the
            9           Water             2599                                     RF approach does not require tree pruning and is relatively in-
 Table 1: Land cover classes and distribution of reference                         sensitive to overfitting [5]. To predict a class label, a RF classifies
 samples [28].                                                                     the sample with all its decision trees and returns the mode of the
                                                                                   predicted classes. A RF approach to satellite image / pixel classi-
                                                                                   fication uses the concatenated spectral features of each pixel as
                                                                                   input (this is a vector of length 230 and the 2 coordinates, lon-
 (THEIA), and atmospherically corrected, geometrically corrected,                  gitude and latitude). RF are a typical machine learning method
 and cloud-masked with the Multi-sensor Atmospheric Correction                     which do not consider any order of the features. If time series
 and Cloud Screening (MACCS) level 2A processor developed at the                   are used as features, each point in time is considered as an in-
 French National Space Agency (CNES). Data pre-processing and                      dependent feature and the order of measurements is ignored.
 temporal gap filling was performed using the iota21 Land Cover                    This implies that such methods are unable to reproduce serial
 processor developed by CESBIO2 . For each time step and pixel,                    correlations or to detect temporal trends in the data.
 ten spectral features were extracted, i.e., the seven reflectance
 bands and three vegetation indices: the Normalized Difference                     4.4    Land Cover Classification with
 Vegetation Index (NDVI), the Normalized Difference Water Index                           WEASEL+MUSE
 (NDWI), and the Brightness Index (BI). Figure 1 shows examples                    WEASEL+MUSE (Word ExtrAction for time SEries cLassification +
 of NDVI time series for different land cover classes.                             MUltivariate Symbols and dErivatives) [26] is a state-of-the-art
    Reference land cover data were derived from two publicly                       MTS classifier that is composed of the building blocks depicted
 available dataset: the 2012 CORINE Land Cover (CLC) map and                       in Figure 2. It conceptually builds upon the univariate Bag-of-
 the 2014 farmers’ graphical land parcel registration (Régistre                    Patterns model applied to each dimension. In the BOP model [23],
 Parcellaire Graphique - RPG). The most significant classes for                    a time series is characterized by the frequency of occurrence of
 the study area were retained, and a spatial processing (aided by                  substructures. BOP-based algorithms build a classification model
 photo-interpretation) was performed to ensure consistency with                    by (1) extracting subsequences from a time series, (2) transform-
 image geometry. Finally, a pixel-based random sampling of this                    ing each subsequence (of real values, the measurements) into a
 dataset was applied to provide an almost balanced ground truth.                   discrete-valued word (a sequence of symbols over a fixed alpha-
 The final reference dataset consisted of a total of 81714 pixels                  bet), (3) building a feature vector from word counts (histogram),
 distributed over 9 classes (Table1). We split this reference dataset              and (4) finally using a classification model from the machine
 randomly in half for model training and testing.                                  learning repertoire on these feature vectors.
                                                                                      Specifically, WEASEL+MUSE treats the pixel time series of the
 1 http://tully.ups-tlse.fr/jordi/iota2.git                                        10 spectral features as a MTS with 10 dimensions. For each spec-
 2 http://www.cesbio.ups-tlse.fr                                                   tral feature, subsequences using varying lengths are extracted,


                                                                              12
      Figure 2: WEASEL+MUSE Pipeline: Feature extraction, univariate BOP models and the multivariate BOP model.


approximated using the truncated Fourier transform, and dis-                  e.g. 49% vs 51%. To combine the two sets of class probabilities
cretized into words using equi-depth or equi-frequency binning.               (2x9 class probabilities), we trained a RF model using the 18
A feature vector is built from the words (unigrams) and pairs                 class probabilities from the training dataset as predictors and the
of words (bigrams) to obtain order awareness. Finally, features               corresponding land cover class labels as response.
are concatenated with the sensor id, to maintain a disjoint word
space for each dimension. This high dimensional feature space                 5 EXPERIMENTS
is subsequently filtered using statistical feature selection (Chi-
squared test); finally, a logistic regression classifier is trained,          5.1 Experimental setup
assigning weights to characteristic word in each spectral band.               Datasets: We evaluated our competitors using the described
    Because WEASEL+MUSE is multivariate, the algorithm can                    Landsat dataset. The dataset was randomly split into 40857 train-
leverage the multi-spectral information of satellite time series.             ing and 40857 test samples. The training dataset was used to
This is an advantage over univariate time series models that                  train each classifier. All reported accuracies are based on the test
operate on single indices (i.e. vegetation indices), as spectral-             dataset.
temporal patterns may differ from sensor band to sensor band.                    Competitors: We performed a series of experiments:
    The feature extraction and selection in WEASEL+MUSE make                        • Build RF on the training dataset and apply it to test dataset.
it interesting for land cover recognition:                                          • Build WEASEL+MUSE on the training dataset and apply
      • Features extraction: The words are derived from sub-                          it to test dataset.
        sequences extracted at multiple window lengths in each                      • Build the combined model on the RF- and MUSE-predicted
        spectral feature using the truncated Fourier transform                        class probabilities on the training dataset and apply it to
        and discretization. The Fourier transform reduces noise                       the test dataset.
        introduced by preprocessing, such as the cloud mask or                   Training: For WEASEL+MUSE we performed 10-fold cross-
        geometric alignment, and the different window lengths                 validation on the training dataset to optimize the parameters
        capture the seasonal trends at different time granularity. Bi-        for the SFA quantization method (equi-depth or equi-frequency
        grams can capture seasonal trends, e.g., higher intensities           binning). To perform the RF analysis, we used the R statistical
        in the spectrum in summer than in autumn. By extracting               language and the RF package from [16]. We set the algorithm to
        subsequences from the pixel stream, the classifier allows                                                      √
                                                                              build 1000 trees and randomly sampled p variables as candidates
        for small shifts in the time line, e.g. a delayed bloom of            at each split (where p = 232, and p is the total number of predictor
        crops in some regions.                                                variables).
      • Feature selection: The wide range of words consid-
        ered also introduces many irrelevant features. Therefore,
                                                                              5.2     Results
        WEASEL+MUSE applies statistical feature selection to re-
        move irrelevant words from each class. These may be a                 Table 2 presents the class-wise accuracies and overall accuracy
        result of erroneous information introduced by the image               (weighted and simple average F1-scores) on the test samples for
        capture or preprocessing.                                             each of the three methods. Overall, the combined model had
                                                                              the highest F1-score of 91.1% followed by WEASEL+MUSE with
The resulting feature vector is highly discriminative and contains            89.6% and RF with 89.0% (weighted averages). Thus, the choice of
words that are characteristic for each class, which allows the use            the classifier had a relatively small effect on the overall accuracy.
of fast logistic regression classifier. To perform our analysis, we           However, classes with high F1-scores were also better represented
used the JAVA implementation available from [29].                             in the training and test samples. The effect on overall accuracy
                                                                              was therefore higher when sample sizes were ignored.
4.5     Ensemble Approach                                                        When looking at the results in detail, all classifiers showed
To understand whether there is value in combining the time-                   high F1-scores of about 90% for all but two classes: “other built-
series-based approach and the feature-based approach, we                      up” and “other crops”. For these classes, the accuracy was only
build and tested a third model based on the ensemble of                       ~60% and ~50%, respectively. While RF showed a high precision,
WEASEL+MUSE and RF. Both approaches output class probabili-                   WEASEL+MUSE had a higher recall and F1-score, indicating that
ties for each pixel and select the class with the highest probability.        the temporal profile improved the detection of challenging classes.
Pixels belonging to a unique spectral class may be associated with            However, the largest improvement in accuracy was obtained from
high class probabilities, whereas pixels with less distinct class             combining the two classifiers, which further improved the F1-
membership may have more equally distributed probabilities,                   score by up to 20 percentage points for the “other crops” class.


                                                                         13
                                 Random Forests (RF)                              WEASEL+MUSE                                     Combined
        Land cover      Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score
           Urban          86.7%     91.1%     88.9%      87.6%      91.6%      89.6%   88.9%    91.6%    90.2%
      Other built-up      75.0%     52.6%     61.9%      73.8%      57.7%      64.8%   71.9%    61.6%    66.3%
           Forests        86.1%     93.1%     89.4%      87.9%      91.7%      89.8%   90.2%    92.5%    91.3%
    Sparse Vegetation     92.3%     93.5%     92.9%      90.5%      93.3%      91.8%   93.9%    94.6%    94.2%
           Barren         96.2%     95.3%     95.7%      96.0%      92.6%      94.3%   96.7%    95.9%    96.3%
         Grassland        83.1%     84.5%     83.8%      85.1%      86.5%      85.8%   87.1%    89.1%    88.1%
         Sugarcane        94.1%     94.1%     94.1%      94.1%      94.9%      94.5%   94.7%    95.3%    95.0%
       Other crops        95.2%     34.5%     50.6%      81.6%      45.3%     58.3%   85.3%     58.2%   69.2%
           Water          90.0%     78.1%     83.6%      91.3%      84.0%      87.5%   91.0%    86.6%    88.7%
    Weight. Average       89.4%     89.4%     89.0%      89.5%      89.6%     89.6%   91.1%     91.2%   91.1%
          Average         88.7%     79.6%     82.3%      87.5%      82.0%     84.0%   88.9%     85.1%   86.6%
Table 2: Class-wise accuracy and overall accuracy based on test sample for RF, WEASEL+MUSE, and our combined model.


                                                                   Vegetation .


                                                                                                              Sugarcane
                                                                                                  Grassland


                                                                                                                                           Accuracy
                                                                     Sparse
                                             built-up


                                                                                                                                            Users
                                                         Forests
                                              Other


                                                                                       Barren
                                    Urban


                                                                                                                                   Water
                                                                                                                          Other
                                                                                                                          crops
                Urban               7274      544     118      59      24   79      90        142                                   59     86.7%
            Other built-up           163      840     12       41      4     8      16         5                                    31     75.0%
                Forests              273       60    7448     146      30  282      73        237                                  106     86.1%
           Sparse Vegetation          50       33     195    7494     226   33       8         9                                    72     92.3%
                Barren                5         7     15      209    6164    2       0         3                                     4     96.2%
               Grassland             123       60     137      48      2   2382     38         75                                    2     83.1%
              Sugarcane               69       20     50        0      0    32     3669        52                                    6     94.1%
              Other crops             1         1     11        0      0     0       1        276                                    0     95.2%
                Water                 26       31     14       17      18    1       2         2                                   998     90.0%
          Producers Accuracy       91.1%     52.6%  93.1%    93.5%  95.3% 84.5% 94.2%        34.5%                                78.1%
                                   Table 3: Confusion matrix from Random Forests classification.


   The confusion matrix for the RF (Table 3) gives a detailed                      125 between the values of the same symbol. This noise reduc-
picture. “other build-up” is often confused with “urban”, and                      tion is useful for applications like gesture recognition [25], but it
“other crops” is often confused with the “other forests”, “urban”                  seems to be too aggressive here.
or “grassland”. This might be a result of an under-representation
of these land cover classes in the dataset, as these two classes are               5.3          Sensitivity Analysis
the ones with the lowest number of instances (Table 1). On aver-                   To better judge the performance of the two classification methods,
age WEASEL+MUSE required 5.8 ms for a single pixel prediction,                     we tested their sensitivity regarding the weighted F1-score to
as a result of the feature extraction and selection phases prior to                varying sizes of the training sample (Figure 3). Specifically, it
classification. The RF took 2.7 ms on average per pixel for classifi-              was unclear whether the superior performance of the time series
cation. For the 7.5 Mio pixels of the study area, this translates into             classifier could be replicated with small sample sizes. Starting
a total, single-CPU runtime of about 5.4 hours for RF compared to                  with a random sample of 10% of the original training sample, we
12.2 hours with WEASEL+MUSE. WEASEL+MUSE obtained very                             iteratively increased the training sample size up to 100% , and
promising accuracy for many classes. However we observed a                         tested all models using the test samples. The results show that
limitation of WEASEL+MUSE and all Bag-of-Pattern approaches                        WEASEL+MUSE scored consistently higher than RF across all
when applied in the context of land cover classification, namely                   sample sizes. The absolute difference was constant, indicating
the discretization step introduced for noise reduction and to ob-                  that both approaches were equally sensitive to the sample size.
tain words from real-valued sequences. For discretization, the
value range is divided into bins, and each one is associated with                  6       CONCLUSION
a label. However, only a limited number of symbols, typically                      Our objective was to test a time-series-based classification ap-
between 4 to 8, can be used to discretize the value range with-                    proach (WEASEL+MUSE) to earth observation time series for land
out negatively impacting accuracy [17]. For the Landsat data                       cover classification and conventional machine learning approach
the spectral range is between 0 and 1000 and the absolute dif-                     (Random Forests). We reported results of a series of experiments
ference between pixels is important for classification. However,                   using 23 Landsat 8 images collected in 16 day-intervals over Re-
after discretization using 8 symbols (i.e., a = [000 − 125] and                    union Island. The reference dataset consisted of a total of 81714
b = [151 − 250], . . . ) there can be a difference between 0 up to                 pixels distributed over 9 classes. Our key finding is that the use
                                                                                   of temporal information improved the classification accuracy,


                                                                           14
                                                                                               satellites. ISPRS Journal of Photogrammetry and Remote Sensing 103 (2015),
                                                                                               115–128.
                                                                                           [5] L. Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
                                                                                           [6] L. Bruzzone and B. Demir. 2014. A review of modern approaches to classifica-
                                                                                               tion of remote sensing data. In Land Use and Land Cover Mapping in Europe.
                                                                                               Springer, 127–143.
                                                                                           [7] J.R.W. Ellefsen, L. Gaydos. 1974. Computer Aided Mapping of Land Use ERTS
                                                                                               Multispectral Scanner Data. First Pan American Congress on Photogrammetry.
                                                                                           [8] K. Grogan, D. Pflugmacher, P. Hostert, R. Kennedy, and R. Fensholt. 2015. Cross-
                                                                                               border forest disturbance and the role of natural rubber in mainland Southeast
                                                                                               Asia using annual Landsat time series. Remote Sensing of Environment 169
                                                                                               (2015), 438–453.
                                                                                           [9] M. Hansen and T. Loveland. 2012. A review of large area monitoring of land
                                                                                               cover change using Landsat data. Remote sensing of Environment 122 (2012),
                                                                                               66–74.
                                                                                          [10] G. Hay, G. Castilla, M. Wulder, and J. Ruiz. 2005. An automated object-based
                                                                                               approach for the multiscale image segmentation of forest scenes. International
                                                                                               Journal of Applied Earth Observation and Geoinformation 7, 4 (2005), 339–359.
                                                                                          [11] C. Homer, C. Huang, L. Yang, B. Wylie, and M. Coan. 2004. Development of
                                                                                               a 2001 national land-cover database for the United States. Photogrammetric
                                                                                               Engineering &amp; Remote Sensing 70, 7 (2004), 829–840.
Figure 3: Overall accuracy of WEASEL+MUSE and Ran-                                        [12] P. Hostert, P. Griffiths, S. van der Linden, and D. Pflugmacher. 2015. Time
                                                                                               series analyses in a new era of optical satellite data. In Remote Sensing Time
dom Forests (RF) classification with varying train sample                                      Series. Springer, 25–41.
sizes, i.e., 10%-100% of the training samples to build the                                [13] B. Jakimow, P. Griffiths, S. van der Linden, and P. Hostert. 2017. Mapping
model and 100% to estimate accuracy.                                                           pasture management in the Brazilian Amazon from dense Landsat time series.
                                                                                               Remote Sensing of Environment (2017).
                                                                                          [14] K. Jia, X. Wei, X. Gu, Y. Yao, X. Xie, and B. Li. 2014. Land cover classification
                                                                                               using Landsat 8 operational land imager data in Beijing, China. Geocarto
                                                                                               International 29, 8 (2014), 941–951.
but not for all land cover classes. The improvement in accuracy                           [15] T. Kuemmerle, O. Chaskovskyy, J. Knorn, V. Radeloff, I. Kruhlov, W. Keeton,
was highest for rare and difficult classes. For these, a combined                              and P. Hostert. 2009. Forest cover change and illegal logging in the Ukrainian
classifier was able to improve the F1-score even further. We used                              Carpathians in the transition period from 1988 to 2007. Remote Sensing of
                                                                                               Environment 113, 6 (2009), 1194–1207.
Random Forests because they are widely used and robust, but                               [16] A. Liaw, M. Wiener, et al. 2002. Classification and regression by randomForest.
the ensemble would also work with other classifiers that output                                R news 2, 3 (2002), 18–22.
                                                                                          [17] J. Lin, R. Khade, and Y. Li. 2012. Rotation-invariant similarity in time series
class probabilities. Regarding classification times, the Random                                using bag-of-patterns representation. Journal of Intelligent Information Systems
Forests approach was twice as fast as WEASEL+MUSE, as Ran-                                     39, 2 (2012), 287–315.
dom Forests do not require feature extraction or selection, as                            [18] Nicola Di Mauro, Antonio Vergari, Teresa Maria Altomare Basile, Fabrizio G.
                                                                                               Ventola, and Floriana Esposito. 2017. End-to-end Learning of Deep Spatio-
opposed to time-series-based approaches.                                                       temporal Representations for Satellite Image Time Series Classification. In
    Further research is needed to understand how                                               Proceedings of the ECML/PKDD Discovery Challenges.
MUSE+WEASEL scales over larger areas, e.g. continental                                    [19] S. Powell, W. Cohen, Z. Yang, J. Pierce, and M. Alberti. 2008. Quantification
                                                                                               of impervious surface in the Snohomish water resources inventory area of
scale. The complexity of land cover processes and their                                        western Washington from 1972–2006. Remote Sensing of Environment 112, 4
spectral-temporal patterns will probably grow with increasing                                  (2008), 1895–1908.
                                                                                          [20] J. Ross Quinlan. 1986. Induction of decision trees. Machine learning 1, 1 (1986),
area size. From an application point of view, our presented                                    81–106.
method may be of particular interest for mapping agricultural                             [21] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu,
land use patterns. Agricultural land can be highly dynamic and                                 J. Zakaria, and E. Keogh. 2012. Searching and mining trillions of time series
                                                                                               subsequences under dynamic time warping. In Proceedings of the 2012 ACM
spectrally variable throughout the year [13]. This land use class                              SIGKDD International Conference on Knowledge Discovery and Data Mining.
is therefore likely to benefit from time-series based approaches.                              ACM, 262–270.
Future research could test the performance of our time-series                             [22] S. Running. 2008. Climate change - Ecosystem disturbance, carbon, and climate.
                                                                                               Science 321, 5889 (2008), 652–653.
based method for classifying a broader range of crop types and                            [23] P. Schäfer. 2015. The BOSS is concerned with time series classification in
cropping cycles. The presented approaches essentially disregard                                the presence of noise. Data Mining and Knowledge Discovery 29, 6 (2015),
                                                                                               1505–1530.
the spatial neighbourhood of a pixel. Experiments have shown                              [24] P. Schäfer and U. Leser. 2017. Benchmarking Univariate Time Series Classifiers.
that including spatial information can improve classification                                  In BTW 2017. 289–298.
accuracies, similar to the results reported in [18].                                      [25] P. Schäfer and U. Leser. 2017. Fast and Accurate Time Series Classification
                                                                                               with WEASEL. Proceedings of the 2017 ACM on Conference on Information and
                                                                                               Knowledge Management (2017), 637–646.
ACKNOWLEDGEMENTS                                                                          [26] P. Schäfer and U. Leser. 2017. Multivariate Time Series Classification with
                                                                                               WEASEL+MUSE. ArXiv e-prints (Nov. 2017). arXiv:1711.11343
This work was partly supported by the German Federal Ministry                             [27] C. Senf, P. Leitão, D. Pflugmacher, S. van der Linden, and P. Hostert. 2015.
of Education and Research through the GeoMultiSens project                                     Mapping land cover in complex Mediterranean landscapes using Landsat: Im-
                                                                                               proved classification accuracies from integrating multi-seasonal and synthetic
(grant no. 01IS14010B).                                                                        imagery. Remote Sensing of Environment 156 (2015), 527–536.
                                                                                          [28] TiSeLaC : Time Series Land Cover Classification Challenge. 2017. https:
REFERENCES                                                                                     //sites.google.com/site/dinoienco/tiselc. (2017).
                                                                                          [29] WEASEL+MUSE implementation. 2016. https://github.com/patrickzib/SFA/.
[1] H. Bagan, Q. Wang, M. Watanabe, Y. Yang, and J. Ma. 2005. Land cover                       (2016).
    classification from MODIS EVI times-series data using SOM neural network.             [30] M. Wistuba, J. Grabocka, and L. Schmidt-Thieme. 2015. Ultra-fast shapelets
    International Journal of Remote Sensing 26, 22 (2005), 4999–5012.                          for time series classification. arXiv preprint arXiv:1503.05018 (2015).
[2] A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh. 2016. The Great Time        [31] H. Yin, D. Pflugmacher, R. Kennedy, D. Sulla-Menashe, and P. Hostert. 2014.
    Series Classification Bake Off: An Experimental Evaluation of Recently Pro-                Mapping annual land use and land cover changes using MODIS time series.
    posed Algorithms. Extended Version. Data Mining and Knowledge Discovery                    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
    (2016), 1–55.                                                                              7, 8 (2014), 3421–3427.
[3] M. Baydogan and G. Runger. 2016. Time series representation and similarity            [32] Z. Zhu and C. Woodcock. 2014. Continuous change detection and classification
    based on local autopatterns. Data Mining and Knowledge Discovery 30, 2 (2016),             of land cover using all available Landsat data. Remote sensing of Environment
    476–509.                                                                                   144 (2014), 152–171.
[4] A. Belward and J. Skøien. 2015. Who launched what, when and why; trends
    in global land-cover observation capacity from civilian earth observation


                                                                                     15