An Empirical Evaluation of Early Time-Series Classification
                           Algorithms
                Evgenios Kladis                                       Charilaos Akasiadis                      Evangelos Michelioudakis
              NCSR “Demokritos”                                       NCSR “Demokritos”                             NCSR “Demokritos”
                Athens, Greece                                           Athens, Greece                              Athens, Greece
           eukladis@iit.demokritos.gr                             cakasiadis@iit.demokritos.gr                   vagmcs@iit.demokritos.gr

                                               Elias Alevizos                                Alexandros Artikis
                                          NCSR “Demokritos”                                    NCSR “Demokritos”
                                            Athens, Greece                                       Athens, Greece
                                    alevizos.elias@iit.demokritos.gr                       a.artikis@iit.demokritos.gr

ABSTRACT                                                                                  The integration of state-of-the-art sensory and telecommu-
Early Time-Series Classification (ETSC) is the task of discerning                      nication devices on ships provides a constant stream of data
the class of time-series observations, as accurately and fast as                       reporting the geospatial positioning, direction, or trajectory in
possible. Such approaches can be incorporated in forecasting,                          the form of time-series data. These can be incorporated in ETSC,
and this way assist on many research fields. However, available                        for example to aid the tasks of naval authorities [22]. Some cases
approaches are not suitable for all problems, since the shape                          worth noting are the early detection of ship collisions [27], which
and the nature of data can impact their performance. In the                            prevents the dangers of the densely used naval routes, or to detect
context of this work, we empirically evaluate five state-of-the-                       naval smuggling events [1] as soon as possible, which calls for
art ETSC algorithms on publicly available data, as well as on                          immediate action. Another important application of ETSC is the
two newly introduced datasets, originating from the biological                         prediction of earthquakes and the intensity of aftershocks from
and maritime application areas. The first dataset refers to cancer                     respective sensor readings; and in the energy field, in order to
simulation data, while the second consists of vessel geospatial                        predict the energy consumption of households and optimize the
information. The aim is to extensively evaluate ETSC algorithms,                       energy supply system [32].
and provide intuition on how such approaches work, and what                               In this paper, we focus on the state-of-the-art of ETSC aiming
are the problem characteristics that may render each method                            to extend the benchmarking literature and tools. We employ a
successful. Also, the framework we used for the evaluation can                         collection of various algorithms, and extensively evaluate them
serve as a benchmark for new related approaches.                                       on publicly available datasets that include multivariate and uni-
                                                                                       variate cases of different sizes. Two new datasets are employed,
                                                                                       originating from drug treatment discovery, and maritime surveil-
1    INTRODUCTION                                                                      lance. Overall, we bundle together algorithms and datasets in a
The evolution of computer systems and the proliferation of In-                         single and publicly available framework, which can serve as a
ternet of Things has significantly aided to the expansion of time-                     benchmark for future developments in this domain.
series data production, collection, and storage [35]. For instance,                       The rest of the paper is organized as follows. In Section 2
in the life sciences field, simulation frameworks aim to estimate                      we discuss related work and the algorithms that are evaluated.
how cellular structures respond to treatments, e.g. in the face of                     Section 3, describes the datasets that we utilize. In Section 4 we
new experimental pharmaceuticals [9]. Such simulations require                         present the experimental results, and, finally, in Section 5, we
vast amounts of computational resources, as well as time, and                          conclude.
produce gigabytes of data in each run. Thus, non-interesting
cases should be detected at early stages and stopped as soon as                        2   ETSC ALGORITHMS
possible, since they are not expected to provide any useful result.
                                                                                       Although ETSC is not a novel field, most of the existing work com-
    However, many problems remain regarding the detection of
                                                                                       pares methods for regular time-series classification, or forecast-
situations and ailments ahead of time. Many effective methods
                                                                                       ing, using the whole time-series for predictions. A representative
for tackling such problems stem from the Early Time-Series Clas-
                                                                                       example is the work of Demsar et al. [6], which evaluates classic
sification (ETSC) domain. Contrary to standard time-series clas-
                                                                                       classifiers, such as, k-NN and Naive Bayes, by incorporating a
sification, which requires all the time-points to make a decision,
                                                                                       variety of methods for statistical comparison. Multiple datasets
ETSC approaches aim to classify time-series as early in time as
                                                                                       are included in the evaluation, and conclusions are drawn using
possible, always pursuing the smallest impact in the accuracy of
                                                                                       ANOVA and Friedman tests. The work of [28], does the same, but
the predictions. By providing results ahead of time, early classifi-
                                                                                       also for Artificial Neural Networks (ANN) approaches, evaluated
cation can give e.g. critical data about a patient’s future health
                                                                                       on a dataset regarding smartphone data usage. The analysis of
complications in the long-term within the predictive medicine
                                                                                       the results highlights that classifiers can be quite effective in
field [20], or in the short-term, such as arrhythmia detection [16]
                                                                                       the smartphone data usage application, with the Random Forest
or stroke prediction [3].
                                                                                       approach being the most well performing one. Also, different
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-
                                                                                       algorithms seem to perform more effectively for certain datasets
ings of the EDBT/ICDT 2021 Joint Conference (March 23-26, 2021, Nicosia, Cyprus)       compared to others, rendering the evaluation of multiple algo-
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0           rithms on each dataset necessary. Moreover, given the increasing
International (CC BY 4.0)
                                                                                       data sources worldwide, as well as the rising complexity in the
contemporary data analysis processes, focus has been put on                   A similar approach to the EDSC is followed by the Mining
multivariate time-series [8]. In a recent work presented in [7],          Core Feature for Early Classification (MCFEC) [11] algorithm.
authors examine the state-of-the-art methods for solely multi-            The difference in this case lies in the criteria for shapelet se-
variate time-series classification and compare inter-dimensional          lection, and also in that MCFEC has the capability to process
dependencies among classifiers, providing this way further in-            multivariate time-series. Like EDSC, for each variable, all the
sight to this field of research.                                          sub-series of minimum and maximum lengths are extracted, but
   A recent work was also conducted in ETSC, reviewing ex-                are additionally grouped using Silhouette Clustering [26]. Then,
isting approaches related to the field [10]. The algorithms are           the best shapelets from each cluster are selected for classification
categorized into four groups and the analysis is performed on             based on an extension of the F1 -score metric.
a theoretical level, highlighting their strengths, limitations, and           Effective Confidence-based Early Classification (ECEC) [19]
major concerns for each group. Our approach differs, since we             utilizes subsequences, but in a quite different way. ECEC employs
follow an empirical approach. We categorize ETSC algorithms               a transformation algorithm called WEASEL [29] that extracts
according to their internal structure and subcomponents and               windows (e.g. {1137, 1227, 1205}), transforms them to symbols
evaluate them on multiple datasets. Here, we distinguish three            that form words (e.g. (abc, cba . . . )), and measures the frequency
main types, i.e. those that rely on subsequence extraction and            of their appearance in each time-series. In more detail, the user
analysis, algorithms that are based on clustering of the time-            provides as input (i) the desired window length and (ii) the length
series, and others that employ ANNs. Note that some algorithms            of the words. Then, the frequency of each word (e.g. [1, 2, 1, 0, 0]
could be considered to belong to more than one category. In this          for a time-series and all window lengths), is passed on to a lo-
case, the type was chosen based on which component was the                gistic regression classifier, which in turn produces probabilistic
most significant for the classification task.                             predictions. ECEC truncates the input into t different prefix sizes,
                                                                          beginning from the 1/t of the time-series up to its full length. For
                                                                          each prefix, it trains a WEASEL classifier and performs a 5-fold
2.1    Subsequence-based Approaches                                       cross validation on the dataset as an evaluation. The predictions
The algorithms that use subsequences, in principle, isolate smaller       acquired are then passed into a cost function, which, based on the
windows of the time-series—termed as subseries—and manipulate             probability of the prediction being correct, calculates a threshold
them accordingly, to maximize information gain and achieve the            θ . During the testing phase, using the classifier of the minimum
earliest and most accurate results. We distinguish five cases that        prefix size, a prediction is made for each prefix, and the corre-
belong to this category. An example of such an algorithm is               sponding cost is calculated. If the cost is higher than the threshold
Early Distinctive Shapelet Classification (EDSC) [34] where the           θ then the prediction is produced, otherwise the algorithm waits
classification task is based on subseries extraction. Initially, the      until enough data have been accumulated to form the next prefix.
user provides as input the minimum and maximum lengths of the                 The WEASEL algorithm for transforming the input data com-
windows that are to be extracted. Then, during the extraction step,       bined with logistic regression is utilized by the Two-tier Early
all the corresponding sub-series are isolated from the full time-         and Accurate Series classifiER (TEASER) method [30]. This algo-
series and, by using Chebyshev Inequality on each extracted part,         rithm truncates the time-series into S prefixes, each containing
                                                                          length of time-series
a threshold is calculated. This threshold controls the minimum                                    more time points than the previous prefix.
                                                                                      S
similarity that a sub-series should have, so to be assigned in the        For each prefix, TEASER applies the WEASEL - logistic regres-
same class. Next, each sub-series is transformed into a triplet           sion pipeline to obtain respective probabilistic values. In order
containing the class of the time-series it was extracted from, the        to ensure the reliability of the classification result, the outputs
sub-series itself, and the threshold, forming this way a shapelet.        are passed into a One-Class SVM constructed for each prefix,
    Having all the shapelets in hand, a utility function calculates       that accepts or rejects a prediction. The final requirement for
a measure similar to the F1 -score for each one, which represents         the approval of a classification is that a prediction should be
in some sense the distinctive capability of the shapelet, i.e., how       consistently the same for v out of S prefixes, with v being an
appropriate a shapelet is to be considered as a template for a            algorithm’s hyper-parameter subject to optimization.
particular class. Ranked by their utility values, the best shapelets          Early Classification framework for time series based on class
are selected and stored into a “pool” that constitutes the clas-          Discriminativeness and Reliability(ECDIRE) [21] also uses part
sification basis. To ease the reader we incorporate a running             of the time-series to produce reliable and early predictions. The
example for the biological case. Suppose we have the time-series          algorithm requires as input the set E which contains time-series
t = {1137, 1227, 1205, 1082, 893, 736, 664, 639, 631} which is part       lengths at which there is high information value, in order to
of a time-series from our biological dataset. The user gives as           reduce granularity of data. Consequently, 10 times 5-fold cross-
input minimum length 2 and maximum, 4. Firstly, all the sub-              validation is done for each length using Gaussian Process Classi-
series between sizes 2 and 4 are going to be extracted from the           fiers [24], storing the predictions and probabilities for each class.
time-series. Let one sub-series be tpr ime = {1137, 1227, 1205}.          Based on the predictions, an earliness threshold is constructed,
For tpr ime , a threshold δ = 3.05 is calculated based on Cheby-          which indexes the time-point from which class instances differ
shev’s inequality, which ensures the similarity of tpr ime only           from the other labeled instances. Using the probabilities from the
with time-series of distance less or equal than the value of δ .          cross-validation, a second probabilistic threshold is calculated,
After this step, spr ime = (class, tpr ime , δ ) is created, and then,    that ensures the reliability of the prediction. A classification result
the weighted F1 -Score of each shapelet is calculated and stored          must satisfy both thresholds in order to be passed as a prediction.
in a list. Assuming that the list is [1.3, 3.67, 0.83] and 3.67 is the        A more recent approach for early time-series classification is
score of tpr ime , the top K shapelets from the list are selected as      the Distance Transformation based Early Classification (DTEC)
the classifier shapelets. When the distance of the test time-series       algorithm [36]. DTEC relies on the transformation of data and
from a shapelet is less than δ , then the shapelet’s class is assigned.   the usage of probabilistic classifiers to achieve early and accurate
results. In particular, DTEC extracts subsequences of time-series             The Multivariate LSTM-FCNs (MLSTM-FCN) [18] method uti-
and maps them to another space based on distances. The trans-             lizes neural networks and operates on multivariate time-series.
formed data are then used to train a probabilistic classifier by          This is actually an extension of the original algorithm LSTM-
forming the confidence area, which is drawn based on the proba-           FCN [17], which supported only univariate datasets. Both algo-
bility strength of the class to be passed as a prediction, making         rithms duplicate the input and pass the data to two sub-models:
this way the classification result more reliable.                         The first sub-model consists of three convolutional neuron lay-
   We selected EDSC, TEASER, and ECEC for evaluation from                 ers. The use of CNNs is widely adopted in regular time-series
this category. EDSC is considered a widely referenced method              classification [37], since it extracts important features from se-
for ETSC, since most recently developed algorithms use it as a            quences of data that often derive from imagery. In the described
baseline. TEASER and ECEC are also interesting choices, since             model, data that exit from each of the first two layers are first
they follow different internal processes and have readily available       batch normalized [15] and then passed to an activation function,
implementations that provide good results.                                i.e. a Rectified Linear Unit (ReLU). In order to maximize the ef-
                                                                          ficiency of the model on multivariate time-series, the activated
                                                                          output is also passed into a squeeze and excitation block [13].
2.2    Clustering-based Approaches
                                                                          A squeeze and excite network consists of a global pooling layer
Clustering is the task of forming groups of data similar to each          and two dense layers that give to each variable of the time-series
other based on some similarity measure. We refer to two dis-              a unique weight, so that to increase the sensitivity of predic-
tinct cases in this category. Early Classification on Time Series         tions. The second sub-model, consists of a masking layer and
(ECTS) [33] is an algorithm that relies on clustering, as well as,        the output is passed on an attention based LSTM. LSTM [12]
the 1-Nearest Neighbors (1-NN) search, to perform accurate and            is a Recurrent Neural Network model, popular in time-series
early classification. In detail, first the 1-NN set of each time-series   classification, because of its ability to remember inter time-series
is calculated. Based on that set, the algorithm follows the Reverse       dependencies with minimal computational cost and high accu-
Nearest Neighbors (R-NN) approach, which in a sense shows how             racy for time-series of length less than a thousand time points [2].
many time-series consider the examined one as a nearest neigh-            Attention based LSTMs are variations of the normal LSTMs, with
bor. Based on the consistency of the R-NN set for each different          increased computational complexity, which nevertheless results
subsequence of the time-series, ECTS computes the minimum                 to increased overall performance. The output of the two sub-
prediction length for each time-series that represents how many           models is concatenated and passed through a dense layer with
time-points are required for the time-series to act as a reliable         as many neurons as the classes, and via a softmax activation
classifier. First, the time-series are clustered using agglomera-         function predictions as a probabilistic output are provided.
tive hierarchical clustering [31] based on Euclidean distances.               Another similar approach is the Multi-Domain Deep Neural
The minimum prediction length of each cluster is considered               Network (MDDNN) [14] algorithm, which utilizes simple neu-
and appointed to each time-series in the cluster. During the test         ral networks based on the same principles as the previous one.
phase, time-series are paired with NN, using initially only the           However, this algorithm consists of two sub-models which are
first time-point and increasing through each iteration, until the         identical with respect to their structure, but differ on the input
time-series they pair with has minimum prediction length less             they receive. The first model takes as input the z-normalized
or equal to that time-point’s position.                                   raw time-series represented on the time domain, and the sec-
    TRIGGER [4] is another algorithm that uses clustering in com-         ond takes as input the z-normalized fast-fourier transform of
bination with a cost function to estimate if a time-series can            the time-series represented on the frequency domain. Both mod-
provide a trustworthy result. The time-series are grouped into            els consist of two CNN layers. After each layer the output is
clusters using k-means [31] based on Euclidean distances. For             again batch normalized and passed onto a one-dimensional max
each cluster k and time-step t ∈ [1, . . . ,T ], a Multilayer Percep-     pooling layer, activated with a ReLU layer. The output of each
tron [31] hkt is trained using time-series of length t from the clus-     model is then concatenated, flattened, and passed through two
ter k. When a new time-series arrives, a “membership” probability         dense layers. Finally a softmax layer is applied, just as in the
is calculated to appoint it to a cluster and, based on the cluster        MLSTM-FCN. We selected the MLSTM-FCN for our evaluation,
and the length of the time-series, a prediction is performed. The         since there are much more details available for this particular
results of the classifier are handled with a cost function, which         approach than others. Also, it is worth noting that the authors
is calculated for every time-point between the current and the            provide respective source-code and datasets to be used by the
last of the time-series. If the cost function returns a value of zero     community.
the result of the classifier is deemed safe and is returned along
with the size of the incomplete time-series up to the currently ex-
amined time-step. We choose to incorporate the ECTS algorithm
in our evaluation since, similarly to EDSC, it is one of the oldest
and most known approaches for early time-series classification.           3   DATASETS
                                                                          In this section, we describe the datasets used in our evaluation,
                                                                          i.e., the publicly available UCR [5], as well as two new ones, from
2.3    Artificial Neural Networks-based                                   the domains of life-sciences and maritime. Note that, instances
       Approaches                                                         included from the publicly available collection are univariate,
First of all, a typical ANN-based approach performs time-series           while the two new cases are both multivariate. Also, since not
classification given data in the full-length time-series. However,        every selected algorithm is designed to work on multivariate time-
this can be tackled on a higher level, by supplying to the in-            series, we implemented a simple voting method for classification.
put only parts of the time-series, thus making them capable of            This applies to the ECTS, EDSC, TEASER, and ECEC algorithms.
conducting ETSC. We refer to two cases in this category.
3.1     Dataset of cancer cell simulations                                Harmonic Mean. Harmonic mean is a metric proposed in [30]
This dataset includes the count of tumor cells population during        and it represents the Pythagorean harmonic mean between earli-
the administration of specific drug treatments as resulting from        ness and accuracy.
large-scale model exploration experiments. Each time-series rep-                                         2 · (1 − earliness) · accuracy
                                                                                 HarmonicMean =
resents one simulation experiment of a specific drug treatment                                             (1 − earliness) · accuracy
configuration, which differs from the others based on a set of
                                                                        Note that the algorithms were evaluated using 5-fold cross-validation
configurable parameters, such as the time of administration, the
                                                                        on the two real-world datasets.
duration, and the drug concentration. A time-point of each in-
stance corresponds to three different integer values, indicating           Training and Testing times. Measured in minutes and seconds
the number of Alive, Necrotic and Apoptotic cells. The time-series      respectively.
are preclassified as interesting or non-interesting, based on if the
drug treatment was effective or not, according to a particular clas-    4.2     Comparison of ETSC Approaches
sification rule that was defined by domain experts. The dataset         We present results from an evaluation of 5 state-of-the-art algo-
consists of 644 time-series, each having 48 time-points. The time-      rithms, i.e, EDSC, ECTS, TEASER, ECEC, and MLSTM-FCN. The
series instances included in the dataset were created by executing      implementations were readily available, except for ECTS, which is
a parallel version of the PhysiBoSSv2 simulator.1                       a custom implementation. In particular, ECEC as well as TEASER
    Since the dataset originates from large-scale simulation ex-        are written in Java, while ECTS and MLSTM-FCN in Python, and,
periments oriented to drug treatment discovery, the classes are         finally, EDSC in C++. The algorithms were tested using 5-fold
rather imbalanced. Specifically, the configurations that produced       cross validation for the maritime and biological datasets. The
interesting time-series constitute the 20% of the dataset, while        results were run on a server with an Intel Xeon E5-2630 2.60GHz
the remaining 80% account for non-interesting cases. Moreover,          processor, with 25-cores and 252 GB RAM.
many interesting and non-interesting examples tend to be very               Since most algorithms don’t support multivariate time-series, a
similar during the early stages of the simulation, until the drug       voting method was applied. According to the voting method, clas-
treatment takes effect, which, as observed, is usually after the        sifiers are trained, and tested on each of the folds, but separately
first 30% of the time-points of the time-series. This fact makes the    for each dimension of the input vector shape. Upon collecting
dataset an interesting benchmark, since it is nearly impossible to      each of the outputs, the most frequent or voted prediction was
obtain accurate predictions in less time than this.                     chosen, however, assigned with the worst earliness among them.
                                                                        In the case of equal votes among classifiers, the earliest one made
3.2     Dataset of vessel position signals                              is selected. MLSTM-FCN is tested on the [40%, 50%, 60%] of the
The maritime dataset contains data of nine different sea vessels        time-series length in each dataset, and the length with the best
that cruise around the port of Brest, France. This dataset is derived   results based on harmonic mean is chosen, for each dataset indi-
from [23, 25]. Each measurement corresponds to a vector of val-         vidually. The number of LSTM cells is set to 8 for all experiments
ues for longitude, latitude, speed, heading, course over ground of      as default. For TEASER, S is set to 10 for the biological and mar-
a vessel at a given time-point. The time-series are fragmented to       itime case whereas for the UCR dataset is fixed to 20. The code
a specific length, and divided into two classes, based on whether       of our testing framework is available publicly, accompanied with
the ship entered or didn’t enter this particular port during the        the respective datasets and algorithm configurations.2
time course of the corresponding time-series fragment. In total,            It should be noted that fig. 1 presents for the biological and
there are 5249 instances of 30 time-points each.                        maritime cases using the 5-fold cross validation over the multi-
                                                                        variate datasets. Also for the UCR datasets, each entry is a dataset
4 EXPERIMENTAL RESULTS                                                  and the F1 -Score bar represents the average F1 -score from all
                                                                        dataset’s classes. The training and testing time plots were made
4.1 Evaluation Metrics
                                                                        using log-scale on the y-axis.
For our evaluation we use five well-known metrics.                          ECTS, as one of the older and simpler approaches in the ETSC
  Earliness. Earliness counts how many time-points were re-             algorithmic ‘arsenal’, has considerably high training times, an
quired to make the final prediction for each time-series.               undesired characteristic that results to the algorithm being unable
                                                                        to provide results within several hours. In particular, for the
                             consumed time series lenдth
             Earliness =                                                maritime dataset, this algorithm did not progress after 48 hours
                              total lenдth o f timeseries               of execution. This can be explained by the fact that ECTS uses
  Accuracy. Accuracy is translated to how many time-series              1-NN as the main procedure behind the early classification. 1-NN
were classified correctly during the testing process.                   requires to calculate Euclidean distances between all time-series,
                                                                        resulting to very high computation times for big datasets both
                           number o f correct predictions
            Accuracy =                                                  in length and size. One notable example is the StarLightCurves
                             number o f test instances
                                                                        dataset, which contained 1000 training time-series of length 1024,
   F1 -score. F1 -score is the harmonic mean of precision and recall:   resulting to more than 65 hours of training and testing times.
                                        tp                              However, for the biological dataset (fig. 1a), the computation
                     F 1 -score =                                       time was on average around 30 minutes per fold. It reaches at
                                      1
                                  tp + (f p + f n)                      top accuracy and over 90% F1 -scores which shows that ECTS can
                                      2
                                                                        successfully tackle the class imbalance.
where tp, f p, and f n are the true positives, false positives, and         EDSC, the first approach to propose the use of sub-sequences
false negatives respectively.                                           for early time-series classification, is also struggling with time
1 https://github.com/xarakas/spheroid-tnf-v2-emews                      2 https://github.com/Eukla/ETS
        1.0                                                                     1.0                                                                        1.0

        0.8                                                                     0.8                                                                        0.8
                     Metric
        0.6      F1-Score                                                       0.6                                                                        0.6
Score


                                                                        Score


                                                                                                                                                   Score
                 Earliness
        0.4      Accuracy                                                       0.4          Metric                                                        0.4          Metric
                 HarmonicMean                                                            F1-Score                                                                   F1-Score
                                                                                         Earliness                                                         0.2      Earliness
        0.2                                                                     0.2      Accuracy                                                                   Accuracy
                                                                                         HarmonicMean                                                               HarmonicMean
        0.0                                                                     0.0                                                                        0.0
              ECTS      EDSC   TEASER MLSTM                 ECEC                      ECTS        EDSC    TEASER MLSTM                ECEC                       ECTS      EDSC     TEASER MLSTM   ECEC
                               Method                                                                     Method                                                                    Method

                        (a) Biological                                                            (b) Maritime                                                                   (c) UCR


                                                                                        Dataset                         103                                      Dataset
                                                                                             UCR                                                                  UCR
                                               102                                           Biological                 102                                       Biological
                                                                                             Maritime                                                             Maritime
                                                                                                                        101
                                  Time (min)


                                                                                                           Time (sec)
                                               101                                                                      100
                                                                                                                        10 1
                                               100
                                                                                                                        10 2
                                                                                                                        10 3
                                                     ECTS      EDSC    TEASER MLSTM             ECEC                           ECTS     EDSC    TEASER MLSTM            ECEC
                                                                      Algorithm                                                                Algorithm

                                                             (d) Training time                                                        (e) Testing Time

    Figure 1: Evaluation measures (top) and training/testing times (bottom) for each dataset and each of the algorithms.


complexities. In fact, training and testing times are greatly influ-                                                correct. For the biological dataset, the ECEC has the highest accu-
enced by the size of the dataset and, more importantly, the length                                                  racy with very low length usage, being the best one in harmonic
of the time-series. Having lengths of more than a thousand time-                                                    mean among the evaluated algorithms. However, this could be
points, and dataset sizes larger than a hundred time-series leads                                                   considered as a misleading result due to the weakness of the
to increased computation times, e.g. the Haptics dataset, which                                                     confidence threshold as a measure of classifications reliability,
has 155 time-series of size 1093 time-points and training times of                                                  and the strong predisposition of the algorithm to choose the
more than 24 hours. For the available biological dataset the per-                                                   majority class as a prediction over the minority. Because the
formance is high in terms of accuracy. However, the data usage is                                                   80% of the dataset is considered as non-interesting, even if all
also very high. For the maritime (fig. 1b), the performance is sub-                                                 predictions are the same, high accuracy can be attained. Note
optimal, since F1 -score is relatively low and the earliness reaches                                                though that the F1 -Score does not suffer from this. For the Mar-
to 100%. When EDSC is unable to find a prediction, returns the                                                      itime dataset, where the classes are more balanced, the harmonic
most frequent class label with the maximum data-usage. EDSC                                                         mean of ECEC is significantly low, with about 80% data usage. It
for both the biological and the maritime dataset was unable to                                                      should be noted that the F1 -Score is on the higher end, at about
make predictions and returned the default answer.                                                                   92% which is explained by increased data usage. Times increase
   We can observe increased earliness values for ECTS and EDSC.                                                     significantly with the size of each dataset, since for datasets such
Recall that the biological dataset consists of 3 variables, the Alive,                                              StarLightCurves training needed more than 5 hours.
Necrotic and Apoptotic cell counts at each time-step. While for                                                        TEASER, uses WEASEL just like ECEC but is shown to perform
the Alive and Necrotic variables numerical value differences                                                        differently. The computation times for the biological, maritime
exist between different class instances, for the Apoptotic, the                                                     and UCR dataset are very low. Testing times are also the lowest.
number of cells is more consistent in both classes. This creates                                                    As one of the most promising presented algorithm in this paper,
ambiguity for a simple classifier such as ECTS and EDSC, resulting                                                  achieves steadily over 80% accuracy, with less than 50% earliness
in very delayed classification when using the Apoptotic variable                                                    on all datasets. Also this was achieved by conducting multivariate
as training. During a test instance, predictions with minimal data                                                  time-series classification using the simple voting method. The
usage are produced by the classifiers trained with the Alive and                                                    3-step prediction process containing, the transformation and lo-
Necrotic subseries but delayed classifications are made from the                                                    gistic regression, the One-Class SVM, and the check of prediction
Apoptotic trained ones. The voting classification method chooses                                                    consistency along with the hyperparameter searching step, give
the most voted prediction, which is usually the one made from                                                       reliable results, very quickly, remaining unaffected from dataset
the Apoptotic classifier, and assigns as earliness the largest value                                                sizes. A weakness of TEASER is the high data usage, in the Inli-
among the three options, which in this dataset is the one from                                                      neSkate dataset of UCR (fig. 1c) for example, which, nevertheless,
Apoptotic. For this reason earliness becomes worse.                                                                 can be tackled by increasing the available RAM. Also the F1 -Score
   Up next, we have the ECEC method. As already explained, the                                                      is high but still has room for improvement.
ECEC spilts the dataset to prefixes and computes a confidence                                                          Lastly, MLSTM-FCN uses Deep Neural Networks to make pre-
threshold based on the a priori probability of a prediction being                                                   dictions. Since we essentially run MLSTM-FCN 3 times for each
dataset of the UCR, for the [40%, 50%, 60%] of the dataset, both                        [10] A. Gupta, H. P. Gupta, B. Biswas, and T. Dutta. 2020. Approaches and Appli-
training and testing times are generally high for the UCR. How-                              cations of Early Classification of Time Series: A Review. IEEE Transactions on
                                                                                             Artificial Intelligence 1, 1 (2020), 47–61.
ever, for the two distinct datasets using 40% of the time-series                        [11] G. He, Y. Duan, R. Peng, X. Jing, T. Qian, and L. Wang. 2015. Early classification
had the minimal trade-off in accuracy and earliness. For each                                on multivariate time series. Neurocomputing 149 (02 2015), 777–787.
                                                                                        [12] S. Hochreiter and J. Schmidhuber. 1997. Long Short-term Memory. Neural
fold, the MLSTM-FCN, proved the biggest divergence of accuracy.                              computation 9 (12 1997), 1735–80.
Weights are randomly initialized at the start of each run, and                          [13] J. Hu, L. Shen, and G. Sun. 2018. Squeeze-and-Excitation Networks. In 2018
Neural Networks are generally sensitive to dataset information,                              IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7132–7141.
                                                                                        [14] H. Huang, C. Liu, and V. S. Tseng. 2018. Multivariate Time Series Early
sometimes leading to erratic results. Nevertheless, MLSTM-FCN                                Classification Using Multi-Domain Deep Neural Network. In 2018 IEEE 5th
shows potential in both unique datasets with very low computa-                               Int. Conf. on Data Science and Advanced Analytics (DSAA). 90–98.
tion times, good accuracy, and early predictions. The F1 -Score                         [15] S. Ioffe and C. Szegedy. 2015. Batch Normalization: Accelerating Deep Network
                                                                                             Training by Reducing Internal Covariate Shift. In Proc. of the 32nd Int. Conf.
is also high on average except for a few particular folds and                                on Machine Learning 37. JMLR.org, 448–456.
datasets. Also, for the UCR datasets, MLSTM-FCN shows some                              [16] M. Kachuee, S. Fazeli, and M. Sarrafzadeh. 2018. ECG Heartbeat Classification:
                                                                                             A Deep Transferable Representation. In 2018 IEEE Int. Conf. on Healthcare
low accuracies compared to other algorithms. Insufficient results                            Informatics (ICHI). 443–444.
derive from the lack of hyper parameter optimization that has                           [17] F. Karim, S. Majumdar, H. Darabi, and S. Chen. 2018. LSTM fully convolutional
not yet been integrated on the used MLSTM-FCN for univariate                                 networks for time series classification. IEEE Access 6 (2018), 1662–1669.
                                                                                        [18] F. Karim, S. Majumdar, H. Darabi, and S. Harford. 2019. Multivariate LSTM-
and multivariate time-series. For the LSTM layer, we used the                                FCNs for time series classification. Neural Networks 116 (2019), 237 – 245.
default 8 cells. Using a grid search for cells might improve UCR                        [19] J. Lv, X. Hu, L. Li, and P. Li. 2019. An Effective Confidence-Based Early
results, but would drastically increase computation times.                                   Classification of Time Series. IEEE Access 7 (2019), 96113–96124.
                                                                                        [20] L. Miner, P. Bolding, J. Hilbe, M. Goldstein, T. Hill, R. Nisbet, N. Walton, and G.
                                                                                             Miner. 2016. Practical Predictive Analytics and Decisioning Systems for Medicine:
5    SUMMARY AND FURTHER WORK                                                                Informatics Accuracy and Cost-Effectiveness for Healthcare Administration and
                                                                                             Delivery Including Medical Research (1st ed.). Academic Press, Inc., USA.
In this work, we evaluated state-of-the-art for ETSC on pub-                            [21] U. Mori, A. Mendiburu, E. Keogh, and J. A. Lozano. 2017. Reliable early
licly available datasets and newly introduced ones originating                               classification of time series based on discriminating the classes over time.
                                                                                             Data Mining and Knowledge Discovery 31, 1 (01 Jan 2017), 233–263.
from real-world applications. The results were very positive,                           [22] K. Patroumpas, E. Alevizos, A. Artikis, Marios Vodas, N. Pelekis, and Y.
since all the algorithms achieved over 80% accuracy, with the                                Theodoridis. 2017. Online event recognition from moving vessel trajecto-
most recently introduced ones using less than half of the avail-                             ries. GeoInformatica 21 (01 04 2017), 389–427.
                                                                                        [23] K. Patroumpas, D. Spirelis, E. Chondrodima, H. Georgiou, P. Petrou, P. Tam-
able time-points, i.e., producing accurate predictions quite early.                          pakis, S. Sideridis, N. Pelekis, and Y. Theodoridis. 2018. Final dataset of Trajec-
Each algorithm performed slightly differently on each dataset,                               tory Synopses over AIS kinematic messages in Brest area (ver. 0.8).
but TEASER and MLSTM-FCN provided a stable and good per-                                [24] C. E. Rasmussen and C. K. I. Willliams. 2006. Gaussian processes for machine
                                                                                             learning. The MIT Press, Cambridge.
formance on both classification precision and earliness, in all                         [25] C. Ray, R. Dreo, E. Camossi, and A. L. Jousselme. 2018. Heterogeneous Inte-
datasets. We also provide a public repository with all of the algo-                          grated Dataset for Maritime Intelligence, Surveillance, and Reconnaissance,
                                                                                             10.5281/zenodo.1167595.
rithms and datasets used, therefore allowing to test and choose                         [26] P. J. Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and val-
the most appropriate algorithm for new applications.                                         idation of cluster analysis. Journal of computational and applied mathematics
   A possible extension of this suite is the integration of online                           20 (1987), 53–65.
                                                                                        [27] I. Rudan, V. Frančić, M. Valčić, and M. Sumner. 2019. Early detection of
data streams to preform time-series classification, allowing to                              vessel collision situtations in a vessel traffic services area. Transport 35 (2019),
test such algorithms in realistic settings.                                                  121–132.
                                                                                        [28] I. H. Sarker, A. S. M. Kayes, and P. Watters. 2019. Effectiveness analysis of
                                                                                             machine learning classification models for predicting personalized context-
ACKNOWLEDGEMENT                                                                              aware smartphone usage. Journal of Big Data 6 (2019), 57.
                                                                                        [29] P. Schäfer and U. Leser. 2017. Fast and Accurate Time Series Classification
This work has received funding from the EU Horizon 2020 RIA                                  with WEASEL. In Proc. of the 2017 ACM Conf. on Information and Knowledge
program INFORE under grant agreement No 825070.                                              Management. Association for Computing Machinery, 637–646.
                                                                                        [30] P. Schäfer and U. Leser. 2020. TEASER: early and accurate time series classifi-
                                                                                             cation. Data Mining and Knowledge Discovery 34, 5 (01 Sep 2020), 1336–1362.
REFERENCES                                                                              [31] P.-N. Tan, M. Steinbach, and V. Kumar. 2016. Introduction to data mining.
 [1] N. Brax, E. Andonoff, and M. Gleizes. 2012. A Self-adaptive Multi-Agent System          Pearson Education India.
     for Abnormal Behavior Detection in Maritime Surveillance. In Agent and Multi-      [32] G. K. F. Tso and K. K. W. Yau. 2007. Predicting electricity energy consumption:
     Agent Systems. Technologies and Applications. Springer Berlin Heidelberg,               A comparison of regression analysis, decision tree and neural networks. Energy
     Berlin, Heidelberg, 174–185.                                                            32, 9 (2007), 1761 – 1768.
 [2] J. Brownlee. 2016. Sequence Classification with LSTM Recurrent Neural              [33] Z. Xing, J. Pei, and P. Yu. 2011. Early classification on time series. Knowledge
     Networks in Python with Keras.          https://machinelearningmastery.com/             and Information Systems - KAIS 31 (01 04 2011), 105–127.
     sequence-classification-lstm-recurrent-neural-networks-python-keras/               [34] Z. Xing, J. Pei, P. Yu, and K. Wang. 2011. Extracting Interpretable Features for
 [3] P. Chantamit-o pas and M. Goyal. 2017. Prediction of Stroke Using Deep Learn-           Early Classification on Time Series. Proc. of the 11th SIAM Int. Conf. on Data
     ing Model. In Neural Information Processing. Springer International Publishing,         Mining, SDM 2011, 247–258.
     Cham, 774–781.                                                                     [35] X. Xu, S. Huang, Y. Chen, K. Browny, I. Halilovicy, and W. Lu. 2014. TSAaaS:
 [4] A. Dachraoui, A. Bondu, and A. Cornuéjols. 2015. Early Classification of Time           Time Series Analytics as a Service on IoT. In 2014 IEEE Int. Conf. on Web
     Series as a Non Myopic Sequential Decision Making Problem. In Machine                   Services. IEEE, 249–256.
     Learning and Knowledge Discovery in Databases. Springer Int. Publishing,           [36] L. Yao, Y. Li, Y. Li, H. Zhang, M. Huai, J. Gao, and A. Zhang. 2019. DTEC:
     Cham, 433–447.                                                                          Distance Transformation Based Early Time Series Classification. Proceedings
 [5] H. A. Dau, A. Bagnall, K. Kamgar, C. M. Yeh, Y. Zhu, S. Gharghabi, C. A.                of the 2019 SIAM International Conference on Data Mining, 486–494.
     Ratanamahatana, and E. Keogh. 2019. The UCR Time Series Archive.                   [37] B. Zhao, H. Lu, S. Chen, J. Liu, and D. Wu. 2017. Convolutional neural networks
     arXiv:cs.LG/1810.07758                                                                  for time series classification. Journal of Systems Engineering and Electronics
 [6] J. Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets.        28 (02 2017), 162–169.
     J. Mach. Learn. Res. 7 (Dec. 2006), 1–30.
 [7] B. Dhariyal, T. Le Nguyen, S. Gsponer, and G. Ifrim. 2020. An Examination
     of the State-of-the-Art for Multivariate Time Series Classification. In LITSA,
     ICDM 2020.
 [8] J. Fan, F. Han, and H. Liu. 2014. Challenges of big data analysis. National
     science review 1, 2 (2014), 293–314.
 [9] N. Giatrakos, N. Katzouris, A. Deligiannakis, A. Artikis, M. Garofalakis, G.
     Paliouras, H. Arndt, R. Grasso, R. Klinkenberg, M. Ponce-De-Leon, et al. 2019.
     Interactive extreme: Scale analytics towards battling cancer. IEEE Technology
     and Society Magazine 38, 2 (2019), 54–61.