An Empirical Evaluation of Early Time-Series Classification Algorithms Evgenios Kladis Charilaos Akasiadis Evangelos Michelioudakis NCSR “Demokritos” NCSR “Demokritos” NCSR “Demokritos” Athens, Greece Athens, Greece Athens, Greece eukladis@iit.demokritos.gr cakasiadis@iit.demokritos.gr vagmcs@iit.demokritos.gr Elias Alevizos Alexandros Artikis NCSR “Demokritos” NCSR “Demokritos” Athens, Greece Athens, Greece alevizos.elias@iit.demokritos.gr a.artikis@iit.demokritos.gr ABSTRACT The integration of state-of-the-art sensory and telecommu- Early Time-Series Classification (ETSC) is the task of discerning nication devices on ships provides a constant stream of data the class of time-series observations, as accurately and fast as reporting the geospatial positioning, direction, or trajectory in possible. Such approaches can be incorporated in forecasting, the form of time-series data. These can be incorporated in ETSC, and this way assist on many research fields. However, available for example to aid the tasks of naval authorities [22]. Some cases approaches are not suitable for all problems, since the shape worth noting are the early detection of ship collisions [27], which and the nature of data can impact their performance. In the prevents the dangers of the densely used naval routes, or to detect context of this work, we empirically evaluate five state-of-the- naval smuggling events [1] as soon as possible, which calls for art ETSC algorithms on publicly available data, as well as on immediate action. Another important application of ETSC is the two newly introduced datasets, originating from the biological prediction of earthquakes and the intensity of aftershocks from and maritime application areas. The first dataset refers to cancer respective sensor readings; and in the energy field, in order to simulation data, while the second consists of vessel geospatial predict the energy consumption of households and optimize the information. The aim is to extensively evaluate ETSC algorithms, energy supply system [32]. and provide intuition on how such approaches work, and what In this paper, we focus on the state-of-the-art of ETSC aiming are the problem characteristics that may render each method to extend the benchmarking literature and tools. We employ a successful. Also, the framework we used for the evaluation can collection of various algorithms, and extensively evaluate them serve as a benchmark for new related approaches. on publicly available datasets that include multivariate and uni- variate cases of different sizes. Two new datasets are employed, originating from drug treatment discovery, and maritime surveil- 1 INTRODUCTION lance. Overall, we bundle together algorithms and datasets in a The evolution of computer systems and the proliferation of In- single and publicly available framework, which can serve as a ternet of Things has significantly aided to the expansion of time- benchmark for future developments in this domain. series data production, collection, and storage [35]. For instance, The rest of the paper is organized as follows. In Section 2 in the life sciences field, simulation frameworks aim to estimate we discuss related work and the algorithms that are evaluated. how cellular structures respond to treatments, e.g. in the face of Section 3, describes the datasets that we utilize. In Section 4 we new experimental pharmaceuticals [9]. Such simulations require present the experimental results, and, finally, in Section 5, we vast amounts of computational resources, as well as time, and conclude. produce gigabytes of data in each run. Thus, non-interesting cases should be detected at early stages and stopped as soon as 2 ETSC ALGORITHMS possible, since they are not expected to provide any useful result. Although ETSC is not a novel field, most of the existing work com- However, many problems remain regarding the detection of pares methods for regular time-series classification, or forecast- situations and ailments ahead of time. Many effective methods ing, using the whole time-series for predictions. A representative for tackling such problems stem from the Early Time-Series Clas- example is the work of Demsar et al. [6], which evaluates classic sification (ETSC) domain. Contrary to standard time-series clas- classifiers, such as, k-NN and Naive Bayes, by incorporating a sification, which requires all the time-points to make a decision, variety of methods for statistical comparison. Multiple datasets ETSC approaches aim to classify time-series as early in time as are included in the evaluation, and conclusions are drawn using possible, always pursuing the smallest impact in the accuracy of ANOVA and Friedman tests. The work of [28], does the same, but the predictions. By providing results ahead of time, early classifi- also for Artificial Neural Networks (ANN) approaches, evaluated cation can give e.g. critical data about a patient’s future health on a dataset regarding smartphone data usage. The analysis of complications in the long-term within the predictive medicine the results highlights that classifiers can be quite effective in field [20], or in the short-term, such as arrhythmia detection [16] the smartphone data usage application, with the Random Forest or stroke prediction [3]. approach being the most well performing one. Also, different © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- algorithms seem to perform more effectively for certain datasets ings of the EDBT/ICDT 2021 Joint Conference (March 23-26, 2021, Nicosia, Cyprus) compared to others, rendering the evaluation of multiple algo- on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 rithms on each dataset necessary. Moreover, given the increasing International (CC BY 4.0) data sources worldwide, as well as the rising complexity in the contemporary data analysis processes, focus has been put on A similar approach to the EDSC is followed by the Mining multivariate time-series [8]. In a recent work presented in [7], Core Feature for Early Classification (MCFEC) [11] algorithm. authors examine the state-of-the-art methods for solely multi- The difference in this case lies in the criteria for shapelet se- variate time-series classification and compare inter-dimensional lection, and also in that MCFEC has the capability to process dependencies among classifiers, providing this way further in- multivariate time-series. Like EDSC, for each variable, all the sight to this field of research. sub-series of minimum and maximum lengths are extracted, but A recent work was also conducted in ETSC, reviewing ex- are additionally grouped using Silhouette Clustering [26]. Then, isting approaches related to the field [10]. The algorithms are the best shapelets from each cluster are selected for classification categorized into four groups and the analysis is performed on based on an extension of the F1 -score metric. a theoretical level, highlighting their strengths, limitations, and Effective Confidence-based Early Classification (ECEC) [19] major concerns for each group. Our approach differs, since we utilizes subsequences, but in a quite different way. ECEC employs follow an empirical approach. We categorize ETSC algorithms a transformation algorithm called WEASEL [29] that extracts according to their internal structure and subcomponents and windows (e.g. {1137, 1227, 1205}), transforms them to symbols evaluate them on multiple datasets. Here, we distinguish three that form words (e.g. (abc, cba . . . )), and measures the frequency main types, i.e. those that rely on subsequence extraction and of their appearance in each time-series. In more detail, the user analysis, algorithms that are based on clustering of the time- provides as input (i) the desired window length and (ii) the length series, and others that employ ANNs. Note that some algorithms of the words. Then, the frequency of each word (e.g. [1, 2, 1, 0, 0] could be considered to belong to more than one category. In this for a time-series and all window lengths), is passed on to a lo- case, the type was chosen based on which component was the gistic regression classifier, which in turn produces probabilistic most significant for the classification task. predictions. ECEC truncates the input into t different prefix sizes, beginning from the 1/t of the time-series up to its full length. For each prefix, it trains a WEASEL classifier and performs a 5-fold 2.1 Subsequence-based Approaches cross validation on the dataset as an evaluation. The predictions The algorithms that use subsequences, in principle, isolate smaller acquired are then passed into a cost function, which, based on the windows of the time-series—termed as subseries—and manipulate probability of the prediction being correct, calculates a threshold them accordingly, to maximize information gain and achieve the θ . During the testing phase, using the classifier of the minimum earliest and most accurate results. We distinguish five cases that prefix size, a prediction is made for each prefix, and the corre- belong to this category. An example of such an algorithm is sponding cost is calculated. If the cost is higher than the threshold Early Distinctive Shapelet Classification (EDSC) [34] where the θ then the prediction is produced, otherwise the algorithm waits classification task is based on subseries extraction. Initially, the until enough data have been accumulated to form the next prefix. user provides as input the minimum and maximum lengths of the The WEASEL algorithm for transforming the input data com- windows that are to be extracted. Then, during the extraction step, bined with logistic regression is utilized by the Two-tier Early all the corresponding sub-series are isolated from the full time- and Accurate Series classifiER (TEASER) method [30]. This algo- series and, by using Chebyshev Inequality on each extracted part, rithm truncates the time-series into S prefixes, each containing length of time-series a threshold is calculated. This threshold controls the minimum more time points than the previous prefix. S similarity that a sub-series should have, so to be assigned in the For each prefix, TEASER applies the WEASEL - logistic regres- same class. Next, each sub-series is transformed into a triplet sion pipeline to obtain respective probabilistic values. In order containing the class of the time-series it was extracted from, the to ensure the reliability of the classification result, the outputs sub-series itself, and the threshold, forming this way a shapelet. are passed into a One-Class SVM constructed for each prefix, Having all the shapelets in hand, a utility function calculates that accepts or rejects a prediction. The final requirement for a measure similar to the F1 -score for each one, which represents the approval of a classification is that a prediction should be in some sense the distinctive capability of the shapelet, i.e., how consistently the same for v out of S prefixes, with v being an appropriate a shapelet is to be considered as a template for a algorithm’s hyper-parameter subject to optimization. particular class. Ranked by their utility values, the best shapelets Early Classification framework for time series based on class are selected and stored into a “pool” that constitutes the clas- Discriminativeness and Reliability(ECDIRE) [21] also uses part sification basis. To ease the reader we incorporate a running of the time-series to produce reliable and early predictions. The example for the biological case. Suppose we have the time-series algorithm requires as input the set E which contains time-series t = {1137, 1227, 1205, 1082, 893, 736, 664, 639, 631} which is part lengths at which there is high information value, in order to of a time-series from our biological dataset. The user gives as reduce granularity of data. Consequently, 10 times 5-fold cross- input minimum length 2 and maximum, 4. Firstly, all the sub- validation is done for each length using Gaussian Process Classi- series between sizes 2 and 4 are going to be extracted from the fiers [24], storing the predictions and probabilities for each class. time-series. Let one sub-series be tpr ime = {1137, 1227, 1205}. Based on the predictions, an earliness threshold is constructed, For tpr ime , a threshold δ = 3.05 is calculated based on Cheby- which indexes the time-point from which class instances differ shev’s inequality, which ensures the similarity of tpr ime only from the other labeled instances. Using the probabilities from the with time-series of distance less or equal than the value of δ . cross-validation, a second probabilistic threshold is calculated, After this step, spr ime = (class, tpr ime , δ ) is created, and then, that ensures the reliability of the prediction. A classification result the weighted F1 -Score of each shapelet is calculated and stored must satisfy both thresholds in order to be passed as a prediction. in a list. Assuming that the list is [1.3, 3.67, 0.83] and 3.67 is the A more recent approach for early time-series classification is score of tpr ime , the top K shapelets from the list are selected as the Distance Transformation based Early Classification (DTEC) the classifier shapelets. When the distance of the test time-series algorithm [36]. DTEC relies on the transformation of data and from a shapelet is less than δ , then the shapelet’s class is assigned. the usage of probabilistic classifiers to achieve early and accurate results. In particular, DTEC extracts subsequences of time-series The Multivariate LSTM-FCNs (MLSTM-FCN) [18] method uti- and maps them to another space based on distances. The trans- lizes neural networks and operates on multivariate time-series. formed data are then used to train a probabilistic classifier by This is actually an extension of the original algorithm LSTM- forming the confidence area, which is drawn based on the proba- FCN [17], which supported only univariate datasets. Both algo- bility strength of the class to be passed as a prediction, making rithms duplicate the input and pass the data to two sub-models: this way the classification result more reliable. The first sub-model consists of three convolutional neuron lay- We selected EDSC, TEASER, and ECEC for evaluation from ers. The use of CNNs is widely adopted in regular time-series this category. EDSC is considered a widely referenced method classification [37], since it extracts important features from se- for ETSC, since most recently developed algorithms use it as a quences of data that often derive from imagery. In the described baseline. TEASER and ECEC are also interesting choices, since model, data that exit from each of the first two layers are first they follow different internal processes and have readily available batch normalized [15] and then passed to an activation function, implementations that provide good results. i.e. a Rectified Linear Unit (ReLU). In order to maximize the ef- ficiency of the model on multivariate time-series, the activated output is also passed into a squeeze and excitation block [13]. 2.2 Clustering-based Approaches A squeeze and excite network consists of a global pooling layer Clustering is the task of forming groups of data similar to each and two dense layers that give to each variable of the time-series other based on some similarity measure. We refer to two dis- a unique weight, so that to increase the sensitivity of predic- tinct cases in this category. Early Classification on Time Series tions. The second sub-model, consists of a masking layer and (ECTS) [33] is an algorithm that relies on clustering, as well as, the output is passed on an attention based LSTM. LSTM [12] the 1-Nearest Neighbors (1-NN) search, to perform accurate and is a Recurrent Neural Network model, popular in time-series early classification. In detail, first the 1-NN set of each time-series classification, because of its ability to remember inter time-series is calculated. Based on that set, the algorithm follows the Reverse dependencies with minimal computational cost and high accu- Nearest Neighbors (R-NN) approach, which in a sense shows how racy for time-series of length less than a thousand time points [2]. many time-series consider the examined one as a nearest neigh- Attention based LSTMs are variations of the normal LSTMs, with bor. Based on the consistency of the R-NN set for each different increased computational complexity, which nevertheless results subsequence of the time-series, ECTS computes the minimum to increased overall performance. The output of the two sub- prediction length for each time-series that represents how many models is concatenated and passed through a dense layer with time-points are required for the time-series to act as a reliable as many neurons as the classes, and via a softmax activation classifier. First, the time-series are clustered using agglomera- function predictions as a probabilistic output are provided. tive hierarchical clustering [31] based on Euclidean distances. Another similar approach is the Multi-Domain Deep Neural The minimum prediction length of each cluster is considered Network (MDDNN) [14] algorithm, which utilizes simple neu- and appointed to each time-series in the cluster. During the test ral networks based on the same principles as the previous one. phase, time-series are paired with NN, using initially only the However, this algorithm consists of two sub-models which are first time-point and increasing through each iteration, until the identical with respect to their structure, but differ on the input time-series they pair with has minimum prediction length less they receive. The first model takes as input the z-normalized or equal to that time-point’s position. raw time-series represented on the time domain, and the sec- TRIGGER [4] is another algorithm that uses clustering in com- ond takes as input the z-normalized fast-fourier transform of bination with a cost function to estimate if a time-series can the time-series represented on the frequency domain. Both mod- provide a trustworthy result. The time-series are grouped into els consist of two CNN layers. After each layer the output is clusters using k-means [31] based on Euclidean distances. For again batch normalized and passed onto a one-dimensional max each cluster k and time-step t ∈ [1, . . . ,T ], a Multilayer Percep- pooling layer, activated with a ReLU layer. The output of each tron [31] hkt is trained using time-series of length t from the clus- model is then concatenated, flattened, and passed through two ter k. When a new time-series arrives, a “membership” probability dense layers. Finally a softmax layer is applied, just as in the is calculated to appoint it to a cluster and, based on the cluster MLSTM-FCN. We selected the MLSTM-FCN for our evaluation, and the length of the time-series, a prediction is performed. The since there are much more details available for this particular results of the classifier are handled with a cost function, which approach than others. Also, it is worth noting that the authors is calculated for every time-point between the current and the provide respective source-code and datasets to be used by the last of the time-series. If the cost function returns a value of zero community. the result of the classifier is deemed safe and is returned along with the size of the incomplete time-series up to the currently ex- amined time-step. We choose to incorporate the ECTS algorithm in our evaluation since, similarly to EDSC, it is one of the oldest and most known approaches for early time-series classification. 3 DATASETS In this section, we describe the datasets used in our evaluation, i.e., the publicly available UCR [5], as well as two new ones, from 2.3 Artificial Neural Networks-based the domains of life-sciences and maritime. Note that, instances Approaches included from the publicly available collection are univariate, First of all, a typical ANN-based approach performs time-series while the two new cases are both multivariate. Also, since not classification given data in the full-length time-series. However, every selected algorithm is designed to work on multivariate time- this can be tackled on a higher level, by supplying to the in- series, we implemented a simple voting method for classification. put only parts of the time-series, thus making them capable of This applies to the ECTS, EDSC, TEASER, and ECEC algorithms. conducting ETSC. We refer to two cases in this category. 3.1 Dataset of cancer cell simulations Harmonic Mean. Harmonic mean is a metric proposed in [30] This dataset includes the count of tumor cells population during and it represents the Pythagorean harmonic mean between earli- the administration of specific drug treatments as resulting from ness and accuracy. large-scale model exploration experiments. Each time-series rep- 2 · (1 − earliness) · accuracy HarmonicMean = resents one simulation experiment of a specific drug treatment (1 − earliness) · accuracy configuration, which differs from the others based on a set of Note that the algorithms were evaluated using 5-fold cross-validation configurable parameters, such as the time of administration, the on the two real-world datasets. duration, and the drug concentration. A time-point of each in- stance corresponds to three different integer values, indicating Training and Testing times. Measured in minutes and seconds the number of Alive, Necrotic and Apoptotic cells. The time-series respectively. are preclassified as interesting or non-interesting, based on if the drug treatment was effective or not, according to a particular clas- 4.2 Comparison of ETSC Approaches sification rule that was defined by domain experts. The dataset We present results from an evaluation of 5 state-of-the-art algo- consists of 644 time-series, each having 48 time-points. The time- rithms, i.e, EDSC, ECTS, TEASER, ECEC, and MLSTM-FCN. The series instances included in the dataset were created by executing implementations were readily available, except for ECTS, which is a parallel version of the PhysiBoSSv2 simulator.1 a custom implementation. In particular, ECEC as well as TEASER Since the dataset originates from large-scale simulation ex- are written in Java, while ECTS and MLSTM-FCN in Python, and, periments oriented to drug treatment discovery, the classes are finally, EDSC in C++. The algorithms were tested using 5-fold rather imbalanced. Specifically, the configurations that produced cross validation for the maritime and biological datasets. The interesting time-series constitute the 20% of the dataset, while results were run on a server with an Intel Xeon E5-2630 2.60GHz the remaining 80% account for non-interesting cases. Moreover, processor, with 25-cores and 252 GB RAM. many interesting and non-interesting examples tend to be very Since most algorithms don’t support multivariate time-series, a similar during the early stages of the simulation, until the drug voting method was applied. According to the voting method, clas- treatment takes effect, which, as observed, is usually after the sifiers are trained, and tested on each of the folds, but separately first 30% of the time-points of the time-series. This fact makes the for each dimension of the input vector shape. Upon collecting dataset an interesting benchmark, since it is nearly impossible to each of the outputs, the most frequent or voted prediction was obtain accurate predictions in less time than this. chosen, however, assigned with the worst earliness among them. In the case of equal votes among classifiers, the earliest one made 3.2 Dataset of vessel position signals is selected. MLSTM-FCN is tested on the [40%, 50%, 60%] of the The maritime dataset contains data of nine different sea vessels time-series length in each dataset, and the length with the best that cruise around the port of Brest, France. This dataset is derived results based on harmonic mean is chosen, for each dataset indi- from [23, 25]. Each measurement corresponds to a vector of val- vidually. The number of LSTM cells is set to 8 for all experiments ues for longitude, latitude, speed, heading, course over ground of as default. For TEASER, S is set to 10 for the biological and mar- a vessel at a given time-point. The time-series are fragmented to itime case whereas for the UCR dataset is fixed to 20. The code a specific length, and divided into two classes, based on whether of our testing framework is available publicly, accompanied with the ship entered or didn’t enter this particular port during the the respective datasets and algorithm configurations.2 time course of the corresponding time-series fragment. In total, It should be noted that fig. 1 presents for the biological and there are 5249 instances of 30 time-points each. maritime cases using the 5-fold cross validation over the multi- variate datasets. Also for the UCR datasets, each entry is a dataset 4 EXPERIMENTAL RESULTS and the F1 -Score bar represents the average F1 -score from all dataset’s classes. The training and testing time plots were made 4.1 Evaluation Metrics using log-scale on the y-axis. For our evaluation we use five well-known metrics. ECTS, as one of the older and simpler approaches in the ETSC Earliness. Earliness counts how many time-points were re- algorithmic ‘arsenal’, has considerably high training times, an quired to make the final prediction for each time-series. undesired characteristic that results to the algorithm being unable to provide results within several hours. In particular, for the consumed time series lenдth Earliness = maritime dataset, this algorithm did not progress after 48 hours total lenдth o f timeseries of execution. This can be explained by the fact that ECTS uses Accuracy. Accuracy is translated to how many time-series 1-NN as the main procedure behind the early classification. 1-NN were classified correctly during the testing process. requires to calculate Euclidean distances between all time-series, resulting to very high computation times for big datasets both number o f correct predictions Accuracy = in length and size. One notable example is the StarLightCurves number o f test instances dataset, which contained 1000 training time-series of length 1024, F1 -score. F1 -score is the harmonic mean of precision and recall: resulting to more than 65 hours of training and testing times. tp However, for the biological dataset (fig. 1a), the computation F 1 -score = time was on average around 30 minutes per fold. It reaches at 1 tp + (f p + f n) top accuracy and over 90% F1 -scores which shows that ECTS can 2 successfully tackle the class imbalance. where tp, f p, and f n are the true positives, false positives, and EDSC, the first approach to propose the use of sub-sequences false negatives respectively. for early time-series classification, is also struggling with time 1 https://github.com/xarakas/spheroid-tnf-v2-emews 2 https://github.com/Eukla/ETS 1.0 1.0 1.0 0.8 0.8 0.8 Metric 0.6 F1-Score 0.6 0.6 Score Score Score Earliness 0.4 Accuracy 0.4 Metric 0.4 Metric HarmonicMean F1-Score F1-Score Earliness 0.2 Earliness 0.2 0.2 Accuracy Accuracy HarmonicMean HarmonicMean 0.0 0.0 0.0 ECTS EDSC TEASER MLSTM ECEC ECTS EDSC TEASER MLSTM ECEC ECTS EDSC TEASER MLSTM ECEC Method Method Method (a) Biological (b) Maritime (c) UCR Dataset 103 Dataset UCR UCR 102 Biological 102 Biological Maritime Maritime 101 Time (min) Time (sec) 101 100 10 1 100 10 2 10 3 ECTS EDSC TEASER MLSTM ECEC ECTS EDSC TEASER MLSTM ECEC Algorithm Algorithm (d) Training time (e) Testing Time Figure 1: Evaluation measures (top) and training/testing times (bottom) for each dataset and each of the algorithms. complexities. In fact, training and testing times are greatly influ- correct. For the biological dataset, the ECEC has the highest accu- enced by the size of the dataset and, more importantly, the length racy with very low length usage, being the best one in harmonic of the time-series. Having lengths of more than a thousand time- mean among the evaluated algorithms. However, this could be points, and dataset sizes larger than a hundred time-series leads considered as a misleading result due to the weakness of the to increased computation times, e.g. the Haptics dataset, which confidence threshold as a measure of classifications reliability, has 155 time-series of size 1093 time-points and training times of and the strong predisposition of the algorithm to choose the more than 24 hours. For the available biological dataset the per- majority class as a prediction over the minority. Because the formance is high in terms of accuracy. However, the data usage is 80% of the dataset is considered as non-interesting, even if all also very high. For the maritime (fig. 1b), the performance is sub- predictions are the same, high accuracy can be attained. Note optimal, since F1 -score is relatively low and the earliness reaches though that the F1 -Score does not suffer from this. For the Mar- to 100%. When EDSC is unable to find a prediction, returns the itime dataset, where the classes are more balanced, the harmonic most frequent class label with the maximum data-usage. EDSC mean of ECEC is significantly low, with about 80% data usage. It for both the biological and the maritime dataset was unable to should be noted that the F1 -Score is on the higher end, at about make predictions and returned the default answer. 92% which is explained by increased data usage. Times increase We can observe increased earliness values for ECTS and EDSC. significantly with the size of each dataset, since for datasets such Recall that the biological dataset consists of 3 variables, the Alive, StarLightCurves training needed more than 5 hours. Necrotic and Apoptotic cell counts at each time-step. While for TEASER, uses WEASEL just like ECEC but is shown to perform the Alive and Necrotic variables numerical value differences differently. The computation times for the biological, maritime exist between different class instances, for the Apoptotic, the and UCR dataset are very low. Testing times are also the lowest. number of cells is more consistent in both classes. This creates As one of the most promising presented algorithm in this paper, ambiguity for a simple classifier such as ECTS and EDSC, resulting achieves steadily over 80% accuracy, with less than 50% earliness in very delayed classification when using the Apoptotic variable on all datasets. Also this was achieved by conducting multivariate as training. During a test instance, predictions with minimal data time-series classification using the simple voting method. The usage are produced by the classifiers trained with the Alive and 3-step prediction process containing, the transformation and lo- Necrotic subseries but delayed classifications are made from the gistic regression, the One-Class SVM, and the check of prediction Apoptotic trained ones. The voting classification method chooses consistency along with the hyperparameter searching step, give the most voted prediction, which is usually the one made from reliable results, very quickly, remaining unaffected from dataset the Apoptotic classifier, and assigns as earliness the largest value sizes. A weakness of TEASER is the high data usage, in the Inli- among the three options, which in this dataset is the one from neSkate dataset of UCR (fig. 1c) for example, which, nevertheless, Apoptotic. For this reason earliness becomes worse. can be tackled by increasing the available RAM. Also the F1 -Score Up next, we have the ECEC method. As already explained, the is high but still has room for improvement. ECEC spilts the dataset to prefixes and computes a confidence Lastly, MLSTM-FCN uses Deep Neural Networks to make pre- threshold based on the a priori probability of a prediction being dictions. Since we essentially run MLSTM-FCN 3 times for each dataset of the UCR, for the [40%, 50%, 60%] of the dataset, both [10] A. Gupta, H. P. Gupta, B. Biswas, and T. Dutta. 2020. Approaches and Appli- training and testing times are generally high for the UCR. How- cations of Early Classification of Time Series: A Review. IEEE Transactions on Artificial Intelligence 1, 1 (2020), 47–61. ever, for the two distinct datasets using 40% of the time-series [11] G. He, Y. Duan, R. Peng, X. Jing, T. Qian, and L. Wang. 2015. Early classification had the minimal trade-off in accuracy and earliness. For each on multivariate time series. Neurocomputing 149 (02 2015), 777–787. [12] S. Hochreiter and J. Schmidhuber. 1997. Long Short-term Memory. Neural fold, the MLSTM-FCN, proved the biggest divergence of accuracy. computation 9 (12 1997), 1735–80. Weights are randomly initialized at the start of each run, and [13] J. Hu, L. Shen, and G. Sun. 2018. Squeeze-and-Excitation Networks. In 2018 Neural Networks are generally sensitive to dataset information, IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7132–7141. [14] H. Huang, C. Liu, and V. S. Tseng. 2018. Multivariate Time Series Early sometimes leading to erratic results. Nevertheless, MLSTM-FCN Classification Using Multi-Domain Deep Neural Network. In 2018 IEEE 5th shows potential in both unique datasets with very low computa- Int. Conf. on Data Science and Advanced Analytics (DSAA). 90–98. tion times, good accuracy, and early predictions. The F1 -Score [15] S. Ioffe and C. Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. of the 32nd Int. Conf. is also high on average except for a few particular folds and on Machine Learning 37. JMLR.org, 448–456. datasets. Also, for the UCR datasets, MLSTM-FCN shows some [16] M. Kachuee, S. Fazeli, and M. Sarrafzadeh. 2018. ECG Heartbeat Classification: A Deep Transferable Representation. In 2018 IEEE Int. Conf. on Healthcare low accuracies compared to other algorithms. Insufficient results Informatics (ICHI). 443–444. derive from the lack of hyper parameter optimization that has [17] F. Karim, S. Majumdar, H. Darabi, and S. Chen. 2018. LSTM fully convolutional not yet been integrated on the used MLSTM-FCN for univariate networks for time series classification. IEEE Access 6 (2018), 1662–1669. [18] F. Karim, S. Majumdar, H. Darabi, and S. Harford. 2019. Multivariate LSTM- and multivariate time-series. For the LSTM layer, we used the FCNs for time series classification. Neural Networks 116 (2019), 237 – 245. default 8 cells. Using a grid search for cells might improve UCR [19] J. Lv, X. Hu, L. Li, and P. Li. 2019. An Effective Confidence-Based Early results, but would drastically increase computation times. Classification of Time Series. IEEE Access 7 (2019), 96113–96124. [20] L. Miner, P. Bolding, J. Hilbe, M. Goldstein, T. Hill, R. Nisbet, N. Walton, and G. Miner. 2016. Practical Predictive Analytics and Decisioning Systems for Medicine: 5 SUMMARY AND FURTHER WORK Informatics Accuracy and Cost-Effectiveness for Healthcare Administration and Delivery Including Medical Research (1st ed.). Academic Press, Inc., USA. In this work, we evaluated state-of-the-art for ETSC on pub- [21] U. Mori, A. Mendiburu, E. Keogh, and J. A. Lozano. 2017. Reliable early licly available datasets and newly introduced ones originating classification of time series based on discriminating the classes over time. Data Mining and Knowledge Discovery 31, 1 (01 Jan 2017), 233–263. from real-world applications. The results were very positive, [22] K. Patroumpas, E. Alevizos, A. Artikis, Marios Vodas, N. Pelekis, and Y. since all the algorithms achieved over 80% accuracy, with the Theodoridis. 2017. Online event recognition from moving vessel trajecto- most recently introduced ones using less than half of the avail- ries. GeoInformatica 21 (01 04 2017), 389–427. [23] K. Patroumpas, D. Spirelis, E. Chondrodima, H. Georgiou, P. Petrou, P. Tam- able time-points, i.e., producing accurate predictions quite early. pakis, S. Sideridis, N. Pelekis, and Y. Theodoridis. 2018. Final dataset of Trajec- Each algorithm performed slightly differently on each dataset, tory Synopses over AIS kinematic messages in Brest area (ver. 0.8). but TEASER and MLSTM-FCN provided a stable and good per- [24] C. E. Rasmussen and C. K. I. Willliams. 2006. Gaussian processes for machine learning. The MIT Press, Cambridge. formance on both classification precision and earliness, in all [25] C. Ray, R. Dreo, E. Camossi, and A. L. Jousselme. 2018. Heterogeneous Inte- datasets. We also provide a public repository with all of the algo- grated Dataset for Maritime Intelligence, Surveillance, and Reconnaissance, 10.5281/zenodo.1167595. rithms and datasets used, therefore allowing to test and choose [26] P. J. Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and val- the most appropriate algorithm for new applications. idation of cluster analysis. Journal of computational and applied mathematics A possible extension of this suite is the integration of online 20 (1987), 53–65. [27] I. Rudan, V. Frančić, M. Valčić, and M. Sumner. 2019. Early detection of data streams to preform time-series classification, allowing to vessel collision situtations in a vessel traffic services area. Transport 35 (2019), test such algorithms in realistic settings. 121–132. [28] I. H. Sarker, A. S. M. Kayes, and P. Watters. 2019. Effectiveness analysis of machine learning classification models for predicting personalized context- ACKNOWLEDGEMENT aware smartphone usage. Journal of Big Data 6 (2019), 57. [29] P. Schäfer and U. Leser. 2017. Fast and Accurate Time Series Classification This work has received funding from the EU Horizon 2020 RIA with WEASEL. In Proc. of the 2017 ACM Conf. on Information and Knowledge program INFORE under grant agreement No 825070. Management. Association for Computing Machinery, 637–646. [30] P. Schäfer and U. Leser. 2020. TEASER: early and accurate time series classifi- cation. Data Mining and Knowledge Discovery 34, 5 (01 Sep 2020), 1336–1362. REFERENCES [31] P.-N. Tan, M. Steinbach, and V. Kumar. 2016. Introduction to data mining. [1] N. Brax, E. Andonoff, and M. Gleizes. 2012. A Self-adaptive Multi-Agent System Pearson Education India. for Abnormal Behavior Detection in Maritime Surveillance. In Agent and Multi- [32] G. K. F. Tso and K. K. W. Yau. 2007. Predicting electricity energy consumption: Agent Systems. Technologies and Applications. Springer Berlin Heidelberg, A comparison of regression analysis, decision tree and neural networks. Energy Berlin, Heidelberg, 174–185. 32, 9 (2007), 1761 – 1768. [2] J. Brownlee. 2016. Sequence Classification with LSTM Recurrent Neural [33] Z. Xing, J. Pei, and P. Yu. 2011. Early classification on time series. Knowledge Networks in Python with Keras. https://machinelearningmastery.com/ and Information Systems - KAIS 31 (01 04 2011), 105–127. sequence-classification-lstm-recurrent-neural-networks-python-keras/ [34] Z. Xing, J. Pei, P. Yu, and K. Wang. 2011. Extracting Interpretable Features for [3] P. Chantamit-o pas and M. Goyal. 2017. Prediction of Stroke Using Deep Learn- Early Classification on Time Series. Proc. of the 11th SIAM Int. Conf. on Data ing Model. In Neural Information Processing. Springer International Publishing, Mining, SDM 2011, 247–258. Cham, 774–781. [35] X. Xu, S. Huang, Y. Chen, K. Browny, I. Halilovicy, and W. Lu. 2014. TSAaaS: [4] A. Dachraoui, A. Bondu, and A. Cornuéjols. 2015. Early Classification of Time Time Series Analytics as a Service on IoT. In 2014 IEEE Int. Conf. on Web Series as a Non Myopic Sequential Decision Making Problem. In Machine Services. IEEE, 249–256. Learning and Knowledge Discovery in Databases. Springer Int. Publishing, [36] L. Yao, Y. Li, Y. Li, H. Zhang, M. Huai, J. Gao, and A. Zhang. 2019. DTEC: Cham, 433–447. Distance Transformation Based Early Time Series Classification. Proceedings [5] H. A. Dau, A. Bagnall, K. Kamgar, C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. of the 2019 SIAM International Conference on Data Mining, 486–494. Ratanamahatana, and E. Keogh. 2019. The UCR Time Series Archive. [37] B. Zhao, H. Lu, S. Chen, J. Liu, and D. Wu. 2017. Convolutional neural networks arXiv:cs.LG/1810.07758 for time series classification. Journal of Systems Engineering and Electronics [6] J. Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets. 28 (02 2017), 162–169. J. Mach. Learn. Res. 7 (Dec. 2006), 1–30. [7] B. Dhariyal, T. Le Nguyen, S. Gsponer, and G. Ifrim. 2020. An Examination of the State-of-the-Art for Multivariate Time Series Classification. In LITSA, ICDM 2020. [8] J. Fan, F. Han, and H. Liu. 2014. Challenges of big data analysis. National science review 1, 2 (2014), 293–314. [9] N. Giatrakos, N. Katzouris, A. Deligiannakis, A. Artikis, M. Garofalakis, G. Paliouras, H. Arndt, R. Grasso, R. Klinkenberg, M. Ponce-De-Leon, et al. 2019. Interactive extreme: Scale analytics towards battling cancer. IEEE Technology and Society Magazine 38, 2 (2019), 54–61.