Motivation

Information Technology and Interactions, December

Models for Anomaly Detection in an Industrial Transporting System

Kyrylo Kadomskyi

0 0 Taras Shevchenko National University of Kyiv , 64/13, Volodymyrska st., Kyiv, 01601 , Ukraine

2020

0 2 03

Cyber-Physical Production Systems (CPPS) require robust techniques for detecting anomalies and root causes in the system. The model-based diagnosis is a commonly used approach in which a dynamic process model captures spatio-temporal features of the system's behavior. Because of the infeasibility of precise mathematical or expert modeling, algorithms have been developed for learning such models from system observations. These algorithms are characterized by high domain-specialization and yield relatively poor performance in other use cases. In this paper the CPPS data is used, on which existing models have proven ineffective. The perspective of applying deep learning approach to constructing a process model in such systems is investigated. The main idea is to go from models with fixed structure to more universal techniques for learning optimal structure from challenges of evaluating dynamic system models of this class are identified, and evaluation criteria are proposed for representative comparison and benchmarking of the models. It is shown that deep learning models provide increase in anomaly detection score but require additional verification of model robustness. industrial IoT Anomaly detection, autoencoder, model evaluation, cyber-physical production systems,

Motivation

Industrial AI is an emergent research field that is actively revolutionizing production plants. Increasing product variety, product complexity and pressure for efficiency lead to systems that contain a growing set of sensors to facilitate automation [1]. In this context diagnosis of complex production processes has gained new attention due to research agendas such as Cyberphysical Production Systems (CPPS) [2, 3]: the initiative of Industrial Internet of Things (IIoT) and Industrie 4.0. In these agendas the most important goals of self-diagnosis are identification of anomalous system behavior, suboptimal energy consumption, or wear in CPPS [4, 5].

The most accepted method is model based diagnosis [4] where the features of normal and anomalous system’s behavior are captured by the process model. Modern CPPS are adaptable and changeable, which makes both precise mathematical modelling and manual expert modelling costly and ineffective [6]. Thus, to build the model the process features must be extracted from sensory measurements. As the process often is highly dynamic and variable, the most informative features are spacio-temporal and include sequential events, timing and duration of specific process stages, or the boundaries on observed values specific to each given stage.

To achieve this, novel dynamic modelling techniques are being developed [3, 4, 7, 8] and are currently replacing traditional methods, such as Statistical Process Control (SPC) and Bayesian inference with time dependency. While showing good results in certain applications, this models yield

2020 Copyright for this paper by its authors. relatively poor performance in other similar use cases [9, 7, 8]. The hypothesis is that this effect is due to limited nature and fixed structure of spatio-temporal features learned by the model, which are imposed by the structure of the model itself. Then the informativeness of learned features will vary in different physical systems, which can explain the observed effect.

In this study Deep Learning (DL) models, such as autoencoders [10], are applied to remove the mentioned limitation by automatically selecting the most relevant features and structure to represent the data. Evaluating these models on the dataset that has proven challenging for applying novel dynamic models is conducted aiming for accurate benchmarking of the two approaches. This in turn provides the possibility to assess the limits of model-based anomaly detection in given class if CPPS.

As results of traditional evaluation techniques in CPPS applications may not be representative [9], the challenges of evaluating dynamic system models in CPPS are identified by analyzing data collected from DL models, and robustness criteria are proposed to increase evaluation representativeness.

2. The System and the data

Currently several projects are aimed at utilizing new technical possibilities to meet the challenges of Industrial IoT and Industrie 4.0. Under the European Union’s Horizon 2020 research project IMPROVE [11] a number of experiments in industrial systems were made, and environments were designed specifically to test novel methods for self-diagnosis (including monitoring, anomaly detection) and self-optimization [12]. The High Rack Storage System or HRSS is a demonstrator system built in SmartFactoryOWL in Lemgo, Germany. The system transports pallets between its different shelves, as shown in Figure 1.

Measurements of position, power and voltage are made at each of the system’s drives during full transporting cycles. Anomalies in this system include shortening of cycles, pauses, abnormal timing, duration, or sequence of different process stages, as well as increase or decrease in one or multiple signals at certain stages. The task is to detect HRSS anomalies and to localize them with time-step precision by constructing the model of normal system behavior in an unsupervised manner.

A time series dataset [13] was collected in this system under IMPROVE project and is being actively used to test novel approaches to anomaly detection [9, 14]. The data contains 18 real-valued signals sampled 15–20 times per second. It includes time series of 106 normal cycles (25,907 observations) and 111 cycles containing labelled anomalies (23,645 observations). The dataset is unbalanced with 76.0% of negative examples. Statistical distributions of the classes (i.e. normal and anomalous measurements) are not distinguishable in feature space, which excludes direct applying of traditional Machine Learning (ML) methods for anomaly detection (e.g. linear models, decision trees, SVM, etc.). At the same time PCA analysis shows that 10 main principal components cover 98.1% of data variation, so linear dimensionality reduction techniques can be useful. Data quality issues that may affect model performance include high noisiness, strong outliers, and difference in feature ranges by several orders of magnitude.

3. Background research

As the statistical separation of classes is not possible in this task, constructing a model from process measurements involves learning spatio-temporal patterns and events, which are typically characterized by timing and duration of different process stages.

To address this goal the use of dynamic process models such as Hybrid Timed Automata (HTA) has been proposed [9]. To apply a discrete state HTA model to continuous process measurements the unsupervised data preprocessing with self-organizing maps (SOM) and watershed transformations were utilized. This method detects anomalies with timestep precision. Yet, having proven effective in other CPPS applications [7, 8], it yields low performance on HRSS data with 30.76% F1 score and 26.7% recall (1516 true positives).

In another study the Deep Learning architectures were applied to the same data [14]: Siameese LSTM model was used for binary classification of full process cycles into ‘normal’ and ‘anomaous’ classes. Targeting minimal false-positive score this model yields 25.6% F1 measure, 88.2% precision, and 15.0% recall, while being unable to localize anomalies within a cycle.

In both studies anomaly detection rates are low, comparing to other CPPS applications, thus learning a model from the process measurements in HRSS plant remains a challenging task. To address this task, features of HRSS system must be identified that explain observed drop in efficiency. As the results of the two studies are not directly comparable, the perspective of applying DL models in this class of CPSS also remains an open question. Answering it requires strict evaluation of DL models, as well as assessment of the effect of architectural variations. As the representativeness of evaluation results remains unknown [9], additional measures must be developed to assess model robustness.

4. The method

In this study a set of autoencoder architectures are applied to the task of anomaly detection [10] in a setup shown in Figure 2. The DL model, i.e. autoencoder, is trained in unsupervised manner to reconstruct normal time series targeting minimal reconstruction loss. Then the trained model is used to reconstruct unseen time series with anomalies, where the reconstruction error is expected to peak at anomalous intervals. To evaluate the model, the distributions of reconstruction error in normal and anomalous intervals are analyzed for being statistically distinguishable. Finally, from the error distributions a decision-rule classifier for anomaly detection is built in a supervised mode.

Measurements, time series Preprocessing, feature engineering Features Autoencoder Distance measure Decision tree Reconstructed time series Anomaly prediction

This method detects anomalies with time step precision, and most of evaluated models can be applied in real time. modelled and evaluated in a setup allowing for direct benchmarking against background research. For results to be representative, models’ robustness must be assessed. From the analysis of evaluation results, two challenges were identified that must be met to achieve model robustness and the representativeness of evaluation.

One distinct feature of HRSS plant is low process variation in normal conditions with 12.6% mean absolute deviation from the averaged process cycle. Under such conditions, an autoencoder model can reach local minima of reconstruction error without reconstructing individual features of distinct cycles (i.e. different process runs). In this case model’s output is close to the average training cycle with reconstruction loss close to vnormal. Such model performs well on HRSS data where process variation is low, but it will not be useful in most CPPS applications where process variation is higher. 2. The presence of anomalies may affect model’s performance in reconstructing neighboring normal intervals. This is expected behavior in models with internal time-dependency, which are used in this study. In this case model’s robustness is limited by the type and the length of anomalies, which typically are not known at training time. 4.1.

Robustness criteria

To address the mentioned challenges two robustness criteria are proposed for representative model RC1. Reconstructed variation rate is calculated in unsupervised mode using the training set of normal process cycles, by comparing step-vise standard deviation of reconstructed signal to standard deviation of the model input

RC2. Reconstruction sensitivity to anomalies is assessed in supervised mode on the set of anomalous cycles (i.e. evaluation set) as the correlation between error of reconstructing normal intervals and the strength of anomalies in the same process cycle or time window: 2 = corr (M| − _ ), _

is domain specific and includes type, time length and strength of the evaluation. where anomaly.

  In HRSS plant two distinct types of anomalies are present.

Type 1: amplitude deviations from normal signal

Type 2: deviations in timing, duration, or sequence of process stages

In practice anomalous cycle duration and long-term type-2 anomalies have noticeable effect on RC2, as shown in Figure 3. 4.2.

Evaluation techniques

To evaluate the DL models, HRSS dataset is split into three parts.

autoencoder. overtraining.

Training set contains randomly selected 2/3 of normal cycles and is used to train the Test set contains remaining normal cycles and is used to validate autoencoder and test it for Evaluation set contains all cycles with anomalies and is used to assess anomaly detection performance and to justify the selection of decision threshold.

4.2.1. Choice of performance measures

The architecture consists of two parts: the autoencoder which is used to reconstruct input time sequence, and the classifier used for anomaly detection. So, two performance indicators are required.

The performance of signal reconstruction was measured with MAE loss function, which is more outlier-resistant and more suitable for high-dimensional data comparing to MSE.

Anomaly detection performance was measured with F1 score and confusion matrix. The F1 score has the advantage of accounting for both false positives and false negatives. Comparing to accuracy and correlation-based measures, which also account for true negatives, F1 score better suits an unbalanced dataset. Also, F1 score with confusion matrix enable direct comparison with the background research.

4.2.2. Selecting decision threshold

in two ways:

In anomaly detector the threshold must be set for the signal reconstruction error. Let be the distribution of signal reconstruction loss obtained on the training set (i.e. in normal cycles); let and be the distributions of loss obtained on validation data: in normal intervals and in anomalies respectively. Then optimal value for the classification threshold can be assessed from , and Unsupervised: = E( ) + 2 ( ).

Supervised: detection.

= argmax ( , , ), where is a performance measure for anomaly

Experiments on HRSS data show that the optimal threshold value for different architectural modifications varies in a broad range. While the first assessment can be far from optimal, the second assessment may not be possible in most applications where labelled anomalous data is not available.

4.2.3. Evaluation steps

Evaluation steps include: 1.

calculating performance measures. 2. assessing the statistical separation between autoencoder response to normal and anomalous signals ( , and ). 3. assessing robustness criteria RC1 and RC2. 4. selecting the optimal model by maximal performance, among models that have passed robustness tests.

Models

The DL models being tested are divided in two groups by DL architecture type: LSTM and Convolutional. In each group the first model is a traditional architecture used for anomaly detection. Other models are built to assess the effect of architectural modifications on model performance.

The choice of the model’s hyper-parameters affects both experimental performance and robustness. Hyper-parameters include the number, types and sizes of layers, compression rate of autoencoder, the use of dropouts, as well as internal layer parameters (e.g. kernel size, activation function). As no computationally effective techniques exist for finding the optimal architecture construction through hyper-parameter choices, this task remains tedious and highly intuition driven [15, 16]. In this study a grid search approach was applied for each model type, obtaining the models shown in Table 1.

6. Experimental setup

The models were implemented using Keras with Tensorow backend. Training was performed using ‘Adam’ optimizer and MAE loss function with learning rate of = 0.005, 1 = 0.9, 2 = 0.999, and fuzzy factor = 10−7 [19]. The time series of complete process cycles, padded to constant length of 300 timesteps, were used as both input and target. Training was run with 130 epochs for LSTM models and 300 epochs for ConvNet models, in mini-batch mode with batch size 32. To rule out the effect of batch-averaging on robustness criteria RC1, training was repeated in stochastic mode (batch size 1). In this setup the number of epochs was reduced by the factor of 5, as epochs are more time-consuming in this mode, but epoch-to-epoch convergence is faster. As no significant influence of the batch size on evaluation criteria was observed in experiments, only results obtained in mini-batch mode are presented. As the reconstruction loss fluctuates between training epochs, averaging across last 10 epochs was used for reliable performance estimate.

Data pre-processing included the following steps:  Introducing velocity features, calculated with second order accurate central differences.  Dimensionality reduction from 24 to 12 components with PCA, which preserves 98.2% of data variance.  Normalization and scaling to the range (0,1), which unifies value ranges of features.  Time smoothing with gaussian kernel of width 15 and standard deviation 3.

 Unifying time series length by padding.

7. Results

Reconstruction rates of all models fall into a narrow range, as shown in Table 2, with exception of classic LSTM autoencoder (LSTM 1), which proved unable to accurately reconstruct the process. Thus, reconstruction loss measure cannot be used to assess the efficiency of autoencoder model in CPPS anomaly detection task. Instead, statistical analysis of the loss distributions must be applied.

Model Target

value

LSTM 1 LSTM 2 LSTM 3 LSTM 4 LSTM 5 LSTM 6 ConvNet 1 ConvNet 2

ConvNet 3

1 100 52.6 46.3 69.2 62.3 75.4 75.7

Evaluation results indicate that increasing complexity of DL models (top down in Table 2) leads to higher performance measure. However, this is not the case with robustness. Deep LSTM models with heterogeneous layers (LSTM 5 and LSTM 6) tend to average out all variation in the signal (i.e., have low RC1), while deeper convolutional networks lose ability to reconstruct normal signal in presence of type 2 anomalies (i.e., have high RC2). It may be concluded that traditional performance metrics for model evaluation are misleading in case of HRSS, favoring models with low robustness according to criteria RC1 and RC2.

Considering both performance measure and proposed robustness criteria, the LSTM 4 model is selected as the best choice for HRSS data. Model’s architecture is demonstrated in Figure 6.

Comparing to traditional LSTM autoencoder architectures [17, 18], this model introduces two distinct architectural features. First, input time-series are not flattened into a vector, and thus the model has lower compression rate. Experimental evidence (Table 2) suggests that preserving time dimension in encoder generally leads to better performance in anomaly detection task. Second, an additional convolution layer is added at model’s bottleneck to capture long-term features in input time-series.

The obtained LSTM 4 model provides 62.3±2.1% overall anomaly detection rate (F1 score) and 59.1% recall with 3350 true positives, as shown in Table 3. Comparing to the baseline efficiency [9], an increase by 102% in anomaly detection score and an increase by 121% in recall are achieved.

8. Conclusions

The problem of the model-based anomaly detection in industrial CPPS was addressed in the Deep Learning paradigm by applying autoencoder architectures. The specific case of HRSS plant was studied, in which construction and evaluation of process models had proven to be a challenging task. The major challenges of applying Deep Learning models were identified as low process variation in the training set, and presence of two distinct types of anomalies, detecting which requires different algorithms or settings.

It was shown that increasing model complexity, both in LSTM and convolution-based models, allow to increase anomaly detection performance but has strong robustness tradeoff. This indicates that model evaluation in systems of this class cannot rely completely on performance metrics. For evaluation results to be representative, detection rates of different anomaly types must be assessed separately, and additional robustness criteria must be considered. Such criteria were proposed based on statistical analysis of both the data and the model output in supervised training context.

In the studied industrial transporting system (HRSS) applying deep learning models and autoencoder techniques allowed for 102% performance gain, F1 score, while preserving model’s robustness. Wider assessment of perspectives of CPPS applications requires further experimental research in cases of higher variance in the normal process as well as different types of anomalies.

9. Acknowledgements

This research utilizes the data collected at SmartFactoryOWL Lemgo, Germany, under the European Union’s Horizon 2020 research project IMPROVE [12]. The data was made publicly available by inIT [13] under a Creative Commons License Attribution-ShareAlike 4.0 International (CC BY-NC-SA 4.0). 10.References [1] Factories of the future: multi-annual roadmap for the contractual PPP under HORIZON 2020, Publications Office of the European Union, Luxembourg, 2013. [2] E. A. Lee, Cyber physical systems: design challenges. In: Proceedings of the 11th IEEE international symposium on Object Oriented Real-Time Distributed Computing (ISORC), Orlando, FL, 2008, pp. 363–369. doi: 10.1109/ISORC.2008.25. [3] O. Niggemann, C. Frey, Data-driven anomaly detection in cyber-physical production systems,

AT – Automatisierungstechnik, 2015, vol. 63, issue 10. doi: 10.1515/auto-2015-0060. [4] L. Christiansen, A. Fay, B. Opgenoorth, J. Neidig, Improved diagnosis by combining structural and process knowledge, in: Proceedings of the 16th IEEE conference on Emerging Technologies Factory Automation, ETFA, Toulouse, France, 2011, pp. 1–8. doi: 10.1109/ETFA.2011.6059056. [5] S. Windman, S. Jiao, O. Niggemann, H. Borcherding, A stochastic method for the detection of anomalous energy consumption in hybrid industrial systems, in: Proceedings of the 11th international IEEE conference on Industrial Informatics, INDIN, Bochum, Germany, 2013. doi: 10.1109/INDIN.2013.6622881. [6] B. Vogel-Heuser, C. Diedrich, A. Fay, S. Jeschke, M. Kowalewski, S. Wollschlaeger, P. Goehner, Challenges for software engineering in automation, Journal of Software Engineering and Applications 7 (2014) 440–451. doi: 10.4236/jsea.2014.75041. [7] N. Hranisavljevic, O. Niggemann, A. Maier, A novel anomaly detection algorithm for hybrid production systems based on deep learning and timed automata, in: Proceedings of the 27th international workshop on Principles of Diagnosis, DX-2016, Denver, Colorado, 2016. [8] A. von Birgelen, O. Niggemann, Enable learning of hybrid timed automata in absence of discrete events through self-organizing maps, in: O. Niggemann, P. Schüller (eds.), IMPROVE – Innovative modelling approaches for production systems to raise validatable efficiency. Technologien für die intelligente automation (Technologies for intelligent automation), vol. 8, Springer Vieweg, Berlin, Heidelberg, 2008. doi: 10.1007/978-3-662-57805-6_3. [9] A. von Birgelen, O. Niggemann, Using self-organizing maps to learn hybrid timed automata in absence of discrete events, in: Proceedings of the 22nd IEEE international conference on Emerging Technologies and Factory Automation, ETFA, Limassol, Cyprus, 2017, pp. 1–8. doi: 10.1109/ETFA.2017.8247695. [10] C. Zhou, R. C. Paffenroth, Anomaly detection with robust deep autoencoders, in: Proceedings of the 23rd ACM SIGKDD international conference on Knowledge Discovery and Data Mining, KDD '17, Halifax NS, Canada, 2017, pp. 665–674. doi: 10.1145/3097983.3098052. [11] IMPROVE. Creating the factory of the future with 4.0 solutions, 2016. URL: http://improvevfof.eu/. [12] Physical factory / demonstrators IMPROVE, 2016. URL: http://improvevfof.eu/background/physical-factory-demonstrators. [13] inIT, High storage system data for energy optimization, 2018. URL: https://www.kaggle.com/inIT-OWL/high-storage-system-data-for-energy-optimization. [14] M. Cerliani. Predictive maintenance with LSTM siamese network, 2019. URL: https://towardsdatascience.com/predictive-maintenance-with-lstm-siamese-network51ee7df29767. [15] S. R. Young, D. C. Rose, T. P. Karnowski, S.-H. Lim, R. M. Patton, Optimizing deep learning hyper-parameters through an evolutionary algorithm, in: Proceedings of the workshop on Machine Learning in High-Performance Computing Environments, MLHPC '15, Austin, Texas, 2015, article no. 4. doi: 10.1145/2834892.2834896. [16] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, The journal of machine learning research, 13 (2012), pp. 281–305. [17] A. Sagheer, M. Kotb. Unsupervised pre-training of a deep LSTM-based stacked autoencoder for multivariate time series forecasting problems, Scientific Reports 9, 19038 (2019). doi: 10.1038/s41598-019-55320-6. [18] A. H. Mirza, S. Cosan, Computer network intrusion detection using sequential LSTM neural networks autoencoders, in: Proceedings of the 26th Signal Processing and Communications Applications Conference, SIU, Izmir, Turkey, 2018, pp. 1–4. doi: 10.1109/SIU.2018.8404689. [19] D. P. Kingma, J. Ba. Adam: a method for stochastic optimization, in: Proceedings of the 3rd international conference for Learning Representations, CoRR, San Diego, CA, 2014, abs/1412.6980.