=Paper=
{{Paper
|id=Vol-2845/Paper_2
|storemode=property
|title=Evaluating Deep Learning Models for Anomaly Detection in an Industrial Transporting System
|pdfUrl=https://ceur-ws.org/Vol-2845/Paper_2.pdf
|volume=Vol-2845
|authors=Kyrylo Kadomskyi
|dblpUrl=https://dblp.org/rec/conf/iti2/Kadomskyi20
}}
==Evaluating Deep Learning Models for Anomaly Detection in an Industrial Transporting System==
<pdf width="1500px">https://ceur-ws.org/Vol-2845/Paper_2.pdf</pdf>
<pre>
Evaluating Deep Learning Models for Anomaly Detection
in an Industrial Transporting System
Kyrylo Kadomskyi
Taras Shevchenko National University of Kyiv, 64/13, Volodymyrska st., Kyiv, 01601, Ukraine


                 Abstract
                 Cyber-Physical Production Systems (CPPS) require robust techniques for detecting
                 anomalies and root causes in the system. The model-based diagnosis is a commonly used
                 approach in which a dynamic process model captures spatio-temporal features of the
                 system’s behavior. Because of the infeasibility of precise mathematical or expert modeling,
                 algorithms have been developed for learning such models from system observations. These
                 algorithms are characterized by high domain-specialization and yield relatively poor
                 performance in other use cases.
                 In this paper the CPPS data is used, on which existing models have proven ineffective. The
                 perspective of applying deep learning approach to constructing a process model in such
                 systems is investigated. The main idea is to go from models with fixed structure to more
                 universal techniques for learning optimal structure from dynamic observations. The
                 challenges of evaluating dynamic system models of this class are identified, and evaluation
                 criteria are proposed for representative comparison and benchmarking of the models. It is
                 shown that deep learning models provide increase in anomaly detection score but require
                 additional verification of model robustness.

                 Keywords 1
                 Anomaly detection, autoencoder, model evaluation, cyber-physical production systems,
                 industrial IoT

1. Motivation
   Industrial AI is an emergent research field that is actively revolutionizing production plants.
Increasing product variety, product complexity and pressure for efficiency lead to systems that
contain a growing set of sensors to facilitate automation [1]. In this context diagnosis of complex
production processes has gained new attention due to research agendas such as Cyberphysical
Production Systems (CPPS) [2, 3]: the initiative of Industrial Internet of Things (IIoT) and Industrie
4.0. In these agendas the most important goals of self-diagnosis are identification of anomalous
system behavior, suboptimal energy consumption, or wear in CPPS [4, 5].
   The most accepted method is model based diagnosis [4] where the features of normal and
anomalous system’s behavior are captured by the process model. Modern CPPS are adaptable and
changeable, which makes both precise mathematical modelling and manual expert modelling costly
and ineffective [6]. Thus, to build the model the process features must be extracted from sensory
measurements. As the process often is highly dynamic and variable, the most informative features are
spacio-temporal and include sequential events, timing and duration of specific process stages, or the
boundaries on observed values specific to each given stage.
   To achieve this, novel dynamic modelling techniques are being developed [3, 4, 7, 8] and are
currently replacing traditional methods, such as Statistical Process Control (SPC) and Bayesian
inference with time dependency. While showing good results in certain applications, this models yield

IT&I-2020 Information Technology and Interactions, December 02–03, 2020, KNU Taras Shevchenko, Kyiv, Ukraine
EMAIL: cyril.kadomsky@gmail.com (K. Kadomskyi)
ORCID: 0000-0002-6163-3704
            ©️ 2020 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                               11
relatively poor performance in other similar use cases [9, 7, 8]. The hypothesis is that this effect is due
to limited nature and fixed structure of spatio-temporal features learned by the model, which are
imposed by the structure of the model itself. Then the informativeness of learned features will vary in
different physical systems, which can explain the observed effect.
    In this study Deep Learning (DL) models, such as autoencoders [10], are applied to remove the
mentioned limitation by automatically selecting the most relevant features and structure to represent
the data. Evaluating these models on the dataset that has proven challenging for applying novel
dynamic models is conducted aiming for accurate benchmarking of the two approaches. This in turn
provides the possibility to assess the limits of model-based anomaly detection in given class if CPPS.
    As results of traditional evaluation techniques in CPPS applications may not be representative [9],
the challenges of evaluating dynamic system models in CPPS are identified by analyzing data
collected from DL models, and robustness criteria are proposed to increase evaluation
representativeness.

2. The System and the data

    Currently several projects are aimed at utilizing new technical possibilities to meet the challenges
of Industrial IoT and Industrie 4.0. Under the European Union’s Horizon 2020 research project
IMPROVE [11] a number of experiments in industrial systems were made, and environments were
designed specifically to test novel methods for self-diagnosis (including monitoring, anomaly
detection) and self-optimization [12]. The High Rack Storage System or HRSS is a demonstrator
system built in SmartFactoryOWL in Lemgo, Germany. The system transports pallets between its
different shelves, as shown in Figure 1.


Figure 1: A schematic representation of the system. The system consists of two stationary (‘BLO’,
‘BRU’) and two movable (‘BHL’, ‘BHR’) conveyer belts, as well as vertical rails (‘HL’, ‘HR’). The arrows
show three of possible transporting paths.
Source: https://www.kaggle.com/inIT-OWL/high-storage-system-data-for-energy-optimization.

   Measurements of position, power and voltage are made at each of the system’s drives during full
transporting cycles. Anomalies in this system include shortening of cycles, pauses, abnormal timing,
duration, or sequence of different process stages, as well as increase or decrease in one or multiple
signals at certain stages. The task is to detect HRSS anomalies and to localize them with time-step
precision by constructing the model of normal system behavior in an unsupervised manner.
   A time series dataset [13] was collected in this system under IMPROVE project and is being
actively used to test novel approaches to anomaly detection [9, 14]. The data contains 18 real-valued
signals sampled 15–20 times per second. It includes time series of 106 normal cycles (25,907
observations) and 111 cycles containing labelled anomalies (23,645 observations). The dataset is
unbalanced with 76.0% of negative examples. Statistical distributions of the classes (i.e. normal and
anomalous measurements) are not distinguishable in feature space, which excludes direct applying of
traditional Machine Learning (ML) methods for anomaly detection (e.g. linear models, decision trees,
SVM, etc.). At the same time PCA analysis shows that 10 main principal components cover 98.1% of
data variation, so linear dimensionality reduction techniques can be useful. Data quality issues that

                                                                                                        12
may affect model performance include high noisiness, strong outliers, and difference in feature ranges
by several orders of magnitude.

3. Background research
    As the statistical separation of classes is not possible in this task, constructing a model from
process measurements involves learning spatio-temporal patterns and events, which are typically
characterized by timing and duration of different process stages.
    To address this goal the use of dynamic process models such as Hybrid Timed Automata (HTA)
has been proposed [9]. To apply a discrete state HTA model to continuous process measurements the
unsupervised data preprocessing with self-organizing maps (SOM) and watershed transformations
were utilized. This method detects anomalies with timestep precision. Yet, having proven effective in
other CPPS applications [7, 8], it yields low performance on HRSS data with 30.76% F1 score and
26.7% recall (1516 true positives).
    In another study the Deep Learning architectures were applied to the same data [14]: Siameese
LSTM model was used for binary classification of full process cycles into ‘normal’ and ‘anomaous’
classes. Targeting minimal false-positive score this model yields 25.6% F1 measure, 88.2% precision,
and 15.0% recall, while being unable to localize anomalies within a cycle.
    In both studies anomaly detection rates are low, comparing to other CPPS applications, thus
learning a model from the process measurements in HRSS plant remains a challenging task. To
address this task, features of HRSS system must be identified that explain observed drop in efficiency.
As the results of the two studies are not directly comparable, the perspective of applying DL models
in this class of CPSS also remains an open question. Answering it requires strict evaluation of DL
models, as well as assessment of the effect of architectural variations. As the representativeness of
evaluation results remains unknown [9], additional measures must be developed to assess model
robustness.

4. The method

    In this study a set of autoencoder architectures are applied to the task of anomaly detection [10] in
a setup shown in Figure 2. The DL model, i.e. autoencoder, is trained in unsupervised manner to
reconstruct normal time series targeting minimal reconstruction loss. Then the trained model is used
to reconstruct unseen time series with anomalies, where the reconstruction error is expected to peak at
anomalous intervals. To evaluate the model, the distributions of reconstruction error in normal and
anomalous intervals are analyzed for being statistically distinguishable. Finally, from the error
distributions a decision-rule classifier for anomaly detection is built in a supervised mode.


                                                              Features
    Measurements,                 Preprocessing, feature
                                                                           Autoencoder
     time series                       engineering

                                                   Distance                       Reconstructed
                                                   measure                        time series

                                                                                   Anomaly
                                                           Decision tree
                                                                                   prediction

Figure 2: Solution architecture

   This method detects anomalies with time step precision, and most of evaluated models can be
applied in real time.


                                                                                                      13
   5 modifications of LSTM autoencoder and 3 modifications of ConvNet autoencoder were
modelled and evaluated in a setup allowing for direct benchmarking against background research. For
results to be representative, models’ robustness must be assessed. From the analysis of evaluation
results, two challenges were identified that must be met to achieve model robustness and the
representativeness of evaluation.
   1. One distinct feature of HRSS plant is low process variation in normal conditions with 12.6%
        mean absolute deviation from the averaged process cycle. Under such conditions, an
        autoencoder model can reach local minima of reconstruction error without reconstructing
        individual features of distinct cycles (i.e. different process runs). In this case model’s output is
        close to the average training cycle with reconstruction loss close to vnormal. Such model
        performs well on HRSS data where process variation is low, but it will not be useful in most
        CPPS applications where process variation is higher.
   2. The presence of anomalies may affect model’s performance in reconstructing neighboring
        normal intervals. This is expected behavior in models with internal time-dependency, which
        are used in this study. In this case model’s robustness is limited by the type and the length of
        anomalies, which typically are not known at training time.

4.1.    Robustness criteria

   To address the mentioned challenges two robustness criteria are proposed for representative model
evaluation.
   RC1. Reconstructed variation rate is calculated in unsupervised mode using the training set of
normal process cycles, by comparing step-vise standard deviation of reconstructed signal
 𝑛𝑜𝑟𝑚𝑎𝑙                                                  𝑛𝑜𝑟𝑚𝑎𝑙
𝑠𝑟𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑒𝑑 to standard deviation of the model input 𝑠𝑖𝑛𝑝𝑢𝑡  :
                                             𝑛𝑜𝑟𝑚𝑎𝑙
                                          𝜎(𝑠𝑟𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑒𝑑 )
                                  𝑅𝐶1 =         𝑛𝑜𝑟𝑚𝑎𝑙                                               (1)
                                            𝜎(𝑠𝑖𝑛𝑝𝑢𝑡 )
    RC2. Reconstruction sensitivity to anomalies is assessed in supervised mode on the set of
anomalous cycles (i.e. evaluation set) as the correlation between error of reconstructing normal
intervals and the strength of anomalies in the same process cycle or time window:
                         𝑛𝑜𝑟𝑚𝑎𝑙           𝑛𝑜𝑟𝑚𝑎𝑙
          𝑅𝐶2 = corr (M|𝑠𝑟𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑒𝑑 − 𝑠𝑖𝑛𝑝𝑢𝑡  |, 𝑎𝑛𝑜𝑚𝑎𝑙𝑦_𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒),                               (2)
where 𝑎𝑛𝑜𝑚𝑎𝑙𝑦_𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 is domain specific and includes type, time length and strength of the
anomaly.
   In HRSS plant two distinct types of anomalies are present.
    Type 1: amplitude deviations from normal signal
    Type 2: deviations in timing, duration, or sequence of process stages
   In practice anomalous cycle duration and long-term type-2 anomalies have noticeable effect on
RC2, as shown in Figure 3.


Figure 3: Evaluation results for two different autoencoder models which demonstrate low (left) and
high (right) values of RC2, respectively. Results were obtained from LSTM 2 and ConvNet 2 models.


                                                                                                           14
4.2.    Evaluation techniques
   To evaluate the DL models, HRSS dataset is split into three parts.
    Training set contains randomly selected 2/3 of normal cycles and is used to train the
       autoencoder.
    Test set contains remaining normal cycles and is used to validate autoencoder and test it for
       overtraining.
    Evaluation set contains all cycles with anomalies and is used to assess anomaly detection
       performance and to justify the selection of decision threshold.

4.2.1. Choice of performance measures

   The architecture consists of two parts: the autoencoder which is used to reconstruct input time
sequence, and the classifier used for anomaly detection. So, two performance indicators are required.
   The performance of signal reconstruction was measured with MAE loss function, which is more
outlier-resistant and more suitable for high-dimensional data comparing to MSE.
   Anomaly detection performance was measured with F1 score and confusion matrix. The F1 score
has the advantage of accounting for both false positives and false negatives. Comparing to accuracy
and correlation-based measures, which also account for true negatives, F1 score better suits an
unbalanced dataset. Also, F1 score with confusion matrix enable direct comparison with the
background research.

4.2.2. Selecting decision threshold
    In anomaly detector the threshold must be set for the signal reconstruction error. Let 𝐿𝑛 be the
distribution of signal reconstruction loss obtained on the training set (i.e. in normal cycles); let 𝐿𝑣𝑛 and
𝐿𝑣𝑎 be the distributions of loss obtained on validation data: in normal intervals and in anomalies
respectively. Then optimal value for the classification threshold can be assessed from 𝐿𝑛 , 𝐿𝑣𝑛 and 𝐿𝑣𝑎
in two ways:
     Unsupervised: 𝑇 = E(𝐿𝑛 ) + 2𝜎(𝐿𝑛 ).
     Supervised: 𝑇 = argmax 𝑆(𝐿𝑣𝑛 , 𝐿𝑣𝑎 , 𝑇), where 𝑆 is a performance measure for anomaly
        detection.
    Experiments on HRSS data show that the optimal threshold value for different architectural
modifications varies in a broad range. While the first assessment can be far from optimal, the second
assessment may not be possible in most applications where labelled anomalous data is not available.

4.2.3. Evaluation steps

   Evaluation steps include:
   1. calculating performance measures.
   2. assessing the statistical separation between autoencoder response to normal and anomalous
      signals (𝐿𝑛 , 𝐿𝑣𝑛 and 𝐿𝑣𝑎 ).
   3. assessing robustness criteria RC1 and RC2.
   4. selecting the optimal model by maximal performance, among models that have passed
      robustness tests.

5. Models

   The DL models being tested are divided in two groups by DL architecture type: LSTM and
Convolutional. In each group the first model is a traditional architecture used for anomaly detection.
Other models are built to assess the effect of architectural modifications on model performance.


                                                                                                         15
   The choice of the model’s hyper-parameters affects both experimental performance and
robustness. Hyper-parameters include the number, types and sizes of layers, compression rate of
autoencoder, the use of dropouts, as well as internal layer parameters (e.g. kernel size, activation
function). As no computationally effective techniques exist for finding the optimal architecture
construction through hyper-parameter choices, this task remains tedious and highly intuition driven
[15, 16]. In this study a grid search approach was applied for each model type, obtaining the models
shown in Table 1.

Table 1
Architecture modifications in the autoencoder. All models have decoder layers symmetrical to
encoder. Models were compiled with 'tanh' function for layer activation, and 'sigmoid' function for
recurrent activation. In the “CR” column the compression rate of the encoder is given.
                                                          Model hyper-parameters
  Model           Description
                                                                                  Transformations at
                                             Layers in encoder              CR
                                                                                    the bottleneck
LSTM 1       Classic LSTM            2 LSTM layers (filters: 30, 60)        60   Final LSTM output
             architecture [17, 18]                                               repeated for each
                                                                                 timestep
LSTM 2       Model without time      2 LSTM layers (filters: 60, 4)         3    No transformation
             compressing
LSTM 3       Model with time         3 LSTM layers (filters: 25, 25, 6),    12   No transformation
             pooling                 2 pooling layers (factor 3, 2)
LSTM 4       Model with time         3 LSTM layers (filters: 25, 25, 6),    24   1 convolutional layer
             pooling and             3 pooling layers (factor 3, 3)              (6 filters, kernel
             convolutional layers                                                size 3)
LSTM 5       Model with time         3 LSTM layers (filters: 25, 25, 6),    12   1 locally connected
             pooling and locally     2 pooling layers (factor: 3, 2)             layer (6 filters, kernel
             connected layers                                                    size 5)
LSTM 6       Model with time         3 LSTM layers (filters: 25, 25, 6),    48   Flattening,
             pooling and dense       3 pooling layers (factor: 3, 2, 2),         dense layer (75
             layers                  1 convolutional layer (6 filters,           filters), dense layer
                                     kernel size 3)                              (150 filters)
ConvNet 1    Classic convolutional   3 convolutional layers                 16   1 convolutional layer
             model                   (filters: 32, 16, 6; kernel size 5),        (6 filters, kernel
                                     3 pooling layers (factor 2)                 size 5)
ConvNet 2    Extended                3 convolutional layers (filters:       24   1 convolutional layer
             convolutional model     64, 32, 6; kernel size: 5, 10, 20),         (6 filters, kernel
                                     3 pooling layers (factor 3, 2, 2)           size 5)
ConvNet 3    Convolutional model     3 convolutional layers (filters:       48   Flattening,
             with dense layers       64, 32, 6; kernel size: 5, 10, 20),         dense layer (75
                                     3 pooling layers (factor 3, 2, 2)           filters), dense layer
                                                                                 (150 filters)


6. Experimental setup
   The models were implemented using Keras with Tensorow backend. Training was performed
using ‘Adam’ optimizer and MAE loss function with learning rate of 𝛼 = 0.005, 𝛽1 = 0.9, 𝛽2 =


                                                                                                            16
0.999, and fuzzy factor 𝜀 = 10−7 [19]. The time series of complete process cycles, padded to
constant length of 300 timesteps, were used as both input and target. Training was run with 130
epochs for LSTM models and 300 epochs for ConvNet models, in mini-batch mode with batch size
32. To rule out the effect of batch-averaging on robustness criteria RC1, training was repeated in
stochastic mode (batch size 1). In this setup the number of epochs was reduced by the factor of 5, as
epochs are more time-consuming in this mode, but epoch-to-epoch convergence is faster. As no
significant influence of the batch size on evaluation criteria was observed in experiments, only results
obtained in mini-batch mode are presented. As the reconstruction loss fluctuates between training
epochs, averaging across last 10 epochs was used for reliable performance estimate.
   Data pre-processing included the following steps:
    Introducing velocity features, calculated with second order accurate central differences.
    Dimensionality reduction from 24 to 12 components with PCA, which preserves 98.2% of data
        variance.
    Normalization and scaling to the range (0,1), which unifies value ranges of features.
    Time smoothing with gaussian kernel of width 15 and standard deviation 3.
    Unifying time series length by padding.

7. Results
   Reconstruction rates of all models fall into a narrow range, as shown in Table 2, with exception of
classic LSTM autoencoder (LSTM 1), which proved unable to accurately reconstruct the process.
Thus, reconstruction loss measure cannot be used to assess the efficiency of autoencoder model in
CPPS anomaly detection task. Instead, statistical analysis of the loss distributions must be applied.

Table 2
Model evaluation results. Column 1 shows anomaly detection score obtained in the case of
supervised optimal threshold selection using labelled anomalies, column 2 gives the score obtained
with unsupervised threshold estimate. Separate assessments for type 1 and type 2 anomalies are
obtained in the case of optimal threshold selection. Reconstruction score is assessed relative to the
amplitude variation in normal signal. In some models RC1 varies significantly during the process
cycle, starting with 0.1-7.5%. For such models, the mean value is given, marked with asterisk “*”,
while model’s robustness drops significantly at the beginning of each cycle.
                                       Performance                                  Robustness
                     Anomaly detection, F1 score, %
  Model                                                     Reconstruction,
              Overall estimate     Type 1    Type 2                                RC1         RC2
                                                               MAE, %
                 1         2      anomalies anomalies
  Target                                                                                    none or
                100       100        100          100             100             100
   value                                                                                      low
LSTM 1          52.6      38.3       25.6         81.8         42.5 ± 1.5        0.0012      none
LSTM 2          46.3      36.3       34.2         59.0         12.0 ± 0.53        51.2         low
LSTM 3          69.2      67.9       37.7         85.9         13.4 ± 0.92       27.14*        low
LSTM 4          62.3      59.2       34.5         78.5         14.7 ± 0.83        36.5*        low
LSTM 5          75.4      75.3       39.3         92.9        12.63 ± 0.27        0.014       none
LSTM 6          75.7      75.2       40.2         93.2         12.8 ± 0.41       0.0064       none
ConvNet 1       38.8      25.8       30.8         55.1         10.9 ± 0.81        0.652     medium
ConvNet 2       48.3      42.7       33.6         62.8         14.9 ± 0.75        26.5        high
ConvNet 3       69.9      68.0       44.3         85.2         10.4 ± 0.41        42.6        high


                                                                                                      17
         a)                                                  b)


                                 c)


Figure 4: Error distributions 𝐿𝑛 , 𝐿𝑣𝑛 and 𝐿𝑣𝑎 for three models with close performance estimates in
normal signal reconstruction: a) ConvNet 1 model: the classes are not separated;
b) LSTM 2 model: the classes are overlapping; c) LSTM 4: the classes are well separated

    Figure 4 shows the statistical distributions of autoencoder’s responses to normal and anomalous
data. The performance of anomaly detection is defined by the quality of class separation, but it is also
highly dependent on the method of selecting decision threshold. Empirically defined optimal
threshold for the tested models varies in wide range from 0.20 to 1.12 (Table 2, column “1”). Optimal
threshold selection is only possible in a supervised mode with the use of labelled anomalies, while in
most practical applications unsupervised threshold selection must be applied (Table 2, column “2”).
    While some models achieve high scores in detecting type 2 anomalies, they proved not being
sensitive to type 1 anomalies, e. g. 20% amplitude deviations from the normal signal. Then high
overall detection score is explained by large relative abundance of type 2 anomalies in HRSS data.
For evaluation results to be representative, the detection rates in different anomaly types must be
assessed separately.
    Figure 5 shows results of reconstructing a single process cycle containing a type 2 anomaly.
Graphs a and b are obtained from robust models (having high RC1 value and low RC2 value), graph c
demonstrates an extreme case of zero RC1 value, and graph d is the case of high RC2 value. In the
cases a, and b the model captures features of the individual observed cycle, so it is expected to show
comparable performance in other typical CPSS applications. In the case c the model output follows
the averaged train data, regardless of the observed process features. High anomaly detection score in
this case is not representative and is only observed due to low variance in training cycles in HRSS. In
the case d presence of type 2 anomaly strongly affects reconstruction of preceding normal interval,
making the task of statistically separating them in time (as well as the resulting performance estimate)
inadequate.


                                                                                                     18
                                            a)                                                  b)


                                            c)                                                  d)


Figure 5: Reconstruction of a signal containing type 2 anomaly: a) using LSTM 2 model;
b) using LSTM 3 model; c) using LSTM 5 model; d) using ConvNet 2 model

    Evaluation results indicate that increasing complexity of DL models (top down in Table 2) leads to
higher performance measure. However, this is not the case with robustness. Deep LSTM models with
heterogeneous layers (LSTM 5 and LSTM 6) tend to average out all variation in the signal (i.e., have
low RC1), while deeper convolutional networks lose ability to reconstruct normal signal in presence of
type 2 anomalies (i.e., have high RC2). It may be concluded that traditional performance metrics for
model evaluation are misleading in case of HRSS, favoring models with low robustness according to
criteria RC1 and RC2.
    Considering both performance measure and proposed robustness criteria, the LSTM 4 model is
selected as the best choice for HRSS data. Model’s architecture is demonstrated in Figure 6.
    Comparing to traditional LSTM autoencoder architectures [17, 18], this model introduces two
distinct architectural features. First, input time-series are not flattened into a vector, and thus the
model has lower compression rate. Experimental evidence (Table 2) suggests that preserving time
dimension in encoder generally leads to better performance in anomaly detection task. Second, an
additional convolution layer is added at model’s bottleneck to capture long-term features in input
time-series.
    The obtained LSTM 4 model provides 62.3±2.1% overall anomaly detection rate (F1 score) and
59.1% recall with 3350 true positives, as shown in Table 3. Comparing to the baseline efficiency [9],
an increase by 102% in anomaly detection score and an increase by 121% in recall are achieved.

Table 3
Confusion matrix obtained with the selected model.
           Labelled                     Predicted negative                 Predicted positive
           Negative                              16237                            1738
           Positive                              2320                             3350

                                                                                                     19
Figure 6: Selected autoencoder architecture

8. Conclusions

    The problem of the model-based anomaly detection in industrial CPPS was addressed in the Deep
Learning paradigm by applying autoencoder architectures. The specific case of HRSS plant was
studied, in which construction and evaluation of process models had proven to be a challenging task.
The major challenges of applying Deep Learning models were identified as low process variation in
the training set, and presence of two distinct types of anomalies, detecting which requires different
algorithms or settings.
    It was shown that increasing model complexity, both in LSTM and convolution-based models,
allow to increase anomaly detection performance but has strong robustness tradeoff. This indicates
that model evaluation in systems of this class cannot rely completely on performance metrics. For
evaluation results to be representative, detection rates of different anomaly types must be assessed
separately, and additional robustness criteria must be considered. Such criteria were proposed based
on statistical analysis of both the data and the model output in supervised training context.
    In the studied industrial transporting system (HRSS) applying deep learning models and
autoencoder techniques allowed for 102% performance gain, F1 score, while preserving model’s
robustness. Wider assessment of perspectives of CPPS applications requires further experimental
research in cases of higher variance in the normal process as well as different types of anomalies.

9. Acknowledgements

   This research utilizes the data collected at SmartFactoryOWL Lemgo, Germany, under the
European Union’s Horizon 2020 research project IMPROVE [12]. The data was made publicly
available by inIT [13] under a Creative Commons License Attribution-ShareAlike 4.0 International
(CC BY-NC-SA 4.0).

10.References
[1] Factories of the future: multi-annual roadmap for the contractual PPP under HORIZON 2020,
    Publications Office of the European Union, Luxembourg, 2013.

                                                                                                  20
[2] E. A. Lee, Cyber physical systems: design challenges. In: Proceedings of the 11th IEEE
     international symposium on Object Oriented Real-Time Distributed Computing (ISORC),
     Orlando, FL, 2008, pp. 363–369. doi: 10.1109/ISORC.2008.25.
[3] O. Niggemann, C. Frey, Data-driven anomaly detection in cyber-physical production systems,
     AT – Automatisierungstechnik, 2015, vol. 63, issue 10. doi: 10.1515/auto-2015-0060.
[4] L. Christiansen, A. Fay, B. Opgenoorth, J. Neidig, Improved diagnosis by combining structural
     and process knowledge, in: Proceedings of the 16th IEEE conference on Emerging Technologies
     Factory      Automation,        ETFA,     Toulouse,     France,     2011,     pp. 1–8.      doi:
     10.1109/ETFA.2011.6059056.
[5] S. Windman, S. Jiao, O. Niggemann, H. Borcherding, A stochastic method for the detection of
     anomalous energy consumption in hybrid industrial systems, in: Proceedings of the 11th
     international IEEE conference on Industrial Informatics, INDIN, Bochum, Germany, 2013. doi:
     10.1109/INDIN.2013.6622881.
[6] B. Vogel-Heuser, C. Diedrich, A. Fay, S. Jeschke, M. Kowalewski, S. Wollschlaeger,
     P. Goehner, Challenges for software engineering in automation, Journal of Software Engineering
     and Applications 7 (2014) 440–451. doi: 10.4236/jsea.2014.75041.
[7] N. Hranisavljevic, O. Niggemann, A. Maier, A novel anomaly detection algorithm for hybrid
     production systems based on deep learning and timed automata, in: Proceedings of the 27th
     international workshop on Principles of Diagnosis, DX-2016, Denver, Colorado, 2016.
[8] A. von Birgelen, O. Niggemann, Enable learning of hybrid timed automata in absence of discrete
     events through self-organizing maps, in: O. Niggemann, P. Schüller (eds.), IMPROVE –
     Innovative modelling approaches for production systems to raise validatable efficiency.
     Technologien für die intelligente automation (Technologies for intelligent automation), vol. 8,
     Springer Vieweg, Berlin, Heidelberg, 2008. doi: 10.1007/978-3-662-57805-6_3.
[9] A. von Birgelen, O. Niggemann, Using self-organizing maps to learn hybrid timed automata in
     absence of discrete events, in: Proceedings of the 22nd IEEE international conference on
     Emerging Technologies and Factory Automation, ETFA, Limassol, Cyprus, 2017, pp. 1–8. doi:
     10.1109/ETFA.2017.8247695.
[10] C. Zhou, R. C. Paffenroth, Anomaly detection with robust deep autoencoders, in: Proceedings of
     the 23rd ACM SIGKDD international conference on Knowledge Discovery and Data Mining,
     KDD '17, Halifax NS, Canada, 2017, pp. 665–674. doi: 10.1145/3097983.3098052.
[11] IMPROVE. Creating the factory of the future with 4.0 solutions, 2016. URL: http://improve-
     vfof.eu/.
[12] Physical     factory      /   demonstrators   IMPROVE,        2016.    URL:     http://improve-
     vfof.eu/background/physical-factory-demonstrators.
[13] inIT,     High     storage    system    data   for   energy     optimization,   2018.     URL:
     https://www.kaggle.com/inIT-OWL/high-storage-system-data-for-energy-optimization.
[14] M. Cerliani. Predictive maintenance with LSTM siamese network, 2019. URL:
     https://towardsdatascience.com/predictive-maintenance-with-lstm-siamese-network-
     51ee7df29767.
[15] S. R. Young, D. C. Rose, T. P. Karnowski, S.-H. Lim, R. M. Patton, Optimizing deep learning
     hyper-parameters through an evolutionary algorithm, in: Proceedings of the workshop on
     Machine Learning in High-Performance Computing Environments, MLHPC '15, Austin, Texas,
     2015, article no. 4. doi: 10.1145/2834892.2834896.
[16] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, The journal of machine
     learning research, 13 (2012), pp. 281–305.
[17] A. Sagheer, M. Kotb. Unsupervised pre-training of a deep LSTM-based stacked autoencoder for
     multivariate time series forecasting problems, Scientific Reports 9, 19038 (2019). doi:
     10.1038/s41598-019-55320-6.
[18] A. H. Mirza, S. Cosan, Computer network intrusion detection using sequential LSTM neural
     networks autoencoders, in: Proceedings of the 26th Signal Processing and Communications
     Applications Conference, SIU, Izmir, Turkey, 2018, pp. 1–4. doi: 10.1109/SIU.2018.8404689.
[19] D. P. Kingma, J. Ba. Adam: a method for stochastic optimization, in: Proceedings of the 3rd
     international conference for Learning Representations, CoRR, San Diego, CA, 2014,
     abs/1412.6980.

                                                                                                  21

</pre>