A Customized Approach to Anomalies Detection by using
Autoencoders
Roberto Aurelia , Nicolo’ Brandizzia , Giorgio De Magistrisa and Rafał Brociekb
a
 Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00135, Rome, Italy
b
 Department of Mathematics Applications and Methods for Artificial Intelligence, Faculty of Applied Mathematics, Silesian University of
Technology, 44-100 Gliwice, Poland


                                       Abstract
                                       When dealing with sensor’s data, it’s important to keep track of what it’s really happening in the tracked environments
                                       since failures, interruptions and misreadings must be expected at any time. Especially with logging processes involving
                                       extremely voluminous reports, an automatic method to detect entries that are not following the normal distribution of data
                                       (i.e. anomalies) should be the ideal solution. In the presented work the task performed by the autoencoder is to generate a
                                       reproduction error, used as metric for the classification of a sample in one of two classes: anomalous or non-anomalous.

                                       Keywords
                                       Parallel DSP, TI-ADC, FPGA, ASIC


1. Introduction                                                                        not.
                                                                                       Ideally, the anomalies are a minimal part of a dataset
When dealing with sensor’s data, it’s important to keep with generally low probabilities to be drawn from the
track of what it’s really happening in the tracked envi- distribution describing the set: this scarcity implies a
ronments since failures, interruptions and misreadings big reproduction error from the autoencoder. Moreover,
must be expected at any time. Especially with logging the more a gradient descent is performed over a set of
processes involving extremely voluminous reports, an inputs, the more the loss should decrease (until it hits its
automatic method to detect entries that are not following minimum), vice versa, if a datum is not common, the loss
the normal distribution of data (i.e. anomalies) should be is greater with respect to other well known data.
the ideal solution.                                                                    This method is shown to work over a real life, unlabelled
Neural Networks can be used in this type of task as de- dataset, posing the problem in the unsupervised learning
tectors for the distance of the sample from the natural landscape.
distribution underlying the dataset.
In particular, autoencoders [1] are a type of neural net-
work capable of compressing the input into a reduced, 2. Related Works
meaningful representation and finally decoding it back,
reproducing it with the minimum error possible [2, 3]. Anomalies detection tasks have been already studied and
This type of networks has currently been used success- solved with neural networks exploiting the reproduction
fully for image denoising [4, 5, 6], NLP’s tasks and generic error: the difference between a generic sample and a re-
dimensionality reduction [7, 8]. The first use of this type construction of itself performed by some mathematical
of network dates back to the 80s, however its origins and model.
authors are unclear, caused by changes in nomenclatures In [9], a module made up by stacked LSTMs networks
and definitions.                                                                       is trained over non-anomalous data and its prediction
In this work, the reproduction error (i.e. the error be- error over the future steps is used as an indicator for the
tween the input sample and the output of the autoen- anomaly of the sample. However, this approach needs
coder) over a set of sample is exploited to discriminate the dataset to be labelled, increasing the work needed in
which samples are anomalous in the given set and which the creation and the difficulty of application in real life
                                                                                       scenarios.
SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology,                                Similarly, in [10], they proposed a novel architecture
Engineering and Mathematics. July 27–29, 2021, Catania, IT
                                                                                       called ALAD (Adversarially Learned Anomaly Detection),
" aureli.1757131@studenti.uniroma1.it (R. Aureli);
brandizzi@diag.uniroma1.it (N. Brandizzi);                                             an approach based on generative adversarial networks.
demagistris@diag.uniroma1.it (G. D. Magistris);                                        The GAN generates an adversarially learned set of fea-
rafal.brociek@polsl.pl (R. Brociek)                                                    tures used to project the high-dimensional original space
 0000-0002-3076-4509 (G. D. Magistris); 0000-0002-7255-6951                           of the dataset into a reduced one. The reduced represen-
(R. Brociek)                                                                           tations are then decoded and the reproduction error is
          © 2021 Copyright for this paper by its authors. Use permitted under Creative
    CEUR
          Commons License Attribution 4.0 International (CC BY 4.0).
          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                       used as an anomaly indicator.


                                                                                            53
Roberto Aureli et al. CEUR Workshop Proceedings                                                                      53–59


More similar to the approach proposed in this work but         3.3. Training
extremely more advanced, [11] makes use of robust tech-
                                                               The training procedure is the standard training proce-
niques paired with autoencoders to detect anomalies and
                                                               dure used for generic neural networks. A sample 𝑥 is
type of anomalies like random corruptions, recurrent
                                                               drawn from the dataset, sampled from the distribution
corruptions (i.e. corruptions present in more that one
                                                               𝑝𝑑𝑎𝑡𝑎𝑠𝑒𝑡 (𝑋). The latent vector (i.e. the output of the bot-
instance) and so on.
                                                               tleneck layer) ℎ is generated by the encoder part of the
In the presented work the task performed by the autoen-
                                                               autoencoder, generating ℎ = 𝐸(𝑥). This representation,
coder is to generate a reproduction error, used as metric
                                                               at the end of the training, can be used as an approximated
for the classification of a sample in one of two classes:
                                                               reduced representation of the original input. Finally, ℎ is
anomalous or non-anomalous.
                                                               used as input for the decoder 𝐷, generating 𝑥′ = 𝐷(ℎ),
                                                               belonging to the same space of 𝑥. In this case, the L1 error
3. Background and method                                       between 𝑥 and 𝑥′ is minimized to train the parameters
                                                               of the network. It’s crucial that the goal of the training is
   description                                                 to guarantee the minimum difference between 𝑥 and 𝑥′ .
3.1. Autoencoder
                                                               3.4. Anomalies Detection
An autoencoder is made up by two functions: an encoder
and a decoder. The goal of the encoding function is to      The key concept of the method is in the notion of scarcity:
map (i.e. to encode) the input into a different space. Sym- an anomaly, to be defined as such, needs to be a notewor-
metrically, the decoder must map the encoded vector         thy event. In practical cases, it could be an unpredictable
back to the original input, without losing information.     spike in a time series, a set of burned pixels in a photo,
Conveniently, the encoding space is often chosen smaller    the saturation of a sensor, an unexpected down time of a
in dimensions than the original space, making the au-       link and so on.
toencoder performing also a dimensionality reduction.       It’s worth nothing to say that, when sampling an element
More formally, given an encoding function 𝐸(𝑥) and a        from a distribution, it’s far more probable to sample a
decoding function 𝐷(𝑥), 𝐷(𝐸(𝑥)) must return back to         normal entry instead of an anomalous case (if not, other
the original 𝑥.                                             methods must be used or some problems could be in the
The two mentioned above functions can be approximated       dataset). With these premises, the autoencoder capac-
by symmetric neural networks, solving the following op-     ity to capture a distribution and projecting it in another
timization problem:                                         space is exploited: during the training phase, multiple
                                                            epochs are performed over all the samples in dataset,
                                                            meaning that the network will experience a gradient
                  min ‖𝑥 − 𝐷(𝐸(𝑥))‖2                        descent over the loss generated by the same samples
                 𝜃𝐷 ,𝜃𝐸
                                                            multiple times but without modifying the ratio between
where 𝜃𝐷 and 𝜃𝐸 are the parameters of the respective non-anomalous samples and anomalous ones. Moreover,
neural networks.                                            the reproduction error will be minimized over the most
                                                            prominent distribution in the dataset, generating larger
3.2. Architecture                                           errors in the anomalous subsets.
                                                            As shown in Figure 1, the scarcity is the fundamental
The only constraint that need to be taken in consideration parameter that separates a normal sample from an anoma-
is the presence of a bottleneck, a layer smaller than all lous one. When inverting the contamination proportion,
the other layers in the network, essential for the dimen- the originally good samples results in a loss distributed
sionality reduction. Without a bottleneck, the network higher than the respective anomalous one. This example
is not "forced" to ignore useless or non-representative is purely demonstrative of the analysis made on the loss:
features in the input, losing the capacity of mapping the a real-life dataset has a percentage of anomalies sensibly
input in a denser space.                                    lower than the ones used here.
The networks’ architectures are completely adaptable The last thing needed to define an anomaly is the loss
to the problem. Normally, the input (e.g. time-series, threshold: after analyzing the distribution of the autoen-
images) is embedded into a vectorial representation and coder’s reproduction loss on all the samples in the dataset,
then reduced till the bottleneck. This representation is a threshold must be manually imposed, where a sample
the projection of the input into a different space carried with a loss beyond the latter is considered anomalous.
out by the encoder part. The latent vector (i.e. the output A compromise must be reached, since all the anoma-
of the bottleneck) is then passed to a generally symmet- lous samples must be included without including non-
ric network, returning a representation belonging to the anomalous samples.
original input space.


                                                          54
Roberto Aureli et al. CEUR Workshop Proceedings                                                                    53–59


3.5. Workflow                                                  multiple entries will show the multiple sessions.
The method is fast and with big enough datasets a single       • VID: An integer representing a single car in an
training can be performed to perform analysis on future,         unique way.
unseen events.                                                 • First Ping: A date in the format yyyy-dd-mm
The preprocessing of the dataset is unavoidable: normal-         representing the day of the session
izing data and orders of magnitude is necessarily to have      • TotalTime: Seconds (integer) elapsed from the
a good, consistent loss analysis. Samples bigger in mod-         first connection to the last disconnection, same
ule can importantly alter the distribution.                      for all the sessions of the day.
Finally, the model is fine-tuned (i.e. autoencoder archi-
                                                               • start: timestamp representing the start of the
tecture, bottleneck dimensions and so on) and a standard
                                                                 session.
training procedure must be performed. The loss distribu-
                                                               • end: timestamp representing the end of the ses-
tion is analyzed and the threshold is manually selected
                                                                 sion.
by the user.
Once new data arrive, there is no need to retrain the          • env: An integer representing the client owning
autoencoder: the sample is normalized and passed to the          the car in an unique way.
model, finally it’s classified according to the threshold.     • Service: An integer representing the fleet of the
It may happen that long-term changes in the dataset              client, unique for the given client.
distribution could completely alter the outcome of the
process, requiring a new training and threshold selection. 4.2. Preprocessing
                                                            The structure of the dataset could resemble a time-series
4. Experiment                                               in which, eventually, each day is made up by more en-
                                                            tries for the same car. To tackle this complexity, a re-
4.1. Dataset                                                duction to a single entry for each car in a given day is
                                                            performed in such a way to get a single point for each pair
The dataset is composed by information gathered from of (VID, First Ping) (First Ping is then dropped
real car sensors.                                           at the end of the preprocessing phase). However, to not
Each car is assigned to a client that can have multiple lose any important information, some fields have been
fleets and, every day that the car has been used, the total added to record what is implicit in the original dataset
time of connection to the network is logged, saving the while VID, TotalTime, env and Service are kept
duration in seconds, the start date and the end date of the the same. Since start and end are removed, TotalTime_OFF
session. In case of disconnections during the same day, has been added to record the total time in which the
                                                            car has been disconnected from the network. Moreover,
                                                            n_disconnections is added to remember the number
                                                            of sessions in a single day for the given car. The final
                                                            attribute added is C_v_off_time, namely the coefficient
                                                            of variation of the down-times between each session, an
                                                            index of dispersion.
                                                            The coefficient is defined as:
                                                                                              𝜎(𝑥)
                                                                                   𝐶𝑣 (𝑥) =                          (1)
                                                                                              𝜇(𝑥)

                                                                where 𝑥 is a set of data, 𝜎 is the standard deviation of
                                                             𝑥 and 𝜇 is the mean. Practically, an higher 𝐶𝑣 means that
                                                             the data are unbalanced, implying a big difference across
                                                             each element in the set. For example, 𝐶𝑣 (10, 10, 10, 10, 10) =
                                                             0 meaning no dispersion in the data, while 𝐶𝑣 (35, 5, 5, 2, 3) =
                                                             1.4, evidencing a set of data more scattered around the
                                                             mean.
Figure 1: The losses distribution of an autoencoder trained  The  meaning given to the coefficient of variation can
on an artificially contaminated labelled dataset composed by vary based on the needs: in some cases it’s better to have
normalized vectors of dimension 24 sampled by a multivari- short disconnections than a long one since it could be
ate Gaussian for the non-anomalous data and from an uni- easier to deduct missing locations or missing data, prefer-
form distribution for the anomalous data.                    ring therefore lower 𝐶𝑣 s. Contrary to the last sentence,


                                                          55
Roberto Aureli et al. CEUR Workshop Proceedings                                                                       53–59


a 𝐶𝑣 near 0 could also represent a set of long disconnec-
tion times followed by small ones: this value alone is not
enough to get an idea of how a link is performing since
it doesn’t contain information regarding the quantities
in the set, only a normalized index of dispersion.
After the preprocessing, the dataset has been reduced
from 573064 entries to 99386. As for the training process,
the autoencoder is trained over TotalTime, TotalTime_OFF,
C_v_off_time and n_disconnections normalized be-
tween 0 and 1. This choice is justified by the fact that the
remaining variables are categorical one hence completely
arbitrarily values used only for an identification purpose.
                                                               Figure 3: Training profile of the autoencoder’s losses: there
                                                               isn’t a significant overfit in the last part.
4.3. Model architecture and Training
The simplicity of the reduced dataset permits the use
of a fully connected network. An hidden layer (for each where 1803 anomalies are found, the 1.84% of the whole
network) is enough to capture the dimensionality, ending dataset.
in a bidimensional bottleneck. The nonlinearity is intro-
duced by a ReLU function and a Sigmoid [12] at the out-
put of the two networks. To ease the training process and
evade possible loacal-minima situations, a dropout layer
is used in order to randomly drop to 0 the weights of the
network with a probability of 25%. The∑︀  associated loss
is an L1-distance defined as 𝑑(𝑥, 𝑦) = 𝑚    𝑖=1 |𝑥𝑖 − 𝑦𝑖 |,
where 𝑥 and 𝑦 are vectors of length 𝑚.
The training process is supervisioned by an early stop-
ping mechanism, keeping the best model before reach-
ing a situation of overfit. The performances are tracked
by computing the loss over a set of unseen samples,
extracted with a proportion of 30% from the original
dataset.

                                                               Figure 4: Histogram showing the distribution of the loss over
                                                               all the dataset. Logarithmic scale on the y-axis.


                                                               5. Results
                                                            The following results are extracted from the dataset after
                                                            the classification process mentioned before: each sample
Figure 2: Autoencoder’s layers and dimensions
                                                            is expanded with the associated previously removed cat-
                                                            egorical variables, in order to contextualize the results in
                                                            the dataset domain.
                                                            Each analysis starts from the comparison of the distribu-
4.4. Training results                                       tion in each class: anomalous and non-anomalous data.
As mentioned before, after the training the loss distribu- In each plot, the blue distribution represents the whole
tion must be analyzed. The histogram in Figure 4 shows dataset distribution (labels on the left y-axis) while the
a decreasing trend before 0.10 after a plateau that drops orange one shows the anomalies (labels on the right y-
at 0.25, followed by some sparse samples. The quality of axis).
the training is confirmed by the peak of the loss distribu-
tion near 0, evidencing an overall low loss.
By an empirical choice, the threshold is defined at 0.10,


                                                          56
Roberto Aureli et al. CEUR Workshop Proceedings                                                                          53–59


5.1. Time anomalies                                                To better see the motivation of the overlapping peaks, a
                                                                   new type of plot can be introduced:
As expected, the notion of anomaly in a connection to
network domain is in direct correlation with the duration
of the down-time.


                                                                   Figure 7: Scattered representation of each entry with
                                                                   TotalTime and TotalTime_OFF as coordinates for each
                                                                   point. Non-anomalous points in blue, anomalous points in
Figure 5: Distribution of the Total Time,                          red.


   Figure 5 shows a blue peak of the distribution over the    In this plot each entry is scattered on a plane, represent-
maximum value admissible (i.e 86400 seconds in a day), ing TotalTime on the x-axis and TotalTime_OFF on
followed by a peak in the same position of the anomalies. the y-axis. An ideal entry has the maximum TotalTime
The interesting part of the plot is where the anomalies and the minimum TotalTime_OFF, posing itself on the
are distributed more than the dataset itself, hence in the rightmost lower corner of the plot.
lower values of the x-axis. As expected, an important The yellow star represents the center of mass of the non-
number of entries with a low total connection time is anomalous distribution (in blue), very near to the ideal
classified as anomaly.                                     point, while the green star represents the center of mass
Symmetrically, the plot of the down-time represents a of the anomalies distribution. There are ∼ 80𝑘 blue
similar situation:                                         points while only ∼ 2𝑘 red ones, evidencing a big differ-
                                                           ence in the concentration.
                                                           It’s worth nothing to say that a different threshold would
                                                           have moved the frontier of the two clusters up or down.
                                                           The reason of the overlapping peaks lies in the fact that
                                                           a big number of anomalies is in the vertical line over the
                                                           maximum of TotalTime (i.e. 𝑥 = 86400) and on the
                                                           horizontal line over the minimum TotalTime_OFF (i.e.
                                                           𝑦 = 0), meaning that one variable is in a good range
                                                           while the other one not. The worst anomalies are the
                                                           ones lying near the center of the plot, containing a dis-
                                                           crepancy in both the variables.

                                                                   5.2. Number of disconnections
                                                                   A counterintuitive results is shown in the next plots
                                                                      In the left plot there aren’t any notable results, the two
Figure 6: Distribution of the Total Time of disconnection          distributions appear the same. The number of disconnec-
                                                                   tions is evenly distributed over each entry in the dataset
                                                                   and in the anomalies.
   Also in this case, there is an important anomalies’ peak        When computing the distribution over the average dis-
near the dataset distribution followed by many samples             connection time for each car, the histograms show an
on the right part of the x-axis. An anomalous entry is             important difference: the peak of the anomalies is lower
also described by an high disconnection time.


                                                              57
Roberto Aureli et al. CEUR Workshop Proceedings                                                                           53–59


than the peak of the dataset, meaning that the anomalous
entries have a lower number of disconnections. However
this could cause some confusion since a larger number
is expected when talking about this type of variable but
analyzing the results paired with the ones obtained in the
previous subsection, the lower number of disconnections
reveal a longer down-time. This result is confirmed by
the following plot where the anomalies distribution is
slightly translated to the right, meaning a less uniform
number of disconnections that can be caused by a longer
disconnection time followed by a set of short times.

5.3. Categorical analysis
A final analysis can be made on the categorical variables, Figure 9: Plot of the distribution of the average coefficient
answering the practical question: "Are there bad cars or of variation for each car.
bad clients?".
Following the previous analysis, it’s possible to retrieve
an overview on the presence of a single car in the anoma-
lies by plotting the distribution of the VIDs. In Figure
11 (left), the anomalies are concentrated in the first part
of the x-axis, showing a peak near 0. A set of car that
appear only less than 10.0% in the dataset is responsible
of the ∼ 30.0% of the anomalies. A zoomed version of
the same plot can be seen on Figure 11 (right), where
only the cars with a VID less than 2000 are showed.
In a practical way, this result can help focusing more on
the subset of car that is more present in the anomalies,
helping saving time on the analysis.
From this result it’s possible to derive the conclusion
on the final result, the client analysis. In Figure 10 the
biggest percentage of anomalies is covered by the client
number 1. The duality on the results can show that the Figure 10: Plot of the distribution of the clients id.
most of the first 2000 cars are assigned to the first client,
notion that may help with the understanding of the fail-
ures.
                                                              All the results mentioned are perfectly reproducible by
                                                              utilizing the same saved model (i.e. same architecture
5.4. Reproducibility                                          with same weights loaded in) and the same threshold. The
The method explained here is implemented in PyTorch only stochastic variable in the model is the dropout layer
[13].                                                         that must be deactivated before the evaluation. An useful
                                                              thing is that the threshold is not an hyperparameter of


Figure 8: Plot of the distributions of the number of discon-
nections (left), plot of the average number of disconnections     Figure 11: Plot of the distributions of the VIDs(left), plot of
for each car (right)                                              the distributions of the VIDs showed on the first 2000 vehicles
                                                                  (right).


                                                             58
Roberto Aureli et al. CEUR Workshop Proceedings                                                                   53–59


the network, meaning that it can be changed, accord-               Multi-class nearest neighbour classifier for incom-
ing to the needs of the user, after the training phase. A          plete data handling, volume 9119, 2015, pp. 469–480.
variation on the anomalies threshold could completely          [3] S. Russo, S. Illari, R. Avanzato, C. Napoli, Reducing
alter the result by including more or less entries in the          the psychological burden of isolated oncological
anomalous set, creating a more severe (or less) detection          patients by means of decision trees, volume 2768,
system.                                                            2020, pp. 46–53.
Finally, every evaluation can be made in real-time (af-        [4] L. Gondara, Medical image denoising using convo-
ter the training), with times that can vary according to           lutional denoising autoencoders, in: 2016 IEEE 16th
the hardware and the architecture used. On an NVIDIA               international conference on data mining workshops
MX150, the evaluation over all the dataset takes approxi-          (ICDMW), IEEE, 2016, pp. 241–246.
mately 20 seconds.                                             [5] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana,
   It’s possible that the model needs a retraining if the          M. Woźniak, A novel neural networks-based tex-
distributions in the dataset changes in an unexpected              ture image processing algorithm for orange defects
way (e.g. logging temperatures can require a retraining            classification, International Journal of Computer
between summer and winter if the entries are not enough            Science and Applications 13 (2016) 45–60.
for each season).                                              [6] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac-
                                                                   caro, Yolov3-based mask and face recognition al-
                                                                   gorithm for individual protection applications, vol-
6. Conclusions                                                     ume 2768, 2020, pp. 41–45.
                                                               [7] Y. Wang, H. Yao, S. Zhao, Auto-encoder based
The method presented in this paper has shown the ability
                                                                   dimensionality reduction, Neurocomputing 184
to perform a classification of an unlabelled dataset. It’s a
                                                                   (2016) 232–242.
fast way to identify outliers in datasets or in a real-time
                                                               [8] G. Capizzi, S. Coco, G. Sciuto, C. Napoli, A new
data feed (after a first training over a big enough dataset
                                                                   iterative fir filter design approach using a gaussian
composed by recorded logs).
                                                                   approximation, IEEE Signal Processing Letters 25
Its flexibility can be a great incentive to its utilization,
                                                                   (2018) 1615–1619.
being tied only to the use of an autoencoder, architecture
                                                               [9] P. Malhotra, L. Vig, G. Shroff, P. Agarwal, Long
that can be expanded and customized at each eventuality.
                                                                   short term memory networks for anomaly detection
Moreover, the fact that the threshold must be imposed
                                                                   in time series, 2015.
after the training can be exploited to increase or decrease
                                                              [10] H. Zenati, M. Romain, C.-S. Foo, B. Lecouat, V. Chan-
the severity of the system in real-time, following changes
                                                                   drasekhar, Adversarially learned anomaly detec-
in the needs of the user.
                                                                   tion, in: 2018 IEEE International Conference on
However, the manual choice of a loss threshold could
                                                                   Data Mining (ICDM), 2018, pp. 727–736. doi:10.
be an element of imprecision, looking at the fact that a
                                                                   1109/ICDM.2018.00088.
slightly alteration could extremely change the samples
                                                              [11] C. Zhou, R. C. Paffenroth, Anomaly detection with
considered as anomalous. Another downside could be
                                                                   robust deep autoencoders, in: Proceedings of the
the needing of a big enough dataset, since with a little
                                                                   23rd ACM SIGKDD international conference on
one there could be difficulties in learning the right distri-
                                                                   knowledge discovery and data mining, 2017, pp.
bution. However, this is a common problem among all
                                                                   665–674.
the autoencoder applications.
                                                              [12] C. Nwankpa, W. Ijomah, A. Gachagan, S. Mar-
Finally, the system is sensible to the dataset’s dimensions,
                                                                   shall, Activation functions: Comparison of trends
requiring an adequate normalization that must be applied
                                                                   in practice and research for deep learning, 2018.
also to real-time samples.
                                                                   arXiv:1811.03378.
In conclusion, as future work the method can be ex-
                                                              [13] A. e. a. Paszke, Pytorch: An imperative style, high-
panded to automatically detect the threshold, removing
                                                                   performance deep learning library, in: H. Wal-
the manual component that could completely change the
                                                                   lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
outcome of the process.
                                                                   E. Fox, R. Garnett (Eds.), Advances in Neural Infor-
                                                                   mation Processing Systems 32, Curran Associates,
References                                                         Inc., 2019, pp. 8024–8035.

 [1] D. Bank, N. Koenigstein, R. Giryes, Autoencoders,
     CoRR abs/2003.05991 (2020). URL: https://arxiv.org/
     abs/2003.05991. arXiv:2003.05991.
 [2] B. Nowak, R. Nowicki, M. Woźniak, C. Napoli,


                                                           59