A Customized Approach to Anomalies Detection by using Autoencoders Roberto Aurelia , Nicolo’ Brandizzia , Giorgio De Magistrisa and Rafał Brociekb a Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00135, Rome, Italy b Department of Mathematics Applications and Methods for Artificial Intelligence, Faculty of Applied Mathematics, Silesian University of Technology, 44-100 Gliwice, Poland Abstract When dealing with sensor’s data, it’s important to keep track of what it’s really happening in the tracked environments since failures, interruptions and misreadings must be expected at any time. Especially with logging processes involving extremely voluminous reports, an automatic method to detect entries that are not following the normal distribution of data (i.e. anomalies) should be the ideal solution. In the presented work the task performed by the autoencoder is to generate a reproduction error, used as metric for the classification of a sample in one of two classes: anomalous or non-anomalous. Keywords Parallel DSP, TI-ADC, FPGA, ASIC 1. Introduction not. Ideally, the anomalies are a minimal part of a dataset When dealing with sensor’s data, it’s important to keep with generally low probabilities to be drawn from the track of what it’s really happening in the tracked envi- distribution describing the set: this scarcity implies a ronments since failures, interruptions and misreadings big reproduction error from the autoencoder. Moreover, must be expected at any time. Especially with logging the more a gradient descent is performed over a set of processes involving extremely voluminous reports, an inputs, the more the loss should decrease (until it hits its automatic method to detect entries that are not following minimum), vice versa, if a datum is not common, the loss the normal distribution of data (i.e. anomalies) should be is greater with respect to other well known data. the ideal solution. This method is shown to work over a real life, unlabelled Neural Networks can be used in this type of task as de- dataset, posing the problem in the unsupervised learning tectors for the distance of the sample from the natural landscape. distribution underlying the dataset. In particular, autoencoders [1] are a type of neural net- work capable of compressing the input into a reduced, 2. Related Works meaningful representation and finally decoding it back, reproducing it with the minimum error possible [2, 3]. Anomalies detection tasks have been already studied and This type of networks has currently been used success- solved with neural networks exploiting the reproduction fully for image denoising [4, 5, 6], NLP’s tasks and generic error: the difference between a generic sample and a re- dimensionality reduction [7, 8]. The first use of this type construction of itself performed by some mathematical of network dates back to the 80s, however its origins and model. authors are unclear, caused by changes in nomenclatures In [9], a module made up by stacked LSTMs networks and definitions. is trained over non-anomalous data and its prediction In this work, the reproduction error (i.e. the error be- error over the future steps is used as an indicator for the tween the input sample and the output of the autoen- anomaly of the sample. However, this approach needs coder) over a set of sample is exploited to discriminate the dataset to be labelled, increasing the work needed in which samples are anomalous in the given set and which the creation and the difficulty of application in real life scenarios. SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology, Similarly, in [10], they proposed a novel architecture Engineering and Mathematics. July 27–29, 2021, Catania, IT called ALAD (Adversarially Learned Anomaly Detection), " aureli.1757131@studenti.uniroma1.it (R. Aureli); brandizzi@diag.uniroma1.it (N. Brandizzi); an approach based on generative adversarial networks. demagistris@diag.uniroma1.it (G. D. Magistris); The GAN generates an adversarially learned set of fea- rafal.brociek@polsl.pl (R. Brociek) tures used to project the high-dimensional original space  0000-0002-3076-4509 (G. D. Magistris); 0000-0002-7255-6951 of the dataset into a reduced one. The reduced represen- (R. Brociek) tations are then decoded and the reproduction error is © 2021 Copyright for this paper by its authors. Use permitted under Creative CEUR Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 used as an anomaly indicator. 53 Roberto Aureli et al. CEUR Workshop Proceedings 53–59 More similar to the approach proposed in this work but 3.3. Training extremely more advanced, [11] makes use of robust tech- The training procedure is the standard training proce- niques paired with autoencoders to detect anomalies and dure used for generic neural networks. A sample 𝑥 is type of anomalies like random corruptions, recurrent drawn from the dataset, sampled from the distribution corruptions (i.e. corruptions present in more that one 𝑝𝑑𝑎𝑡𝑎𝑠𝑒𝑡 (𝑋). The latent vector (i.e. the output of the bot- instance) and so on. tleneck layer) ℎ is generated by the encoder part of the In the presented work the task performed by the autoen- autoencoder, generating ℎ = 𝐸(𝑥). This representation, coder is to generate a reproduction error, used as metric at the end of the training, can be used as an approximated for the classification of a sample in one of two classes: reduced representation of the original input. Finally, ℎ is anomalous or non-anomalous. used as input for the decoder 𝐷, generating 𝑥′ = 𝐷(ℎ), belonging to the same space of 𝑥. In this case, the L1 error 3. Background and method between 𝑥 and 𝑥′ is minimized to train the parameters of the network. It’s crucial that the goal of the training is description to guarantee the minimum difference between 𝑥 and 𝑥′ . 3.1. Autoencoder 3.4. Anomalies Detection An autoencoder is made up by two functions: an encoder and a decoder. The goal of the encoding function is to The key concept of the method is in the notion of scarcity: map (i.e. to encode) the input into a different space. Sym- an anomaly, to be defined as such, needs to be a notewor- metrically, the decoder must map the encoded vector thy event. In practical cases, it could be an unpredictable back to the original input, without losing information. spike in a time series, a set of burned pixels in a photo, Conveniently, the encoding space is often chosen smaller the saturation of a sensor, an unexpected down time of a in dimensions than the original space, making the au- link and so on. toencoder performing also a dimensionality reduction. It’s worth nothing to say that, when sampling an element More formally, given an encoding function 𝐸(𝑥) and a from a distribution, it’s far more probable to sample a decoding function 𝐷(𝑥), 𝐷(𝐸(𝑥)) must return back to normal entry instead of an anomalous case (if not, other the original 𝑥. methods must be used or some problems could be in the The two mentioned above functions can be approximated dataset). With these premises, the autoencoder capac- by symmetric neural networks, solving the following op- ity to capture a distribution and projecting it in another timization problem: space is exploited: during the training phase, multiple epochs are performed over all the samples in dataset, meaning that the network will experience a gradient min ‖𝑥 − 𝐷(𝐸(𝑥))‖2 descent over the loss generated by the same samples 𝜃𝐷 ,𝜃𝐸 multiple times but without modifying the ratio between where 𝜃𝐷 and 𝜃𝐸 are the parameters of the respective non-anomalous samples and anomalous ones. Moreover, neural networks. the reproduction error will be minimized over the most prominent distribution in the dataset, generating larger 3.2. Architecture errors in the anomalous subsets. As shown in Figure 1, the scarcity is the fundamental The only constraint that need to be taken in consideration parameter that separates a normal sample from an anoma- is the presence of a bottleneck, a layer smaller than all lous one. When inverting the contamination proportion, the other layers in the network, essential for the dimen- the originally good samples results in a loss distributed sionality reduction. Without a bottleneck, the network higher than the respective anomalous one. This example is not "forced" to ignore useless or non-representative is purely demonstrative of the analysis made on the loss: features in the input, losing the capacity of mapping the a real-life dataset has a percentage of anomalies sensibly input in a denser space. lower than the ones used here. The networks’ architectures are completely adaptable The last thing needed to define an anomaly is the loss to the problem. Normally, the input (e.g. time-series, threshold: after analyzing the distribution of the autoen- images) is embedded into a vectorial representation and coder’s reproduction loss on all the samples in the dataset, then reduced till the bottleneck. This representation is a threshold must be manually imposed, where a sample the projection of the input into a different space carried with a loss beyond the latter is considered anomalous. out by the encoder part. The latent vector (i.e. the output A compromise must be reached, since all the anoma- of the bottleneck) is then passed to a generally symmet- lous samples must be included without including non- ric network, returning a representation belonging to the anomalous samples. original input space. 54 Roberto Aureli et al. CEUR Workshop Proceedings 53–59 3.5. Workflow multiple entries will show the multiple sessions. The method is fast and with big enough datasets a single • VID: An integer representing a single car in an training can be performed to perform analysis on future, unique way. unseen events. • First Ping: A date in the format yyyy-dd-mm The preprocessing of the dataset is unavoidable: normal- representing the day of the session izing data and orders of magnitude is necessarily to have • TotalTime: Seconds (integer) elapsed from the a good, consistent loss analysis. Samples bigger in mod- first connection to the last disconnection, same ule can importantly alter the distribution. for all the sessions of the day. Finally, the model is fine-tuned (i.e. autoencoder archi- • start: timestamp representing the start of the tecture, bottleneck dimensions and so on) and a standard session. training procedure must be performed. The loss distribu- • end: timestamp representing the end of the ses- tion is analyzed and the threshold is manually selected sion. by the user. Once new data arrive, there is no need to retrain the • env: An integer representing the client owning autoencoder: the sample is normalized and passed to the the car in an unique way. model, finally it’s classified according to the threshold. • Service: An integer representing the fleet of the It may happen that long-term changes in the dataset client, unique for the given client. distribution could completely alter the outcome of the process, requiring a new training and threshold selection. 4.2. Preprocessing The structure of the dataset could resemble a time-series 4. Experiment in which, eventually, each day is made up by more en- tries for the same car. To tackle this complexity, a re- 4.1. Dataset duction to a single entry for each car in a given day is performed in such a way to get a single point for each pair The dataset is composed by information gathered from of (VID, First Ping) (First Ping is then dropped real car sensors. at the end of the preprocessing phase). However, to not Each car is assigned to a client that can have multiple lose any important information, some fields have been fleets and, every day that the car has been used, the total added to record what is implicit in the original dataset time of connection to the network is logged, saving the while VID, TotalTime, env and Service are kept duration in seconds, the start date and the end date of the the same. Since start and end are removed, TotalTime_OFF session. In case of disconnections during the same day, has been added to record the total time in which the car has been disconnected from the network. Moreover, n_disconnections is added to remember the number of sessions in a single day for the given car. The final attribute added is C_v_off_time, namely the coefficient of variation of the down-times between each session, an index of dispersion. The coefficient is defined as: 𝜎(𝑥) 𝐶𝑣 (𝑥) = (1) 𝜇(𝑥) where 𝑥 is a set of data, 𝜎 is the standard deviation of 𝑥 and 𝜇 is the mean. Practically, an higher 𝐶𝑣 means that the data are unbalanced, implying a big difference across each element in the set. For example, 𝐶𝑣 (10, 10, 10, 10, 10) = 0 meaning no dispersion in the data, while 𝐶𝑣 (35, 5, 5, 2, 3) = 1.4, evidencing a set of data more scattered around the mean. Figure 1: The losses distribution of an autoencoder trained The meaning given to the coefficient of variation can on an artificially contaminated labelled dataset composed by vary based on the needs: in some cases it’s better to have normalized vectors of dimension 24 sampled by a multivari- short disconnections than a long one since it could be ate Gaussian for the non-anomalous data and from an uni- easier to deduct missing locations or missing data, prefer- form distribution for the anomalous data. ring therefore lower 𝐶𝑣 s. Contrary to the last sentence, 55 Roberto Aureli et al. CEUR Workshop Proceedings 53–59 a 𝐶𝑣 near 0 could also represent a set of long disconnec- tion times followed by small ones: this value alone is not enough to get an idea of how a link is performing since it doesn’t contain information regarding the quantities in the set, only a normalized index of dispersion. After the preprocessing, the dataset has been reduced from 573064 entries to 99386. As for the training process, the autoencoder is trained over TotalTime, TotalTime_OFF, C_v_off_time and n_disconnections normalized be- tween 0 and 1. This choice is justified by the fact that the remaining variables are categorical one hence completely arbitrarily values used only for an identification purpose. Figure 3: Training profile of the autoencoder’s losses: there isn’t a significant overfit in the last part. 4.3. Model architecture and Training The simplicity of the reduced dataset permits the use of a fully connected network. An hidden layer (for each where 1803 anomalies are found, the 1.84% of the whole network) is enough to capture the dimensionality, ending dataset. in a bidimensional bottleneck. The nonlinearity is intro- duced by a ReLU function and a Sigmoid [12] at the out- put of the two networks. To ease the training process and evade possible loacal-minima situations, a dropout layer is used in order to randomly drop to 0 the weights of the network with a probability of 25%. The∑︀ associated loss is an L1-distance defined as 𝑑(𝑥, 𝑦) = 𝑚 𝑖=1 |𝑥𝑖 − 𝑦𝑖 |, where 𝑥 and 𝑦 are vectors of length 𝑚. The training process is supervisioned by an early stop- ping mechanism, keeping the best model before reach- ing a situation of overfit. The performances are tracked by computing the loss over a set of unseen samples, extracted with a proportion of 30% from the original dataset. Figure 4: Histogram showing the distribution of the loss over all the dataset. Logarithmic scale on the y-axis. 5. Results The following results are extracted from the dataset after the classification process mentioned before: each sample Figure 2: Autoencoder’s layers and dimensions is expanded with the associated previously removed cat- egorical variables, in order to contextualize the results in the dataset domain. Each analysis starts from the comparison of the distribu- 4.4. Training results tion in each class: anomalous and non-anomalous data. As mentioned before, after the training the loss distribu- In each plot, the blue distribution represents the whole tion must be analyzed. The histogram in Figure 4 shows dataset distribution (labels on the left y-axis) while the a decreasing trend before 0.10 after a plateau that drops orange one shows the anomalies (labels on the right y- at 0.25, followed by some sparse samples. The quality of axis). the training is confirmed by the peak of the loss distribu- tion near 0, evidencing an overall low loss. By an empirical choice, the threshold is defined at 0.10, 56 Roberto Aureli et al. CEUR Workshop Proceedings 53–59 5.1. Time anomalies To better see the motivation of the overlapping peaks, a new type of plot can be introduced: As expected, the notion of anomaly in a connection to network domain is in direct correlation with the duration of the down-time. Figure 7: Scattered representation of each entry with TotalTime and TotalTime_OFF as coordinates for each point. Non-anomalous points in blue, anomalous points in Figure 5: Distribution of the Total Time, red. Figure 5 shows a blue peak of the distribution over the In this plot each entry is scattered on a plane, represent- maximum value admissible (i.e 86400 seconds in a day), ing TotalTime on the x-axis and TotalTime_OFF on followed by a peak in the same position of the anomalies. the y-axis. An ideal entry has the maximum TotalTime The interesting part of the plot is where the anomalies and the minimum TotalTime_OFF, posing itself on the are distributed more than the dataset itself, hence in the rightmost lower corner of the plot. lower values of the x-axis. As expected, an important The yellow star represents the center of mass of the non- number of entries with a low total connection time is anomalous distribution (in blue), very near to the ideal classified as anomaly. point, while the green star represents the center of mass Symmetrically, the plot of the down-time represents a of the anomalies distribution. There are ∼ 80𝑘 blue similar situation: points while only ∼ 2𝑘 red ones, evidencing a big differ- ence in the concentration. It’s worth nothing to say that a different threshold would have moved the frontier of the two clusters up or down. The reason of the overlapping peaks lies in the fact that a big number of anomalies is in the vertical line over the maximum of TotalTime (i.e. 𝑥 = 86400) and on the horizontal line over the minimum TotalTime_OFF (i.e. 𝑦 = 0), meaning that one variable is in a good range while the other one not. The worst anomalies are the ones lying near the center of the plot, containing a dis- crepancy in both the variables. 5.2. Number of disconnections A counterintuitive results is shown in the next plots In the left plot there aren’t any notable results, the two Figure 6: Distribution of the Total Time of disconnection distributions appear the same. The number of disconnec- tions is evenly distributed over each entry in the dataset and in the anomalies. Also in this case, there is an important anomalies’ peak When computing the distribution over the average dis- near the dataset distribution followed by many samples connection time for each car, the histograms show an on the right part of the x-axis. An anomalous entry is important difference: the peak of the anomalies is lower also described by an high disconnection time. 57 Roberto Aureli et al. CEUR Workshop Proceedings 53–59 than the peak of the dataset, meaning that the anomalous entries have a lower number of disconnections. However this could cause some confusion since a larger number is expected when talking about this type of variable but analyzing the results paired with the ones obtained in the previous subsection, the lower number of disconnections reveal a longer down-time. This result is confirmed by the following plot where the anomalies distribution is slightly translated to the right, meaning a less uniform number of disconnections that can be caused by a longer disconnection time followed by a set of short times. 5.3. Categorical analysis A final analysis can be made on the categorical variables, Figure 9: Plot of the distribution of the average coefficient answering the practical question: "Are there bad cars or of variation for each car. bad clients?". Following the previous analysis, it’s possible to retrieve an overview on the presence of a single car in the anoma- lies by plotting the distribution of the VIDs. In Figure 11 (left), the anomalies are concentrated in the first part of the x-axis, showing a peak near 0. A set of car that appear only less than 10.0% in the dataset is responsible of the ∼ 30.0% of the anomalies. A zoomed version of the same plot can be seen on Figure 11 (right), where only the cars with a VID less than 2000 are showed. In a practical way, this result can help focusing more on the subset of car that is more present in the anomalies, helping saving time on the analysis. From this result it’s possible to derive the conclusion on the final result, the client analysis. In Figure 10 the biggest percentage of anomalies is covered by the client number 1. The duality on the results can show that the Figure 10: Plot of the distribution of the clients id. most of the first 2000 cars are assigned to the first client, notion that may help with the understanding of the fail- ures. All the results mentioned are perfectly reproducible by utilizing the same saved model (i.e. same architecture 5.4. Reproducibility with same weights loaded in) and the same threshold. The The method explained here is implemented in PyTorch only stochastic variable in the model is the dropout layer [13]. that must be deactivated before the evaluation. An useful thing is that the threshold is not an hyperparameter of Figure 8: Plot of the distributions of the number of discon- nections (left), plot of the average number of disconnections Figure 11: Plot of the distributions of the VIDs(left), plot of for each car (right) the distributions of the VIDs showed on the first 2000 vehicles (right). 58 Roberto Aureli et al. CEUR Workshop Proceedings 53–59 the network, meaning that it can be changed, accord- Multi-class nearest neighbour classifier for incom- ing to the needs of the user, after the training phase. A plete data handling, volume 9119, 2015, pp. 469–480. variation on the anomalies threshold could completely [3] S. Russo, S. Illari, R. Avanzato, C. Napoli, Reducing alter the result by including more or less entries in the the psychological burden of isolated oncological anomalous set, creating a more severe (or less) detection patients by means of decision trees, volume 2768, system. 2020, pp. 46–53. Finally, every evaluation can be made in real-time (af- [4] L. Gondara, Medical image denoising using convo- ter the training), with times that can vary according to lutional denoising autoencoders, in: 2016 IEEE 16th the hardware and the architecture used. On an NVIDIA international conference on data mining workshops MX150, the evaluation over all the dataset takes approxi- (ICDMW), IEEE, 2016, pp. 241–246. mately 20 seconds. [5] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana, It’s possible that the model needs a retraining if the M. Woźniak, A novel neural networks-based tex- distributions in the dataset changes in an unexpected ture image processing algorithm for orange defects way (e.g. logging temperatures can require a retraining classification, International Journal of Computer between summer and winter if the entries are not enough Science and Applications 13 (2016) 45–60. for each season). [6] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac- caro, Yolov3-based mask and face recognition al- gorithm for individual protection applications, vol- 6. Conclusions ume 2768, 2020, pp. 41–45. [7] Y. Wang, H. Yao, S. Zhao, Auto-encoder based The method presented in this paper has shown the ability dimensionality reduction, Neurocomputing 184 to perform a classification of an unlabelled dataset. It’s a (2016) 232–242. fast way to identify outliers in datasets or in a real-time [8] G. Capizzi, S. Coco, G. Sciuto, C. Napoli, A new data feed (after a first training over a big enough dataset iterative fir filter design approach using a gaussian composed by recorded logs). approximation, IEEE Signal Processing Letters 25 Its flexibility can be a great incentive to its utilization, (2018) 1615–1619. being tied only to the use of an autoencoder, architecture [9] P. Malhotra, L. Vig, G. Shroff, P. Agarwal, Long that can be expanded and customized at each eventuality. short term memory networks for anomaly detection Moreover, the fact that the threshold must be imposed in time series, 2015. after the training can be exploited to increase or decrease [10] H. Zenati, M. Romain, C.-S. Foo, B. Lecouat, V. Chan- the severity of the system in real-time, following changes drasekhar, Adversarially learned anomaly detec- in the needs of the user. tion, in: 2018 IEEE International Conference on However, the manual choice of a loss threshold could Data Mining (ICDM), 2018, pp. 727–736. doi:10. be an element of imprecision, looking at the fact that a 1109/ICDM.2018.00088. slightly alteration could extremely change the samples [11] C. Zhou, R. C. Paffenroth, Anomaly detection with considered as anomalous. Another downside could be robust deep autoencoders, in: Proceedings of the the needing of a big enough dataset, since with a little 23rd ACM SIGKDD international conference on one there could be difficulties in learning the right distri- knowledge discovery and data mining, 2017, pp. bution. However, this is a common problem among all 665–674. the autoencoder applications. [12] C. Nwankpa, W. Ijomah, A. Gachagan, S. Mar- Finally, the system is sensible to the dataset’s dimensions, shall, Activation functions: Comparison of trends requiring an adequate normalization that must be applied in practice and research for deep learning, 2018. also to real-time samples. arXiv:1811.03378. In conclusion, as future work the method can be ex- [13] A. e. a. Paszke, Pytorch: An imperative style, high- panded to automatically detect the threshold, removing performance deep learning library, in: H. Wal- the manual component that could completely change the lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, outcome of the process. E. Fox, R. Garnett (Eds.), Advances in Neural Infor- mation Processing Systems 32, Curran Associates, References Inc., 2019, pp. 8024–8035. [1] D. Bank, N. Koenigstein, R. Giryes, Autoencoders, CoRR abs/2003.05991 (2020). URL: https://arxiv.org/ abs/2003.05991. arXiv:2003.05991. [2] B. Nowak, R. Nowicki, M. Woźniak, C. Napoli, 59