-

Journal of Intelligent Information Systems (2023) 1-19. [34] S. Rayana

1613-0073

10.1145/3292500.3330871

On the Environmental Impact of the Algorithm LatentOut for Unsupervised Anomaly Detection

Fabrizio Angiulli

f.angiulli@dimes.unical.it 0

Fabio Fassetti

f.fassetti@dimes.unical.it 0

Luca Ferragina

luca.ferragina@unical.it 0

Latent

Workshop

Anomaly Detection, Variational Autoencoder, Carbon Footprint

0 DIMES Dept., University of Calabria , 87036 Rende (CS) , Italy

2021

13601 10 12

environmental impact. Because of their astonishing performances, Deep Neural Network-based approaches have become pervasive in many human activities. However, they often require a long, energy-intensive training phase, which has a huge In recent years, there has been a significant increase in the emphasis placed on environmental themes across various sectors, driven by growing concerns over climate change and sustainability. This heightened focus has led to many initiatives, policies and discussions aimed at addressing ecological challenges and promoting a more sustainable future. For the reasons stated above, Deep Learning cannot be exempted from such initiatives and the literature is starting to pay attention to these issues. This paper aims at contributing to this field, in particular, concerning the Anomaly Detection Task whose environmental impact, due to its widespread employment, deserves to be addressed.

with other Anomaly Detection Neural Network-based methods and we highlight that it is the one that obtains the best results in terms of a balance between high accuracy performance and low carbon footprint.

1. Introduction

Anomalies can be defined as examples that significantly deviate from the majority of the data to arise the suspect of being generated by a diferent mechanism. Anomaly Detection represents a fundamental task in many human activities, including Healthcare, Cyber-security, Industrial Monitoring, Fraud Detection, and many others.

It is possible to identify three diferent types of settings of Anomaly Detection [ 1 ]. In the Supervised setting a dataset whose items are labeled as normal and abnormal is available to build a classifier, typically the dataset is highly unbalanced and the anomalies form a rare class. The Semi-supervised setting, also called one-class, is characterized by the presence in input of only examples from the normal class that are used to train the detector. In the Unsupervised setting the goal is to assign an anomaly score to each object of the input dataset in order to find anomalies in it.

Classical data mining and machine learning algorithms performing the task of detecting outliers include statistical-based [ 2 ], distance-based [ 3, 4, 5, 6 ], density-based [ 7, 8 ], reverse nearest neighborbased [ 9, 10, 11 ], SVM-based [ 12, 13 ], and many others [ 1 ].

CEUR

ceur-ws.org

Recently, the approaches that have achieved the most success have been those based on deep learning [ 14 ], which can be divided into three main families: reconstruction error-based methods employing Autoencoders (AE), models based on Generative Adversarial Networks (GAN), and SVM-like neural architectures.

At the basis of the application of Autoencoders (AE) and Variational Autoencoders (VAE) [ 15, 16, 14 ] to Anomaly Detection relies the concept of reconstruction error. More in detail, (Variational) Autoencoders are trained to map data into a low dimensional latent space and then map them back into the original space generating in output a reconstruction of the input as similar as possible to it. Since the majority of the data used for training models belongs to the normal class, it is assumed that these networks are able to reconstruct the inliers better than the outliers and, thus, the reconstruction error can be adopted as an anomaly score.

GAN-based models [ 17, 18, 19, 20 ] basically consist in the combined, adversarial training of two sub-architectures, the generator and the discriminator. Specifically, the generator network produces artificial anomalies as realistic as possible, and the discriminator assigns an anomaly score to each item.

SVM-like methods [ 21, 22, 23 ] leverage the idea of enclosing normal data into a hypersphere employing a One-Class SVM-like loss function combined with a deep neural architecture. A slightly diferent approach that can be included in this family, is introduced in [24] where the architecture presents an additional final layer composed of just one neuron that produces an anomaly score that, for anomalies, is as far as possible from a value obtained as the average of randomly sampled normal items anomaly scores.

Moreover, in [25] has been introduced Deep Isolation Forest (DIF), a novel methodology that utilizes casually initialized neural networks to map original data into random representation ensembles, where random axis-parallel cuts are subsequently applied to perform data partition.

Nevertheless, the cost of high power and energy combines with the high accuracy and training speed of the Deep Learning models. This is leading researchers to be aware of the environmental impact of deep neural architectures by trading of accuracy against energy consumption and also to perform characterization in terms of performance, power and energy for guiding the architecture design of DNN models [26, 27, 28, 29].

This paper aims to provide a contribution in this direction, and, in particular, to the field of Anomaly Detection by analyzing the behaviour of recent methods from the point of view of the detection performance as well as from the point of view of their carbon footprint. Specifically, we focus on the Latent algorithm [30, 31, 32, 33], an anomaly detection framework that applies to any deep neural architecture as a baseline to obtain a refined score, and we compare it with the baseline architecture on which it is applied and deep learning-based competitors from the other families. 2. The Latent

algorithm for Unsupervised Anomaly Detection Due to the quite good performances they obtained as well as their versatility, the ones based on (Variational) Autoencoders have become the most widespread Anomaly Detection approaches relying on Deep Neural Networks.

The main issue about them is that they often generalize so well to reconstruct also anomalies [ 30], thus worsening the capability of detecting anomalies of the reconstruction error.

In [31] Latent is introduced. It is a methodology that enhances both the reconstruction error and the latent space distribution of the Variational Autoencoder in order to obtain a refined anomaly score. Specifically, the first variant of the Latent (Figure 1) algorithm considers the enlarged feature space = × , where represents the latent space and is the reconstruction error space (usually ⊆ ℝ ), and performs a -NN density estimation in the space .

In Figure 1 the complete workflow of Latent is showed. Each point of the dataset ∈ is mapped into the latent space of the VAE (blue points represent inliers, red ones represent anomalies) by means of the encoder and then reconstructed back in the original space ∈̂ by means of the decoder . Then, the reconstruction error () = ‖ − ‖̂ 22 is computed, the feature space = × is created, and the -NN density estimation is performed in it to compute the Latent anomaly score.

The motivation behind this procedure is based on the observation that anomalies tend to lie in the sparsest regions of the augmented feature space . This happens because even when their reconstruction error is not exceptionally large, is still significantly larger than that of their most similar normal items.

In [32] Latent has been expanded in order to be potentially applied to any neural architecture that has three fundamental properties: • it outputs an anomaly score, • it has a latent space , • it performs a mapping from the original data space to through an encoder-shaped module. In particular, the neural models on which Latent has actually been tested are AE, VAE, GANomaly, Fast–AnoGAN, SO − GAAL, and MO − GAAL.

Moreover, in [33] it has been showed that the separation properties of the enlarged space allow any generic anomaly score (not only the -NN) to perform better when applied on it than on the input data space .

3. Experimental results 3.1. Experimental setup

In our experiments we consider the tabular datasets cardio, letter, lympho, mammography, pendigits, pima, satellite, satimage-2, speech, thyroid, from the ODDS repository [34] as well as the image datasets MNIST [35], Fashion-MNIST [36], and CIFAR10 [37].

The last three datasets (diferently from the ones from the ODDS repository) are multi-class, thus to make them suitable for the anomaly detection task we adopt a one-vs-all strategy, meaning that we consider one class as normal and we randomly sample items from each other class. If not otherwise stated, we set = 10 . Specifically, we select the class “ 0” as normal for the MNIST dataset, the class “Sandal” for Fashion-MNIST, and the class “deer ” for CIFAR-10.

As for the implementation details of the algorithm, we consider the original version of Latent with the VAE as baseline architecture, and the -NN with = 50 as estimator of the density of the feature space . The latent space dimension ℓ of the VAE is set to ℓ = 2 for tabular ODDS datasets and to ℓ = 32 for image datasets. As for the encoder structure (the decoder is symmetric to it) we adopt the same strategy used in [33], i. e. we insert hidden layers of dimension ℓ = ⌊ 4 ⌋ between the input -dimensional space and the ℓ-dimensional latent space for each ∈ ℕ + such that ⌊ 4 ⌋ > ℓ.

The 2 emissions are estimated by means of the Python library CodeCarbon [38] which bases its tracking on the power consumption and the geographic location where the code is executed. 3.2. Evolution of performance and emissions of Latent and VAE during training The energy consumption of any Deep Learning model is related to the training phase, and, in particular, to the number of training epochs.

Therefore, it is of crucial importance to understand the behavior of these algorithms as the training proceeds to optimize the trade-of between the maximization of the performance and the minimization of energy consumption.

The quantity of 2 produced by Latent , which we represent as ℰLatent , is fundamentally constituted by two terms: • the emissions ℰ needed for the training of the architecture and the computation, which is shared with the Variational Autoencoder, • the emissions ℰ -NN used for the building of the feature space ℱ and the computation of the -NN algorithm in it.

Since the two operations are carried out in sequence and independently of each other, we have that ℰLatent = ℰ + ℰ -NN which means that, with equal training epochs, the carbon footprint of Latent is always greater than the one of the Variational Autoencoder. Thus, for a fair comparison, we train the Variational Autoencoder for 100 epochs and we stop the training earlier for evaluating the Latent score.

In figures 2, 3, 4, we show the performances of both Latent (in orange) and the standard Variational Autoencoder (in blue) in terms of Area Under the ROC Curve (AUC) as the training proceeds. Observe that on the horizontal axis is reported the 2 emissions (in ), which means that, for the reasons stated above, each value of the AUC of Latent is obtained with fewer epochs than the relative value of the VAE.

As we can see, in almost every plot the curve of Latent is placed above the curve of the VAE. Moreover, the trend of Latent is much more regular than the one of the VAE (see in particular the plots of the datasets cardio, mammography, satellite, satimage-2, mnist, cifar ). This implies that if we ifx a threshold on the amount of 2 we want to emit, the score of Latent always outperforms the standard score of the VAE. In other words, Latent is able to better exploit the emissions produced than the standard architecture on which it is applied.

This happens because as the training proceeds the reconstruction capabilities of the VAE improve so much that at some point it becomes able to reconstruct also outliers, thus lowering the anomaly detection performances of the model. On the other side Latent benefits of the latent space organization that produces a progressively better separation between normal examples and anomalies in the feature space .

3.3. Comparison with competitors

We consider as competitors some of the neural networks algorithm implemented in the Python library PyOD [39], namely Deep-SVDD [ 21 ], from the SVM-like family, AnoGAN [ 17 ] and ALAD [ 20 ], from the GAN family, and DIF [25]. For the implementation details (number of layers and neurons, training epochs, learning rate, potential hyperparameters), we refer to the default values fixed in PyOD. As for Latent , we consider again the setup described in section 3.1 and we perform a few-epochs training, due to the good convergence properties observed in the last section. Specifically, the VAE is trained for 15 epochs.

As evaluation metrics we adopt the standard Area Under the ROC Curve (AUC) and the ratio 2 between the emissions of 2 (in ) produced for the training and the inference of a model, and the AUC. This last value is a measure combining both performance and energy consumption, indeed it indicates how much 2 is needed (on average) to obtain a single percentage point of AUC.

Table 1 shows the results in terms of AUC. As we can see, Latent is the best method for half the datasets, achieving performances close to the best also in the other half. In particular, confirming the observation made in [31], Latent is especially efective on higher dimensional, structured data (for example speech and the image datasets). In Table 2 are shown the results of the experiment in terms of the ratio 2 . Here, Latent outperforms its competitors in all but one dataset, exhibiting the best trade-of between performances obtained and the emissions of 2 produced.

4. Conclusion

In this paper, we have focused on the algorithm Latent for unsupervised anomaly detection in order to evaluate its performances and measure the environmental impact of its executions. When compared to the standard architecture on which it is applied, i. e. the Variational Autoencoder, Latent shows that low energy-consumptive training can lead it to conspicuously better results. Moreover, in comparison with other neural network-based anomaly detection approaches it has shown superior performances both in terms of absolute AUC and, most importantly, in terms of the ratio between the emitted 2 and the AUC obtained.

As future development, we intend to expand the discussion about the environmental impact of Latent by including a more profound analysis of all its several variants and an investigation specialized on the hardware type (e.g., CPU vs. GPU), as well as propose novel measures to better capture the trade-of between emissions and performances. Finally, as a more ambitious goal, we aim at introducing a mechanism enabling Latent to consider the green-aware aspect at training time.

Acknowledgments

We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), Spoke 9 Green-aware AI, under the NRRP MUR program funded by the NextGenerationEU.

[1]

Ruf ,

J. R.

Kaufmann ,

R. A.

Vandermeulen , G. Montavon,

Samek ,

Kloft ,

T. G.

Dietterich ,

Müller , A unifying review of deep and shallow anomaly detection , Proc. IEEE 109 ( 2021 ) 756 - 795 .

[2]

Davies , U. Gather, The identification of multiple outliers , Journal of the American Statistical Association 88 ( 1993 ) 782 - 792 .

[3]

Knorr ,

Ng ,

Tucakov , Distance-based outlier: algorithms and applications , VLDB Journal 8 ( 2000 ) 237 - 253 .

[4]

Angiulli ,

Pizzuti , Outlier mining in large high-dimensional data sets , IEEE Trans. Knowl. Data Eng . 2 ( 2005 ) 203 - 215 .

[5]

Angiulli ,

Basta ,

Pizzuti , Distance-based detection and prediction of outliers , IEEE Trans. on Knowledge and Data Engineering 2 ( 2006 ) 145 - 160 .

[6]

Angiulli ,

Fassetti , DOLPHIN: an eficient algorithm for mining distance-based outliers in very large datasets , ACM Trans. Knowl. Disc. Data (TKDD) 3 ( 1 ) ( 2009 ) Article 4 .

[7] M. M. Breunig , H.

Kriegel , R.

Ng , J.

Sander , Lof: Identifying density-based local outliers , in: Proc. Int. Conf. on Managment of Data (SIGMOD) , 2000 .

[8]

Jin ,

Tung , J. Han, Mining top-n local outliers in large databases , in: Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD) , 2001 .

[9]

Hautamäki , I. Kärkkäinen ,

Fränti , Outlier detection using k-nearest neighbour graph , in: International Conference on Pattern Recognition (ICPR) , Cambridge, UK, August 23- 26 , 2004 , pp. 430 - 433 .

[10]

Radovanović ,

Nanopoulos ,

Ivanović , Reverse nearest neighbors in unsupervised distancebased outlier detection , IEEE Transactions on Knowledge and Data Engineering 27 ( 2015 ) 1369 - 1382 .

[11]

Angiulli , CFOF: A concentration free measure for anomaly detection, ACM Transactions on Knowledge Discovery from Data (TKDD) 14 ( 2020 ) 4: 1 - 4 : 53 .

[12]

Schölkopf ,

J. C.

Platt ,

Shawe-Taylor ,

A. J.

Smola , R. C. Williamson, Estimating the support of a high-dimensional distribution , Neural Computation ( 2001 ).

[13] D. M. J. Tax , R. P. W. Duin , Support vector data description, Mach . Learn. ( 2004 ).

[14]

Chalapathy ,

Chawla , Deep learning for anomaly detection: A survey , 2019 . arXiv: 1901 .03407.

[15]

Hawkins ,

He ,

Williams ,

Baxter , Outlier detection using replicator neural networks , in: International Conference on Data Warehousing and Knowledge Discovery (DAWAK) , 2002 , pp. 170 - 180 .

[16]

An ,

Cho , Variational autoencoder based anomaly detection using reconstruction probability , Technical Report 3 ,

SNU

Data Mining Center , 2015 .

[17]

Schlegl ,

Seeböck ,

S. M.

Waldstein ,

Schmidt-Erfurth , G. Langs, Unsupervised anomaly detection with generative adversarial networks to guide marker discovery , 2017 . arXiv: 1703 . 05921 .

[18]

Akcay ,

Atapour-Abarghouei ,

T. P.

Breckon , Ganomaly: Semi-supervised anomaly detection via adversarial training , 2018 . arXiv: 1805 .06725.

[19]

Liu ,

Li ,

Zhou ,

Jiang ,

Sun ,

Wang ,

He , Generative adversarial active learning for unsupervised outlier detection , IEEE Trans. Knowl. Data Eng . 32 ( 2020 ) 1517 - 1528 .

[20]

Zenati ,

Romain ,

C.-S.

Foo ,

Lecouat ,

Chandrasekhar , Adversarially learned anomaly detection , in: 2018 IEEE International conference on data mining (ICDM) , IEEE, 2018 , pp. 727 - 736 .

[21]

Ruf ,

Görnitz ,

Deecke ,

S. A.

Siddiqui ,

R. A.

Vandermeulen ,

Binder , E. Müller, M. Kloft,