1. Introduction

Unsupervised Anomaly Detection in Industrial Image Data with Autoencoders

Tulsi Kumar

Gautam Malik

Adriano Puglisi

0 0 Department of Computer, Control and Management Engineering, Sapienza University of Rome. Via Ariosto 25, Roma, 00185 , Italy

84 91

Traditional quality control techniques could miss small defects in manufacturing environments, reducing the quality of the final product. Using the MVTec dataset, a commonly used benchmark in industrial visual inspection, in this study we investigate two types of autoencoders, denoising autoencoders (DAE) and contractive autoencoders (CAE), to solve the problem of defect identification in industrial processes. The presence of both textured and non-textured objects allows a direct comparison between materials with diferent surface characteristics. The VGG16 and ResNet models pre-trained on ImageNet are used as encoders. Three variants of DAE and three of CAE are designed and evaluated. Both the loss MSE (Mean Squared Error) and the SSIM (Structural Similarity Index Measure) are used to compare the reconstruction quality and the defect detection capability. The results highlight performance diferences between DAE and CAE and between diferent object categories, providing useful insights into the efectiveness of each approach in diferent industrial scenarios.

eol>Unsupervised Learning Denoising Autoencoder Contractive Autoencoder Anomaly Detection

1. Introduction

contractive autoencoders to improve the efectiveness of defect detection.

Quality control is an essential part of many manufactur- The autoencoder encodes the input into a lowering industries. Usually, it is performed manually, but the dimensional representation known as the latent space, problem with manual visual inspection is that there are from which the decoder reconstructs the output. A modpossibilities for error and for this reason, vision-based ification to the autoencoder called a denoising autoeninspection can be used. The deep neural network has coder stops the network from learning the identity funcplayed an important role in the automation industry. Us- tion. To be more precise, if the autoencoder is too large, ing deep neural networks, visual inspection can also be it can just learn the data, resulting in output equivaautomated. Many image processing and machine learn- lent to input without doing any beneficial representation ing methods have already been used to achieve auto- learning or dimensionality reduction. Denoising autoenmated defect detection in production parts. However, coders address this issue by purposefully introducing image processing methods have limitations, as implicit errors, noise, or masking some input values. [ 8, 9 ] engineering features are used for the application, which A contractive autoencoder is an unsupervised deep can be misleading for complex cases. Deep convolutional learning method that aids a neural network in encoding networks are a solution for automating quality control in unlabeled training input. In general, autoencoders are the manufacturing industry since they have the ability to employed to discover a representation, or encoding, for a obtain the best features from images, but these methods set of unlabeled data, typically as the initial step toward are limited by data availability. There are two problems dimensionality reduction or the creation of new data to consider: one is the imbalance of data between nor- models. The traditional reconstruction cost function is mal and defective images. The other is the annotation of enhanced by a penalty term in a contractive autoencoder. the data. To overcome this problem, defect detection is The Frobenius norm of the Jacobian matrix representtreated as an anomaly detection problem. ing the activations of the encoder with respect to the

Due to the absence of labels in the data, the problem input corresponds to this penalty term. This penalty can be addressed through unsupervised learning, by train- term causes a localized space contraction, which in turn ing convolutional networks on normal images and testing produces strong characteristics on the activation layer. them on images containing defects [ 1, 2, 3, 4, 5 ]. In this The penalty aids in sculpting a representation that is work, convolutional neural network autoencoders [6, 7] more invariant to most directions orthogonal to the manare used to perform anomaly detection, in particular, the ifold while also better capturing the local directions of study focuses on the use of denoising autoencoders and variation required by the data, which correspond to a lower-dimensional non-linear manifold [ 10 ].

2. Related Works

called PNI that, given neighborhood characteristics and a multi-layer perceptron network model, computes the For the anomaly detection SIFT and SURF are used to normal distribution using conditional probability. Addiextract the features from the images and train the model tionally, a histogram of typical characteristics is made for on normal image. Features of images can be mislead- each point to use position information. The suggested ing sometimes depending upon nature of the applica- technique uses an extra refining network trained on fabrition. Machine learning algorithms can be used to clas- cated anomaly pictures in addition to the anomaly map to sify anomalies from the normal images. A supervised better interpolate and account for the shape and edge of learning approach is not good for this application, but the input image. Yang et al. [22] presents a novel method semi-supervised and unsupervised models increase per- for detecting industrial image anomalies based on a selfformance of model. The supervised and semi-supervised supervised learning and self-attentive graph convolution approaches are compared and the performance of the (SLSG) network. In SLSG, pseudo-prior knowledge of semi-supervised model is better than the supervised anomalies is introduced by simulated abnormal samples, approach[11]. In this paper [1] they propose a model and the encoder is assisted in learning the embedding based on point features of the images. A hand-crafted of normal patterns and position connections. Holly et point feature Harris-Laplace point detector is used in this al. [23] suggests a technique that makes use of a total study to detect the anomalies. The point feature uses reconstruction error and an autoencoder to locate system Harris corner detector and then SIFT key points to ex- problems. In order to pinpoint the source of a problem, tract the local shape around key points. Diferent loss the signals that contribute the most to the overall reconfunctions are used for the unsupervised deep learning struction error are identified by computing the individual model. This study shows that for diferent types of ob- reconstruction error for each sensor signal. jects the model performs diferently and a specific type of loss is suitable for a specific application. To identify and categorize defects of the LED chip, Lin et al.[12] sug- 3. Methodology gested the LEDNet network. Cha et al.[13] suggested utilizing Faster R-CNN, that showed promising results 3.1. Dataset also in other applications [14], to identify five distinct The dataset used is the MVtec industrial images dataset. It lfaws to detect structural damage. contains images of many diferent industrial products. It

The use of autoencoders for unsupervised anomaly is an industrial inspection-focused dataset for evaluating identification based on reconstruction loss is examined anomaly detection techniques. Over 5000 high-resolution in [15], highlighting both its strengths and weaknesses. images in fifteen diferent object and texture categories Using an analogous situation from particle physics, it make up this collection. Each category includes both a demonstrates that the standard autoencoder configura- test set of photos with various types of faults and images tion is not a model independent anomaly tagger. In the without defects as well as a set of training images with work of Lupo et al. [16] generative models are used to no defects [24]. detect anomalies in texts, exploring approaches ranging from machine learning to deep learning. Among the analyzed models, the variational autoencoder is the one with 3.2. Preprocessing the most promising performances for this task. Vincent During the training phase, only normal images, free of et al. [17] instead studied denoising autoencoders for the defects or anomalies, are used. Images containing defects extraction of robust features from images, demonstrating of various types, specific to each product, are instead that these models are able to improve the representation used in the testing phase. All images provided are highof visual features and, consequently, the overall image resolution RGB images with dimensions of 1024×1024. To quality. Similar architectures have been employed also reduce computational complexity, they are downsampled in the field of audio processing, for example for the auto- to 256×256 and normalized by dividing the pixel values matic identification of speech disorders, exploiting unla- by 255. beled speech signals [18]. Bionda et al. [19] proposes a When using a denoising autoencoder, noise is introdeep convolutional autoencoder to detect the anomalies duced into the data to artificially corrupt it. Similarly, the in the textured images. MSE, or pixel vise error is not suit- test images are also altered and the result of the model able for textured images as it only focuses on pixels. So, is compared with the original uncorrupted images. An SSIM is used as a loss function to improve performance of example of the images with noise is shown in Figure 1. the autoencoder. Complex wavelet SSIM performs better For the contractive autoencoder, resized but uncorrupted than MSE for the textured images. Based on the applica- images are directly fed as input and the model is trained tion loss function plays a great role in generative models to faithfully reproduce the same images as output. [20]. In this paper [21] they present a novel approach 3.3. Model Architecture The autoencoder model adopted in this work is based on a structure composed entirely of convolutional layers. Its architecture can be conceptually divided into three fundamental components, namely the encoder, the decoder and the latent space, often referred to as bottleneck. The encoder has the task of compressing the data, reducing the dimensionality of the input image until obtaining a compact representation in the latent space. This encoded representation is then transmitted to the decoder, which has the role of expanding the data again to reconstruct an image as similar as possible to the initial one.

Four convolutional layers and four max-pooling layers form the encoder, which gradually reduces the number of pixels in the image. The type of autoencoder used determines the structure of the bottleneck. The latent space in the denoising autoencoder is composed directly of the final output of the encoder. Instead, the representation passes through a thick layer that acts as a bottleneck in the contractive autoencoder. Both architectures use the same decoder, consisting of four upsampling layers and five convolutional layers. To restore the output to its initial size, a final convolutional layer is added. The complete architectures of the two models are reported in Figure 2 and Figure 3.

To compare our model we also took into account pre trained encoders, in particular, the VGG16 model already trained on the ImageNet dataset. This model includes sixteen layers in total, thirteen of which are convolutional and three fully connected. In our implementation, only the convolutional layers were kept, while the fully connected ones were removed [ 25 ]. The ResNet50 model pre-trained on the ImageNet dataset was also used. Although the encoder architectures are diferent, in both cases the decoder was kept unchanged, so as to make the comparison between the configurations more fair and meaningful. 3.3.1. Loss Function

For training and testing losses, denoising autoencoder

uses the MSE. On measuring pixel values between two images, Mean squared error computes the average of the (, ) = 1 ∑︁ ∑︁[(, ) − (, )]2 (1) squared diferences between corresponding pixel values. =1 =1 In the case of a contractive autoencoder, the bottleneck Another loss function used for the texture image is the or the latent dimension of the autoencoder is used to measure of resemblance between two pictures. SSIM comcompute the contractive loss along with MSE. Finally, pares the brightness, contrast, and structural elements of the sum of the two losses is calculated. The autoencoder is trained using contractive loss. two images to determine how similar they are. It makes In both cases, training is aimed at minimizing the loss use of statistical metrics including pixel intensity mean, function. variance, and covariance. 1 and 2 are constants that are added to prevent denominator instability [26]. It is often 3.3.3. Thresholding and classification adopted as a loss function for picture-based optimization tasks, for example, image denoising or super-resolution, and as a quality metric for image compression or restoration.

For anomaly detection using autoencoders, a threshold based on the reconstruction error is adopted to distinguish between normal images and images containing defects. The error is calculated by comparing each reconstructed image with the corresponding original image, (, ) = (2 + 1)(2 + 2) (2) using MSE or SSIM depending on the type of sample. ( 2 + 2 + 1)( 2 + 2 + 2) The threshold is determined starting from the training images, which are all free of anomalies. For each of them, the reconstruction error is calculated, after which the 3.3.2. Training and Optimization threshold is obtained as the average of the errors obtained. Once established, this threshold allows to classify the test To train the denoising autoencoder, input images are ar- images, those with an error lower than the threshold are tificially corrupted with noise, while the corresponding considered normal, while those with an error higher than clean images are used as targets. The model is optimized the threshold are classified as anomalous. using Adam optimizer, with a learning rate of 0.0001 and To evaluate the efectiveness of the classification proa batch size of 32. The contractive autoencoder is trained cess, the accuracy and the F1 score are used. Even if in with Adam as well, but its loss function includes a reg- the test dataset there are diferent types of anomalies, in ularization term on the encoder’s Jacobian in addition this study they are treated as belonging to a single class. to the reconstruction loss. Both loss functions are applied to each sample type, in order to analyze the model performance in diferent scenarios. 4. Results and Analysis

As for the pretrained encoders, the VGG16 and ResNet50 architectures have been considered, both opti- The model was evaluated on the hazelnut, pill, bottle, mized with the same decoder used in the other models. screw and tile categories of the MVTec dataset. The pill In the case of VGG16, the last three fully connected lay- and tile classes represent textured samples, while the ers are removed, keeping the five convolutional blocks others have smooth surfaces. Both autoencoders, DAE that are subsequently trained together with the decoder. and CAE, were trained using the two loss functions, MSE After the encoder, a dense layer is inserted to act as a and SSIM, and tested on all categories. The results for bottleneck. As for ResNet50, the last layer is removed DAE are reported in Table 1. and all the other layers are fine-tuned with the decoder. 84–91

For the hazelnut and bottle classes, the classic DAE

with MSE loss achieved the best performance. In the case of the screw class, the use of SSIM led to a higher accuracy, probably due to the geometric complexity of the spirals. For textured samples such as pill and tile, the DAE with SSIM consistently provided better results. The VGG16 encoder, despite slightly increasing overall accuracy, reduced the ability to detect normal images, whereas the classical DAE with SSIM maintained a better balance between normal and defective cases.

The performance of the contractive autoencoder is reported in Table 2. For the hazelnut and screw classes, classical CAE with MSE performed best. For the bottle class, CAE with VGG16 encoder and loss SSIM showed the highest performance. For textured samples, using SSIM also proved more efective. In particular, for the pill class, CAE with VGG16 achieved the best results. For the tile class, classical CAE showed solid performance, while CAE with VGG16 and loss MSE achieved the highest absolute accuracy, but failed to properly distinguish normal images.

DAE proved to be more efective in most cases, its performance drops with ResNet encoders, while CAE maintains more consistency across classes. The latter is more stable on textured samples, but less accurate in reconstructing details.

5. Conclusion

This study has shown that autoencoders are a valid and promising solution for unsupervised anomaly detection in industrial image data. In particular, the denoising autoencoder (DAE) achieved consistently better results than the contractive autoencoder (CAE) across most object categories. This confirms that the introduction of noise during training encourages more robust feature learning and improves the generalization ability of the model when tested on unseen defective images.

The experiments demonstrated that the Structural Similarity Index Measure (SSIM) is more efective than Mean Squared Error (MSE) when dealing with textured surfaces. SSIM is sensitive to structural deformations, brightness, and contrast, and is therefore better suited for materials where texture plays a key role in defect identification. On smooth objects, instead, MSE remains competitive and sometimes preferable.

One important finding is that using pre-trained encoders such as VGG16 or ResNet50 does not always improve results. While VGG16 provided a slight improvement in some categories, it sometimes reduced the correct classification of normal samples. ResNet, in particular, underperformed in most configurations, possibly due to its architectural complexity and its limited adaptability to small or subtle defect patterns after fine-tuning. This indicates that a careful balance must be found between the use of pre-trained knowledge and the specific needs of anomaly detection tasks, where fine-grained pixel-level reconstruction is crucial.

From a methodological perspective, the combination of classical convolutional autoencoders with loss functions adapted to the type of image (MSE for regular shapes, SSIM for textures) provides a strong and flexible framework. Moreover, the use of a thresholding strategy based on reconstruction errors proved simple yet efective in binary classification between normal and defective cases.

Despite the relatively small size of the training data used, the models were able to achieve good classification

Declaration on Generative AI During the preparation of this work, the authors used

ChatGPT, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. performance, confirming the potential of unsupervised learning techniques in real-world industrial inspection scenarios. These methods avoid the need for large labeled datasets and are capable of identifying a wide range of defects without explicit annotation.

For future work, it would be beneficial to explore the integration of attention mechanisms or generative adversarial networks (GANs) to enhance reconstruction quality and reduce false positives. Furthermore, expanding the set of evaluated loss functions to include perceptual losses or multi-scale SSIM might improve results on more complex textures. Finally, applying these models in realtime settings, with hardware constraints and on-the-fly decision-making, remains a key area for further development and practical validation in industrial contexts.

Journal of Intelligent Systems 36 ( 2021 ) 2443 - 2464 .

doi:10 .1002/int.22386. [6]

Fiani ,

Ponzi ,

Russo , Keeping eyes on the

volume 3695 , 2023 , p. 85 - 95 . [7]

Boutarfaia ,

Russo ,

Tibermacine , I. E . Tiber-

shop Proceedings , volume 3695 , 2023 , p. 68 - 74 . [8]

Kumar , G. Nandi,

Kala , Static hand gesture

coders, in: 2014 Seventh International Conference

on Contemporary Computing (IC3) , 2014 , pp. 99 -

104. doi: 10 .1109/IC3. 2014 . 6897155 . [9]

B. A.

Nowak ,

R. K.

Nowicki ,

Woźniak , C. Napoli,

puter Science) , volume 9119 , 2015 , p. 469 - 480 .

doi:10 .1007/978-3- 319 -19324-3_ 42 . [10]

Rifai ,

Vincent ,

Muller ,

Glorot , Y. Ben-

during feature extraction , 2011 . [11]

Bilik ,

Horak , Sift and surf based fea-

ture extraction for the anomaly detection , 2022 . [1]

A. M.

Kamoona ,

A. K.

Gostar ,

Bab-Hadiashar , arXiv: 2203 . 13068 .

Hoseinnezhad , Point pattern feature-based [12]

Lin ,

Li ,

Wang ,

Shu , S. Niu, Automated de-

cess 9 ( 2021 ) 158672 - 158681 . URL: https://doi. ing 30 ( 2019 ) 2525 - 2534 .

org/10 .1109% 2Faccess . 2021 . 3130261 . doi: 10 .1109/ [13] Y.-J. Cha , W. Choi , G. Suh, S. Mahmoudkhani,

access.

2021 .3130261. O. Buüyuökoöztürk, Autonomous structural visual [2]

Capizzi ,

Coco ,

G. L.

Sciuto ,

Napoli , A new inspection using region-based deep learning for de-

sian approximation , IEEE Signal Processing Letters Civil and Infrastructure Engineering 33 ( 2018 ) 731 -

25 ( 2018 ) 1615 - 1619 . doi: 10 .1109/LSP. 2018 . 747 .

2866926. [14]

Fiani ,

Puglisi ,

Napoli , et al., Enhanc [3]

Połap ,

Woźniak ,

Napoli , E. Tramontana, ing object detection robustness for cross-depiction

via cuckoo search algorithm , International Journal SHOP PROCEEDINGS , volume 3684 , CEUR-WS ,

of Electronics and Telecommunications 61 ( 2015 ) 2023 , pp. 15 - 20 .

333 - 338 . doi: 10 .1515/eletel-2015- 0043 . [15]

Finke ,

Krämer ,

Morandini ,

Mück , I. Olek[4]

Capizzi ,

Bonanno ,

Napoli , Hybrid neu- siyuk, Autoencoders for unsupervised anomaly

diction of new generation batteries storage , in: Energy Physics 2021 ( 2021 ). URL: https://doi.org/

3rd International Conference on Clean Electrical 10 .1007%2Fjhep06% 282021 % 29161 . doi: 10 .1007/

Power: Renewable Energy Resources Impact , IC- jhep06 ( 2021 ) 161 .

CEP 2011 , 2011 , p. 341 - 344 . doi: 10 .1109/ICCEP. [16]

Lupo , Variational autoencoder for unsupervised

2011 .6036301. anomaly detection, Master's thesis , 2019 . Corso di [5]

Lo Sciuto , G. Capizzi,

Shikler ,

Napoli , Or- laurea magistrale in Ingegneria Matematica.

ganic solar cells defects classification by using a [17]

Vincent ,

Larochelle ,

Bengio , P.-A. Man-

with denoising autoencoders , 2008 , pp. 1096 - 1103 .

doi:10.1145/1390156 .1390294. [18]

Corvitto ,

Faiella ,

Napoli ,

Puglisi , S. Russo,

CEUR WORKSHOP PROCEEDINGS , volume 3869 ,

CEUR-WS , 2024 , pp. 19 - 31 . [19]

Bionda ,

Frittoli , G. Boracchi, Deep au-

and Processing - ICIAP 2022 , Springer Interna-

tional Publishing , 2022 , pp. 669 - 680 . URL: https:

//doi.org/10.1007% 2F978 - 3 - 031 -064302_ 56 . doi:10.

1007 / 978 -3- 031 -06430-2_ 56 . [20]

Puglisi ,

Fiani , G. De Magistris , et al., Increased

using gans ., in: ICYRIME , 2023 , pp. 39 - 45 . [21]

Bae ,

J.-H.

Lee ,

Kim , Pni : Industrial anomaly

mation , 2023 . arXiv: 2211 . 12634 . [22]

Yang ,

Liu ,

Yang ,

Wu , Slsg: Industrial

ture embeddings and one-class classification , 2023 .

arXiv:2305 . 00398 . [23]

Holly ,

Heel ,

Katic ,

Schoefl , A . Stiftinger,

ization in industrial cooling systems , 2022 .

arXiv:2210 . 08011 . [24]

Bergmann ,

Fauser ,

Sattlegger , C. Steger,

for unsupervised anomaly detection , in: 2019

Pattern Recognition (CVPR) , 2019 , pp. 9584 - 9592 .

doi:10 .1109/CVPR. 2019 . 00982 . [25]

Simonyan ,

Zisserman , Very deep convolu-

2015 . arXiv: 1409 . 1556 . [26]

Nilsson , T. Akenine-Möller, Understanding ssim,

2020 . arXiv: 2006 .13846.