-

Ital-IA

1613-0073

Models for Accurate Anomaly Detection in Industry 5.0

Luigi Capogrosso

luigi.capogrosso@univr.it 0

Alvise Vivenza

alvise.vivenza@univr.it 0 2

Andrea Chiarini

andrea.chiarini@univr.it 1 2

Francesco Setti

francesco.setti@univr.it 0 2

Marco Cristani

marco.cristani@univr.it 0 2

Difusion Models, Anomaly Detection, Industry 5.0

0 Department of Engineering for Innovation Medicine, University of Verona , Italy 1 Department of Management, University of Verona , Italy 2 QUALYCO S.r.l, Spin-of of the University of Verona , Verona , Italy

2024

4 29 30

Defect detection is the task of identifying defects in production samples. Usually, defect detection classifiers are trained on ground-truth data formed by normal samples (negative data) and samples with defects (positive data), where the latter are consistently fewer than normal samples. State-of-the-art data augmentation procedures add synthetic defect data by superimposing artifacts to normal samples to mitigate problems related to unbalanced training data. These techniques often produce out-of-distribution images, resulting in systems that learn what is not a normal sample but cannot accurately identify what a defect looks like. In this paper, we show the research we are carrying out in collaboration with QUALYCO, a startup spin-of of the University of Verona, on multimodal Latent Difusion Models (LDMs) for accurate anomaly detection in Industry 5.0. Unlike conventional image generation techniques, we work within a human feedback loop pipeline, where domain experts provide multimodal guidance to the model through text descriptions and region localization of the possible anomalies. This strategic shift enhances the interpretability of results and fosters a more robust human feedback loop, facilitating iterative improvements of the generated outputs. Remarkably, our approach operates in a zero-shot manner, avoiding time-consuming fine-tuning procedures while achieving superior performance. We demonstrate its eficacy and versatility on the challenging KSDD2 dataset, achieving state-of-the-art results.

CEUR ceur-ws.org

1. Introduction

Surface Defect Detection (SDD) is a challenging problem in industrial scenarios, defined as the task of individuating samples containing a defect [1]. In many real-world applications, a human expert inspects every product and removes those defective pieces. Unfortunately, human experts are often inaccurate, and outputs can be inconsistent or biased. Moreover, humans are relatively slow in accomplishing this task, and their performances are subject to stress and fatigue.

Automated defect detection systems [2] can easily overcome most of these issues by learning classifiers on defective and nominal training products. The main drawback is the data collection process required to train a model efectively. Indeed, defective items (i.e., positive samples) are relatively rare compared to nominal items (i.e., negative samples). Thus, the user may need to collect massive amounts of data to have enough positive samples. Moreover, with the rise of the Industry 5.0 [3] and the transition towards flexible manufacturing processes where human operators and production line components actively collaborate, there is an increasing demand for systems that can quickly adapt to new production setups, i.e., customized products manufactured in small batches. Traditional automated systems cannot comply with these demands since data collection could easily involve the whole batch size.

Recent studies on SDD focused on limiting the impact of the labeling process by formulating the problem under the unsupervised learning paradigm [4] or training exclusively on nominal samples [5], possibly using few-shot learning strategies [6]. In both cases, the goal is to generate an accurate model of the nominal sample distribution and predict everything with a low probability score as anomalies. However, due to the limited restoration capability of these models, these approaches tend to generate many false positives, especially on datasets with complex structures or textures [7].

It is worth noting that, in industrial setups, anomalies

are not generated by Gaussian processes but are the outcome of specific, often predictable, issues during the production process. Consequently, the anomalous samples are not randomly distributed outside the nominal distribution; they can be modeled as a mixture of Gaussian Normal samples

from the production line

Image generation via Diffusion Models Region localization Text description "A photo of a scratched

surface" Neg. text description "A photo of a smooth surface"

Latent Diffusion

Model

TinyML-based image classification with the generated samples

Classification

model Anomalous

Not anomalous distributions in the feature space instead. While general, defective samples compared to previous state-ofunpredictable anomalies can still happen, expert opera- the-art approaches. tors can easily define the main problems they can expect • We dive into spatial control approaches to enfrom the manufacturing process, such as which kind of able the synthesis of defect samples incorporatdefects, in which locations, and how often they wish to ing regional information and exhibit enhanced appear. Thus, generative AI can represent a powerful controllability of the image generation through a tool for SDD, with defect image generation emerging as human feedback loop pipeline, efectively utiliza promising approach to enhance detector performance. ing domain expertise to generate more plausible

Specifically, in this paper, we report the result of our in-distribution anomalies. research on Latent Difusion Models (LDMs), a powerful class of generative models, to produce fine-grained realistic defect images that can be used as positive sam- 2. Related Work ples to train an anomaly detection model. We name our approach DIAG, a training-free Difusion-based In- Research on SDD has been conducted according to diferdistribution Anomaly Generation pipeline for data aug- ent setups: unsupervised approaches [8] use a mixture of mentation in the SDD task. By leveraging pre-trained unlabelled positive and negative sample images for trainLDMs with multimodal conditioning, we can exploit do- ing; supervised approaches require labeled samples in the main experts’ knowledge to generate plausible anoma- form of binary masks representing the defects (full superlies without needing real positive data. When using vision) [9] or simply as a tag for the whole image (weak these augmented images to train an anomaly detection supervision) [10]. Supervised methods demonstrated model, we show a notable increase in the detection perfor- superior accuracy in the identification of anomalies. Nevmance compared to previous state-of-the-art augmenta- ertheless, the efort required to provide good annotations tion pipelines. Specifically, this research is being carried is not always justified. Collecting positive samples can out in collaboration with QUALYCO1, a startup spin-of be time and resource-consuming due to the low rate of of the University of Verona. Figure 1 outlines our ap- defective products generated by industrial lines. proach. Thus, many recent approaches adopt a “clean” setup, The main contributions of our research are as follows: where the training set consists of only nominal samples.

Two strategies can be adopted in clean setups: model • We present a complete pipeline for training iftting and image generation. Model fitting approaches anomaly detection models based on nominal im- aim at generating an accurate model of the nominal disages and textual prompts. We showcase the su- tribution, considering an outlier in every sample with perior outcomes achieved by utilizing generated a likelihood lower than –or a distance from the nominal prototype higher than– a predefined threshold [ 11].

On the contrary, data augmentation approaches leverefectively enhance spatial control, opting to utilize an age generative methods to synthesize images of defects inpainting model, as demonstrated in [14, 18]. Given an and use these images as positive samples for training image with a masked region, inpainting seamlessly fills a supervised model. Specifically, this work focuses on it with content that harmonizes with the surrounding imgeneration-based data augmentation under clean setups. age. Although typically employed to eliminate undesired The most popular data augmentation pipeline for SDD artifacts, the inpainting process ensures that the masked consists of a series of random standard transformations area incorporates the provided prompt, efectively mergof the input image –such as mirroring, rotations, and ing textual and visual content.

3.2. Our proposed pipeline

To generate an anomalous image , the process starts by sampling a random negative image, an anomaly description, and a mask, forming the triplet ( , , ).

These pieces of information will then be fed to a text

conditioned LDM to perform inpainting on image using

The anomaly description guides the generation, fillplies with the prompt. To generate images resembling real anomalous samples, domain knowledge from industrial experts is exploited, providing textual descriptions of the potential anomalies’ type, shape, and spatial information.

The LDM is then conditioned on this information to noise foreground ROI is super-imposed on the original the mask . image to obtain the simulated anomalous image. However, all these approaches are based on generating out- ing the masked region of with an anomaly that comimprove the generation of in-distribution anomalous sam- inpaint plausible anomalies on defect-free samples. Forples of [13], incorporating domain knowledge provided mally, given pictures of defect-free (negative) samples by an expert user through textual prompts and localiza- , domain experts will provide textual descriptions tion of salient regions in a training-free setup. color changes– followed by the super-imposition of noisy patches [12].

In MemSeg [12], the pipeline for the generation of the abnormal synthetic examples is divided into three steps: i) a Region of Interest (ROI) indicating where the defect will be located is generated using Perlin noise and the target foreground; ii) the ROI is applied to a noise image to generate a noise foreground ROI; iii) the of-distribution patterns that do not faithfully represent the target-domain anomalies.

More recently, the first work that draws attention to in-distribution defect data is In&Out [13], in which we empirically show that difusion models provide more realistic in-distribution defects. Here, we significantly Methodology 3.1. Multimodal Difusion-based image generation

LDMs [14, 15] are a class of deep latent variable models that work by modeling the joint distribution of the data over a Markovian inference process. This process consists of small perturbations of the data with a variancepreserving property [16], such that the limit distribution after the difusion process is approximately identical to a known prior distribution. Starting with samples from the prior, a reverse difusion process is learned by gradual denoising the sample to resemble the initial data by the end of the procedure.

We leveraged the natural ability of LDMs to incorporate multimodal conditioning in the generation process, taking inspiration from [17, 18, 19]. Specifically, we use as textual descriptions a prompt and a negative prompt, i.e., a prompt which guides the image generation “away” from its concepts of the desired output, resulting in high-quality images that comply with the given descriptions [20, 21].

In particular, we do not do full image generation to of what diferent anomalies may look like. At the same time, regions where these anomalies may appear on the defect-free samples will be designated. We define this set of regions as a set of binary masks of possible anomalies, shapes, and locations. The result of this operation is , an anomalous version of , where an anomaly has been inpainted in the masked region . Due to the stochastic nature of LDMs, this process can be repeated multiple times to generate an augmented set of anomalous sample images . Finally, the set can be used as data augmentation for training anomaly detection models, as presented in the following section.

3.3. The anomaly detection task We approach the anomaly detection problem as a binary

classification problem, where the objective is to predict whether a sample belongs to one of two classes. Specifically, we utilized a ResNet-50 [22] backbone trained with a binary cross-entropy loss function denoted as ℒBCE.

Mathematically, it is defined as:

1 =1 ℒBCE( , )̂ = −

∑ [ log( ̂ ) + (1 − ) log(1 − ̂ )] , where, represents the ground truth labels, ̂ represents the predicted probabilities, and is the number of sam(1) ples. In detail, denotes the true label for sample , Table 1 which can be either 0 or 1, while ̂ signifies the predicted Results between MemSeg, In&Out and DIAG when no probability that sample belongs to class 1. anomalous samples are available. In bold, the best results.

Ongoing developments aim to optimize a model Underlined, the second best. tehficrieonutghsyTsitneymMtLha[2t3c]atnecwhonrikqusemsoinotohrldyeirntorehaalv-tei maneuolntra- Model Naug AP ↑ Precision ↑ Recall ↑ a production line.

4. Experiments 4.1. Experiment setup

Datasets We use the Kolektor Surface-Defect Dataset 2 DIAG (ours) 80 .769 .851 .673 (KSDD2) [10], one of the most recent, complex, and real- DIAG (ours) 100 .801 .924 .664 world SDD datasets. This dataset comprises 246 positive DIAG (ours) 120 .739 .944 .609 and 2085 negative images in the training set and 110 positive and 894 negative images in the testing set. Positive images are images with visible defects, such as scratches, the training set, which will be used to train the anomaly spots, and surface imperfections. Since the images have detection model. diferent dimensions, we standardize the dataset resolution, resizing all the images to 224 × 632 pixels while keeping the number of normal and anomalous samples unchanged.

ResNet-50 training and testing For a fair comparison with [13], we use the same PyTorch implementation of the ResNet-50 [22] as our anomaly detection model, in which we substitute the fully connected layers after the backbone to make it a binary classifier. The network is trained for 50 epochs with Adam [24] as an optimizer, a learning rate of 0.0001, and a batch size of 32. To maintain consistency with the training and evaluation procedures of KSDD2, our setup is the same as presented in [10, 13], where only the images and ground truth labels are used to train the model.

4.3. Quantitative results Evaluation metrics The anomaly detection performance was evaluated based on Average Precision (AP), Precision, and Recall, following the evaluation protocol defined in [ 13]. 4.2. Implementation details In this section, we specify all the implementation details for reproducibility. All training and inferences were conducted on an NVIDIA RTX 3090 GPU.

Inpainting via Difusion Models trained implementation of SDXL [21] from Difusers as our text-conditioned LDM. Following the procedure outlined in Section 3.2, we use the negative images of KSDD2 as the set . As the set of anomaly descriptions , we used the prompts “white marks on the wall” and “copper metal scratches”. Instead, “smooth, plain, black, dark, shadow” were used as a negative prompt to improve the performance further. These prompts were chosen after a series of tests, simulating the iterative process of our human feedback loop pipeline until the resulting images resembled plausible anomalies. We used the segmentation masks of positive samples in the KSDD2 dataset to simulate the domain experts’ definition of plausible anomalous regions. Then, these data are fed to the pre-trained SDXL model to perform inpainting on the negative images in a training-free process, generating the set of augmented anomalous images as described in Section 3.2. Finally, the generated images are added to

Zero-shot data augmentation Here, we emulate the We use the pre- situation where no original positive samples are available in the training set. This scenario makes generating augmented positive samples necessary and restricts the users to augmentation procedures that do not rely on positive images. To do this, we build the set of augmented anomalous images by generating augmented positive samples with diferent pipelines, i.e., MemSeg [ 12], In&Out [13] and DIAG. Then, we train the ResNet-50 model on a dataset that includes the original negative samples and the augmented positive samples . Finally, we evaluate the model on the original test set.

Table 1 reports the comparison between the models trained with MemSeg, In&Out, and DIAG augmented data at diferent values of . As we can see, our proposed method achieves the highest AP (.801), recorded at 100 augmented images, while also resulting in a consistently higher AP when compared to the MemSeg and In&Out pipelines. These impressive results highlight how, through domain expertise in the form of anomaly descriptions and segmentation masks, it is possible to generate in-distribution images able to meaningfully guide an anomaly detection network, even in a complicated scenario where no real anomalous data is available.

Surprisingly, the DIAG performance with = 120 4.4. Qualitative results augmented images is lower than using a smaller num- The main goal of our data augmentation pipeline is to ber of augmented images. We hypothesize this is due generate in-distribution synthetic positive images, meanto the stochastic nature of the LDMs image generation. ing images that closely resemble the real ones. Figure 2 While it allows the generation of various images given shows qualitative results. It’s evident that the images the same guidance, it can also lower, in some cases, the produced by DIAG are markedly more realistic compared predictability of the quality of the generated samples, to those generated by MemSeg [12] and In&Out [13]. which sometimes may not faithfully comply with the prompt. Future works will focus on studying quality consistency in the image generation pipeline. 5. Conclusions Full-shot data augmentation To showcase DIAG as This work presents DIAG, a novel data augmentation a general data augmentation technique, we also explore pipeline that leverages visual language models to produce the scenario where real positive samples are available in training-free positive images for enhancing the perforthe training set. To this aim, we include all the 246 real mance of an SDD model. We introduced domain experts positive samples in the training set, together with the in the generation pipeline, asking them to describe with real negative images and the augmented positive textual prompts how a defect should look and where it images . can be localized. Then, we adopt a pre-trained LDM to

As we can see from Table 2, DIAG achieves the highest generate defective images and train a binary classifier average AP yet (.924), surpassing the .782 set by the pre- for isolating the anomalous images. We focus our exvious state-of-the-art data augmentation pipeline [13]. periments on the KSDD2 dataset and establish ourselves When comparing these results to the ones obtained in as the new state-of-the-art data augmentation pipeline, the “zero-shot data augmentation” scenario, it is clear surpassing previous approaches in both the zero-shot how more in-distribution images improve model per- and full-shot data augmentation scenarios with an AP formance during training. This is highlighted by the of .801 and .924, respectively. These results highlight improvement in performance of all the models when the potential of in-distribution data augmentation in the adding the real positive images to the training set. At anomaly detection field, where training-free generative the same time, the inclusion of DIAG augmented images model pipelines such as DIAG can provide meaningful allows the model to explore the anomaly distribution fur- data for downstream classification, making them appealther, resulting in the diference in performance between ing solutions in scenarios where real anomalous data is the diferent data augmentation pipelines. dificult to collect or unavailable. These promising results promote further exploration across various datasets, particularly investigating how robust the image generation is compared to noisy textual prompts. [13] L. Capogrosso, F. Girella, F. Taioli, M. Dalla Chiara,

M. Aqeel, F. Fummi, F. Setti, M. Cristani, Difusion[1] T. Wang, Y. Chen, M. Qiao, H. Snoussi, A fast and based image generation for in-distribution data robust convolutional neural network-based defect augmentation in surface defect detection, in: detection model in product quality control, The International Joint Conference on Computer ViInternational Journal of Advanced Manufacturing sion, Imaging and Computer Graphics Theory Technology 94 (2018) 3465–3471. and Applications (VISAPP), 2024. doi:10.5220/ [2] S. H. Hanzaei, A. Afshar, F. Barazandeh, Automatic 0012350400003660.

detection and classification of the ceramic tiles’ sur- [14] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, face defects, Pattern Recognition 66 (2017) 174–189. S. Ganguli, Deep unsupervised learning using [3] P. K. R. Maddikunta, Q.-V. Pham, B. Prabadevi, nonequilibrium thermodynamics, in: International N. Deepa, K. Dev, T. R. Gadekallu, R. Ruby, M. Liyan- Conference on Machine Learning (ICML), 2015. age, Industry 5.0: A survey on enabling technolo- [15] J. Ho, A. Jain, P. Abbeel, Denoising difusion probgies and potential applications, Journal of Industrial abilistic models, Advances in Neural Information Information Integration 26 (2022) 100257. Processing Systems (NeurIPS) 33 (2020) 6840–6851. [4] K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, [16] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, P. Gehler, Towards total recall in industrial anomaly S. Ermon, B. Poole, Score-based generative moddetection, in: IEEE/CVF Conference on Computer eling through stochastic diferential equations, in: Vision and Pattern Recognition (CVPR), 2022, pp. International Conference on Learning Representa14318–14328. tions (ICLR), 2020. [5] M. Rudolph, B. Wandt, B. Rosenhahn, Same same [17] J. Ho, T. Salimans, Classifier-free difusion guidance, but difernet: Semi-supervised defect detection arXiv preprint arXiv:2207.12598 (2022). with normalizing flows, in: Winter Conference [18] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, on Applications of Computer Vision (WACV), 2021. B. Ommer, High-resolution image synthesis with la[6] Y. Song, T. Wang, P. Cai, S. K. Mondal, J. P. Sahoo, tent difusion models, in: IEEE/CVF Conference on A comprehensive survey of few-shot learning: Evo- Computer Vision and Pattern Recognition (CVPR), lution, applications, challenges, and opportunities, 2022.

ACM Computing Surveys 55 (2023) 1–40. [19] L. Capogrosso, A. Mascolini, F. Girella, G. Skenderi, [7] Y. Chen, Y. Ding, F. Zhao, E. Zhang, Z. Wu, L. Shao, S. Gaiardelli, N. Dall’Ora, F. Ponzio, E. Fraccaroli, Surface defect detection methods for industrial S. Di Cataldo, S. Vinco, et al., Neuro-symbolic products: A review, Applied Sciences 11 (2021) empowered denoising difusion probabilistic mod7657. els for real-time anomaly detection in industry 4.0: [8] X. Tao, D. Zhang, W. Ma, Z. Hou, Z. Lu, C. Adak, Wild-and-crazy-idea paper, in: 2023 Forum on SpecUnsupervised anomaly detection for surface defects ification & Design Languages (FDL), IEEE, 2023, pp. with dual-siamese network, IEEE Transactions on 1–4.

Industrial Informatics 18 (2022) 7707–7717. [20] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, [9] C. Luan, R. Cui, L. Sun, Z. Lin, A siamese net- Hierarchical text-conditional image generation work utilizing image structural diferences for cross- with clip latents. arxiv 2022, arXiv preprint category defect detection, in: 2020 IEEE Interna- arXiv:2204.06125 (2022). tional Conference on Image Processing (ICIP), IEEE, [21] D. Podell, Z. English, K. Lacey, A. Blattmann, 2020. T. Dockhorn, J. Müller, J. Penna, R. Rombach, [10] J. Božič, D. Tabernik, D. Skočaj, Mixed supervision Sdxl: Improving latent difusion models for highfor surface-defect detection: From weakly to fully resolution image synthesis, arXiv preprint supervised learning, Computers in Industry 129 arXiv:2307.01952 (2023).

(2021) 103459. [22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn[11] T. Defard, A. Setkov, A. Loesch, R. Audigier, ing for image recognition, in: IEEE/CVF ConferPadim: a patch distribution modeling framework ence on Computer Vision and Pattern Recognition for anomaly detection and localization, in: Interna- (CVPR), 2016. tional Conference on Pattern Recognition (ICPR), [23] L. Capogrosso, F. Cunico, D. S. Cheng, F. Fummi, 2021. M. Cristani, A machine learning-oriented survey [12] M. Yang, P. Wu, H. Feng, Memseg: A semi- on tiny machine learning, IEEE Access (2024). supervised method for image surface defect detec- [24] D. P. Kingma, J. Ba, Adam: A method for stochastion using diferences and commonalities, Engi- tic optimization, arXiv preprint arXiv:1412.6980 neering Applications of Artificial Intelligence 119 (2014). (2023) 105835.