Exploiting Multimodal Latent Diffusion Models for Accurate
                                Anomaly Detection in Industry 5.0
                                Luigi Capogrosso1,∗ , Alvise Vivenza1,2 , Andrea Chiarini2,3 , Francesco Setti1,2 and
                                Marco Cristani1,2
                                1
                                  Department of Engineering for Innovation Medicine, University of Verona, Italy
                                2
                                  QUALYCO S.r.l, Spin-off of the University of Verona, Verona, Italy
                                3
                                  Department of Management, University of Verona, Italy


                                                  Abstract
                                                  Defect detection is the task of identifying defects in production samples. Usually, defect detection classifiers are trained
                                                  on ground-truth data formed by normal samples (negative data) and samples with defects (positive data), where the latter
                                                  are consistently fewer than normal samples. State-of-the-art data augmentation procedures add synthetic defect data by
                                                  superimposing artifacts to normal samples to mitigate problems related to unbalanced training data. These techniques often
                                                  produce out-of-distribution images, resulting in systems that learn what is not a normal sample but cannot accurately identify
                                                  what a defect looks like. In this paper, we show the research we are carrying out in collaboration with QUALYCO, a startup
                                                  spin-off of the University of Verona, on multimodal Latent Diffusion Models (LDMs) for accurate anomaly detection in
                                                  Industry 5.0. Unlike conventional image generation techniques, we work within a human feedback loop pipeline, where
                                                  domain experts provide multimodal guidance to the model through text descriptions and region localization of the possible
                                                  anomalies. This strategic shift enhances the interpretability of results and fosters a more robust human feedback loop,
                                                  facilitating iterative improvements of the generated outputs. Remarkably, our approach operates in a zero-shot manner,
                                                  avoiding time-consuming fine-tuning procedures while achieving superior performance. We demonstrate its efficacy and
                                                  versatility on the challenging KSDD2 dataset, achieving state-of-the-art results.

                                                  Keywords
                                                  Diffusion Models, Anomaly Detection, Industry 5.0


                                1. Introduction                                                                                                   to collect massive amounts of data to have enough pos-
                                                                                                                                                  itive samples. Moreover, with the rise of the Industry
                                Surface Defect Detection (SDD) is a challenging problem 5.0 [3] and the transition towards flexible manufacturing
                                in industrial scenarios, defined as the task of individuat- processes where human operators and production line
                                ing samples containing a defect [1]. In many real-world components actively collaborate, there is an increasing
                                applications, a human expert inspects every product and demand for systems that can quickly adapt to new pro-
                                removes those defective pieces. Unfortunately, human duction setups, i.e., customized products manufactured
                                experts are often inaccurate, and outputs can be incon- in small batches. Traditional automated systems cannot
                                sistent or biased. Moreover, humans are relatively slow comply with these demands since data collection could
                                in accomplishing this task, and their performances are easily involve the whole batch size.
                                subject to stress and fatigue.                                                                                       Recent studies on SDD focused on limiting the impact
                                              Automated defect detection systems [2] can easily of the labeling process by formulating the problem under
                                overcome most of these issues by learning classifiers the unsupervised learning paradigm [4] or training exclu-
                                on defective and nominal training products. The main sively on nominal samples [5], possibly using few-shot
                                drawback is the data collection process required to train learning strategies [6]. In both cases, the goal is to gener-
                                a model effectively. Indeed, defective items (i.e., posi- ate an accurate model of the nominal sample distribution
                                tive samples) are relatively rare compared to nominal and predict everything with a low probability score as
                                items (i.e., negative samples). Thus, the user may need anomalies. However, due to the limited restoration capa-
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
                                                                                                                                                  bility of these models, these approaches tend to generate
                                nized by CINI, May 29-30, 2024, Naples, Italy                                                                     many false positives, especially on datasets with complex
                                ∗
                                     Corresponding author.                                                                                        structures or textures [7].
                                Envelope-Open luigi.capogrosso@univr.it (L. Capogrosso);                                                             It is worth noting that, in industrial setups, anomalies
                                alvise.vivenza@univr.it (A. Vivenza); andrea.chiarini@univr.it                                                    are not generated by Gaussian processes but are the out-
                                (A. Chiarini); francesco.setti@univr.it (F. Setti);
                                marco.cristani@univr.it (M. Cristani)
                                                                                                                                                  come of specific, often predictable, issues during the pro-
                                Orcid 0000-0002-4941-2255 (L. Capogrosso); 0000-0003-4915-5145                                                    duction process. Consequently, the anomalous samples
                                (A. Chiarini); 0000-0002-0015-5534 (F. Setti); 0000-0002-0523-6042                                                are not randomly distributed outside the nominal distri-
                                (M. Cristani)                                                                                                     bution; they can be modeled as a mixture of Gaussian
                                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
     Normal samples                    Image generation via Diffusion Models                       TinyML-based image
        from the                                                                                   classification with the
     production line                                                                                generated samples
                              Region localization


                               Text description
                              "A photo of a scratched
                                     surface"

                                                                                                         Classification
                                                        Latent Diffusion                                    model
                             Neg. text description          Model
                               "A photo of a smooth
                                     surface"
                                                                                                    Anomalous    Not anomalous


Figure 1: Our pipeline. Starting from positive samples, we leverage a Latent Diffusion Model (LDM) to synthesize novel
in-distribution high-quality images of defective surfaces based on defect localization via gesture and textual prompts by a
human feedback loop. Then, these synthetic images are used as anomaly samples to train a TinyML-based binary classifier
directly on the production line for real-time anomaly detection.


distributions in the feature space instead. While general,                   defective samples compared to previous state-of-
unpredictable anomalies can still happen, expert opera-                      the-art approaches.
tors can easily define the main problems they can expect                   • We dive into spatial control approaches to en-
from the manufacturing process, such as which kind of                        able the synthesis of defect samples incorporat-
defects, in which locations, and how often they wish to                      ing regional information and exhibit enhanced
appear. Thus, generative AI can represent a powerful                         controllability of the image generation through a
tool for SDD, with defect image generation emerging as                       human feedback loop pipeline, effectively utiliz-
a promising approach to enhance detector performance.                        ing domain expertise to generate more plausible
   Specifically, in this paper, we report the result of our                  in-distribution anomalies.
research on Latent Diffusion Models (LDMs), a power-
ful class of generative models, to produce fine-grained
realistic defect images that can be used as positive sam-         2. Related Work
ples to train an anomaly detection model. We name
our approach DIAG, a training-free Diffusion-based In-     Research on SDD has been conducted according to differ-
distribution Anomaly Generation pipeline for data aug-     ent setups: unsupervised approaches [8] use a mixture of
mentation in the SDD task. By leveraging pre-trained       unlabelled positive and negative sample images for train-
LDMs with multimodal conditioning, we can exploit do-      ing; supervised approaches require labeled samples in the
main experts’ knowledge to generate plausible anoma-       form of binary masks representing the defects (full super-
lies without needing real positive data. When using        vision) [9] or simply as a tag for the whole image (weak
these augmented images to train an anomaly detection       supervision) [10]. Supervised methods demonstrated
model, we show a notable increase in the detection perfor- superior accuracy in the identification of anomalies. Nev-
mance compared to previous state-of-the-art augmenta-      ertheless, the effort required to provide good annotations
                                                           is not always justified. Collecting positive samples can
tion pipelines. Specifically, this research is being carried
out in collaboration with QUALYCO1 , a startup spin-off    be time and resource-consuming due to the low rate of
of the University of Verona. Figure 1 outlines our ap-     defective products generated by industrial lines.
proach.                                                       Thus, many recent approaches adopt a “clean” setup,
   The main contributions of our research are as follows:  where the training set consists of only nominal samples.
                                                           Two strategies can be adopted in clean setups: model
         • We present a complete pipeline for training fitting and image generation. Model fitting approaches
           anomaly detection models based on nominal im- aim at generating an accurate model of the nominal dis-
           ages and textual prompts. We showcase the su- tribution, considering an outlier in every sample with
           perior outcomes achieved by utilizing generated a likelihood lower than –or a distance from the nomi-
1
                                                           nal prototype higher than– a predefined threshold [11].
    https://qualyco.com.
On the contrary, data augmentation approaches lever-         effectively enhance spatial control, opting to utilize an
age generative methods to synthesize images of defects       inpainting model, as demonstrated in [14, 18]. Given an
and use these images as positive samples for training        image with a masked region, inpainting seamlessly fills
a supervised model. Specifically, this work focuses on       it with content that harmonizes with the surrounding im-
generation-based data augmentation under clean setups.       age. Although typically employed to eliminate undesired
   The most popular data augmentation pipeline for SDD       artifacts, the inpainting process ensures that the masked
consists of a series of random standard transformations      area incorporates the provided prompt, effectively merg-
of the input image –such as mirroring, rotations, and        ing textual and visual content.
color changes– followed by the super-imposition of noisy
patches [12].                                                3.2. Our proposed pipeline
   In MemSeg [12], the pipeline for the generation of
the abnormal synthetic examples is divided into three        To generate an anomalous image 𝑖𝑎 , the process starts
steps: i) a Region of Interest (ROI) indicating where the    by sampling a random negative image, an anomaly de-
defect will be located is generated using Perlin noise       scription, and a mask, forming the triplet (𝑖𝑛 , 𝑑𝑎 , 𝑚𝑎 ).
and the target foreground; ii) the ROI is applied to a       These pieces of information will then be fed to a text-
noise image to generate a noise foreground ROI; iii) the     conditioned LDM to perform inpainting on image 𝑖𝑛 using
noise foreground ROI is super-imposed on the original        the mask 𝑚𝑎 .
image to obtain the simulated anomalous image. How-              The anomaly description 𝑑𝑎 guides the generation, fill-
ever, all these approaches are based on generating out-      ing the masked region of 𝑖𝑛 with an anomaly that com-
of-distribution patterns that do not faithfully represent    plies with the prompt. To generate images resembling
the target-domain anomalies.                                 real anomalous samples, domain knowledge from indus-
   More recently, the first work that draws attention to     trial experts is exploited, providing textual descriptions
in-distribution defect data is In&Out [13], in which we      of the potential anomalies’ type, shape, and spatial infor-
empirically show that diffusion models provide more          mation.
realistic in-distribution defects. Here, we significantly        The LDM is then conditioned on this information to
improve the generation of in-distribution anomalous sam-     inpaint plausible anomalies on defect-free samples. For-
ples of [13], incorporating domain knowledge provided        mally, given pictures of defect-free (negative) samples
by an expert user through textual prompts and localiza-      𝐼𝑛 , domain experts will provide textual descriptions 𝐷𝑎
tion of salient regions in a training-free setup.            of what different anomalies may look like. At the same
                                                             time, regions where these anomalies may appear on the
                                                             defect-free samples will be designated. We define this
3. Methodology                                               set of regions as a set of binary masks 𝑀𝑎 of possible
                                                             anomalies, shapes, and locations. The result of this oper-
3.1. Multimodal Diffusion-based image                        ation is 𝑖𝑎 , an anomalous version of 𝑖𝑛 , where an anomaly
       generation                                            has been inpainted in the masked region 𝑚𝑎 . Due to the
                                                             stochastic nature of LDMs, this process can be repeated
LDMs [14, 15] are a class of deep latent variable mod-
                                                             multiple times to generate an augmented set of anoma-
els that work by modeling the joint distribution of the
                                                             lous sample images 𝐼𝑎 . Finally, the set 𝐼𝑎 can be used as
data over a Markovian inference process. This process
                                                             data augmentation for training anomaly detection mod-
consists of small perturbations of the data with a variance-
                                                             els, as presented in the following section.
preserving property [16], such that the limit distribution
after the diffusion process is approximately identical to a
known prior distribution. Starting with samples from the 3.3. The anomaly detection task
prior, a reverse diffusion process is learned by gradual We approach the anomaly detection problem as a binary
denoising the sample to resemble the initial data by the classification problem, where the objective is to predict
end of the procedure.                                        whether a sample belongs to one of two classes. Specifi-
   We leveraged the natural ability of LDMs to incor- cally, we utilized a ResNet-50 [22] backbone trained with
porate multimodal conditioning in the generation pro- a binary cross-entropy loss function denoted as ℒ .
                                                                                                                     BCE
cess, taking inspiration from [17, 18, 19]. Specifically, Mathematically, it is defined as:
we use as textual descriptions a prompt and a negative
prompt, i.e., a prompt which guides the image generation                         1
                                                                                    𝑁
“away” from its concepts of the desired output, result- ℒBCE (𝑦, 𝑦)̂ = − ∑ [𝑦𝑖 log(𝑦𝑖̂ ) + (1 − 𝑦𝑖 ) log(1 − 𝑦𝑖̂ )] ,
                                                                                 𝑁 𝑖=1
ing in high-quality images that comply with the given
                                                                                                                      (1)
descriptions [20, 21].
                                                             where, 𝑦 represents the ground truth labels, 𝑦̂ represents
   In particular, we do not do full image generation to
                                                             the predicted probabilities, and 𝑁 is the number of sam-
ples. In detail, 𝑦𝑖 denotes the true label for sample 𝑖,        Table 1
which can be either 0 or 1, while 𝑦𝑖̂ signifies the predicted   Results between MemSeg, In&Out and DIAG when no
probability that sample 𝑖 belongs to class 1.                   anomalous samples are available. In bold, the best results.
   Ongoing developments aim to optimize a model                 Underlined, the second best.
through TinyML [23] techniques in order to have an ultra-        Model            Naug    AP ↑    Precision ↑    Recall ↑
efficient system that can work smoothly in real-time on
a production line.                                               MemSeg [12]      80      .514        .733          .436
                                                                 MemSeg [12]      100     .388        .633          .432
                                                                 MemSeg [12]      120     .511        .683          .470
4. Experiments                                                   In&Out [13]      80      .556        .530          .655
                                                                 In&Out [13]      100     .626        .742          .568
4.1. Experiment setup                                            In&Out [13]      120     .536        .699          .534

Datasets We use the Kolektor Surface-Defect Dataset 2            DIAG (ours)      80      .769        .851         .673
(KSDD2) [10], one of the most recent, complex, and real-         DIAG (ours)      100     .801        .924         .664
                                                                 DIAG (ours)      120     .739        .944         .609
world SDD datasets. This dataset comprises 246 positive
and 2085 negative images in the training set and 110 pos-
itive and 894 negative images in the testing set. Positive
images are images with visible defects, such as scratches,      the training set, which will be used to train the anomaly
spots, and surface imperfections. Since the images have         detection model.
different dimensions, we standardize the dataset reso-
lution, resizing all the images to 224 × 632 pixels while     ResNet-50 training and testing For a fair comparison
keeping the number of normal and anomalous samples            with [13], we use the same PyTorch implementation of
unchanged.                                                    the ResNet-50 [22] as our anomaly detection model, in
                                                              which we substitute the fully connected layers after the
Evaluation metrics The anomaly detection perfor- backbone to make it a binary classifier. The network is
mance was evaluated based on Average Precision (AP), trained for 50 epochs with Adam [24] as an optimizer, a
Precision, and Recall, following the evaluation protocol learning rate of 0.0001, and a batch size of 32. To maintain
defined in [13].                                              consistency with the training and evaluation procedures
                                                              of KSDD2, our setup is the same as presented in [10, 13],
                                                              where only the images and ground truth labels are used
4.2. Implementation details                                   to train the model.
In this section, we specify all the implementation de-
tails for reproducibility. All training and inferences were 4.3. Quantitative results
conducted on an NVIDIA RTX 3090 GPU.
                                                              Zero-shot data augmentation Here, we emulate the
Inpainting via Diffusion Models We use the pre- situation where no original positive samples are avail-
trained implementation of SDXL [21] from Diffusers as able in the training set. This scenario makes generating
our text-conditioned LDM. Following the procedure out- augmented positive samples necessary and restricts the
lined in Section 3.2, we use the negative images of KSDD2 users to augmentation procedures that do not rely on pos-
as the set 𝐼𝑛 . As the set of anomaly descriptions 𝐷𝑎 , itive images. To do this, we build the set of augmented
we used the prompts “white marks on the wall ” and anomalous images 𝐼𝑎 by generating 𝑁𝑎𝑢𝑔 augmented pos-
“copper metal scratches ”. Instead, “smooth, plain, itive samples with different pipelines, i.e., MemSeg [12],
black, dark, shadow ” were used as a negative prompt In&Out [13] and DIAG. Then, we train the ResNet-50
to improve the performance further. These prompts were model on a dataset that includes the original negative
chosen after a series of tests, simulating the iterative pro- samples 𝐼𝑛 and the augmented positive samples 𝐼𝑎 . Finally,
cess of our human feedback loop pipeline until the result- we evaluate the model on the original test set.
ing images resembled plausible anomalies. We used the            Table 1 reports the comparison between the models
segmentation masks of positive samples in the KSDD2           trained   with MemSeg, In&Out, and DIAG augmented
dataset to simulate the domain experts’ definition of plau-   data  at different values of 𝑁𝑎𝑢𝑔 . As we can see, our pro-
sible anomalous regions. Then, these data are fed to the posed method achieves the highest AP (.801), recorded
pre-trained SDXL model to perform inpainting on the at 100 augmented images, while also resulting in a con-
negative images in a training-free process, generating sistently higher AP when compared to the MemSeg and
the set of augmented anomalous images 𝐼𝑎 as described in In&Out pipelines. These impressive results highlight
Section 3.2. Finally, the generated images 𝐼 are added to how, through domain expertise in the form of anomaly
                                             𝑎
Table 2
Results between MemSeg, In&Out and DIAG when all the
anomalous samples are available. In bold, the best results.
Underlined, the second best.

 Model            Naug    AP ↑    Precision ↑    Recall ↑
 MemSeg [12]      80      .744        .851          .691
 MemSeg [12]      100     .774        .814          .752
 MemSeg [12]      120     .734        .772          .707
 In&Out [13]      80      .747        .764          .734
 In&Out [13]      100     .775        .868          .720
 In&Out [13]      120     .782        .906          .689
                                                              Figure 2: First row displays some negative samples from
 DIAG (ours)      80      .869        .912         .755       the KSDD2 dataset. The second row shows some images
 DIAG (ours)      100     .911        .978         .800       of positive samples from the same dataset. The third row
 DIAG (ours)      120     .924        .896         .864       shows the MemSeg-generated defect samples. The fourth
                                                              row shows In&Out generated defect samples. Lastly, the final
                                                              row showcases some images generated with DIAG. Notably,
descriptions and segmentation masks, it is possible to gen-   the defect images that DIAG generated are more realistic and
                                                              in-distribution.
erate in-distribution images able to meaningfully guide
an anomaly detection network, even in a complicated
scenario where no real anomalous data is available.
   Surprisingly, the DIAG performance with 𝑁𝑎𝑢𝑔 = 120         4.4. Qualitative results
augmented images is lower than using a smaller num-           The main goal of our data augmentation pipeline is to
ber of augmented images. We hypothesize this is due           generate in-distribution synthetic positive images, mean-
to the stochastic nature of the LDMs image generation.        ing images that closely resemble the real ones. Figure 2
While it allows the generation of various images given        shows qualitative results. It’s evident that the images
the same guidance, it can also lower, in some cases, the      produced by DIAG are markedly more realistic compared
predictability of the quality of the generated samples,       to those generated by MemSeg [12] and In&Out [13].
which sometimes may not faithfully comply with the
prompt. Future works will focus on studying quality
consistency in the image generation pipeline.                 5. Conclusions
Full-shot data augmentation To showcase DIAG as               This work presents DIAG, a novel data augmentation
a general data augmentation technique, we also explore        pipeline that leverages visual language models to produce
the scenario where real positive samples are available in     training-free positive images for enhancing the perfor-
the training set. To this aim, we include all the 246 real    mance of an SDD model. We introduced domain experts
positive samples 𝐼𝑝 in the training set, together with the    in the generation pipeline, asking them to describe with
real negative images 𝐼𝑛 and the 𝑁𝑎𝑢𝑔 augmented positive       textual prompts how a defect should look and where it
images 𝐼𝑎 .                                                   can be localized. Then, we adopt a pre-trained LDM to
   As we can see from Table 2, DIAG achieves the highest      generate defective images and train a binary classifier
average AP yet (.924), surpassing the .782 set by the pre-    for isolating the anomalous images. We focus our ex-
vious state-of-the-art data augmentation pipeline [13].       periments on the KSDD2 dataset and establish ourselves
When comparing these results to the ones obtained in          as the new state-of-the-art data augmentation pipeline,
the “zero-shot data augmentation” scenario, it is clear       surpassing previous approaches in both the zero-shot
how more in-distribution images improve model per-            and full-shot data augmentation scenarios with an AP
formance during training. This is highlighted by the          of .801 and .924, respectively. These results highlight
improvement in performance of all the models when             the potential of in-distribution data augmentation in the
adding the real positive images 𝐼𝑝 to the training set. At    anomaly detection field, where training-free generative
the same time, the inclusion of DIAG augmented images         model pipelines such as DIAG can provide meaningful
allows the model to explore the anomaly distribution fur-     data for downstream classification, making them appeal-
ther, resulting in the difference in performance between      ing solutions in scenarios where real anomalous data is
the different data augmentation pipelines.                    difficult to collect or unavailable. These promising results
                                                              promote further exploration across various datasets, par-
                                                              ticularly investigating how robust the image generation
                                                              is compared to noisy textual prompts.
References                                                     [13] L. Capogrosso, F. Girella, F. Taioli, M. Dalla Chiara,
                                                                    M. Aqeel, F. Fummi, F. Setti, M. Cristani, Diffusion-
 [1] T. Wang, Y. Chen, M. Qiao, H. Snoussi, A fast and              based image generation for in-distribution data
     robust convolutional neural network-based defect               augmentation in surface defect detection, in:
     detection model in product quality control, The                International Joint Conference on Computer Vi-
     International Journal of Advanced Manufacturing                sion, Imaging and Computer Graphics Theory
     Technology 94 (2018) 3465–3471.                                and Applications (VISAPP), 2024. doi:10.5220/
 [2] S. H. Hanzaei, A. Afshar, F. Barazandeh, Automatic             0012350400003660 .
     detection and classification of the ceramic tiles’ sur-   [14] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan,
     face defects, Pattern Recognition 66 (2017) 174–189.           S. Ganguli, Deep unsupervised learning using
 [3] P. K. R. Maddikunta, Q.-V. Pham, B. Prabadevi,                 nonequilibrium thermodynamics, in: International
     N. Deepa, K. Dev, T. R. Gadekallu, R. Ruby, M. Liyan-          Conference on Machine Learning (ICML), 2015.
     age, Industry 5.0: A survey on enabling technolo-         [15] J. Ho, A. Jain, P. Abbeel, Denoising diffusion prob-
     gies and potential applications, Journal of Industrial         abilistic models, Advances in Neural Information
     Information Integration 26 (2022) 100257.                      Processing Systems (NeurIPS) 33 (2020) 6840–6851.
 [4] K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox,     [16] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar,
     P. Gehler, Towards total recall in industrial anomaly          S. Ermon, B. Poole, Score-based generative mod-
     detection, in: IEEE/CVF Conference on Computer                 eling through stochastic differential equations, in:
     Vision and Pattern Recognition (CVPR), 2022, pp.               International Conference on Learning Representa-
     14318–14328.                                                   tions (ICLR), 2020.
 [5] M. Rudolph, B. Wandt, B. Rosenhahn, Same same             [17] J. Ho, T. Salimans, Classifier-free diffusion guidance,
     but differnet: Semi-supervised defect detection                arXiv preprint arXiv:2207.12598 (2022).
     with normalizing flows, in: Winter Conference             [18] R. Rombach, A. Blattmann, D. Lorenz, P. Esser,
     on Applications of Computer Vision (WACV), 2021.               B. Ommer, High-resolution image synthesis with la-
 [6] Y. Song, T. Wang, P. Cai, S. K. Mondal, J. P. Sahoo,           tent diffusion models, in: IEEE/CVF Conference on
     A comprehensive survey of few-shot learning: Evo-              Computer Vision and Pattern Recognition (CVPR),
     lution, applications, challenges, and opportunities,           2022.
     ACM Computing Surveys 55 (2023) 1–40.                     [19] L. Capogrosso, A. Mascolini, F. Girella, G. Skenderi,
 [7] Y. Chen, Y. Ding, F. Zhao, E. Zhang, Z. Wu, L. Shao,           S. Gaiardelli, N. Dall’Ora, F. Ponzio, E. Fraccaroli,
     Surface defect detection methods for industrial                S. Di Cataldo, S. Vinco, et al., Neuro-symbolic
     products: A review, Applied Sciences 11 (2021)                 empowered denoising diffusion probabilistic mod-
     7657.                                                          els for real-time anomaly detection in industry 4.0:
 [8] X. Tao, D. Zhang, W. Ma, Z. Hou, Z. Lu, C. Adak,               Wild-and-crazy-idea paper, in: 2023 Forum on Spec-
     Unsupervised anomaly detection for surface defects             ification & Design Languages (FDL), IEEE, 2023, pp.
     with dual-siamese network, IEEE Transactions on                1–4.
     Industrial Informatics 18 (2022) 7707–7717.               [20] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen,
 [9] C. Luan, R. Cui, L. Sun, Z. Lin, A siamese net-                Hierarchical text-conditional image generation
     work utilizing image structural differences for cross-         with clip latents. arxiv 2022, arXiv preprint
     category defect detection, in: 2020 IEEE Interna-              arXiv:2204.06125 (2022).
     tional Conference on Image Processing (ICIP), IEEE,       [21] D. Podell, Z. English, K. Lacey, A. Blattmann,
     2020.                                                          T. Dockhorn, J. Müller, J. Penna, R. Rombach,
[10] J. Božič, D. Tabernik, D. Skočaj, Mixed supervision            Sdxl: Improving latent diffusion models for high-
     for surface-defect detection: From weakly to fully             resolution image synthesis,            arXiv preprint
     supervised learning, Computers in Industry 129                 arXiv:2307.01952 (2023).
     (2021) 103459.                                            [22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
[11] T. Defard, A. Setkov, A. Loesch, R. Audigier,                  ing for image recognition, in: IEEE/CVF Confer-
     Padim: a patch distribution modeling framework                 ence on Computer Vision and Pattern Recognition
     for anomaly detection and localization, in: Interna-           (CVPR), 2016.
     tional Conference on Pattern Recognition (ICPR),          [23] L. Capogrosso, F. Cunico, D. S. Cheng, F. Fummi,
     2021.                                                          M. Cristani, A machine learning-oriented survey
[12] M. Yang, P. Wu, H. Feng, Memseg: A semi-                       on tiny machine learning, IEEE Access (2024).
     supervised method for image surface defect detec-         [24] D. P. Kingma, J. Ba, Adam: A method for stochas-
     tion using differences and commonalities, Engi-                tic optimization, arXiv preprint arXiv:1412.6980
     neering Applications of Artificial Intelligence 119            (2014).
     (2023) 105835.