<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Zero-Shot Anomaly Detection with CLIP in Medical Imaging: Are We There Yet?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aldo Marzullo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Bianca Maria Ranzini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRCCS Humanitas Research Hospital - via Manzoni 56</institution>
          ,
          <addr-line>20089 Rozzano, Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Zero-shot anomaly detection (ZSAD) ofers potential for identifying anomalies in medical imaging without task-specific training. In this paper, we evaluate CLIP-based models, originally developed for industrial tasks, on brain tumor detection using the BraTS-MET dataset. Our analysis examines their ability to detect medical-specific anomalies with no or minimal supervision, addressing the challenges posed by limited data annotation. While these models show promise in transferring general knowledge to medical tasks, their performance falls short of the precision required for clinical use. Our findings highlight the need for further adaptation before CLIP-based models can be reliably applied to medical anomaly detection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;anomaly detection</kwd>
        <kwd>domain generalization</kwd>
        <kwd>medical imaging</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Anomaly detection (AD) in medical imaging plays a critical role in identifying and diagnosing diseases,
often detecting rare or subtle anomalies that may go unnoticed by human observers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In particular,
zero-shot anomaly detection (ZSAD), which aims to identify anomalies without specific training on
abnormal samples, holds great promise for medical applications where obtaining labeled data is both
challenging and costly. Despite the progress in anomaly detection, the majority of current models
are trained on large, domain-specific datasets, which limits their applicability to novel tasks and
categories—especially in the medical field, where training data encompassing real world variability are
often dificult to obtain [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Recently, models based on Contrastive Language-Image Pretraining (CLIP)
have shown remarkable success in zero- and few-shot tasks across various domains [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These models,
trained on vast amounts of diverse, publicly available image-text pairs, excel at generalizing to unseen
categories with minimal task-specific fine-tuning. However, their efectiveness in the medical domain,
where anomaly detection often involves identifying subtle, domain-specific abnormalities, remains
underexplored [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In this paper, we explore the potential of CLIP-based models for medical anomaly detection by
focusing on a brain tumor detection task using the BraTS dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. While CLIP-based models have
demonstrated superior performance in industrial AD tasks, it is unclear whether these models are capable
of achieving similar success in the medical domain, where higher accuracy and domain specificity
are critical for clinical applications. We compare the performance of several CLIP-based models on
the BraTS dataset, aiming to assess whether these zero-shot models are ready for medical anomaly
detection or if further domain adaptation is required. Our findings reveal that, while CLIP models
show promise in transferring knowledge from general tasks to medical imaging, their performance in
detecting anomalies such as brain tumors is not yet suficient for clinical use. Therefore, significant
improvements and adaptations are needed to fully harness the potential of CLIP-based models in medical
anomaly detection, especially for tasks that require high sensitivity and precision.
3rd AIxIA Workshop on Artificial Intelligence for Healthcare and 5th Data4SmartHealth
* Corresponding author.
$ aldo.marzullo@humanitas.it (A. Marzullo); marta.ranzini@humanitas.it (M. B. M. Ranzini)
0000-0002-9651-7156 (A. Marzullo); 0000-0001-8275-6028 (M. B. M. Ranzini)
      </p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <sec id="sec-2-1">
        <title>2.1. Dataset Description</title>
        <p>
          We tested selected CLIP-based abnormality detection models to detect and segment brain metastases.
This task was chosen due to the clinical significance of brain metastases, which are common secondary
tumors that pose significant challenges in diagnosis and treatment. They are indeed characterized by
an extreme variability in terms of lesion size, shape and localisation within the brain, thus representing
an optimal case study for AD. To this aim, we used the the BraTS 2023 Brain Metastases [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. It
consists of a retrospective collection of treatment-naive brain metastases mpMRI scans obtained from
various institutions. The dataset includes pre-contrast T1-weighted (t1w), post-contrast T1-weighted
(t1c), T2-weighted (t2w), and T2-weighted FLAIR (t2f) sequences. All data underwent standardized
preprocessing, including conversion to NIf TI format, co-registration, resampling to 1mm3 resolution,
and skull-stripping. Imaging volumes were manually segmented and refined by neuroradiologists.
        </p>
        <p>For this work, we used only the BraTS 2023 training set (165 patients), splitting 70% for training and
30% for testing. The whole tumor ground truth mask (WC = Nonenhancing tumor core + Surrounding
non-enhancing FLAIR hyperintensity + Enhancing Tumor) was used as the target mask. As a case study,
only the axial view of t2w images was considered. CLIP’s preprocessing standardization was applied
using OpenAI’s ImageNet mean and standard deviation on each slice separately. Data augmentation
techniques were employed to reduce overfitting during training.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Zero Shot Anomaly Detection using CLIP</title>
        <p>
          CLIP (Contrastive Language-Image Pretraining) is a large-scale vision-language model that has shown
remarkable success in zero-shot image classification tasks [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. CLIP learns joint embeddings of images
and text by aligning them in a shared feature space. This makes it particularly powerful for tasks
requiring minimal supervision, as it can match images to text prompts without the need for training on
specific labeled datasets. While CLIP’s ability to generalize across domains has been demonstrated in
various vision tasks, its capacity for anomaly detection, particularly in medical contexts, remains an
open question.
        </p>
        <p>
          ZSAD aims to identify anomalous patterns in images from categories that are not present during
training. Given an image  ∈ R×  × 3, the objective is to compute an image-level anomaly score
 ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] and a pixel-level anomaly map  ∈ R×  , where larger values indicate a higher likelihood
of an anomaly. Unlike traditional anomaly detection methods, ZSAD does not rely on training data
from the target categories but instead leverages a pre-trained vision-language (VL) model such as CLIP,
which has been trained on natural image-text pairs. The VL model encodes both image and text features.
To perform ZSAD, textual prompts describing normal and anomalous states are commonly used, e.g.,
“A photo of a normal {object}” or “A photo of a damaged {object},” where {object} is replaced with the
category of interest. The model computes cosine similarities between the image embeddings Fimg and
text embeddings Ftext for normal and anomalous states. The pixel-level anomaly map  is defined by
comparing the similarity between image patch embeddings Fpatch and the text embeddings for both
normal (Fnormal) and anomalous (Fanomalous) states:
, =
        </p>
        <p>exp(cos(F(pa,tc)h, Fanomalous))
exp(cos(F(pa,tc)h, Fnormal)) + exp(cos(F(pa,tc)h, Fanomalous))
where cos(· , · ) denotes the cosine similarity, and ,  refer to the spatial location of the patch. The
image-level anomaly score  is then computed by aggregating the anomaly map  to provide an
overall score for the image.</p>
        <p>Several variations of the above described approach have been developed, introducing features such
as object-agnostic and learnable text prompts or value-wise attention mechanisms for fine-grained
localization. In these cases, a set of auxiliary data train = {(, )}=1, where  represents the images
and  ∈ {0, 1}×  are the corresponding ground-truth masks for anomalies, is typically available to
train the adaptation layers. In this paper, we evaluate four AD methods that adapt CLIP for zero-shot
and few-shot anomaly detection in medical imaging.</p>
        <p>
          • AnomalyCLIP [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]: This method leverages CLIP’s vision-language alignment for zero-shot anomaly
detection (ZSAD) across diverse domains. AnomalyCLIP introduces learnable object-agnostic
text prompts to capture generic notions of normality and abnormality, allowing it to focus on
detecting anomalies regardless of foreground object semantics. By ignoring object class labels
and concentrating on abnormal regions, AnomalyCLIP aims to improve ZSAD performance in
highly variable datasets, such as defect inspection and medical imaging.
• VAND [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]: Developed for the Visual Anomaly and Novelty Detection (VAND) challenge, this
method augments CLIP with additional trainable linear layers to map image features into the
joint embedding space, enabling better alignment with text features. For the zero-shot setting,
VAND compares the features of test images with reference images stored in memory banks, which
improves anomaly detection accuracy. This method showed excellent results in industrial settings,
winning the zero-shot track of the VAND challenge, but its performance on medical tasks, such
as brain tumor detection, remains to be seen.
• AnoVL [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]: AnoVL addresses CLIP’s limitation in capturing fine-grained, patch-level anomalies
by introducing a value-wise attention mechanism within the visual encoder. This mechanism
enhances the ability to localize anomalies at the pixel level. Additionally, AnoVL utilizes
domainaware state prompting to refine the matching between visual anomalies and abnormal state text
prompts. Further improvements are made using a test-time adaptation (TTA) technique that
refines the anomaly localization results by fine-tuning lightweight adapters based on pseudo-labels
generated by AnoVL itself.
• AdaCLIP [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]: AdaCLIP enhances CLIP for zero-shot anomaly detection by introducing hybrid
learnable prompts, combining static prompts shared across images with dynamic prompts
generated for each test image. This hybrid approach improves adaptability and generalization across
diverse anomaly categories. AdaCLIP also integrates a Hybrid Semantic Fusion (HSF) module,
boosting both pixel- and image-level detection accuracy.
        </p>
        <p>
          In addition to using the standard CLIP backbone, we also leverage a specialized version of CLIP
for medical imaging, called PMC-CLIP [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. PMC-CLIP is pretrained on the PMC-OA dataset, which
contains 1.6 million image-caption pairs from PubMedCentral’s OpenAccess subset, covering a wide
range of biomedical modalities and diseases. This biomedical-specific pretraining may potentially be
more suitable for tasks such as brain metastasis detection than the original CLIP model. Notably, both
CLIP and PMC-CLIP are designed and trained to process 2D images. To the best of our knowledge
and at the time of writing, there is no publicly available CLIP-like model for 3D anomaly detection.
This might represent a limitation of such 2D approaches in MR anomaly detection, as they would not
be able to exploit the rich three-dimensional information for a more accurate and spatially-coherent
identification of anomalies. With our experiments we also aim to investigate the impact on volume-wise
(3D) anomaly detection while using a 2D approach.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Experiments</title>
        <p>
          In this section, we describe the benchmark designed to evaluate four CLIP-based anomaly detection (AD)
methods—VAND, AnomalyCLIP, AnoVL and AdaCLIP — on the brain tumor segmentation task using the
BraTS dataset. The benchmark is designed to assess the capability of these methods, originally developed
for industrial AD tasks, in detecting medical anomalies such as brain tumors. We explore diferent
training strategies to test the generalization of these models across domains, including industrial,
medical, and brain tumor datasets. More in detail, we designed four experimental setups:
• (Industrial) Pretrained on Industrial AD Dataset: In the first setup, we evaluate each method
in its original form, using models pretrained on an industrial anomaly detection dataset. This
setup allows us to examine the generalization capabilities of these models in detecting medical
anomalies such as brain tumors without any additional medical-specific training.
• (Finetune) Finetuned on BraTS: Next, we finetune the models using the BraTS dataset, which
contains labeled examples of brain tumors. This setup assesses the improvement in performance
when the models are adapted to a specific medical domain through supervised learning. In
this setting, the CLIP model weights are frozen. Only the adaptive layers introduced by each
modification are finetuned on BraTS.
• (Brats) Training from scratch: In this experiment, we train the model in its original form, with
a randomly initialized adapter layer, on the BraTS dataset. As in the previous setting, the CLIP
model weights are frozen.
• (PMC) Using PMC-CLIP [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] as backbone and training from scratch on BraTS: Finally, we replace
the CLIP model pretrained on industrial data with a CLIP model pretrained on the PMC dataset.
As in the third setting, the adaptive layers on each anomaly-detection approach are trained from
scratch on the BraTS dataset. This setup explores whether pretraining on a medical dataset
followed by further adaptation on the target medical task enhances the models’ ability to detect
brain tumors.
        </p>
        <p>
          These methods are designed for 2D image processing. Since brain MRIs are 3D volumes, we treat
each slice (axial view) as a 2D image and reconstruct the 3D anomaly map by stacking the 2D anomaly
maps. For all four setups, we measure the performance of each method using Dice score (equivalent to
the F1-score) as well as sensitivity, specificity, and positive predictive value (PPV). These metrics are
computed for each 3D volume and averaged across all test subjects. Additionally, we compute the Area
Under the Receiver Operating Characteristic curve (AUROC) and the maximum F1-score (F1-max) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ],
which are commonly used metrics for assessing segmentation quality in AD tasks [
          <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
          ]. Note that to
calculate pixel-wise metrics, the AD problem is framed as a segmentation task by thresholding the 3D
anomaly maps to generate binary masks. In all experiments, we used the open-source implementations
provided by the authors of each method. We kept the default settings and hyperparameters unchanged
for training (e.g., number of epochs and batch size), except for the image size, which was set to 240 × 240.
For AdaCLIP specifically, due to the significant time required for training and inference, we randomly
sampled 50 patients for training (same size as the test set). Notice that, due to its specific training-free
design, AnoVL was only evaluated on a single setting (namely industrial). More details for models
training are reported in Appendix Table 2.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <sec id="sec-3-1">
        <title>3.1. Overall Performance Comparison</title>
        <p>The 3D Dice scores across subjects and models are summarized in Fig. 1. This plot compares the
performance of four anomaly detection models (AdaCLIP, AnomalyCLIP, VAND, and AnoVL) for each
of the four experimental setups (pretrained on industrial datasets, finetuned on BraTS, training on
BraTS from scratch, and training on BraTS using the PMC-CLIP backbone).</p>
        <p>AdaCLIP demonstrates the highest variance in performance across experiments, with its 3D Dice Scores
ranging from approximately 0.1 to 0.7, particularly better when training the adapters on the BraTS
dataset. VAND also shows relatively high performance, especially when using the PMC-CLIP backbone,
where its scores vary between 0.1 and 0.6. In contrast, AnomalyCLIP shows consistently lower Dice
Scores across all experiments, particularly struggling with scores close to 0 when using the industrial
and industrial-finetuned weights. AnoVL shows the weakest performance overall, with very low scores
across all experiments, rarely exceeding 0.1, reflecting its poor capacity for anomaly detection in this
setting.</p>
        <p>
          For further comparison, we analyze additional metrics as reported in Table 1. Notably, the trend
observed in the AUROC values contrasts sharply with the Dice scores. We attribute this discrepancy to
the AUROC’s sensitivity to data imbalance [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], which can significantly afect the analysis of small
Model
AdaCLIP
AdaCLIP
AdaCLIP
AdaCLIP
        </p>
        <p>AnoVL
AnomalyCLIP
AnomalyCLIP
AnomalyCLIP
AnomalyCLIP</p>
        <p>VAND
VAND
VAND
VAND</p>
        <p>Industrial
Finetune</p>
        <p>Brats</p>
        <p>PMC
Industrial
Industrial
Finetune</p>
        <p>Brats</p>
        <p>PMC
Industrial
Finetune</p>
        <p>Brats
PMC
industrial , finetune , brats , and pmc .
lesions within large 3D volumes. Consequently, AUROC values may provide an inflated representation
of performance that does not accurately reflect segmentation quality.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Per-Subject Performance Analysis</title>
        <p>To further investigate model performance, we analyzed Dice Scores per subject across the experiments,
as shown in Fig. 2. Due to its low performance we exclude AnoVL from this analysis. Each plot
presents 2D Dice Scores across individual slices (blue bars) and the corresponding 3D Dice Scores (red
crosses). This comparison illustrates the extent of variability between 2D slice-level performance and
3D volumetric segmentation accuracy.</p>
        <p>• VAND exhibits relatively stable 2D segmentation accuracy across subjects, especially when
trained on BraTS from scratch and when using PMC-CLIP backbone, where Dice Scores typically
range from 0.4 to 0.8. However, the 3D Dice Scores are notably lower for most subjects, with
values between 0.2 and 0.5, highlighting challenges in volumetric segmentation. To note, while
showing similar variability of Dice score on 2D slices, the experiment on PMC-CLIP backbone
reports higher 3D Dice scores compared to industrial-CLIP with training on BraTS, suggesting
that, for this architecture, medical-specific CLIP embeddings are more suitable for the task of
anomaly detection.
• AdaCLIP also shows a wide range of 2D Dice Scores, particularly excelling in the training on
BraTS, where it achieves values as high as 0.8 in certain subjects. However, like VAND, the
corresponding 3D Dice Scores are consistently lower, ranging between 0.2 and 0.4 for many
subjects.
• AnomalyCLIP, on the other hand, underperforms across all datasets. Similarly to VAND, the
greatest improvement is obtained with the use of the PMC-CLIP and training the adaptive layers
on BraTS. However, its 2D Dice Scores are generally below 0.3 for most subjects, and the 3D
Dice Scores are even lower, barely exceeding 0.1 for the majority of cases. This suggests poor
performance for anomaly detection in both 2D and 3D segmentations.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Error Distribution Across Brain Regions</title>
        <p>To understand how segmentation performance changes spatially within the brain, we plotted the mean
2D Dice Score for each discretized distance from the brain center (normalized from -1 to 1) in Fig. 3.
The number of slices containing lesions at each distance is also shown (yellow bars), providing insight
into how lesion location afects segmentation accuracy.</p>
        <p>For both VAND and AdaCLIP, segmentation accuracy is highest near the upper part of the brain
(normalized distance around 0.25), where the number of lesion-containing slices is also highest. Dice
Scores reach values between 0.4 and 0.7 at these upper-central regions, while they drop to around
0.2 towards the periphery (distances near -1 and 1). This suggests that these models struggle more
with peripheral brain regions, where lesion frequency is lower or more variable. AnomalyCLIP,
however, despite showing a similar trend as the other methods, achieves lower performance across all
distances overall, with mean Dice Scores consistently below 0.3. This suggests that the model is unable
to efectively capture lesion characteristics regardless of their location in the brain.</p>
        <p>In summary, there is a trend in achieving reasonable 2D segmentation performance near the center
of the brain but struggle towards the periphery. Their 3D performance remains lower than their 2D
scores, suggesting challenges in volumetric segmentation.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Discussion and Conclusion</title>
        <p>The results from our experiments show consistently low performance for all models on the BraTS
brain tumor segmentation task, with Dice scores below 50%, even when using PMC-CLIP, a
visuallanguage model pretrained on a medical dataset. Although a comprehensive quantitative comparison is
challenging due to variations in medical applications and datasets, we observe that the AUROC values
obtained in our experiments (Table 1) are generally lower than those reported in the related studies.
Indeed, despite pixel-level Dice scores not being reported for Brain MRI, AdaCLIP presents AUROC
values ranging from 0.772 to 0.904, while AnomalyCLIP reports values between 0.789 and 0.897 for
ZSAD across other (2D) medical imaging applications (e.g. skin, colon, thyroid). Concerning Brain
anomaly detection, their reported image level AUROC values exceed 90%. However, such high values
do not align with pixel-level metrics observed in our experiments, also due to the fact that AUROC
might not be sensitive enough to capture anomaly segmentation quality. Therefore, the Dice score
should be preferred as a more reliable metric for performance evaluation in this context, particularly
given the sparsity of anomalous voxels.</p>
        <p>
          Nevertheless, the observed discrepancy may be attributable to the nature of the BraTS dataset, which
presents unique challenges such as greater variability in tumor appearances and difuse boundaries
compared to the possibly simpler brain MRI datasets used in related work experiments, e.g.
BrainMRI [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This latter dataset contains only 2D axial slices of 15 patients that have glial brain tumors
mostly located in the upper part of the brain, which may be an easier task compared to the detection of
metastasis in a 3D setting. This also aligns with the observation that, as shown in Figure 3, all tested
algorithms report higher performance in the central-upper part of the brain, suggesting that these areas
might be better captured by the CLIP embeddings with respect to the rest of the brain. Therefore, our
ifndings suggest that CLIP-based models may struggle with more complex and heterogeneous medical
anomalies thus highlighting the need for more specialized fine-tuning and adaptation in these domains.
One notable limitation of our study is the slow training process associated with the AdaCLIP model,
which stems from the use of batch size 1. Due to computational constraints, the cohort used for training
was sampled down, potentially afecting the robustness and comparability of the results. The ineficiency
caused by the small batch size severely hampered the speed of convergence and made the fine-tuning
process impractically slow. Additionally, another critical factor that may have contributed to the low
performance of the models is the choice of the initial prompt. The prompts used for zero-shot anomaly
detection in brain imaging may not have been optimally suited for identifying brain anomalies, such
as metastases, which have unique and complex characteristics. Future research may involve more
specialized prompt templates to better align with the medical domain. Finally, while PMC-CLIP has
demonstrated promising results when fine-tuned for downstream medical tasks, it has been observed
that it may lack strong zero-shot anomaly detection capabilities. Other models, such as BiomedCLIP
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], may ofer more suitable alternatives for ZSAD tasks, although technical challenges in adapting
these models to existing architectures still need to be addressed.
        </p>
        <p>
          Finally, several other CLIP-based models have already been designed for medical imaging, including
MedicalCLIP [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], MediCLIP [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and MVFA-AD [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. While these models show promise, their
evaluation has so far been limited to 2D medical datasets, leaving uncertainty about their applicability to more
complex 3D medical imaging tasks, such as brain metastasis detection. In a real-world context, the
detection of brain metastases from 3D MRI images presents significant challenges due to the intricacies
of tumor shape, size, and location, as well as the variability in imaging protocols. Thus, it remains
unclear whether these models can generalize efectively to such challenging datasets, suggesting the
need for future work on adapting these methods to handle 3D data and validating their performance on
tasks involving more complex anatomical structures. Such evaluation is the goal of our future work.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Acknowledgments</title>
      <p>This work was partially supported by the “Bando 5× mille Ministero della Salute 2024" assigned from
the IRCCS Humanitas.</p>
    </sec>
    <sec id="sec-5">
      <title>A. experiment settings</title>
    </sec>
    <sec id="sec-6">
      <title>B. Variability of lesions at normalized distances.</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Tschuchnig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gadermayr</surname>
          </string-name>
          ,
          <article-title>Anomaly detection in medical imaging-a mini review</article-title>
          ,
          <source>in: Data Science-Analytics and Applications: Proceedings of the 4th International Data Science Conference-iDSC2021</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Segato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marzullo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Calimeri</surname>
          </string-name>
          , E. De Momi,
          <article-title>Artificial intelligence for brain diseases: A systematic review</article-title>
          ,
          <source>APL bioengineering 4</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Teng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Clip in medical imaging: A comprehensive survey</article-title>
          ,
          <source>arXiv preprint arXiv:2312.07353</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Moawad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Janas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Baid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jekel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Krantchev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Moy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Saluja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Osenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilms</surname>
          </string-name>
          , et al.,
          <article-title>The brain tumor segmentation (brats-mets) challenge 2023: Brain metastasis segmentation on pre-treatment mri</article-title>
          ,
          <source>arXiv preprint arXiv:2306.00838</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , G. Pang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Anomalyclip:
          <article-title>Object-agnostic prompt learning for zero-shot anomaly detection</article-title>
          ,
          <source>arXiv preprint arXiv:2310.18961</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Y. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Zhang, April-gan:
          <article-title>A zero-/few-shot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1&amp;2: 1st place on zero-shot ad and 4th place on few-shot ad</article-title>
          ,
          <source>arXiv preprint arXiv:2305.17382</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Anovl: Adapting vision-language models for unified zero-shot anomaly localization</article-title>
          ,
          <source>arXiv preprint arXiv:2308.15939</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Frittoli, Y. Cheng, W. Shen,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Boracchi, Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection</article-title>
          ,
          <source>arXiv preprint arXiv:2407.15795</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Wu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>Pmc-clip: Contrastive languageimage pre-training using biomedical documents</article-title>
          , in: International Conference on Medical Image Computing and
          <string-name>
            <surname>Computer-Assisted</surname>
            <given-names>Intervention</given-names>
          </string-name>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>525</fpage>
          -
          <lpage>536</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ravichandran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Dabeer</surname>
          </string-name>
          , Winclip: Zero-/
          <article-title>few-shot anomaly classification and segmentation</article-title>
          ,
          <source>arXiv preprint arXiv:2303.14814</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goadrich</surname>
          </string-name>
          ,
          <article-title>The relationship between precision-recall and roc curves</article-title>
          ,
          <source>in: Proceedings of the 23rd international conference on Machine learning</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>233</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>P. B. Kanade</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gumaste</surname>
          </string-name>
          ,
          <article-title>Brain tumor detection using mri images</article-title>
          ,
          <source>Brain</source>
          <volume>3</volume>
          (
          <year>2015</year>
          )
          <fpage>146</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bagga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Preston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Valluri</surname>
          </string-name>
          , et al.,
          <article-title>Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs</article-title>
          ,
          <source>arXiv preprint arXiv:2303.00915</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          , Medicalclip:
          <article-title>Anomaly-detection domain generalization with asymmetric constraints</article-title>
          ,
          <source>Biomolecules</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <fpage>590</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Mediclip:
          <article-title>Adapting clip for few-shot medical image anomaly detection</article-title>
          ,
          <source>arXiv preprint arXiv:2405.11315</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Adapting visual-language models for generalizable anomaly detection in medical images</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>11375</fpage>
          -
          <lpage>11385</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>