=Paper= {{Paper |id=Vol-3271/Paper18_CVCS2022 |storemode=property |title=A self-guided anomaly detection-inspired few-shot segmentation network |pdfUrl=https://ceur-ws.org/Vol-3271/Paper18_CVCS2022.pdf |volume=Vol-3271 |authors=Suaiba Amina Salahuddin,Stine Hansen,Srishti Gautam,Michael Kampffmeyer,Robert Jenssen |dblpUrl=https://dblp.org/rec/conf/cvcs/SalahuddinHGKJ22 }} ==A self-guided anomaly detection-inspired few-shot segmentation network == https://ceur-ws.org/Vol-3271/Paper18_CVCS2022.pdf
A self-guided anomaly detection-inspired few-shot
segmentation network
Suaiba Amina Salahuddin1,* , Stine Hansen1 , Srishti Gautam1 , Michael Kampffmeyer1
and Robert Jenssen1
1
    Department of Physics and Technology, UiT The Arctic University of Norway, Tromsø NO-9037, Norway


                                         Abstract
                                         Standard strategies for fully supervised semantic segmentation of medical images require large pixel-level
                                         annotated datasets. This makes such methods challenging due to the manual labor required and limits the
                                         usability when segmentation is needed for new classes for which data is scarce. Few-shot segmentation
                                         (FSS) is a recent and promising direction within the deep learning literature designed to alleviate these
                                         challenges. In FSS, the aim is to create segmentation networks with the ability to generalize based on
                                         just a few annotated examples, inspired by human learning. A dominant direction in FSS is based on
                                         matching representations of the image to be segmented with prototypes acquired from a few annotated
                                         examples. A recent method called the ADNet, inspired by anomaly detection only computes one single
                                         prototype. This prototype captures the properties of the foreground segment. In this paper, the aim is
                                         to investigate whether the ADNet may benefit from more than one prototype to capture foreground
                                         properties. We take inspiration from the very recent idea of self-guidance, where an initial prediction
                                         of the support image is used to compute two new prototypes, representing the covered region and the
                                         missed region. We couple these more fine-grained prototypes with the ADNet framework to form what
                                         we refer to as the self-guided ADNet, or SG-ADNet for short. We evaluate the proposed SG-ADNet on a
                                         benchmark cardiac MRI data set, achieving competitive overall performance compared to the baseline
                                         ADNet, helping reduce over-segmentation errors for some classes.

                                         Keywords
                                         Medical image segmentation, Few-Shot, Self-supervision




1. Introduction
Significant advances have been made toward image classification and semantic segmentation
tasks by deep convolutional neural network-driven approaches such as U-Net, V-Net, FCN
and 3D U-Net [1, 2, 3, 4]. Standard fully supervised semantic segmentation strategies can
be impractical particularly for medical images as they require large pixel-level annotated
datasets which are expensive and time-consuming to acquire, needing considerable clinical
expertise. Furthermore, once trained, the models suffer from poor generalisability to classes not
encountered during training.
   Few-shot learning (FSL) is a promising way to address these challenges. Inspired by how
humans are able to distinguish a new concept with just a handful of examples, FSL seeks to learn

The 11th Colour and Visual Computing Symposium, September 08-09, 2022, Gjøvik, Norway
*
 Corresponding author.
$ suaiba.a.salahuddin@uit.no (S. A. Salahuddin); s.hansen@uit.no (S. Hansen); srishti.gautam@uit.no (S. Gautam);
michael.c.kampffmeyer@uit.no ( M. Kampffmeyer); robert.jenssen@uit.no (R . Jenssen)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
a model that uses only a single or few annotated samples to segment images from previously
unseen classes. Initially, FSL was applied for classification tasks. A key effort was presented
in [5]. This then layed down a basis for more recent applications on the more challenging
task of semantic segmentation. Most existing few-shot segmentation (FSS) techniques adopt
so-called prototypical learning [6, 7, 8, 9, 10, 11]. These approaches usually entail a two-branch
encoder-decoder architecture (support and query branches). Here, support refers to the set of
images with a few annotated images of certain classes which help learn desired segmentation
tasks. The query set refers to the set of images to be segmented, composed of one or more of the
same classes as the support. Typically, within a standard prototype-based FSS framework, the
support branch extracts class-wise prototypes from the support image which then guides the
segmentation of the query image in the query branch. Usually, global average pooling (GAP) is
used to extract the prototypes and the query image is segmented based on e.g. cosine similarity
between query pixels and the prototypes in the embedding space.
   For the particular task of medical image segmentation by FSS, [12] recently proposed self-
supervised training with supervoxel pseudo-labels to generate support and query sets for the
training phase. Supervoxels refer to groups of similar image voxels and were generated offline
as proposed in [13]. The idea is to use an unlabeled image slice and one of its random supervoxel
segmentations (as the foreground mask) to be the support image-label pair. Then the query
image-label pair is constructed by arbitrary transformations performed on the support data.
During testing, new classes are segmented using only a few annotated image slices.
   A key drawback of the prototypical FSS approaches is the loss of intra-class local information
after GAP. In the context of medical images with large, spatially variant background classes
containing all classes other than the foreground this is particularly disadvantageous. Several
approaches have attempted to address this with additional prototypes learned class-wise. To
tackle this issue and boost segmentation accuracy, [12] incorporated adaptive local prototype
pooling. With this strategy, local prototypes were evaluated within a local pooling window
overlaid over support data. Recently, [14] postulated that the background volume characteristics
cannot be sufficiently represented by prototypes evaluated from only a few support image
slices. To overcome this limitation they proposed a novel anomaly detection-inspired technique
(ADNet) whereby background prototypes are excluded and only one representative prototype is
extracted from the more homogeneous foreground class (i.e. an organ). Then anomaly scores are
computed between this foreground prototype and each query pixel to evaluate dissimilarity. In
this scheme, the segmentation is then performed by thresholding anomaly scores with a learned
threshold value. This approach also includes a novel 3D super-voxel based self-supervision
scheme to leverage volumetric data from medical images.
   However, the foreground may be too complex to be modelled using a single prototype. This
was supported by [15], showing that a single extracted support prototype carried insufficient
information to obtain accurate query segmentations even in the case of using the same image
as both support and query. They argued, that using average pooling operations inevitably lost
useful information needed as support for some query pixels. They sought to overcome this
using a self-guided module (SGM) which first extracted a prediction for the labelled support
image using the original prototype and then primary and auxiliary prototypes were extracted
with masked GAP from the covered and uncovered foreground areas respectively. The primary
and auxiliary support vectors were then combined to achieve boosted query segmentation
performances.
   Motivated by the findings of [14] and [15], in this work we propose a self-guided ADNet,
or SG-ADNet, for short, which generates more fine-grained representations of the foreground
properties. In our SG-ADNet, we adopted an Adaptive Self-Guided Module, ASGM, which
differing from [15], determines primary and auxiliary prototype outputs based upon number
of covered and uncovered foreground regions. Our objective is to investigate properties and
potential benefits of the proposed SG-ADNet, having different prototypes (primary and auxiliary)
focusing on different foreground regions of interest, to address the potential drawbacks of the
single foreground-prototype ADNet framework.
   In summary, the main contributions of this work are:
   1. We propose a novel framework, SG-ADNet, to help address limitations of the single fore-
      ground prototype based ADNet. We leverage multiple foreground prototypes generated
      using a novel self guidance module, ASGM, to better account for foreground regions
      “missed” by a single prototype.
   2. We show that the SG-ADNet achieves competitive results relative to the ADNet baseline
      whilst diminishing over-segmentation in cardiac segmentation tasks.
In Section 2 we introduce FSS and our proposed SG-ADNet. Section 3 presents the experimental
setup and data used. Section 4 presents results for the experiments, highlighting properties of
the proposed method. Appendix A gives additional results. Finally, Section 5 concludes the
paper.


2. Self-guided anomaly detection-inspired few shot
   segmentation for medical images
In order to present the SG-ADNet and its context, we will for the benefit of the reader review
the FSS problem setting (Section 2.1), give a brief explanation of the self-supervision stage
which is important in FSS approaches to medical image segmentation (Section 2.2), for then to
briefly give an overview of the ADNet (Section 2.3). This puts us in a position to explain how
the principle of self-guidance is leveraged to propose the novel SG-ADNet.

2.1. The FSS problem setting
In FSS, the model is typically trained on an annotated set 𝐷𝑡𝑟𝑎𝑖𝑛 with classes 𝐶𝑡𝑟𝑎𝑖𝑛 and then
this trained model is used to make predictions on a different test set, 𝐷𝑡𝑒𝑠𝑡 with new classes
𝐶𝑡𝑒𝑠𝑡 for which only a few annotated examples are available. The training and testing are
typically performed episodically [16]; usually with an N-way-K-shot scheme whereby there
are N different classes to be distinguished with K examples of each. An episode incorporates
support set 𝑆 and a query set 𝑄 for a particular class. The support set contains K annotated
images for each class N, which serves as input. The query set contains a query image containing
one or multiple N classes. The model learns from the information in the support set about the
N classes to then segment the query set outputting a predicted query mask with height and
width dimensions 𝐻 and 𝑊 respectively.
2.2. Supervoxel-based self supervision
Due to the particular characteristics of medical images, [12] and later [14] developed training
episodes for FSS by a self-supervised learning approach via supervoxels. In particular, we adopt
the 3D supervoxel approach put forward by [14]. The motivation is to use 3D supervoxels,
which are a collection of similar voxels from localised areas within the image volume, to sample
“pseudo"-labels for semantically uniform image locations to then guide training. Supervoxels
are computed offline using a 3D extension of the efficient, unsupervised, graph-based image-
segmentation algorithm prescribed in [13].
   In training, each episode involves one unlabeled image volume along with one of its ran-
domly sampled supervoxel masks, representative of the foreground, to first yield a 3D binary
mask. Next, 2D slices including this supervoxel class are sampled from the image volume to
make up the support and query images. As in [14], we also apply random transformations to
support/query images and exploit volumetric data across slices with this 3D approach enabling
added information to be available compared to related 2D variants.

2.3. Anomaly detection-inspired FSS - ADNet
As mentioned, in the ADNet-approach, the support and query images, leveraging self-supervised
learning (supervoxels), are embedded into features 𝐹 𝑠 and 𝐹 𝑞 respectively. The focus is on
the foreground class 𝑐, only, within each episode. GAP is hence applied only for this class of
interest. To ensure masking can be performed, the support feature maps are resized to the same
dimensions as the masks (i.e. (H,W)).
   One original foreground prototype 𝑝 ∈ R𝑑 , with 𝑑 indicating the dimensionality of the
embedding space, is extracted. This is done as follows:

                                        𝐹 𝑠 (𝑥, 𝑦) ∘ y𝑓 𝑔 (𝑥, 𝑦)
                                   𝑝=                            ,                              (1)
                                              y𝑓 𝑔 (𝑥, 𝑦)
where ∘ represents the Hadamard product and y𝑓 𝑔 (𝑥, 𝑦) = 1(y(𝑥, 𝑦) = 𝑐) represents the binary
foreground mask. This support foreground prototype is input to our self-guided module, as
discussed later when presenting the SG-ADNet.
   Furthermore, in the original ADNet, segmentation is based on the foreground prototype by
adopting a threshold-based metric learning scheme. This involves evaluating anomaly scores 𝑆
per query feature vector using a negative, scaled cosine similarity measured to the foreground
prototype 𝑝 for that episode by:

                                  𝑆(𝑥, 𝑦) = −𝛼𝑑(𝐹 𝑞 (𝑥, 𝑦), 𝑝).                                 (2)

Here, 𝑑(𝑥, 𝑦) represents the cosine similarity between 𝐹 𝑞 and 𝑝 and the scaling factor 𝛼 is set
to 20 as in [14]. Then, using a learned variable T, the anomaly scores are thresholded to yield
the foreground mask prediction. A shifted Sigmoid 𝜎(.), is applied to perform soft thresholding
which made this procedure differentiable by ŷ𝑞𝑓 𝑔 (𝑥, 𝑦) = 1 − 𝜎(𝑆(𝑥, 𝑦) − 𝑇 ). As a result, query
features with anomaly scores below threshold T receives a foreground probability greater than
0.5. The background mask, ŷ𝑞𝑏𝑔 (𝑥, 𝑦) is 1-ŷ𝑞𝑓 𝑔 (𝑥, 𝑦).
  Following this, the predicted foreground and background masks are upsampled to the same
dimensions as the images (𝐻, 𝑊 ) and then a binary cross-entropy loss for segmentation is
employed:
                      1 ∑︁ 𝑞
          𝐿𝑠𝑒𝑔 = −         [y (𝑥, 𝑦) log(ŷ𝑞𝑏𝑔 (𝑥, 𝑦)) + y𝑞𝑓 𝑔 (𝑥, 𝑦) log(ŷ𝑞𝑓 𝑔 (𝑥, 𝑦))].      (3)
                     𝐻𝑊 𝑥,𝑦 𝑏𝑔

   As prescribed in earlier approaches [7, 8, 14, 12] a regularization prototype alignment loss
is also adopted whereby the roles of the support and query images are exchanged, i.e. the
predicted query mask guides segmentation of the support image:
                      1 ∑︁ 𝑠
          𝐿𝑟𝑒𝑔 = −         [y (𝑥, 𝑦) log(ŷ𝑠𝑏𝑔 (𝑥, 𝑦)) + y𝑠𝑓 𝑔 (𝑥, 𝑦) log(ŷ𝑠𝑓 𝑔 (𝑥, 𝑦))].      (4)
                     𝐻𝑊 𝑥,𝑦 𝑏𝑔

   Another loss term is also added to minimise the threshold T as follows 𝐿𝑡 = 𝑇𝛼 . The total
loss is hence evaluated as follows: 𝐿 = 𝐿𝑠𝑒𝑔 + 𝐿𝑟𝑒𝑔 + 𝐿𝑡 .

2.4. The proposed SG-ADNet
Whereas the ADNet approach involved extracting one prototype representative of each class,
we are interested in extracting multiple foreground-only prototypes, however still excluding the
background. In order to achieve this, we were inspired by [15] to design our proposed ASGM,
for our framework. Figure 1 depicts the ASGM. The ASGM coupled with the ADNet constitutes
the proposed SG-ADNet. An overview of the framework is presented in Figure 2.
   Training of the proposed strategy is conducted in an end-to-end manner. The first steps
involve feature extraction (encoding) with a ResNet-101 backbone. The backbone feature
extractor with shared weights is used to embed support and query data into deep feature maps.
Then metric based learning is used to perform segmentation in the embedding space. The goal
is to best utilise the support information. In our approach, first masked GAP is performed
over all support foreground pixels to generate initial support prototypes. These prototypes are
then input to our ASGM along with the original support masks. The ASGM produces two new
prototypes, primary and auxiliary, based on the true positive and false negative pixels of the
predicted support masks respectively. The primary prototype preserves the main support data
and gathers the True Positive (TP) predictions. The auxiliary prototypes collect all the ‘lost’ key
information not accounted for by the primary prototypes, (the False Negative (FN) predictions),
which could not be predicted using the initial support prototype vector. Thus, by aggregating
both the primary and auxiliary prototypes we aim to leverage more comprehensive information
that could be useful in segmenting the query images.
   After the incorporation of our proposed ASGM, a second threshold and a corresponding loss
term 𝐿𝑡2 are added to the overall loss function. Also, as we now have two query predictions,
one from the original and one from the self-guided prototype we have two segmentation
losses, 𝐿𝑠𝑒𝑔1 and 𝐿𝑠𝑒𝑔2 respectively. Therefore, the new, complete loss term obtained is as
follows: 𝐿 = 𝐿𝑠𝑒𝑔1 + 𝐿𝑠𝑒𝑔2 + 𝐿𝑟𝑒𝑔 + 𝐿𝑡1 + 𝐿𝑡2 . Here, the 𝐿𝑡1 and 𝐿𝑡2 refer to threshold losses
corresponding to learned thresholds 𝑇1 and 𝑇2 for thresholding the anomaly scores of the
original and the new, self-guided prototypes respectively. Something also notable is that after
Figure 1: Illustration of the ASGM. The support feature map 𝐹𝑠 , the support foreground label 𝑌𝑠 and
the original prototype extracted from the support are inputs and produce two new prototypes: primary
and auxiliary respectively. The new prototypes are produced by comparing the ground truth and original
prototype predictions. The original prototype predictions in the image have been delineated in green
and overlaid on the ground truth labels when demonstrating comparison. Here, the primary prototype
encodes the true positive regions predicted by the original prototype. The auxiliary prototype encodes
the false negative regions predicted by the original prototype.


the ASGM, for the term 𝐿𝑟𝑒𝑔 , we can utilise two predicted query masks (from the original single
prototype and the new, self-guided prototype approaches respectively) to compute prototypes
for support image segmentation. Note, we only consider the “new" prediction from the combined
prototypes and not the original prototype prediction for evaluation.
   We investigated different approaches to aggregate the information from the two prototypes
(primary and auxiliary). This is represented in the purple box on Figure 2. For our proposed
framework, we achieved the best segmentation performance with a weighted sum of the two
(primary and auxiliary) prototypes. With this approach, within the ASGM it was determined
whether the sums of all evaluated True Positive (TP) and False Negative (FN) pixels exceeded a
selected threshold value, 𝜏 . Based upon this, four different scenarios were possible with four
different ASGM outputs, this is summarised in Figure 3.
   The ASGM outputs were determined for each of the four cases as follows:
   1. Case 1: both sums of all TP and FN pixels are below 𝜏 . Output: original prototype as in
      [14].
   2. Case 2: sum of TP pixels is above 𝜏 but the sum of FN pixels is not. Output: primary
      prototype only.
   3. Case 3: sum of FN pixels is above 𝜏 but the sum of TP pixels is not. Output: auxiliary
Figure 2: The proposed self-guided, multi prototype FSS framework with anomaly detection, SG-ADNet.
In stage I, the prototype extraction and in stage II obtaining the prediction are depicted. Note, that
we only consider the "new" prediction from the combined prototypes and not the original prototype
prediction for evaluation.


      prototype only.
   4. Case 4: sums of both TP and FN pixels are above 𝜏 . Output: weighted sum of primary
      and auxiliary prototypes. Here weighting for primary and auxiliary prototypes were:
      𝑤1 = 𝑇 𝑃𝑇+𝐹 𝑁 and 𝑤2 = 𝐹 𝑁 +𝑇 𝑃 respectively.
                𝑃               𝐹𝑁


   We also investigated different settings where certain loss terms were excluded from the total
loss to determine the optimal configuration for our SG-ADNet. We found the optimal setting
to be as follows: 𝐿 = 𝐿𝑠𝑒𝑔1 + 𝐿𝑠𝑒𝑔2 + 𝐿𝑟𝑒𝑔 + 𝐿𝑡2 . Here, the 𝐿𝑡1 term is excluded as we do
not want to constrain the learning of the model too much. In addition, the threshold 𝑇1 and
the corresponding original prototype are not used in our final predictions. The term 𝐿𝑟𝑒𝑔 is
evaluated using sum of two losses obtained with the original and combined query predicted
masks guiding support image segmentations respectively.


3. Experiments
Our framework is evaluated on the MS-CMRSeg (bSSFP fold) dataset from the MICCAI 2019
Multi-sequence Cardiac MRI Segmentation Challenge [17]. This dataset contains 35 clinical 3D
cardiac MRI scans with labels for the 3 classes Right-Ventricle (RV), Left-Ventricle blood-pool
Figure 3: Overview of proposed ASGM’s case-based prototype outputs.


(LV-BP) and Left-Ventricle myocardium (LV-MYO). This is a much-used benchmark dataset in
FSS for medical images.
   For all experiments conducted, self-supervised training is performed and evaluation is done
with a 5-fold cross-validation. Per fold, support images are sampled from one subject and the
rest are selected as query images.
   To account for the stochasticity in the model and optimization, each fold is repeated thrice.
Therefore, per fold, the baseline model is repeated thrice and then each baseline model is used
to train three runs of SG-ADNet respectively.
   Some strategies other than a weighted sum to aggregate the information from the primary
and auxiliary prototypes included: stacking the prototypes and similar to [15], concatenating
the anomaly scores of the primary and auxiliary prototypes with the query features which was
predicted upon using convolutional layers. However, as mentioned above, a weighted sum of
the two (primary and auxiliary) prototypes is a good choice, in our experience. Investigations
were also performed with and without pre-training of the model with the single prototype
method (ADNet) as prescribed in [14], considered to be a baseline method for this work. The
key parameters affected by pre-training included the learning rate and model threshold values,
the threshold value learned with the single prototype (T) was used to initialise both thresholds
(T1 and T2) in the multi-prototype approach.
   Note, additionally, motivated by successful utilisation of intermediate layer outputs in [6],
we explored adding outputs of different layers of the ResNet, i.e. final layer output only, the
output of layer 4 (penultimate layer) and a combination of layers 3 and 4 only. Ultimately, the
standard setting with final layer output only was adopted.
3.1. Implementation details
In the proposed SG-ADNet approach, a weighted sum of self-guided prototypes is used, 𝜏 is set
to 1 to avoid prototypes computed from only one pixel, 𝐿𝑟𝑒𝑔 is determined using both original
and new predictions, and 𝐿𝑡1 is excluded.
   The implementation of the proposed framework is based on the PyTorch implementation of
the ADNet [14]. The feature extractor backbone architecture chosen is ResNet-101 pre-trained
on MS-COCO [18]. The proposed approach is pre-trained with a baseline model as in ADNet
with a single prototype. We use a stochastic gradient descent optimiser with a momentum of
0.9. The proposed model’s optimiser is initialised with the learning rate obtained at the end of
the pre-training baseline model, for 50 epochs.

3.2. Evaluation metrics
We use two standard evaluation metrics, mean dice scores and Intersection over Union (IoU)
scores as per prior approaches [7, 12, 8]. For two segmentation masks A and B, the dice score is
                            2||𝐴∩𝐵||
as follows: 𝐷𝑖𝑐𝑒(𝐴, 𝐵) = ||𝐴||+||𝐵|| . For two segmentation masks A and B, the IoU score can
                                         ||𝐴∩𝐵||
be expressed as follows: 𝐼𝑜𝑈 (𝐴, 𝐵) = ||𝐴∪𝐵||    . Both dice and IoU scores range from 0 to 1 with
0 indicating no and 1 indicating full overlap between A and B. The two metrics are comparable
and both are commonly used for evaluating Few-Shot approaches.


4. Results
For quantitative results, we first report the mean dice and IoU scores obtained over 3 runs
performed for each of the 5 folds of MS-CMRSeg data. These are compared to the implementation
of the baseline approach in [14], with a single prototype. The results are summarized in Tables
1 and 2.
   From Tables 1 and 2, we observe that the proposed approach achieves somewhat higher
mean dice and IoU scores compared to the baseline approach. For the classes LV-BP and RV,
the results with SG-ADNet show an improvement over the ADNet. However, for the LV-MYO
structure the ADNet performs better. Thus, it appears that the incorporation of ASGM improves
performance for 2 classes for this dataset, namely RV and LV-BP, at the expense of performance
for LV-MYO. However, the SG-ADNet provides an increase in the overall mean dice and IoU
results over the three classes.
   Since the ADNet is the state-of-the-art method in the literature for this benchmark dataset,
we consider this to be promising.
   The results may indicate that for classes RV and LV-BP, the foreground representation benefits
from a more fine grained approach when it comes to prototypes by leveraging self-guidance,
and that the opposite effect observed on LV-MYO is not strong enough to hinder SG-ADNet
from performing well on the overall mean as compared to ADNet in this case.
   For the benefit of the reader, we include qualitative results from one representative example
image slice from the cardiac MRI dataset. This is illustrated in Figure 4. The image is visualised
at the mid-slice level. Each prediction made, is visualised overlaid on the Ground Truth (GT)
labels. The predictions are outlined with a green border and filled in with a red colour. If the
Table 1
Dice scores for proposed and baseline approaches averaged over 3 runs for each of the 5 folds. Highest
dice scores are reported in bold lettering.

                         Method       LV-MYO       LV-BP     RV       Mean
                         ADNet         0.595        0.832   0.645     0.691
                        SG-ADNet        0.585      0.840    0.697     0.707

Table 2
IoU for proposed and baseline approaches averaged over 3 runs for each of the 5 folds. Highest IoU
scores are reported in bold lettering.

                         Method       LV-MYO       LV-BP     RV       Mean
                         ADNet         0.426        0.717   0.485     0.543
                        SG-ADNet        0.417      0.728    0.516     0.554


GT and prediction intersects, the colour observed is a pale red and if the prediction exceeds
the GT (i.e. over-segmentation occurred) a dark red colour is seen. From Figure 4, for all 3
regions of interest, the proposed model’s predictions resembles the GT labels more than the
predictions made with a single prototype. The proposed model predictions appears to rectify
some over-segmentation in the predictions made with a single prototype. The evaluated dice
scores for RV, LV-BP and LV-MYO for this example image are 0.714, 0.883 and 0.631 respectively.
   For completeness, we provide further details and results of the investigations. These are
presented in the Appendix with the aim to: i) Compare the different strategies we explore on
how to aggregate the data from the primary and auxiliary prototypes, and ii) The effect of
design choices such as pre-training, different final loss terms and the choice of hyperparameters
in our proposed model.


5. Conclusion
In this work, we have proposed SG-ADNet, a novel self-guided anomaly detection-inspired few-
shot segmentation framework utilising multiple foreground prototypes. We have particularly
investigated the proposed approach for cardiac MRI segmentation. Qualitative results showed
the effectiveness of the SG-ADNet in correcting over-segmentation for all structures segmented
over the baseline.
   Notably, the quantitative results showed improvements in dice and IoU scores for two classes,
RV and LV-BP, over the baseline approach. However, this improvement was at the expense
of a reduced score for the LV-MYO class, relative to the baseline. Further research should be
conducted to better understand these results and the effects of the self-guidance on the cardiac
segmentation performance (i.e why the results improved for only RV and LV-BP while they
declined for LV-MYO). In future work, we aim to explore additional medical applications and
datasets to assess SG-ADNet’s full potential and will analyse how it generalises to other imaging
modalities.
Figure 4: Qualitative results on an example image slice from the cardiac MRI data. Top to bottom the
regions of interest are: RV, LV-BP and LV-MYO. From left to right within each row, the visualisations
are: i) The original image with the GT label overlaid. The GT label here is highlighted with a red colour,
ii) The GT label, iii) The prediction obtained with a single prototype overlaid on the GT label. The
prediction here is outlined with a green border and filled in with a red colour & iv) the prediction of
the proposed approach with multiple prototypes overlaid on the GT label. Again, the prediction here is
outlined with a green border and filled in with a red colour.


Acknowledgments
This work was supported by The Research Council of Norway (RCN), through its Centre for
Research-based Innovation funding scheme [grant number 309439] and Consortium Partners;
RCN FRIPRO [grant number 315029]; RCN IKTPLUSS [grant number 303514]; and the UiT
Thematic Initiative.
References
 [1] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image
     segmentation, in: International Conference on Medical image computing and computer-
     assisted intervention, Springer, 2015, pp. 234–241.
 [2] F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neural networks for
     volumetric medical image segmentation, in: 2016 fourth international conference on 3D
     vision (3DV), IEEE, 2016, pp. 565–571.
 [3] W. Sun, R. Wang, Fully convolutional networks for semantic segmentation of very high
     resolution remotely sensed images combined with dsm, IEEE Geoscience and Remote
     Sensing Letters 15 (2018) 474–478.
 [4] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, O. Ronneberger, 3d u-net: learning dense
     volumetric segmentation from sparse annotation, in: International conference on medical
     image computing and computer-assisted intervention, Springer, 2016, pp. 424–432.
 [5] J. Snell, K. Swersky, R. Zemel, Prototypical networks for few-shot learning, Advances in
     Neural Information Processing Systems 2017-December (2017) 4078–4088.
 [6] C. Zhang, G. Lin, F. Liu, R. Yao, C. Shen, Canet: Class-agnostic segmentation networks
     with iterative refinement and attentive few-shot learning, in: Proceedings of the IEEE/CVF
     Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 [7] K. Wang, J. H. Liew, Y. Zou, D. Zhou, J. Feng, Panet: Few-shot image semantic segmentation
     with prototype alignment, in: Proceedings of the IEEE/CVF International Conference on
     Computer Vision (ICCV), 2019.
 [8] Y. Liu, X. Zhang, S. Zhang, X. He, Part-aware prototype network for few-shot semantic
     segmentation, Lecture Notes in Computer Science (including subseries Lecture Notes in
     Artificial Intelligence and Lecture Notes in Bioinformatics) 12354 LNCS (2020) 142–158.
     doi:10.1007/978-3-030-58545-7_9.
 [9] G. Li, V. Jampani, L. Sevilla-Lara, D. Sun, J. Kim, J. Kim, Adaptive prototype learning and
     allocation for few-shot segmentation, in: Proceedings of the IEEE/CVF Conference on
     Computer Vision and Pattern Recognition, 2021, pp. 8334–8343.
[10] Q. Yu, K. Dang, N. Tajbakhsh, D. Terzopoulos, X. Ding, A location-sensitive local prototype
     network for few-shot medical image segmentation, in: 2021 IEEE 18th International
     Symposium on Biomedical Imaging (ISBI), IEEE, 2021, pp. 262–266.
[11] F. Cermelli, M. Mancini, Y. Xian, Z. Akata, B. Caputo, A few guidelines for incremental
     few-shot segmentation, arXiv e-prints (2020) arXiv–2012.
[12] C. Ouyang, C. Biffi, C. Chen, T. Kart, H. Qiu, D. Rueckert, Self-supervision with superpixels:
     Training few-shot medical image segmentation without annotation, volume 12374 LNCS,
     Springer Science and Business Media Deutschland GmbH, 2020, pp. 762–780. doi:10.1007/
     978-3-030-58526-6_45.
[13] P. F. Felzenszwalb, D. P. Huttenlocher, Efficient graph-based image segmentation, Interna-
     tional journal of computer vision 59 (2004) 167–181.
[14] S. Hansen, S. Gautam, R. Jenssen, M. Kampffmeyer, Anomaly detection-inspired few-shot
     medical image segmentation through self-supervision with supervoxels, Medical Image
     Analysis 78 (2022) 102385.
[15] B. Zhang, J. Xiao, T. Qin, Self-guided and cross-guided learning for few-shot segmentation,
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
     (CVPR), 2021 (2021) 8312–8321.
[16] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., Matching networks for one shot
     learning, Advances in neural information processing systems 29 (2016).
[17] X. Zhuang, Multivariate mixture model for cardiac segmentation from multi-sequence
     mri, in: International Conference on Medical Image Computing and Computer-Assisted
     Intervention, Springer, 2016, pp. 581–588.
[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick,
     Microsoft coco: Common objects in context, in: European conference on computer vision,
     Springer, 2014, pp. 740–755.



A. Appendix
The details of experiments conducted, to determine the optimal strategy for aggregating data
from the prototypes produced by our ASGM, are presented.
   First, the relative performances of the methods we considered for aggregating information
from the new primary and auxiliary prototypes obtained with our ASGM are summarised in
Table 3. The methods compared include: i) stacking the prototypes with 𝜏 = 1, ii) concatenating
the anomaly scores the query predictions achieved with each prototype (primary and auxiliary),
and passing through two 3 × 3 convolutional layers and iii) the weighted sum of prototypes as
in our proposed approach. The reported metrics are dice scores for fold 1 of the MS-CMRSeg
cardiac MRI data. Our proposed strategy of using weighted sum was chosen as it had an overall
good performance across all three classes. The proposed weighted sum approach had the
best performance, amongst the methods compared, for the LV-BP and RV classes and had the
second-best performance for LV-MYO compared to the baseline.
   Next, Table 4 summarises mean dice scores over 5 folds obtained with different combinations
of the final loss term and thresholds applied. In each investigated approach, the weighted
sum of primary and auxiliary prototypes as described previously was adopted. Three settings
of 𝐿𝑟𝑒𝑔 were explored: using original prediction only, using new prototype only and using
both predictions. Effect of whether or not to: i) initialise learning rate with pre-training, ii)
include 𝐿𝑠𝑒𝑔1 , iii) include 𝐿𝑡1 , iv) include 𝐿𝑡2 and v) keeping t1 fixed to pre-trained value were
investigated.
   Comparing the dice score results reported, the most appropriate approach determined was:
the initialisation of learning rate with pre-training, 𝐿𝑟𝑒𝑔 evaluated with both original and new,


Table 3
Comparison of investigated strategies for aggregating data from primary and auxiliary prototypes for
fold 1. Highest dice scores in bold.
    Aggregation approach                                           LV-MYO      LV-BP      RV
    Stacked prototypes                                               0.527      0.760    0.641
    Concatenation of anomaly scores with predictions                 0.530      0.832    0.602
    Weighted sum of prototypes (proposed)                           0.595      0.861     0.706
Table 4
Investigation of different final loss terms for pre-trained, weighted sum prototype approach to determine
optimal configuration. The evaluation metric reported here is the dice score. Highest dice scores are
reported in bold lettering. The selected approach for the proposed scheme is shown in the last row.
                   Lseg1    𝑇1 fixed     Lt1    Lt2    LV-BP     LV-MYO        RV
                                                        0.600      0.857      0.703
                                ×         ×             0.595     0.861       0.706
                                ×                       0.605      0.852      0.698
                                ×         ×      ×     0.612       0.841      0.682
                     ×          ×                       0.583      0.860      0.710
                                ×         ×             0.595     0.861       0.706


self-guided predictions, the inclusion of 𝐿𝑠𝑒𝑔1 , and 𝐿𝑡2 , excluding 𝐿𝑡1 and not fixing t1. This
selected approach is shown in the last row of Table 4.