<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A self-guided anomaly detection-inspired few-shot segmentation network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Suaiba Amina Salahuddin</string-name>
          <email>suaiba.a.salahuddin@uit.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stine Hansen</string-name>
          <email>s.hansen@uit.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Srishti Gautam</string-name>
          <email>srishti.gautam@uit.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Kampfmeyer</string-name>
          <email>michael.c.kampfmeyer@uit.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Jenssen</string-name>
          <email>robert.jenssen@uit.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Physics and Technology, UiT The Arctic University of Norway</institution>
          ,
          <addr-line>Tromsø NO-9037</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Standard strategies for fully supervised semantic segmentation of medical images require large pixel-level annotated datasets. This makes such methods challenging due to the manual labor required and limits the usability when segmentation is needed for new classes for which data is scarce. Few-shot segmentation (FSS) is a recent and promising direction within the deep learning literature designed to alleviate these challenges. In FSS, the aim is to create segmentation networks with the ability to generalize based on just a few annotated examples, inspired by human learning. A dominant direction in FSS is based on matching representations of the image to be segmented with prototypes acquired from a few annotated examples. A recent method called the ADNet, inspired by anomaly detection only computes one single prototype. This prototype captures the properties of the foreground segment. In this paper, the aim is to investigate whether the ADNet may benefit from more than one prototype to capture foreground properties. We take inspiration from the very recent idea of self-guidance, where an initial prediction of the support image is used to compute two new prototypes, representing the covered region and the missed region. We couple these more fine-grained prototypes with the ADNet framework to form what we refer to as the self-guided ADNet, or SG-ADNet for short. We evaluate the proposed SG-ADNet on a benchmark cardiac MRI data set, achieving competitive overall performance compared to the baseline ADNet, helping reduce over-segmentation errors for some classes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical image segmentation</kwd>
        <kwd>Few-Shot</kwd>
        <kwd>Self-supervision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Significant advances have been made toward image classification and semantic segmentation
tasks by deep convolutional neural network-driven approaches such as U-Net, V-Net, FCN
and 3D U-Net [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. Standard fully supervised semantic segmentation strategies can
be impractical particularly for medical images as they require large pixel-level annotated
datasets which are expensive and time-consuming to acquire, needing considerable clinical
expertise. Furthermore, once trained, the models sufer from poor generalisability to classes not
encountered during training.
      </p>
      <p>
        Few-shot learning (FSL) is a promising way to address these challenges. Inspired by how
humans are able to distinguish a new concept with just a handful of examples, FSL seeks to learn
a model that uses only a single or few annotated samples to segment images from previously
unseen classes. Initially, FSL was applied for classification tasks. A key efort was presented
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This then layed down a basis for more recent applications on the more challenging
task of semantic segmentation. Most existing few-shot segmentation (FSS) techniques adopt
so-called prototypical learning [
        <xref ref-type="bibr" rid="ref10 ref11 ref6 ref7 ref8 ref9">6, 7, 8, 9, 10, 11</xref>
        ]. These approaches usually entail a two-branch
encoder-decoder architecture (support and query branches). Here, support refers to the set of
images with a few annotated images of certain classes which help learn desired segmentation
tasks. The query set refers to the set of images to be segmented, composed of one or more of the
same classes as the support. Typically, within a standard prototype-based FSS framework, the
support branch extracts class-wise prototypes from the support image which then guides the
segmentation of the query image in the query branch. Usually, global average pooling (GAP) is
used to extract the prototypes and the query image is segmented based on e.g. cosine similarity
between query pixels and the prototypes in the embedding space.
      </p>
      <p>
        For the particular task of medical image segmentation by FSS, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] recently proposed
selfsupervised training with supervoxel pseudo-labels to generate support and query sets for the
training phase. Supervoxels refer to groups of similar image voxels and were generated ofline
as proposed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The idea is to use an unlabeled image slice and one of its random supervoxel
segmentations (as the foreground mask) to be the support image-label pair. Then the query
image-label pair is constructed by arbitrary transformations performed on the support data.
During testing, new classes are segmented using only a few annotated image slices.
      </p>
      <p>
        A key drawback of the prototypical FSS approaches is the loss of intra-class local information
after GAP. In the context of medical images with large, spatially variant background classes
containing all classes other than the foreground this is particularly disadvantageous. Several
approaches have attempted to address this with additional prototypes learned class-wise. To
tackle this issue and boost segmentation accuracy, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] incorporated adaptive local prototype
pooling. With this strategy, local prototypes were evaluated within a local pooling window
overlaid over support data. Recently, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] postulated that the background volume characteristics
cannot be suficiently represented by prototypes evaluated from only a few support image
slices. To overcome this limitation they proposed a novel anomaly detection-inspired technique
(ADNet) whereby background prototypes are excluded and only one representative prototype is
extracted from the more homogeneous foreground class (i.e. an organ). Then anomaly scores are
computed between this foreground prototype and each query pixel to evaluate dissimilarity. In
this scheme, the segmentation is then performed by thresholding anomaly scores with a learned
threshold value. This approach also includes a novel 3D super-voxel based self-supervision
scheme to leverage volumetric data from medical images.
      </p>
      <p>
        However, the foreground may be too complex to be modelled using a single prototype. This
was supported by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], showing that a single extracted support prototype carried insuficient
information to obtain accurate query segmentations even in the case of using the same image
as both support and query. They argued, that using average pooling operations inevitably lost
useful information needed as support for some query pixels. They sought to overcome this
using a self-guided module (SGM) which first extracted a prediction for the labelled support
image using the original prototype and then primary and auxiliary prototypes were extracted
with masked GAP from the covered and uncovered foreground areas respectively. The primary
and auxiliary support vectors were then combined to achieve boosted query segmentation
performances.
      </p>
      <p>
        Motivated by the findings of [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], in this work we propose a self-guided ADNet,
or SG-ADNet, for short, which generates more fine-grained representations of the foreground
properties. In our SG-ADNet, we adopted an Adaptive Self-Guided Module, ASGM, which
difering from [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], determines primary and auxiliary prototype outputs based upon number
of covered and uncovered foreground regions. Our objective is to investigate properties and
potential benefits of the proposed SG-ADNet, having diferent prototypes (primary and auxiliary)
focusing on diferent foreground regions of interest, to address the potential drawbacks of the
single foreground-prototype ADNet framework.
      </p>
      <p>In summary, the main contributions of this work are:
1. We propose a novel framework, SG-ADNet, to help address limitations of the single
foreground prototype based ADNet. We leverage multiple foreground prototypes generated
using a novel self guidance module, ASGM, to better account for foreground regions
“missed” by a single prototype.
2. We show that the SG-ADNet achieves competitive results relative to the ADNet baseline
whilst diminishing over-segmentation in cardiac segmentation tasks.</p>
      <p>In Section 2 we introduce FSS and our proposed SG-ADNet. Section 3 presents the experimental
setup and data used. Section 4 presents results for the experiments, highlighting properties of
the proposed method. Appendix A gives additional results. Finally, Section 5 concludes the
paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Self-guided anomaly detection-inspired few shot segmentation for medical images</title>
      <p>In order to present the SG-ADNet and its context, we will for the benefit of the reader review
the FSS problem setting (Section 2.1), give a brief explanation of the self-supervision stage
which is important in FSS approaches to medical image segmentation (Section 2.2), for then to
briefly give an overview of the ADNet (Section 2.3). This puts us in a position to explain how
the principle of self-guidance is leveraged to propose the novel SG-ADNet.</p>
      <sec id="sec-2-1">
        <title>2.1. The FSS problem setting</title>
        <p>
          In FSS, the model is typically trained on an annotated set  with classes  and then
this trained model is used to make predictions on a diferent test set,  with new classes
 for which only a few annotated examples are available. The training and testing are
typically performed episodically [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]; usually with an N-way-K-shot scheme whereby there
are N diferent classes to be distinguished with K examples of each. An episode incorporates
support set  and a query set  for a particular class. The support set contains K annotated
images for each class N, which serves as input. The query set contains a query image containing
one or multiple N classes. The model learns from the information in the support set about the
N classes to then segment the query set outputting a predicted query mask with height and
width dimensions  and  respectively.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Supervoxel-based self supervision</title>
        <p>
          Due to the particular characteristics of medical images, [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and later [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] developed training
episodes for FSS by a self-supervised learning approach via supervoxels. In particular, we adopt
the 3D supervoxel approach put forward by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The motivation is to use 3D supervoxels,
which are a collection of similar voxels from localised areas within the image volume, to sample
“pseudo"-labels for semantically uniform image locations to then guide training. Supervoxels
are computed ofline using a 3D extension of the eficient, unsupervised, graph-based
imagesegmentation algorithm prescribed in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          In training, each episode involves one unlabeled image volume along with one of its
randomly sampled supervoxel masks, representative of the foreground, to first yield a 3D binary
mask. Next, 2D slices including this supervoxel class are sampled from the image volume to
make up the support and query images. As in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], we also apply random transformations to
support/query images and exploit volumetric data across slices with this 3D approach enabling
added information to be available compared to related 2D variants.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Anomaly detection-inspired FSS - ADNet</title>
        <p>As mentioned, in the ADNet-approach, the support and query images, leveraging self-supervised
learning (supervoxels), are embedded into features   and   respectively. The focus is on
the foreground class , only, within each episode. GAP is hence applied only for this class of
interest. To ensure masking can be performed, the support feature maps are resized to the same
dimensions as the masks (i.e. (H,W)).</p>
        <p>One original foreground prototype  ∈ R, with  indicating the dimensionality of the
embedding space, is extracted. This is done as follows:
(1)
(2)
 =
 (, ) ∘ y(, )
y(, )
,
where ∘ represents the Hadamard product and y(, ) = 1(y(, ) = ) represents the binary
foreground mask. This support foreground prototype is input to our self-guided module, as
discussed later when presenting the SG-ADNet.</p>
        <p>Furthermore, in the original ADNet, segmentation is based on the foreground prototype by
adopting a threshold-based metric learning scheme. This involves evaluating anomaly scores 
per query feature vector using a negative, scaled cosine similarity measured to the foreground
prototype  for that episode by:</p>
        <p>(, ) = −  ( (, ), ).</p>
        <p>
          Here, (, ) represents the cosine similarity between   and  and the scaling factor  is set
to 20 as in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Then, using a learned variable T, the anomaly scores are thresholded to yield
the foreground mask prediction. A shifted Sigmoid  (.), is applied to perform soft thresholding
which made this procedure diferentiable by yˆ(, ) = 1 −  ((, ) −  ). As a result, query
features with anomaly scores below threshold T receives a foreground probability greater than
0.5. The background mask, yˆ(, ) is 1- yˆ(, ).
        </p>
        <p>Following this, the predicted foreground and background masks are upsampled to the same
dimensions as the images (,  ) and then a binary cross-entropy loss for segmentation is
employed:
1</p>
        <p>∑︁[y(, ) log( yˆ(, )) + y(, ) log( yˆ(, ))].
 = −  ,
(3)</p>
        <p>
          As prescribed in earlier approaches [
          <xref ref-type="bibr" rid="ref12 ref14 ref7 ref8">7, 8, 14, 12</xref>
          ] a regularization prototype alignment loss
is also adopted whereby the roles of the support and query images are exchanged, i.e. the
predicted query mask guides segmentation of the support image:
1
        </p>
        <p>∑︁[y(, ) log( yˆ(, )) + y(, ) log( yˆ(, ))].
 = −  ,
(4)</p>
        <p>Another loss term is also added to minimise the threshold T as follows  =  . The total
loss is hence evaluated as follows:  =  +  + .</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. The proposed SG-ADNet</title>
        <p>
          Whereas the ADNet approach involved extracting one prototype representative of each class,
we are interested in extracting multiple foreground-only prototypes, however still excluding the
background. In order to achieve this, we were inspired by [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] to design our proposed ASGM,
for our framework. Figure 1 depicts the ASGM. The ASGM coupled with the ADNet constitutes
the proposed SG-ADNet. An overview of the framework is presented in Figure 2.
        </p>
        <p>Training of the proposed strategy is conducted in an end-to-end manner. The first steps
involve feature extraction (encoding) with a ResNet-101 backbone. The backbone feature
extractor with shared weights is used to embed support and query data into deep feature maps.
Then metric based learning is used to perform segmentation in the embedding space. The goal
is to best utilise the support information. In our approach, first masked GAP is performed
over all support foreground pixels to generate initial support prototypes. These prototypes are
then input to our ASGM along with the original support masks. The ASGM produces two new
prototypes, primary and auxiliary, based on the true positive and false negative pixels of the
predicted support masks respectively. The primary prototype preserves the main support data
and gathers the True Positive (TP) predictions. The auxiliary prototypes collect all the ‘lost’ key
information not accounted for by the primary prototypes, (the False Negative (FN) predictions),
which could not be predicted using the initial support prototype vector. Thus, by aggregating
both the primary and auxiliary prototypes we aim to leverage more comprehensive information
that could be useful in segmenting the query images.</p>
        <p>After the incorporation of our proposed ASGM, a second threshold and a corresponding loss
term 2 are added to the overall loss function. Also, as we now have two query predictions,
one from the original and one from the self-guided prototype we have two segmentation
losses, 1 and 2 respectively. Therefore, the new, complete loss term obtained is as
follows:  = 1 + 2 +  + 1 + 2. Here, the 1 and 2 refer to threshold losses
corresponding to learned thresholds 1 and 2 for thresholding the anomaly scores of the
original and the new, self-guided prototypes respectively. Something also notable is that after
the ASGM, for the term , we can utilise two predicted query masks (from the original single
prototype and the new, self-guided prototype approaches respectively) to compute prototypes
for support image segmentation. Note, we only consider the “new" prediction from the combined
prototypes and not the original prototype prediction for evaluation.</p>
        <p>We investigated diferent approaches to aggregate the information from the two prototypes
(primary and auxiliary). This is represented in the purple box on Figure 2. For our proposed
framework, we achieved the best segmentation performance with a weighted sum of the two
(primary and auxiliary) prototypes. With this approach, within the ASGM it was determined
whether the sums of all evaluated True Positive (TP) and False Negative (FN) pixels exceeded a
selected threshold value,  . Based upon this, four diferent scenarios were possible with four
diferent ASGM outputs, this is summarised in Figure 3.</p>
        <p>
          The ASGM outputs were determined for each of the four cases as follows:
1. Case 1: both sums of all TP and FN pixels are below  . Output: original prototype as in
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
2. Case 2: sum of TP pixels is above  but the sum of FN pixels is not. Output: primary
prototype only.
        </p>
        <p>3. Case 3: sum of FN pixels is above  but the sum of TP pixels is not. Output: auxiliary</p>
        <p>We also investigated diferent settings where certain loss terms were excluded from the total
loss to determine the optimal configuration for our SG-ADNet. We found the optimal setting
to be as follows:  = 1 + 2 +  + 2. Here, the 1 term is excluded as we do
not want to constrain the learning of the model too much. In addition, the threshold 1 and
the corresponding original prototype are not used in our final predictions. The term  is
evaluated using sum of two losses obtained with the original and combined query predicted
masks guiding support image segmentations respectively.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>
        Our framework is evaluated on the MS-CMRSeg (bSSFP fold) dataset from the MICCAI 2019
Multi-sequence Cardiac MRI Segmentation Challenge [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. This dataset contains 35 clinical 3D
cardiac MRI scans with labels for the 3 classes Right-Ventricle (RV), Left-Ventricle blood-pool
(LV-BP) and Left-Ventricle myocardium (LV-MYO). This is a much-used benchmark dataset in
FSS for medical images.
      </p>
      <p>For all experiments conducted, self-supervised training is performed and evaluation is done
with a 5-fold cross-validation. Per fold, support images are sampled from one subject and the
rest are selected as query images.</p>
      <p>To account for the stochasticity in the model and optimization, each fold is repeated thrice.
Therefore, per fold, the baseline model is repeated thrice and then each baseline model is used
to train three runs of SG-ADNet respectively.</p>
      <p>
        Some strategies other than a weighted sum to aggregate the information from the primary
and auxiliary prototypes included: stacking the prototypes and similar to [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], concatenating
the anomaly scores of the primary and auxiliary prototypes with the query features which was
predicted upon using convolutional layers. However, as mentioned above, a weighted sum of
the two (primary and auxiliary) prototypes is a good choice, in our experience. Investigations
were also performed with and without pre-training of the model with the single prototype
method (ADNet) as prescribed in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], considered to be a baseline method for this work. The
key parameters afected by pre-training included the learning rate and model threshold values,
the threshold value learned with the single prototype (T) was used to initialise both thresholds
(T1 and T2) in the multi-prototype approach.
      </p>
      <p>
        Note, additionally, motivated by successful utilisation of intermediate layer outputs in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
we explored adding outputs of diferent layers of the ResNet, i.e. final layer output only, the
output of layer 4 (penultimate layer) and a combination of layers 3 and 4 only. Ultimately, the
standard setting with final layer output only was adopted.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Implementation details</title>
        <p>In the proposed SG-ADNet approach, a weighted sum of self-guided prototypes is used,  is set
to 1 to avoid prototypes computed from only one pixel,  is determined using both original
and new predictions, and 1 is excluded.</p>
        <p>
          The implementation of the proposed framework is based on the PyTorch implementation of
the ADNet [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The feature extractor backbone architecture chosen is ResNet-101 pre-trained
on MS-COCO [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The proposed approach is pre-trained with a baseline model as in ADNet
with a single prototype. We use a stochastic gradient descent optimiser with a momentum of
0.9. The proposed model’s optimiser is initialised with the learning rate obtained at the end of
the pre-training baseline model, for 50 epochs.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation metrics</title>
        <p>
          We use two standard evaluation metrics, mean dice scores and Intersection over Union (IoU)
scores as per prior approaches [
          <xref ref-type="bibr" rid="ref12 ref7 ref8">7, 12, 8</xref>
          ]. For two segmentation masks A and B, the dice score is
as follows: (, ) = 2||∩|| . For two segmentation masks A and B, the IoU score can
||||+||||
be expressed as follows:  (, ) = ||∩|| . Both dice and IoU scores range from 0 to 1 with
||∪||
0 indicating no and 1 indicating full overlap between A and B. The two metrics are comparable
and both are commonly used for evaluating Few-Shot approaches.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>
        For quantitative results, we first report the mean dice and IoU scores obtained over 3 runs
performed for each of the 5 folds of MS-CMRSeg data. These are compared to the implementation
of the baseline approach in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], with a single prototype. The results are summarized in Tables
1 and 2.
      </p>
      <p>From Tables 1 and 2, we observe that the proposed approach achieves somewhat higher
mean dice and IoU scores compared to the baseline approach. For the classes LV-BP and RV,
the results with SG-ADNet show an improvement over the ADNet. However, for the LV-MYO
structure the ADNet performs better. Thus, it appears that the incorporation of ASGM improves
performance for 2 classes for this dataset, namely RV and LV-BP, at the expense of performance
for LV-MYO. However, the SG-ADNet provides an increase in the overall mean dice and IoU
results over the three classes.</p>
      <p>Since the ADNet is the state-of-the-art method in the literature for this benchmark dataset,
we consider this to be promising.</p>
      <p>The results may indicate that for classes RV and LV-BP, the foreground representation benefits
from a more fine grained approach when it comes to prototypes by leveraging self-guidance,
and that the opposite efect observed on LV-MYO is not strong enough to hinder SG-ADNet
from performing well on the overall mean as compared to ADNet in this case.</p>
      <p>For the benefit of the reader, we include qualitative results from one representative example
image slice from the cardiac MRI dataset. This is illustrated in Figure 4. The image is visualised
at the mid-slice level. Each prediction made, is visualised overlaid on the Ground Truth (GT)
labels. The predictions are outlined with a green border and filled in with a red colour. If the
GT and prediction intersects, the colour observed is a pale red and if the prediction exceeds
the GT (i.e. over-segmentation occurred) a dark red colour is seen. From Figure 4, for all 3
regions of interest, the proposed model’s predictions resembles the GT labels more than the
predictions made with a single prototype. The proposed model predictions appears to rectify
some over-segmentation in the predictions made with a single prototype. The evaluated dice
scores for RV, LV-BP and LV-MYO for this example image are 0.714, 0.883 and 0.631 respectively.</p>
      <p>For completeness, we provide further details and results of the investigations. These are
presented in the Appendix with the aim to: i) Compare the diferent strategies we explore on
how to aggregate the data from the primary and auxiliary prototypes, and ii) The efect of
design choices such as pre-training, diferent final loss terms and the choice of hyperparameters
in our proposed model.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we have proposed SG-ADNet, a novel self-guided anomaly detection-inspired
fewshot segmentation framework utilising multiple foreground prototypes. We have particularly
investigated the proposed approach for cardiac MRI segmentation. Qualitative results showed
the efectiveness of the SG-ADNet in correcting over-segmentation for all structures segmented
over the baseline.</p>
      <p>Notably, the quantitative results showed improvements in dice and IoU scores for two classes,
RV and LV-BP, over the baseline approach. However, this improvement was at the expense
of a reduced score for the LV-MYO class, relative to the baseline. Further research should be
conducted to better understand these results and the efects of the self-guidance on the cardiac
segmentation performance (i.e why the results improved for only RV and LV-BP while they
declined for LV-MYO). In future work, we aim to explore additional medical applications and
datasets to assess SG-ADNet’s full potential and will analyse how it generalises to other imaging
modalities.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by The Research Council of Norway (RCN), through its Centre for
Research-based Innovation funding scheme [grant number 309439] and Consortium Partners;
RCN FRIPRO [grant number 315029]; RCN IKTPLUSS [grant number 303514]; and the UiT
Thematic Initiative.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Appendix</title>
      <p>The details of experiments conducted, to determine the optimal strategy for aggregating data
from the prototypes produced by our ASGM, are presented.</p>
      <p>First, the relative performances of the methods we considered for aggregating information
from the new primary and auxiliary prototypes obtained with our ASGM are summarised in
Table 3. The methods compared include: i) stacking the prototypes with  = 1, ii) concatenating
the anomaly scores the query predictions achieved with each prototype (primary and auxiliary),
and passing through two 3 × 3 convolutional layers and iii) the weighted sum of prototypes as
in our proposed approach. The reported metrics are dice scores for fold 1 of the MS-CMRSeg
cardiac MRI data. Our proposed strategy of using weighted sum was chosen as it had an overall
good performance across all three classes. The proposed weighted sum approach had the
best performance, amongst the methods compared, for the LV-BP and RV classes and had the
second-best performance for LV-MYO compared to the baseline.</p>
      <p>Next, Table 4 summarises mean dice scores over 5 folds obtained with diferent combinations
of the final loss term and thresholds applied. In each investigated approach, the weighted
sum of primary and auxiliary prototypes as described previously was adopted. Three settings
of  were explored: using original prediction only, using new prototype only and using
both predictions. Efect of whether or not to: i) initialise learning rate with pre-training, ii)
include 1, iii) include 1, iv) include 2 and v) keeping t1 fixed to pre-trained value were
investigated.</p>
      <p>Comparing the dice score results reported, the most appropriate approach determined was:
the initialisation of learning rate with pre-training,  evaluated with both original and new,</p>
      <p>Aggregation approach
Stacked prototypes
Concatenation of anomaly scores with predictions
Weighted sum of prototypes (proposed)</p>
      <p>LV-BP
0.600
0.595
0.605
0.612
0.583
0.595</p>
      <p>LV-MYO
0.857
0.861
0.852
0.841
0.860
0.861</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation</article-title>
          ,
          <source>in: International Conference on Medical image computing and computerassisted intervention</source>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Milletari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Navab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-A.</given-names>
            <surname>Ahmadi</surname>
          </string-name>
          , V-net:
          <article-title>Fully convolutional neural networks for volumetric medical image segmentation</article-title>
          ,
          <source>in: 2016 fourth international conference on 3D vision (3DV)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>565</fpage>
          -
          <lpage>571</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Fully convolutional networks for semantic segmentation of very high resolution remotely sensed images combined with dsm</article-title>
          ,
          <source>IEEE Geoscience and Remote Sensing Letters</source>
          <volume>15</volume>
          (
          <year>2018</year>
          )
          <fpage>474</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ö.</given-names>
            <surname>Çiçek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdulkadir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Lienkamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Ronneberger,</surname>
          </string-name>
          <article-title>3d u-net: learning dense volumetric segmentation from sparse annotation</article-title>
          , in: International conference
          <article-title>on medical image computing and computer-assisted intervention</article-title>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>424</fpage>
          -
          <lpage>432</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Snell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <article-title>Prototypical networks for few-shot learning</article-title>
          ,
          <source>Advances in Neural Information Processing Systems 2017-December</source>
          (
          <year>2017</year>
          )
          <fpage>4078</fpage>
          -
          <lpage>4088</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Liew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          , Panet:
          <article-title>Few-shot image semantic segmentation with prototype alignment</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Part-aware prototype network for few-shot semantic segmentation</article-title>
          ,
          <source>Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12354 LNCS</source>
          (
          <year>2020</year>
          )
          <fpage>142</fpage>
          -
          <lpage>158</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -58545-
          <issue>7</issue>
          _
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jampani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sevilla-Lara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Adaptive prototype learning and allocation for few-shot segmentation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8334</fpage>
          -
          <lpage>8343</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tajbakhsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Terzopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <article-title>A location-sensitive local prototype network for few-shot medical image segmentation</article-title>
          ,
          <source>in: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>262</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cermelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mancini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Akata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          ,
          <article-title>A few guidelines for incremental few-shot segmentation</article-title>
          , arXiv e-prints (
          <year>2020</year>
          ) arXiv-
          <fpage>2012</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bifi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rueckert</surname>
          </string-name>
          ,
          <article-title>Self-supervision with superpixels: Training few-shot medical image segmentation without annotation</article-title>
          , volume
          <volume>12374</volume>
          LNCS, Springer Science and Business Media Deutschland GmbH,
          <year>2020</year>
          , pp.
          <fpage>762</fpage>
          -
          <lpage>780</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>030</fpage>
          -58526-6_
          <fpage>45</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Felzenszwalb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Huttenlocher</surname>
          </string-name>
          ,
          <article-title>Eficient graph-based image segmentation</article-title>
          ,
          <source>International journal of computer vision 59</source>
          (
          <year>2004</year>
          )
          <fpage>167</fpage>
          -
          <lpage>181</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jenssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kampfmeyer</surname>
          </string-name>
          ,
          <article-title>Anomaly detection-inspired few-shot medical image segmentation through self-supervision with supervoxels</article-title>
          ,
          <source>Medical Image Analysis</source>
          <volume>78</volume>
          (
          <year>2022</year>
          )
          <fpage>102385</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Xiao, T. Qin,
          <article-title>Self-guided and cross-guided learning for few-shot segmentation</article-title>
          ,
          <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          (
          <year>2021</year>
          )
          <fpage>8312</fpage>
          -
          <lpage>8321</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Blundell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          , et al.,
          <article-title>Matching networks for one shot learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>29</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <article-title>Multivariate mixture model for cardiac segmentation from multi-sequence mri</article-title>
          , in: International Conference on Medical Image Computing and
          <string-name>
            <surname>Computer-Assisted</surname>
            <given-names>Intervention</given-names>
          </string-name>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>581</fpage>
          -
          <lpage>588</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <article-title>Microsoft coco: Common objects in context</article-title>
          ,
          <source>in: European conference on computer vision</source>
          , Springer,
          <year>2014</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>