<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Similarity-based positional encoding for enhanced classification in medical images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio Leonardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Portinale</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Santomauro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Institute, DiSIT, Università del Piemonte Orientale</institution>
          ,
          <addr-line>V.le T. Michel 11, Alessandria, 15121</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper introduces a novel similarity-based positional encoding method aimed at improving the classification of medical images using Vision Transformers (ViTs). Traditional positional encoding methods focus primarily on spatial information, but they may not adequately capture the complex geometric patterns characteristic of medical images. To address this, we propose a method that utilizes convolution operations to extract geometric features, followed by a similarity matrix based on cosine similarity between image patches. This encoding is then incorporated into the ViT model, enabling it to learn more meaningful relationships beyond basic spatial positioning. The efectiveness of this method is shown through experiments on six medical imaging datasets from MedMNIST, where our approach consistently outperforms the conventional learned positional encoding. This is particularly true in datasets with prominent geometric structures like PneumoniaMNIST and BloodMNIST. The results indicate that similarity-based encoding can significantly enhance medical image classification accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical Image Classification</kwd>
        <kwd>Positional Encoding</kwd>
        <kwd>Vision Transfomer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Vision Transformers (ViTs) have revolutionized the field of computer vision, achieving state-of-the-art
performance on various tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A crucial component of these models is the positional encoding,
which provides spatial information to the otherwise position-agnostic self-attention mechanism [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
While initially designed to encode absolute or relative positions of image patches, recent studies suggest
that learned positional encodings in ViTs may be capturing more than just spatial locations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>In this paper, we propose a novel perspective on learned positional encodings in Vision Transformers.
We argue that these encodings are not merely learning "positions" in the traditional sense, but rather
capturing high-level relationships and patterns within the visual data. This insight challenges the
conventional understanding of positional encodings and opens up new avenues for improving ViT
architectures. In particular, by considering complex medical images such as X-rays, histological or
dermatological ones, we can notice that particular diseases or findings are usually visible in form of
geometric patterns, which can be properly extracted using convolutions.</p>
      <p>Building on this observation, we introduce a new encoding method based on similarity measures
between image patches, after applying convolution operations. Our approach leverages the inherent
structure and relationships within visual data, allowing the model to capture more meaningful
representations that go beyond simple spatial positions. We show that this similarity-based encoding leads
to improved performance and generalization across various computer vision tasks. Our contributions
are threefold:
• we show the efectiveness of our approach through extensive experiments on medical benchmark
datasets, showing improved performance compared to traditional positional encodings.</p>
      <p>This work not only advances our understanding of Vision Transformers but also paves the way for
more eficient and efective visual representation learning in deep neural networks applied to medical
images interpretation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works on positional encoding</title>
      <p>
        Positional encoding is a key component for Transfomer model, introduced by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] with the purpose of
introducing spatial information to the model, which is inherently permutation-invariant. There exist
four main types of diferent positional encoding:
• fixed positional encoding : it consists in fixed encodings that rely on predefined functions; for
example, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] uses sine and cosine functions for the positional encoding and connects the coding
information of diferent frequencies to form the final positional encoding.
• learned positional encoding: introduced by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], it is the most widely used and it consists in a
random matrix of fixed dimensions combined with the patches, and optimized during a training
phase through gradient descend.
• learned relative positional encoding (RPE): introduced by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], it focuses on the relative distances
between tokens, providing more flexible and adaptive encoding. In vision transformers, RPE can
help to better capture spatial correlations and context, especially for tasks like image segmentation,
where the relative positions between pixels are critical.
• task-specific positional encoding : for specific application task (such as medical image analysis)
standard encoding techniques may lack in capturing crucial contextual information. In fact, for
medical image classification and segmentation some proposed methods have shown superiority
over classical methods. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the authors introduce a positional encoding that incorporates
volumetric tokens from 3D medical images such as CT and MRI scans, focusing on enhancing
segmentation through long-range contextual learning. Tang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] uses a hierarchical
positional encoding adapted for volumetric medical images. The hierarchical structure improves the
transformer’s ability to capture both local and global features for medical image segmentation.
Wang et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposes 3D inductive positional embeddings (3D IPE) that encode both relational
and absolute position information for 3D medical images. This encoding method allows the
transformer to capture context efectively in 3D segmentation tasks. Yu &amp; Triesch [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposes
Circle Relationship Embedding (CRE) that simplifies positional encoding by utilizing the spatial
arrangement of image patches in a circular manner, improving the transformer’s performance on
medical images.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>
        The Visual Transformer (ViT) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a neural network architecture designed for computer vision tasks.
Figure 1 shows the ViT architecture. Unlike traditional CNNs, which process images using convolutional
layers, the Visual Transformer pre-processes input images through the following steps:
• patch extraction: the input image is divided into a grid of non-overlapping patches. Each patch
is typically a small square region of the image. For example, for an image of size 224x224 pixels,
using a patch size of 16x16 generates 196 patches (14x14 grid).
• flattening and linear projection : each patch is then flattened into a one-dimensional vector.
      </p>
      <p>This means that the spatial information within each patch is encoded into a linear sequence of
values.
An extra learnable parameter is preposed to the features vector, called class token; both the image patch
projection and the class token have the same dimensionality.</p>
      <p>The transformer encoder block in Vision Transformers (ViTs) is a fundamental component responsible
for processing input patches and capturing global dependencies within the image. It consists of several
layers, each composed of two main sub-modules: the multi-head self-attention mechanism and the
position-wise feed-forward neural network. In the multi-head self-attention mechanism, the input
patches are transformed into query, key, and value representations. These representations are linearly
projected to multiple attention heads, which independently compute attention scores capturing the
relationships between diferent patches. The attention scores are then weighted and combined to
generate context-aware representations for each patch. Simultaneously, the position-wise feed-forward
neural network applies a non-linear transformation to each patch’s representation, independently. This
transformation enhances the model’s ability to capture complex patterns and features within the input
patches.</p>
      <p>After processing through the attention mechanism and the feed-forward network, the output
representations of the patches are passed through residual connections and layer normalization, facilitating
stable training and improved gradient flow. Finally, the output of the transformer encoder blocks is fed
into subsequent layers or used directly for downstream tasks such as image classification. In the context
of image classification, our architecture uses the class token which flows in input to a classification
head, using Cross entropy loss as optimization function:</p>
      <p>1 ∑︁ ∑︁ , log(ˆ,)
ℒ = − 
=1 =1
(1)
Where N in the number of samples, C is the number of classes,  is the true label of the i-th sample
for the j-th class (either 0 or 1) and ˆ is the predicted probability that the i-th sample belongs to the
j-th class.</p>
      <sec id="sec-3-1">
        <title>3.1. Positional Encoding</title>
        <p>
          One of the crucial concept in ViTs is the positional encoding. The positional encoding was introduced
in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to provide spatial or sequential information to the model; in fact, self-attention mechanisms are
inherently permutation-invariant, meaning they treat input elements as a set, without any inherent
order. However, both text and images have a logical ordering between words and patches, respectively.
Positional encoding is used to mantain this ordering between elements: in ViT it helps the model
understanding the 2D spatial relationships between image patches. This is crucial because the image is
split into patches and flattened before being processed.
        </p>
        <p>In section 2 we discussed the state-of-the-art for positional encoding. While initially designed to
encode absolute or relative positions of image patches, recent studies suggest that learned positional
encodings in ViTs may be capturing more than just spatial locations and outperform the fixed positional
encoding techniques. In the next section we propose a novel similarity-based encoding method that
explicitly models relationships between image patches through convolution.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed method</title>
      <p>Positional encodings are a class of methods which give spatial or sequential information to the
selfattention mechanisms used by Transfomer models. Positional encodings are broadly studied in the
literature (see Section 2) for standard Transformer model (i.e. Transformer for natural language
processing). While there are few proposals for ViTs’ positional encoding, the most of them adopt
learned positional encoding, which seems to outperform the other techniques.</p>
      <p>Learned positional encoding allows the model to implicitly capture the relationships between patches
in a flexible way, adapting the encoding to the task at hand and discovering representations that go
beyond explicit spatial coordinates, such as semantic relationships and interactions that are spatially
informed. The learned positional encoding aims to capture how diferent parts of the image contribute
to the overall context, learning patterns such as "closeness" in pixel space or object part arrangements.
This can be crucial for tasks like classification and segmentation.</p>
      <p>In medical image analysis, however, geometrical patterns (such as shapes, boundaries, and specific
anatomical structures) often play a more crucial role than the general patch-to-patch correlation captured
by Vision Transformers (ViT), which are unable to manage these geometrical patterns. Based on this
observation, we propose a new method for positional encoding, based on convolution operations and
cosine similarity between extracted features. The overall ViT architecture has not to be changed and
one can exploit pre-trained models for specific task and fine-tune just the positional encoding blocks.</p>
      <p>Let  represent the input image, divided into  patches; we embed each patch and we denote by
 ∈ R×  the matrix representing the patch embeddings extracted from the image  ( is the number
of patches and  is the dimensionality of each patch embedding). Let  ∈ R×  be the feature map
computed by convolutions on the image . The goal is to compute a similarity matrix based on the
features map, and to use this similarity matrix as a positional encoding.</p>
      <p>The cosine similarity between two vectors  and  , representing the  − ℎ and  − ℎ patch
embeddings, is given by:
where  s the cosine similarity between feature  and patch ,  ·  represents the dot product
between the embeddings of patches i and j and |||| and || || are the Euclidean norms (magnitudes) of
the vectors  and  . The complete similarity matrix  ∈ R×  for all features is computed as:
 =</p>
      <p>· 
|||| · ||  ||
 =
 · 
||||2
(2)
(3)
where  is the transpose of  and ||||2 normalizes each row of the features matrix to
compute the cosine similarities row-wise. Next, we apply a linear transformation to the similarity
matrix to map it to the same dimensionality as the original patch embeddings. Let’s define this
transformation using a learnable weight matrix  ∈ R× :
  = 
(4)
where   ∈ R×  is the positional encoding matrix based on similarity between patches,  ∈ R× 
is the cosine similarity matrix and  ∈ R×  is a learnable projection matrix that transforms the
similarity information to match the dimensionality  of the patch embeddings.</p>
      <p>Since  ∈ R×  represents the patch embeddings, the positional ecoding   can be added to  to
produce the final input to the transformer:
′ =  +  
(5)
where ′ ∈ R·  is the modified patch embedding, which now includes the similarity-based positional
encoding.</p>
      <p>The learned convolutional filters extract geometrical information from the input images, which are
essentials in medical image analysis. Through the usage of these geometrical features as encoding, we
help the Transfomer architecture to learn patterns which can be useful for the specific task.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <p>In order to compare learned positional encoding (which is the state-of-the-art encoding for ViTs) with
our proposed encoding based on similarity, we have defined an architecture with fixed hyperparameters
(e.g number of transfomer layer, number of heads, embedding dimension, etc.) and we have trained the
network on specific medical datasets with both encoding techniques, by comparing the results.</p>
      <sec id="sec-5-1">
        <title>5.1. Datasets</title>
        <p>
          We resort to 6 diferent datasets which are part of MedMNIST[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ][
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], a collection of standardized,
lightweight, and preprocessed biomedical image datasets designed for evaluating machine learning
algorithms, especially in the medical field. MedMNIST includes several image classification tasks, primarily
focusing on diverse types of medical data such as internal pathology, dermatology, ophthalmology, and
more. It is inspired by the well-known MNIST dataset (a handwritten digit classification benchmark),
but tailored for biomedical images. The images in MedMNIST can be downloaded in diferent size,and
we choose a 224224 format to remain consistent with the standard input size of the ViT introduced in
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Specifically we used:
• PneumoniaMNIST, a dataset of chest X-ray images, specifically designed to detect and classify
pneumonia. The dataset was created by extracting data from the NIH’s ChestX-ray14 database,
which contains X-rays labeled with various pulmonary conditions. Task: binary classification.
• PathMNIST, a dataset based on pathology images of breast cancer tissue slides, collected from
the CAMELYON16 challenge dataset. The objective is to classify diferent subtypes of breast
cancer from microscopy images. Task: multi-class classification.
• DermaMNIST, consisting of dermatology images aimed at classifying diferent types of skin
lesions. This dataset is derived from the HAM10000 dataset, which is commonly used for the
classification of skin diseases, such as melanomas and benign lesions. Task: multi-class classification.
• BreastMNIST, a dataset derived from ultrasound images for the classification of breast cancer.
        </p>
        <p>The images represent diferent conditions, including benign, malignant, and normal tissue. Task:
binary or multi-class classification.
• TissueMNIST, a dataset that focuses on histological images from human tissue samples. It
contains tissue images from various organs, extracted from the HuBMAP dataset. Each image
represents a section of tissue that can be classified into diferent cell types. Task: multi-class
classification.
• BloodMNIST, a dataset of blood smear images for the classification of blood cells. It is derived
from the Atlas of Blood dataset and contains images of diferent blood cell types, which is useful
for tasks related to hematology. Task: multi-class classification.</p>
        <p>(a) BloodMNIST
(b) PneumoniaMNIST
(c) PathMNIST</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>In this section we compare the results we have obtained by considering both standard learned positional
encoding and similarity positional encoding. Table 1 shows such results in term of accuracy. It is
possible to see that the proposed similarity positional encoding brings to better accuracy for every
benchmark dataset we have used.</p>
        <p>For BloodMNIST, PneumoniaMNIST and PathMNIST the accuracy gap between our positional
encoding and learned positional encoding is bigger than for the other datasets. This may be explained by
the fact that these datasets contain more geometrical structure than the other, and since our positional
encoding is designed to add to the model the ability to recognize geometric shapes, the behaviour is
consistent with the hypotheses. Figure 2 shows an example of BloodMNIST’s image (2a),
PneumoniaMNIST’s image (2b) and PathMNIST’s image (2c). In contrast, datasets as DermaMNIST, in which
there are no such geometrical features, the resulting performance are slightly increased but overall
comparable.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and future works</title>
      <p>In this work we have proposed a new technique for positional encoding in medical image analysis.
Starting from the rationale of having geometrical structures in medical images, we have developed
a new encoding technique that employs convolutions to extract geometrical information from the
input images and compute cosine similarity between extracted features, in such a way to use this
information as positional encoding. We have shown that, in terms of accuracy, this new positional
encoding outperforms the standard learned positional encoding, especially in images featuring geometric
structures.</p>
      <p>As future works we want to focus on two diferent tasks:
• compare the attention masks generated by the two diferent encodings, in order to evaluate if the
explainability of the model increases introducing geometrical features;
• compare our positional encoding with other proposed method, listed in section 2</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Andrea Santomauro’s PhD research is co-financed by ARLANIS REPLY. We also acknowledge the use
of the Chameleon Cloud testbed (https://chameleoncloud.org/) that has allowed us to perform
all the experiments described in the present work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2010</year>
          .11929. arXiv:
          <year>2010</year>
          .11929.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is all you need,
          <year>2023</year>
          . URL: https://arxiv.org/abs/1706.03762. arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Murtadha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wen</surname>
          </string-name>
          , Y. Liu, Roformer:
          <article-title>Enhanced transformer with rotary position embedding</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2104.09864. arXiv:
          <volume>2104</volume>
          .
          <fpage>09864</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen, Deberta:
          <article-title>Decoding-enhanced bert with disentangled attention</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2006</year>
          .03654. arXiv:
          <year>2006</year>
          .03654.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            O. Press,
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <article-title>Train short, test long: Attention with linear biases enables input length extrapolation</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2108.12409. arXiv:
          <volume>2108</volume>
          .
          <fpage>12409</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <article-title>Self-attention with relative position representations</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1803</year>
          .02155. arXiv:
          <year>1803</year>
          .02155.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <article-title>Rethinking and improving relative position encoding for vision transformer</article-title>
          , in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV),
          <year>2021</year>
          , pp.
          <fpage>10013</fpage>
          -
          <lpage>10021</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV48922.
          <year>2021</year>
          .
          <volume>00988</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hatamizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Unetformer: A unified vision transformer model and pre-training framework for 3d medical image segmentation</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv. org/abs/2204.00631. arXiv:
          <volume>2204</volume>
          .
          <fpage>00631</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Landman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hatamizadeh</surname>
          </string-name>
          ,
          <article-title>Self-supervised pre-training of swin transformers for 3d medical image analysis</article-title>
          ,
          <source>in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>20698</fpage>
          -
          <lpage>20708</lpage>
          . doi:
          <volume>10</volume>
          .1109/ CVPR52688.
          <year>2022</year>
          .
          <year>02007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          <article-title>, Multi-scale hierarchical transformer structure for 3d medical image segmentation</article-title>
          ,
          <source>in: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1542</fpage>
          -
          <lpage>1545</lpage>
          . doi:
          <volume>10</volume>
          .1109/BIBM52615.
          <year>2021</year>
          .
          <volume>9669799</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Triesch</surname>
          </string-name>
          , Cre:
          <article-title>Circle relationship embedding of patches in vision transformer, ESANN 2023 proceesdings (</article-title>
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .14428/esann/
          <year>2023</year>
          .es2023-
          <fpage>75</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <article-title>Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis</article-title>
          ,
          <source>in: IEEE 18th International Symposium on Biomedical Imaging (ISBI)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pfister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <article-title>Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>41</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>