<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visualization and Analysis of Transformer Attention⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvatore Calderaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giosué Lo Bosco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Rizzo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Vella</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, Università degli Studi di Palermo</institution>
          ,
          <addr-line>via Archirafi 34, Palermo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of High Performance Computing and Networking, National Research Council of Italy CNR</institution>
          ,
          <addr-line>via Ugo La Malfa 153, Palermo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The capability to select the relevant portion of the input is a key feature to limit the sensory input and focus on the most informative collected part. The transformer architecture is among the most performing deep neural network architectures due to the attention mechanism. The attention allows us to spot relevant connections between portions of the images and highlight these connections. Since the model is complex, it is not easy to determine which are these connections and the important areas. We discuss a technique to show these areas and highlight the regions most relevant for label attribution.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>process “reads” raw data (such as words in an input sentence) and converts them into a
representation, with one feature vector associated with each word position. A second component
stores the reader’s output, and can be considered as a “memory" containing a sequence of facts.
A final process “exploits" the content of the memory and performs a sequential task. At each
time step, this process can put attention on the content of one or a few memory elements [3]
[4].</p>
      <p>The representation used in the first process can be derived from an encoder-decoder
architecture. The model represents the input in the embedding space and processes all the input
items. The list of these representations, coupled with the decoder’s hidden states, is used to
select which inputs will be used to generate the output. The input, the previous hidden states
and the encoded vectors are used to evaluate scores that indicate how much input aligns with
the output. Typically, a softmax is used to normalize the scores and interpret them as weights.
The encoded vectors are scaled by the obtained weights and are used to generate a context
vector. This context is given to the decoder portion of the architecture and is used to generate
the output.</p>
      <p>According to Lindsay [5], “This type of artificial attention is thus a form of iterative
reweighting. Specifically, it dynamically highlights diferent components of a pre-processed input
as they are needed for output generation. This makes it flexible and context-dependent, like
biological attention".</p>
      <p>We are interested in “selective” attention, which is the capability to focus on a limited portion
of the input, filtering out a huge quantity of details.</p>
      <p>The transformers, proposed by Vaswani et al. [6], evaluate an attention function with a
neural model, and it was formerly used to form a context for the words to be translated. Here,
we consider the evolution of transformers used to process visual input, and we are interested in
the visualization of the attention inside the image. In classifying images, the visualization of
attention is helpful to highlight the regions of the images that contributed to the production of
a given label. According to Vaswani [6], the most significant area for creating the context is
letting the model associate the label with the input. Evaluation of the salient and relevant part of
the image can be drawn, considering which details are the most informative, and an evaluation
of the trust of the classification can be assessed. If the classification is performed with attention
to details relevant to the domain’s expert, confidence in the model and its choices increases.
If the attention highlights the border of the image or homogeneous areas, it could be inferred
that overfitting is present. The correct evaluation and comparison of attention is, therefore,
beneficial for assessing machine learning systems. Recent works in explainability [ 7],[8], [9]
and [10] witness the importance of understanding what is relevant to tune and assess the
classification results in the machine learning models. Other works are focused on visualization
of relevant image areas with transformer’s models, such as [11], [12]. These works are focused
on the attention flow across the network layers. We adopt an alternative stance considering
that the visualization of the last attention layer is very informative since the activation of the
last layer is used to associate the final label to the given input. The next section of the paper
describes the vision transformer and the technique we used to visualize attention. In section
4, the experimental part is described, and some results are shown. Conclusions are drawn in
section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Vision Transformer</title>
      <p>The Vision Transformers (ViT) [13] are customized versions of the original Transformers [6].
This architecture is an encoder-decoder designed for word sequence processing and is more
accurate than traditional recurrent networks when replicated in layers and piled with a
multilayer perceptron (MLP) layer. It is characterized by the so-called self-attention layer. The ViT (see
Figure 1) incorporates the transformer’s ability to take into account the long-term relationships
in the input data. It should be noted that the ViT solely relies on an encoder that incorporates a
multi head self-attention (MHA) and does not contain a decoder part.</p>
      <p>
        A tokenized input sequence  = [1, . . . , ] is the input for a single self-attention layer,
where  ∈ R, is used to compute a hidden representation [ℎ1, . . . , ℎ] by the following
formula:
ℎ = [1, . . . , ] softmax︂[ (√· 1) , . . . , (√· ) ︂]
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where , ,  indicates the queries, key, and values matrices, respectively, i.e., the
learning parameters of the self-attention layer. The hidden component ℎ is the relevance of
the token  for generating a corresponding target. The MHA is (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the parallel computation of
several single self-attentions, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the concatenation of the corresponding hidden representations,
and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) their computation by an MLP. Applying the MHA enables us to derive diferent kinds of
relationships between tokens.
      </p>
      <p>Sequence processing sufers from the self-attention’s permutation-invariant property, so a
positional encoding is used to spatially contextualize a symbol in a sequence with a relative and
an absolute position. In particular, a positional encoding vector is appended to each element of
the tokenized input sequence [1, . . . , ].</p>
      <p>In the context of ViT, an image X with  rows,  columns, and  channels is tiled into a set of
 ×  square patches, each one representing a token of the transformer input sequence. The
linearization of each  channels  ×  patch is a token , and the length of the sequence is set
to be  = ×2 . Each layer of the ViT receives and returns vectors of the same dimension . For
this reason, the embedding of the patches must project a linearized patch of length 2 ×  into
a vector of dimension .</p>
      <p>The input sequence of the ViT is  = [, 1, 2, . . . , ] + , where  ∈
R(2× )×  is a learned embedding matrix, and  ∈ R(+1)×  is the positional encoding
matrix. This input sequence starts with an additional component  used to capture a
representation of the whole sequence, such as a weighted average of the tokens in the sequence.</p>
      <p>The single distinguishing feature of ViT is an encoder portion made up of  encoding blocks
joined together, each consisting of an MHA followed by an MLP with one hidden layer. Each of
these is subjected to layer normalisation (LN), and an additional residual connection is added at
the output. The MLP employs the Gaussian Error Linear Unit activation function. In formulas:
Z′ℓ = MHA(LN(Zℓ− 1)) + zℓ− 1
Zℓ = MLP(LN(Z′ℓ)) + z′ℓ
ℓ = 1, . . . , .</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
      </p>
      <p>Transfer learning is always advantageous and becomes necessary for datasets with few images
for the case of image classification by deep models. A pre-trained model on a large dataset
is taken into account for the case of ViT, followed by fine-tuning on a particular task using a
dataset with fewer examples. MLP is applied during the pre-training and fine-tuning processes.
In each of the two cases, 0, i.e. the vector corresponding to  after the  encoder blocks,
is used as input. An MLP is pre-trained using a huge dataset, such as ImageNet. Another MLP
that returns a vector with the same size as the number of classes is used for fine-tuning.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Visualization and Analysis of the Attention</title>
      <p>The method discussed here is applied to perform the visualization of the attention.</p>
      <p>
        Since a ViT is composed of numerous concatenated blocks, the visualization of the attention
is complex. Considering that the final block has unquestionably the highest level of abstraction,
only this block is considered in the visualization paradigm. There are ℎ separate attention heads
in each block, and each one evaluates  + 1 distinct attentions, one for each patch and one for
the token . In the visualization, we decided to consider the softmax attention ℎ of to
 as the query and the embeddings of the patches as the keys, so described:
ℎ = softmax︂[ (√· 1) , . . . , (√· ) ︂]
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
      </p>
      <p>
        To obtain a vector  of  components, one associated with each token and subsequently to
each patch, one softmax attention vector for each head attention was obtained. To aggregate
the softmax attention vector of each head, we use diferent aggregation functions: the mean
and the maximum, and we propose the division between the mean and the standard deviation
defined in the following equation:
 () =
 [1, 2, . . . , ℎ]
 [1, 2, . . . , ℎ]
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
where ℎ is the number of heads. This function is useful to analyze the contribution of the
single heads. While the mean and the maximum typically show the results of one head that has
produced a stronger result, the function in eq.4 shows the points where there is the maximum
agreement among all the heads. If all the heads produce the same or slightly similar activation
values in a given point, the standard deviation is small and the denominator will generate a
larger value. If the heads produce, for the same point, values in a large range, the denominator
will be low and the final point will be not evident in the final attention map.
      </p>
      <p>To superimpose the created attention map onto the input image to the ViT was created by
rearranging these  values in a grid of / and / and scaling it to a resolution of  × 
(see Figures 2,3,4 for an example). To generate a suficiently smooth map throughout the resizing
process, bilinear interpolation was employed, then a median filter with a structural element of
radius /2, followed by a Gaussian blur with  = /4.</p>
      <sec id="sec-3-1">
        <title>4. Experiments and Results</title>
        <p>For the experimental activities, we considered the models implemented in the PyTorch Image
Models (a.k.a. Timm)[14] involving diferent kinds of transformers, the “Base” model (ViTt-B)
introduced by [13], and also further configurations proposed in [ 15], namely the “Tiny” (ViT-T)
and “Small” (ViT-S) models. In particular, ViT-T has 5 × 106 parameters organized in 12 layers,
3 MHA,  = 192, and adopts a one hidden layer MLP with 768 units; ViT-S has 22 × 106
parameters, 12 layers, 6 MHA,  = 384, and an MLP with 1536 neuron in the unique hidden
layer; ViT-B has 86 × 106 parameters, 12 layers, 12 MHA,  = 768 and one hidden layer MLP
with 3072 units. The images are rescaled to 224 × 224 to fit the input dimension of the vision
transformer. The number of considered patches is  = 196, so the size of the single patch
is 16 × 16. The experiments have been carried out on three datasets showing the method’s
reliability across multiple domains. The first dataset is the ImageNet dataset[ 16], the second is
the FruitsGB dataset [17] and the third is the Cassava Dataset[18].</p>
        <sec id="sec-3-1-1">
          <title>4.1. ImageNet</title>
          <p>The ImageNet dataset[16] is the collection of data for the Large Scale Visual Recognition
Challenge (ILSVRC), and it was proposed to evaluate algorithms for object detection and image
classification. A rationale behind this dataset is to have a benchmark for researchers to compare
progress for detecting a wide range of objects. Since it is largely users, it is used to measure
the progress of large-scale image indexing in the tasks of retrieval and annotation[16]. All the
model from Timm library[14] were pretrained with Imagenet dataset, for these experiments we
added to the classic transformer architecture a further layer with 10 classes and fine tuned the
network. A selection of images computed from the Imagenet dataset is shown in figure 2
The Fruits Good/Bad (FruitsGB) dataset[17] comprises 12000 images of 12 diferent classes of
fruits: Bad Apple, Good Apple, Bad Banana, Good Banana, Bad Guava, Good Guava, Bad Lime,
Good Lime, Bad Orange, Good Orange, Bad Pomegranate, and Good Pomegranate. The dataset
is balanced: each class contains 1000 images. The images have a 256 × 256 resolution and were
acquired using the mobile phone’s rear camera with diferent angles, backgrounds, and lighting
[17].</p>
          <p>For the FruitsGB dataset, the ViT-tiny model, trained on the Imagenet dataset, was employed.
The model has been fine-tuned on the samples of the Fruits dataset with training on 20 epochs.
1
sbu
1
sub
1
sub
1
b
su
1
b
su
0.039
0.029
A: 0.11
F: 0.21
PT: 0.03
TA: 0.07
DC: 0.15
LC: 0.12
MC: 0.11
PC: 0.20
A: 0.02
F: 0.08
PT: 0.09
TA: 0.15
DC: 0.08
LC: 0.13
MC: 0.33
PC: 0.13
A: 0.16
F: 0.05
PT: 0.16
TA: 0.06
DC: 0.14
LC: 0.12
MC: 0.12
PC: 0.18
A: 0.11
F: 0.18
PT: 0.10
TA: 0.08
DC: 0.07
LC: 0.15
MC: 0.08
PC: 0.22
3.210
A: 0.06 A: 0.06
F2:.400.805 F: 0.05
PT: 0.14 PT: 0.14
TA1:.600.505 TA: 0.05
DC: 0.07 DC: 0.07
LC0:.800.320 LC: 0.20
MC: 0.40 MC: 0.40
PC0:.000.003 PC: 0.03
2.552
A: 0.17
F: 10.9.0145
PT: 0.13
TA: 10.2.0763
DC: 0.04
LC: 00.6.2382
MC: 0.05
PC: 00.0.3000
2.948
A: 0.02
F:2.021.013
PT: 0.01
TA:1.047.348
DC: 0.38
LC:0.073.073
MC: 0.01
PC:0.000.104</p>
          <p>A: 0.17
F: 0.05
PT: 0.13
TA: 0.03
DC: 0.04
LC: 0.22
MC: 0.05
PC: 0.30
A: 0.02
F: 0.03
PT: 0.01
TA: 0.38
DC: 0.38
LC: 0.03
MC: 0.01</p>
          <p>PC: 0.14
2.589
A: 0.06</p>
          <p>A: 0.06
F1:.9042.05 F: 0.05
PT: 0.16 PT: 0.16
TA1:.2095.04 TA: 0.04
DC: 0.26 DC: 0.26
LC0:.6047.14 LC: 0.14
MC: 0.05 MC: 0.05
PC0:.0000.25 PC: 0.25
The Adam optimisation has been chosen with a learning rate of 0.001 and a minibatch size of
32. The average accuracy is 0.96. Some example are shown in figure 3.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>4.3. Cassava Dataset</title>
          <p>The cassava is a food crop grown by small-holder farmers in Africa since it is a carbohydrates
provider. It can be cultivate despite harsh conditions and it is an important source of food.
These plants are afected by viral diseases that bring poor yields. The Cassava dataset[ 18] is
composed of 9,436 cassava leaf images afected by four diseases that were annotated by experts
at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab
in Makarere University, Kampala. A fifth category for the healthy leaves has been added. The
dataset was used for the fine-grained visual-categorization workshop (FGVC6) at CVPR 2019.</p>
          <p>For the experiment with the Cassava Dataset the base transformer Timm model[14] was
used. The training has been performed with a SGD optimizer with learning rate of 0.001 and
momentum 0.9. The number of iteration has been set to 100. The final accuracy was 0.861.
Some examples of the attention for the image of this dataset are shown in figure 4.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>5. Conclusions</title>
        <p>A method for visualizing attention in transformers has been shown and discussed. This technique
shows the most relevant patterns for the deep model in the classification process, and it can be
useful for multiple purposes. The analysis of these areas allows an interested viewer to focus
on these regions and inspect them with a deeper analysis. It is also a good starting point, even
without knowledge of the specific domain, to assess if the found patterns are textured and in
the central part of the image, or fall - as happen in some case - in homogenous areas along the
images border. Multiple considerations about the relevant characteristics of the input samples
and the proper training of the model can be drawn with this methodology.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Acknowledgments</title>
      <p>Authors acknowledge the contribution of Giuseppe Marino in integrating the software module
for attention and implementing the training and test procedures.
[5] G. W. Lindsay, Attention in psychology, neuroscience, and machine learning, Frontiers in
computational neuroscience 14 (2020) 29.
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I.
Polosukhin, Attention is all you need, Advances in neural information processing systems 30
(2017).
[7] S. Calderaro, G. L. Bosco, R. Rizzo, F. Vella, Deep metric learning for transparent
classification of covid-19 x-ray images, in: 2022 16th International Conference on Signal-Image
Technology &amp; Internet-Based Systems (SITIS), IEEE, 2022, pp. 300–307.
[8] C. Molnar, Interpretable machine learning, Lulu. com, 2020.
[9] D. Amato, S. Calderaro, G. Lo Bosco, R. Rizzo, F. Vella, Metric learning in histopathological
image classification: Opening the black box, Sensors 23 (2023). URL: https://www.mdpi.
com/1424-8220/23/13/6003. doi:10.3390/s23136003.
[10] S. Calderaro, G. Lo Bosco, R. Rizzo, F. Vella, Deep metric learning for histopathological
image classification, 2022, p. 57 – 64.
[11] H. Chefer, S. Gur, L. Wolf, Transformer interpretability beyond attention visualization,
in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
2021, pp. 782–791.
[12] S. Abnar, W. Zuidema, Quantifying attention flow in transformers, arXiv preprint
arXiv:2005.00928 (2020).
[13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.
Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth
16x16 words: Transformers for image recognition at scale, in: 9th International
Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
arXiv:2010.11929.
[14] R. Wightman, Pytorch image models, https://github.com/rwightman/
pytorch-image-models, 2019. doi:10.5281/zenodo.4414861.
[15] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou, Training data-eficient
image transformers &amp; distillation through attention, in: International conference on
machine learning, PMLR, 2021, pp. 10347–10357.
[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition
Challenge, International Journal of Computer Vision (IJCV) 115 (2015) 211–252. doi:10.
1007/s11263-015-0816-y.
[17] V. Meshram, K. Thanomliang, S. Ruangkan, P. Chumchu, K. Patil, Fruitsgb: Top
indian fruits with quality, 2020. URL: https://dx.doi.org/10.21227/gzkn-f379. doi:10.21227/
gzkn-f379.
[18] T. G. ErnestMwebaze, Cassava disease classification, 2019. URL: https://kaggle.com/
competitions/cassava-disease.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>James</surname>
          </string-name>
          , The Principles of Psychology, Henry Holt and Company,
          <year>1890</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Itti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koch</surname>
          </string-name>
          ,
          <article-title>Computational modelling of visual attention</article-title>
          ,
          <source>Nature reviews neuroscience 2</source>
          (
          <year>2001</year>
          )
          <fpage>194</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          , Deep Learning, MIT Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cristina</surname>
          </string-name>
          , What is attention?,
          <year>2022</year>
          . URL: https://machinelearningmastery.com/ what-is-attention/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>