VisualT5: Multitasking Caption and Concept Prediction
                         with Pre-trained ViT, T5 and Customized Spatial Attention
                         in Radiological Images
                         Notebook for the Medical Image Computing Lab at CLEF 2024

                         Diedre Carmo1,* , Letícia Rittner1 and Roberto Lotufo1
                         1
                             School of Electrical and Computer Engineering, Universidade Estadual de Campinas, Campinas, Brazil


                                        Abstract
                                        The development of more explainable and general deep learning-based predictive and generative models is of
                                        interest to the medical imaging processing field, largely due to the “black box" and often specialized nature
                                        of current models. This paper describes our participation in the ImageCLEF Caption Prediction and Concept
                                        Detection challenges with a multitasking, multimodal and explainable architecture named VisualT5. VisualT5
                                        couples the embedding power of a frozen pre-trained Vision Transformer (ViT) with the clinical text generation
                                        capabilities of the pre-trained ClinicalT5. Moreover, we propose a modified spatial attention module that weights
                                        our visual encoder features in the token dimension, showcasing the spatial importance of each ViT token and
                                        permitting more interpretability regarding what parts of the image have more impact on the model’s conclusions.
                                        VisualT5-base-clinical as a single multitasking model achieved 0.61 BERTScore and 0.58 F1-score in the caption
                                        prediction and concept detection tasks, respectively, ranking 6/11 in the caption leaderboard and 6/9 in the
                                        concept leaderboard.

                                        Keywords
                                        vision transformer, t5, image captioning, image classification, medical imaging


                         1. Introduction
                         The success of deep learning for the creation of predictive and generative models is evident [1, 2], with
                         success both in academic research and recently being integrated into real products such as ChatGPT [3]
                         and other platform based LLMs [4]. Deep learning models have also been applied to medical imaging
                         classification and caption generation [5]. However, the translation of such models to real applications in
                         medicine is lagging behind, due to the complex nature of medical diagnosis and related signal processing.
                         Some research has raised the potential problems of bias and other factors leading to the unfeasibility of
                         translation to real clinic of many deep learning based methods [6, 7]. Medical information that leads to
                         a diagnosis or disease understanding is presented in many modalities, either as different types of images
                         acquisitions, structured and free text, and even 1D signals such as electrocardiograms. Moreover, the
                         number of tasks involved in the pipeline of medical processes can’t be summarized into isolated academic
                         tasks such as direct image classification, segmentation, or caption generation. Finally, explainability of
                         key factors that led to decision making is paramount in the medical field [8]. This context has led current
                         research into considering multimodality [9], multitasking [10], and explainability [11] as important
                         aspects of automated medical imaging processing.
                            In terms of model architecture, current approaches for medical imaging classification mostly consist of
                         using CNNs with fully connected layers or the vision transformer model, an state-of-the-art transformer
                         for image classification [12]. In the context of image to caption generation, three methodologies are
                         commonly used: encoder-decoder models, where an encoder generates image features which are
                         decoded into text either by LSTMs of transformers [5]; visual language models, where transformer input

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ d211492@dac.unicamp.br (D. Carmo)
                           https://miclab.fee.unicamp.br/ (D. Carmo)
                           0000-0002-5922-9120 (D. Carmo); 0000-0001-8182-5554 (L. Rittner); 0000-0002-5652-0852 (R. Lotufo)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
tokens mix ViT-like image representations with text tokens [13]; and finally CLIP-like approaches, were
image and text embeddings are aligned, and embedding alignment after training is used to perform
various multimodal tasks [14].
   In this preliminary work, we explore the multitasking of medical image classification and caption
generation in various modalities of radiological images from two ImageCLEF [15] challenges at the same
time: the medical imaging caption prediction and concept detection challenges [16]. Our participation
in this challenge happens as a first step into exploring multitasking, multimodality and explainability
in medical imaging processing for better generalization and usability in practice. Our proposal involves
an encoder-decoder model marrying strong image representation from pre-trained ViT models together
with pre-trained T5 as a text decoder for caption generation, including innovative uses of spatial
attention to promote visual explainability.


2. Methodology
The proposed VisualT5 is an image-to-text encoder-decoder architecture coupling a vision transformer
with an encoder-decoder T5 text transformer. VisualT5 is trained and evaluated using the ImageCLEF
dataset, ROCOv2.

2.1. Dataset
Radiology Objects in COntext (ROCOv2) [17] is the main dataset used by both ImageCLEF challenges
for caption prediction and concept detection labels. In summary, the authors of the dataset performed a
semi-automatic pipeline to extract valid caption and radiological image pairs from publicly available
medical papers. In this year’s version of the dataset, the training set consists of 70108 radiology images,
with 9972 more for validation and 17237 for testing, with testing labels hidden from the participants.
   Concepts classification is multilabel, and the main label used as primary groundtruth in the challenge
uses concepts automatically extracted from captions, represented by 1934 Unified Medical Language
System [18] Concept Unique Identifiers (CUIs). In addition, concepts are reduced into a manually
curated subset containing only modality and body region CUIs for a secondary evaluation.

2.2. Architecture
In VisualT5 (Fig. 1), the frozen pre-trained ViT encoder from MEDSam [19, 21] is used to generate latent
representations. To use their ViT-base [12] architecture, images are bilinearly interpolated to 1024x1024
while keeping the aspect ratio with 0 padding, by using their provided image processing pipeline1 . The
resulting embedding with batch size 1 of [1, 4096, 768] reveals a hidden size of 768 and a 16x16 patch
size, given that the sequence length is 4096, the number of 16x16 patches that fit in a 1024x1024 image.
The last hidden state of same shape is used as a latent representation and weighted by a modified spatial
attention mechanism. Instead of using convolutional layers as in Górriz et al. [20]’s 2D spatial attention,
multiple linear layers with bias and LeakyReLU non-linear activations are used in the same fashion
to compress the 768 hidden size into a single channel array of 4096 sigmoid activated values. Given
that each of the 4096 values corresponds to one of the 64x64 patches, these values are used to weight
(multiply) the contribution of each token, i.e, the importance of each region of the input image. These
4096 values can be visualized as a heatmap after a reshaping to 64x64 and bilinear interpolation to
1024x1024. Finally, the weighted latent space is used as visual encoder features for the subsequent
tasks.
   For concept detection, the visual encoder features are averaged in the sequence dimension and we
train a projection through a linear layer into 1934 sigmoid activated neurons for multilabel concept
detection, with each output neuron representing a CUI. The corresponding CUI strings are included in
the prediction based on a multilabel activation threshold of 0.5. At the same time, for caption prediction,

1
    https://huggingface.co/flaviagiammarino/medsam-vit-base
             Attention Map
          [1, 4096, 1] -> [64, 64]               [1, 4096, 1]                                                                   VisualT5-base-clinical


                                             Spatial Attention
                                             Linear(768, 384)
                                             Linear(384, 192)                       Visual       Mean(dim=1)
                                              Linear(192, 1)                       Encoder
                                                Sigmoid()                          Features
                                                                                               Linear(768, 1934)    CUI Detection   Caption Prediction
                                                                        [1, 4096, 768]             Sigmoid()           (1934)         (seq_len: 128)


                                                       Last Hidden State


                                                                                   AvgPool
                                                                                  sequence
                                                       MEDSam Vision             compressor           Clinical T5-base               Clinical T5-base
                                                     Transformer (Frozen)        4096 -> 128              Encoder                        Decoder
        1x3x1024x1024 input image


                                                 "16x16 words": [1, 4096, 768]                 Input Embedding: [1, 128, 768]


                                     CC BY [Keka-Sylaj et al. (2023)]


Figure 1: Overview of our final and best-performing architecture, VisualT5-base-clinical. The Vision Encoder
is frozen with MEDSam’s [19] ViT-base weights. It’s last hidden state is weighted by our implementation of
spatial attention [20] and used as visual encoder features, where both concept detection and caption prediction
is derived. The T5 text decoder is initialized with Clinical T5 weights.


pre-trained ClinicalT5-base [22, 23] is used, including its text encoder, decoder and tokenizer. Note
that the text encoder and decoder are not frozen, and are adjusted by our training. Our visual encoder
features replace the input embeddings of ClinicalT5’s text encoder. We reduce the sequence length from
4096 to 128 using average pooling, due to limited GPU memory. To continue training and promoting
the alignment of our visual encoder features as input embeddings for ClinicalT5, we follow original
T5’s training procedure [24]. Text generation during evaluation consists of the computation of the
visual encoder features once, and a Seq2Seq greedy decoding strategy for text generation with a 128
tokens maximum sequence length. Attempts at using only the ClinicalT5 decoder with visual encoder
features as "encoder outputs" for T5’s encoder-decoder attention resulted in degraded quantitative
performance with little computational efficiency benefits. With the multilabel concept detection and
generated caption prediction being derived from the same visual encoder features, VisualT5 multitasks
both tasks using the same model, and is trained in both tasks simultaneously.

2.3. Implementation Details
For implementation we used the Hugging Face Transformers library, PyTorch [25] and PyTorch Light-
ning [26]. MEDSam’s ViT pre-trained weights were also sourced from Hugging Face1 . Note that
acquiring ClinicalT5-base weights for T5-base weight initialization required credentialing and ethical
training through the PhysioNet platform [23]2 .
   For validation, we employed the evaluation code provided by ImageCLEF organizers, which computes
BERT [27] and ROUGE [28] scores for caption prediction and multilabel F1-scores for concept detection.
BERTScore and F1-score for all provided concepts were the primary metrics used by challenge runners
for ranking. During their test evaluation, they added and reported on additional metrics [16]. Both
tasks are optimized at the same time using a 4090 24 GB GPU, with a batch size of 5, AdamW optimizer
with 1e-5 initial learning rate and 1e-5 weight decay, and training for 100 epochs with an early stopping
patience of 10 epochs since last validation BERTScore improvement.
2
    https://physionet.org/content/clinical-t5/1.0.0/
3. Results and Discussion
After early experiments defining some hyperparameters, four main experiments were performed and
submitted to the ImageCLEF evaluation platform for testing. Results from the test phase were only
revealed after the end of the challenge. These experiments aimed to evaluate the impact of the design
variations over the previously described architecture (Tab. 1).

Table 1
Description of the four main experiments submitted to the challenge test phase.
 Model                      Description
 VisualT5-small             Use a custom, not frozen ViT-small from scratch, and fine-tune the original T5-small
                            pre-trained language model [24].
 VisualT5-small-cls         Add a CLS token to ViT-small to derive concept results instead of linear projection of
                            the average of visual encoder features in the sequence dimension.
 VisualT5-base              Frozen pre-trained ViT-base from MEDSam [19] and fine-tune the original T5-base
                            pre-trained language model [24].
 VisualT5-base-clinical     Frozen pre-trained ViT from MEDSam [19] and fine tune pre-trained ClinicalT5-
                            base [22].

   Since ViT-small is not defined in the original ViT publication [12], we designed it with 512 hidden
size, image size of 256x256, patch size of 16x16, 8 heads and layers, and 1024 MLP dimensions. With
these parameters, ViT-small analyzes 256 tokens (patches), providing full input embedding alignment
with a 256 sequence length T5-small, without the necessity of sequence length compression through
average pooling. VisualT5-small trains the ViT-small visual encoder from scratch, in contrast with
VisualT5-base where the pre-trained ViT is kept frozen due to memory limitations. Experimental results
showcase the variations in performance resulting from these differences in VisualT5 design (Tab. 2).

Table 2
Primary metrics for caption prediction (BERTScore), and multilabel concept detection (F1-score) for each proposed
multitasking VisualT5 model variation, with the respective Run IDs for the submissions on the challenge platform.
             Run ID                   Model                   Validation                       Test
     Caption      Concept           VisualT5            BERTScore      F1-score     BERTScore         F1-score
       274            275        VisualT5-small             0.61          0.52          0.59            0.53
       676            679      VisualT5-small-cls           0.61          0.50          0.61            0.53
       677            680         VisualT5-base             0.61          0.52          0.37            0.56
       678            681     VisualT5-base-clinical        0.61          0.54          0.61            0.58

   It is noticeable that caption prediction performance did not change significantly during validation
according to BERTScore. Using a CLS token strategy for concept detection resulted in the worst F1-score
in validation, with the full VisualT5-base-clinical method being the best overall. This also translated
to testing computed by the challenge runners, where the full base models with related pre-trained
weights performed best. Of notice is the apparent lack of generalization to the test set of VisualT5-base,
which used a general T5-base text decoder. This overfitting did not happen when training from the
ClinicalT5-base text decoder weights, suggesting using pre-trained encoders and decoders from the
medical domain is beneficial. In the overall test leaderboard [16], our multitask method placed 6/9 in
concept detection and 6/11 in caption prediction.
   In addition to quantitative performance, qualitative evaluation through random visual inspection of
around a hundred test cases reveals that the model can ascertain modality and anatomical information
well in the generated captions and concepts. However, the model is often unable to predict associated
symptoms and diagnostic-related details, which are sometimes present in the target. Those are commonly
related to clinical context or the reason for the examination, information outside of the image scope
(Fig. 2). We believe including more clinical information such as the reason for the image acquisition as
input to these types of methods would lead to improved performance in these tasks.

              Input                              Target                         Prediction                  Spatial Attention


                                      Computed tomography (CT)
                                       shows floating thrombosis
                                                                         Aortic root CT angiogram
                                            (white arrow)
                                                                        showing aortic root dilation.
                                        Concept: ['C0040405',
                                                                        Concept: ['C0040405'], ['X-
                                         'C0040053'], ['X-Ray
                                                                       Ray Computed Tomography']
                                       Computed Tomography',
                                            'Thrombosis']
     CC BY [Sato et al. (2022)]


                                       Enhanced CT scan of the
                                       chest revealed an anterior
                                        mediastinal tumor (black        Axial CT scan of the chest
                                                arrow).                showing a large mass in the
                                                                          anterior mediastinum.
                                         Concept: ['C0025066',
                                       'C0027651', 'C0040405'],    Concept: ['C0040405'], ['X-
                                     ['Mediastinum', 'Neoplasms', Ray Computed Tomography']
                                          'X-Ray Computed
                                            Tomography']
    CC BY [Fan et aol. (2022)]


                                                                        Sagittal T2-weighted MRI of
                                                                        the cervical spine showing a
                                       Early sagittal T2-weighted
                                                                          hyperintense signal in the
                                                  MRI.
                                                                       spinal cord at the C3-C4 level
                                                                                   (arrow).
                                        Concept: ['C0024485'],
                                        ['Magnetic Resonance
                                                                          Concept: ['C0024485'],
                                              Imaging']
                                                                          ['Magnetic Resonance
                                                                                Imaging']
  CC BY [Trowbridge et al. (2022)]


                                     The typical chest X-ray finding
                                      of a patient with coronavirus
                                         disease 2019 infection       Chest X-ray showing bilateral
                                      showing bilateral infiltration.           infiltrates

                                         Concept: ['C0332448',                    Concept:
                                        'C1306645', 'C0817096',          ['C1306645;C0817096'],
                                       'C0009450'], ['Infiltration',        ['Plain x-ray;Chest']
                                          'Plain x-ray', 'Chest',
                                      'Communicable Diseases']
  CC BY [Taweesedt et al. (2021)]                                                                       0                       1


Figure 2: Validation samples with target, prediction, and spatial attention from VisualT5-base-clinical.


   The proposed spatial attention scheme seems to work well empirically, when rendering the generated
4096 sigmoid weights as heatmaps using the Turbo colormap (Fig 3). The ViT tokens related to
foreground parts of the image are being weighted more than background regions. This type of layer
has the potential to improve the readability of ViT-derived transformers, which are notable for having
difficult to visualize output attentions [29]. Note, however, that there is no specific highlight of the
abnormal region. Our spatial attention seems to converge to a state where most foreground tokens are
“important", with values close to 1. More exploration of this type of module in future work might lead to
improved contrast and more specific indication of abnormality localization on the generated heatmaps.
Possibilities include experimenting with different activations and colormaps for visualization.
              Input                  Spatial Attention   Overlap               Input                   Spatial Attention   Overlap   1
    CC BY [Muacevic et al. (2022)]                                 CC BY [Muacevic et al. (2022)]


    CC BY [Edelbach et al. (2023)]                                 CC BY [Cobilinschi et al. (2023)]


                                                                                                                                     0


Figure 3: Some selected VisualT5-base-clinical test outputs showcasing the highlight of the most important
tokens by our proposed spatial attention.


4. Conclusion
We proposed VisualT5, an encoder-decoder model based on coupling pre-trained Vision Transformers
with pre-trained T5 transformers. Better performance in multitasking the ImageCLEF Caption Prediction
and Concept Detection tasks was observed when using models pre-trained on the medical domain. The
same multitasking weight placed in the middle of the leaderboard for both tasks in the challenge’s test
phase. Moreover, the proposed modified spatial attention successfully highlighted areas of medical
interest. Future work will experiment with more general promptable visual language models including
prior information outside of the scope of the radiological acquisition, adding more tasks and modalities,
towards a lightweight open-source, multitasking, multimodal, and explainable model.


Acknowledgments
D. Carmo was partially supported by Sao Paulo Research Foundation (FAPESP) grant #2019/21964-4. R
Lotufo is partially supported by CNPq (The Brazilian National Council for Scientific and Technological
Development) under grant 313047/2022-7. L Rittner is partially supported by CNPQ grant 317133/2023-3,
and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) grant 506728/2020-00.


References
 [1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436–444.
 [2] S. Feuerriegel, J. Hartmann, C. Janiesch, P. Zschech, Generative ai, Business & Information Systems
     Engineering 66 (2024) 111–126.
 [3] OpenAI, Chatgpt, 2024. URL: https://chat.openai.com/chat, accessed: 2024-06-18.
 [4] R. Pires, H. Abonizio, T. S. Almeida, R. Nogueira, Sabiá: Portuguese large language models, in:
     Brazilian Conference on Intelligent Systems, Springer, 2023, pp. 226–240.
 [5] D.-R. Beddiar, M. Oussalah, T. Seppanen, Automatic captioning for medical imaging (mic): a rapid
     review of literature, Artificial Intelligence Review 56 (2023) 4019–4076.
 [6] L. Wynants, B. Van Calster, G. S. Collins, R. D. Riley, G. Heinze, E. Schuit, E. Albu, B. Arshi, V. Bellou,
     M. M. Bonten, et al., Prediction models for diagnosis and prognosis of COVID-19: systematic
     review and critical appraisal, bmj 369 (2020).
 [7] M. Roberts, D. Driggs, M. Thorpe, J. Gilbey, M. Yeung, S. Ursprung, A. I. Aviles-Rivero, C. Etmann,
     C. McCague, L. Beer, et al., Common pitfalls and recommendations for using machine learning to
     detect and prognosticate for COVID-19 using chest radiographs and CT scans, Nature Machine
     Intelligence 3 (2021) 199–217.
 [8] N. Burkart, M. F. Huber, A survey on the explainability of supervised machine learning, Journal
     of Artificial Intelligence Research 70 (2021) 245–317.
 [9] L. Heiliger, A. Sekuboyina, B. Menze, J. Egger, J. Kleesiek, Beyond medical imaging-a review of
     multimodal deep learning in radiology, Authorea Preprints (2023).
[10] Y. Zhao, X. Wang, T. Che, G. Bao, S. Li, Multi-task deep learning for medical image computing and
     analysis: A review, Computers in Biology and Medicine 153 (2023) 106496.
[11] T. Dhar, N. Dey, S. Borra, R. S. Sherratt, Challenges of deep learning in medical image analy-
     sis—improving explainability and trust, IEEE Transactions on Technology and Society 4 (2023)
     68–75.
[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
     M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words:
     Transformers for image recognition at scale, International Conference on Learning Representations
     abs/2010.11929 (2020).
[13] K. Zhang, J. Yu, Z. Yan, Y. Liu, E. Adhikarla, S. Fu, X. Chen, C. Chen, Y. Zhou, X. Li, et al., Biomedgpt:
     A unified and generalist biomedical generative pre-trained transformer for vision, language, and
     multimodal tasks, arXiv preprint arXiv:2305.17100 (2023).
[14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     J. Clark, et al., Learning transferable visual models from natural language supervision, in:
     International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[15] B. Ionescu, H. Müller, A. Drăgulinescu, J. Rückert, A. Ben Abacha, A. García Seco de Herrera,
     L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. G. Pakull, H. Damm, B. Bracke,
     C. M. Friedrich, A. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire,
     D. Schwab, B. Lecouteux, E. Esperança-Rodier, W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A. Hicks,
     M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast, B. Stein,
     Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental
     IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International
     Conference of the CLEF Association (CLEF 2024), Springer Lecture Notes in Computer Science
     LNCS, Grenoble, France, 2024.
[16] J. Rückert, A. Ben Abacha, A. G. Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer,
     B. Bracke, H. Damm, T. M. G. Pakull, C. S. Schmidt, H. Müller, C. M. Friedrich, Overview of
     ImageCLEFmedical 2024 – Caption Prediction and Concept Detection, in: CLEF2024 Working
     Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[17] J. Rückert, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, S. Koitka, O. Pelka, A. B.
     Abacha, A. G. S. de Herrera, H. Müller, P. A. Horn, F. Nensa, C. M. Friedrich, ROCOv2: Radiology
     Objects in COntext version 2, an updated multimodal image dataset, Scientific Data (2024). URL:
     https://arxiv.org/abs/2405.10004v1. doi:10.1038/s41597-024-03496-6.
[18] O. Bodenreider, The unified medical language system (umls): integrating biomedical terminology,
     Nucleic acids research 32 (2004) D267–D270.
[19] J. Ma, Y. He, F. Li, L. Han, C. You, B. Wang, Segment anything in medical images, Nature
     Communications 15 (2024) 654.
[20] M. Górriz, J. Antony, K. McGuinness, X. Giró-i Nieto, N. E. O’Connor, Assessing knee OA severity
     with CNN attention-based end-to-end architectures, in: International conference on medical
     imaging with deep learning, PMLR, 2019, pp. 197–214.
[21] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg,
     W.-Y. Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF International Conference on
     Computer Vision, 2023, pp. 4015–4026.
[22] E. Hernandez, D. Mahajan, J. Wulff, M. J. Smith, Z. Ziegler, D. Nadler, P. Szolovits, A. Johnson,
     E. Alsentzer, et al., Do we still need clinical language models?, in: Conference on Health, Inference,
     and Learning, PMLR, 2023, pp. 578–597.
[23] E. Lehman, A. Johnson, Clinical-t5: Large language models built using mimic clinical text,
     PhysioNet (2023).
[24] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
     the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning
     research 21 (2020) 1–67.
[25] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard,
     E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng,
     J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang,
     J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, H. Suk,
     M. Suo, P. Tillet, E. Wang, X. Wang, W. Wen, S. Zhang, X. Zhao, K. Zhou, R. Zou, A. Mathews,
     G. Chanan, P. Wu, S. Chintala, PyTorch 2: Faster Machine Learning Through Dynamic Python
     Bytecode Transformation and Graph Compilation, in: 29th ACM International Conference on
     Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24),
     ACM, 2024. URL: https://pytorch.org/assets/pytorch2-2.pdf. doi:10.1145/3620665.3640366.
[26] W. Falcon, The PyTorch Lightning team, PyTorch Lightning, 2024. URL: https://github.com/
     Lightning-AI/lightning. doi:10.5281/zenodo.10779019.
[27] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating text generation
     with BERT, International Conference on Learning Representations abs/1904.09675 (2019).
[28] L. Chin-Yew, Rouge: A package for automatic evaluation of summaries, in: Proceedings of the
     Workshop on Text Summarization Branches Out, 2004, 2004.
[29] T. Darcet, M. Oquab, J. Mairal, P. Bojanowski, Vision transformers need registers, arXiv preprint
     arXiv:2309.16588 (2023).