UIT-2Q2T at ImageCLEFmedical 2024 Caption: Multimodal
                         medical image captioning using Bootstrapping
                         Language-Image Pre-training
                         Notebook for the UIT-2Q2T Team at CLEF 2024

                         Thien V. Phan1,2 , Trinh K. Nguyen1,2 , Quang A.D.D. Hoang1,2 , Quan T. Phan1,2 and
                         Thien B. Nguyen-Tat1,2,*
                         1
                             University of Information Technology, Ho Chi Minh City, Vietnam
                         2
                             Vietnam National University, Ho Chi Minh City, Vietnam


                                        Abstract
                                         Introduction: Medical image captioning is an important AI task in healthcare, automating the generation of text
                                        descriptions to support the management and interpretation of medical images. Our team, UIT-2Q2T, participated
                                        in the second task of the ImageCLEFmedical 2024 Caption challenge using the ROCOv2 dataset with the Boot-
                                        strapping Language-Image Pre-training (BLIP) approach.
                                        Methods: Our approach leveraged the BLIP architecture for multimodal medical image captioning. This architec-
                                        ture employs a Vision Transformer (ViT) as the image encoder and a Bidirectional Encoder Representations from
                                        Transformers (BERT) as the text model.
                                        Results: We ranked 5th according to BERTscore and placed 3rd with ROUGEScore, BLEURTScore, and RefCLIP-
                                        Score. Additionally, we achieved 2nd place for BLEU-1, METEOR, and CIDEr scores. Notably, we obtained the
                                        top position with a CLIPScore of 0.827074, demonstrating the effectiveness of our approach in medical image
                                        captioning.
                                        Conclusion: Our participation in the ImageCLEFmedical 2024 Caption challenge demonstrated the effectiveness
                                        of the BLIP architecture for medical image captioning, achieving a high CLIPScore of 0.82707. This result demon-
                                        strates the model’s potential to generate accurate and informative textual descriptions from medical images,
                                        thereby aiding diagnosis and assisting non-experts in understanding medical images.

                                        Keywords
                                        CLEF 2024, Medical image processing, Image captioning, BERT, Pre-trained models, BLIP


                         1. Introduction
                         Image captioning, a well-established field in artificial intelligence (AI), finds applications across diverse
                         domains. In healthcare, the increasing availability of medical imaging equipment and the efficiency of
                         diagnosis based on visual data have fueled the popularity of image-based patient diagnosis. Medical
                         image captioning models address this need by automatically analyzing and describing medical images.
                         These models generate textual descriptions that assist doctors in diagnosing diseases, understanding
                         physiological processes, and enabling non-experts to interpret medical imagery.
                            This field integrates computer vision and natural language processing, demanding an understanding
                         of image components and their relationships [1]. Various models, such as the Show-Attend-Tell, GPT-3,
                         and BioLinkBERT-Large, have been utilized to generate comprehensive and descriptive captions for
                         medical images, including radiological scans and histopathological specimens [2] [3]. Transformer-based
                         approaches, like the Global-Local Visual Extractor (GLVE) and Cross Encoder-Decoder Transformer
                         (CEDT), have shown promise in capturing both global and local features of images, enhancing the
                         accuracy of generated captions [4]. These advancements in medical image captioning not only facilitate


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ 21522628@gm.uit.edu.vn (T. V. Phan); 21522717@gm.uit.edu.vn (T. K. Nguyen); 21522509@gm.uit.edu.vn
                         (. Q. A.D.D. Hoang); 21522502@gm.uit.edu.vn (. Q. T. Phan); thienntb@uit.edu.vn (. T. B. Nguyen-Tat)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
clinical workflows and decision-making but also contribute significantly to medical education by
providing quantitative indicators and assessments for learning outcomes [5].
   To successfully deploy image captioning in healthcare, it is essential to integrate effective algorithms
and use a sufficiently large and diverse training dataset. Our team participated in ImageCLEF 2024
[6] for the ImageCLEFmedical 2024 Caption [7] task which consists of 2 subtask: Concept Detection,
Caption Prediction. We mainly focus on the latter. Here, participants are required to automatically
generate captions for given medical images, which could be of various modalities, such as ultrasound,
X-Ray, Computer Tomography (CT), Magnetic Resonance Imaging (MRI), etc.
   Our approach for the caption prediction subtask is based on Bootstrapping Language-Image Pre-
training (BLIP) [8] architecture with a Vision Transformer (ViT) image Encoder [9]. In this paper,
Section 2 outlines the task and dataset descriptions. Section 3 describes our proposed methodology.
Section 4 details the implementation and results of the experiments. Finally, in Section 5, we conclude
by summarizing the results, discussing the weaknesses, and outlining potential improvements for the
future.


2. Task and Dataset Descriptions
At ImageCLEFmedical 2024, we participated in the image captioning task. This is the 8th edition of the
ImageCLEFmedical caption task. In this section, we will introduce the task in the ImageCLEFmedical
2024 Caption and the dataset used for this challenge.

2.1. Task Description
ImageCLEFmedical 2024 Caption [7] is one of ImageCLEFmedical’s tasks to create descriptive captions
for visual content. The tasks in ImageCLEFmedical Caption include two sub-tasks:
   1. Concept detection: Based on the visual image content, this subtask provides the foundation for
      the scene understanding step by identifying the individual elements from which the annotation
      is generated.
   2. Captions prediction: The core task is to create descriptive captions for given images. Leveraging
      identified concepts and contextual understanding, the models are tasked with generating concise
      and informative textual descriptions that accurately reflect the visual content depicted in the
      image.
In this study, we focus on the second sub-task based on the provided dataset ROCOv2 [10].

2.2. Dataset Descriptions
The dataset for this task is ROCOv2 [10] - an extended version of ROCO [11]. It is a multimodal dataset
consisting of radiological images and associated medical concepts and captions extracted from the
PubMed Open Access subset. All images in the dataset were accompanied by a caption, which form
the labels for the caption prediction task. Each caption was pre-processed by removing links from the
captions. The splits for the dataset are as follows:

    • Training Set: Consists of 70,108 radiology images.
    • Validation Set: Consists of 9,972 radiology images.
    • Test Set: Consists of 17,237 radiology images.

  As shown in Figure 1, the majority of captions in the dataset range from 50 to 150 words in length.
Similarly, Figure 2 illustrates that among the six imaging modalities represented in the dataset, CT scans
and X-rays are predominant, accounting for 24,227 and 19,363 samples in the training set, respectively.
Figure 1: Distribution of caption lengths in the Training Set (left) and Validation Set (right).


Figure 2: Distribution of image modalities in Train and Validation Sets.


3. Methods
In this study, the BLIP model was employed to tackle the image captioning task. This approach involves
finetuning the BLIP model on the competition dataset, which consists of diverse and challenging image-
caption pairs. The pipeline of our method is illustrated in Figure 3, showcasing the steps involved in
adapting the BLIP model for our specific image captioning task.

3.1. Models
Bootstrapping Language-Image Pre-training (BLIP) [8] is a Vision-Language Pre-training (VLP) frame-
work which transfers flexibly to both vision-language understanding and generation tasks. BLIP
effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates
synthetic captions and a filter removes the noisy ones.
   The model uses Vision Transformer (ViT) [9] which divides the input image into patches and encodes
them as a sequence of embedding with the addition of [CLS] token to represent the globe image feature.
As the authors mentioned ViT uses less computation cost and is a straightforward method, and is being
adopted by recent methods.
                                                 ITC                                                  ITM                                           LM


                                                           Feed Forward                        Feed Forward                                    Feed Forward
                  Feed Forward


        NX                                         NX     Cross Attention                                                                     Cross Attention
                                                                                               Cross Attention
                  Self Attention


                                                            Bi Self-Att                           Bi Self-Att                                  Causal Self-Att
        Image                                   Text
        Encoder                                 Encoder
                                                                            Image-grounded                          Image-grounded
                                                                                                                    Text decoder
                                                          "[CLS] +          Text encoder    "[Encode] +                          "[Decode] +

                                                                                Axial image of contrast-enhanced computed tomography (CECT)
                         CC BY-NC                                               of lower abdomen shows a mixed solid and cystic mass....
                  [Bajracharya et al. (2021)]


Figure 3: Pre-training model architecture and objectives of BLIP (same parameters have the same color). The
multimodal mixture of encoder-decoder was proposed, a unified vision-language model which can operate in one
of the three functionalities: (1) Unimodal encoder is trained with an image-text contrastive (ITC) loss to align the
vision and language representations. (2) Image-grounded text encoder uses additional cross-attention layers to
model vision-language interactions, and is trained with a image-text matching (ITM) loss to distinguish between
positive and negative image-text pairs. (3) Image-grounded text decoder replaces the bi-directional self-attention
layers with causal self-attention layers, and shares the same cross-attention layers and feed forward networks
as the encoder. The decoder is trained with a language modeling (LM) loss to generate captions given images.


   To be able to train or pre-train the model for understanding and generation tasks, a multimodal
mixture of an encoder and decoder is used, integrating three functionalities and three objectives, as
illustrated in Figure 3. The functionalities include:
       • Unimodal Encoder: Encodes either the image or the text separately without considering the
         other modality. This helps in understanding individual representations.
       • Image-grounded Text Encoder: Encodes text while being conditioned on the image, allowing
         the model to capture relationships between visual and textual information.
       • Image-grounded Text Decoder: Generates text based on the given image, useful for tasks like
         image captioning where the output is text describing the input image.
The objectives are:
       • Image-Text Contrastive Loss (ITC): Ensures that paired image and text representations are
         closer together in the embedding space compared to unpaired ones. This helps the model learn
         strong associations between images and their corresponding texts.
       • Image-Text Matching Loss (ITM): Assesses whether a given image and text pair match or not,
         promoting accurate image-text alignment in the embedding space.
       • Language Modeling Loss (LM): Focuses on generating coherent and contextually accurate text
         based on given inputs, improving the model’s language generation capabilities.
  These functionalities and objectives together enable the BLIP model to perform both vision-language
understanding and generation tasks effectively.

3.2. Evaluation Metrics
Following the guidelines provided by the competition organizers, we employed two main metrics:
BERTScore [12] and ROUGEScore [13]. To calculate BERTScore, we use the "microsoft/deberta-xlarge-
mnli" model, which can be found on the Hugging Face Model Hub1 . Additionally, other metrics
1
    https://huggingface.co/microsoft/deberta-xlarge-mnli (Last accessed: May 17, 2024)
such as BLEU-1 [14], BLEURT [15], METEOR [16], CIDEr [17], CLIPScore [18], RefCLIPScore [19],
ClinicalBLEURTScore [20], and MedBERTScore [20] were also applied for evaluation.
   As the organizers’ instructions, the captions underwent preprocessing through three steps: conver-
sion to lowercase, replacement of numbers with a special token, and removal of punctuation. This
preprocessing aimed to standardize the text inputs and enhance the quality of evaluation result.


4. Experiments
In this section, we present our experimental setup and results for evaluating the BLIP model in the
ImageCLEFmedical 2024 Caption challenge. The experiments were designed to test the model’s per-
formance across different configurations and metrics, aiming to generate accurate and informative
captions for medical images. We describe the setup in detail and discuss the results obtained from
various test scenarios.

4.1. Experimental Setup
In our experiments, we employed BLIP model (base/large) from pre-trained checkpoints. For BLIP
base, weights were utilized from the checkpoint "Salesforce/blip-image-captioning-base"2 . Training was
conducted over 15 epochs with an initial learning rate of 1e-5. A StepLR scheduler was used to decrease
the learning rate by a factor of 10 every 3 epochs. For the BLIP large model, weights were utilized
from the checkpoint "Salesforce/blip-image-captioning-large"3 . Training was conducted over 5 epochs
with an initial learning rate of 1e-5 and was stopped when the loss ceased to decrease. Throughout
all experiments, the AdamW optimizer [21] was used. Input images were resized to 224x224, and the
maximum length of text input was set to 200 tokens. To facilitate model training, a single GPU A100
PCIE 40GB was used.
   For each model, experiments were conducted with four different generation settings using
no_repeat_ngram_size = 3 to prevent the model from repeating any n-gram of size 3 within the
generated text. The generation settings are as follows:
    (1) Greedy Search: This setting selects the token with the highest probability at each step, ensuring
        a straightforward and fast generation process but potentially missing out on more diverse or
        optimal sequences.
    (2) Beam Search with beam_size = 3: Beam search keeps the top 3 most probable sequences at each
        generation step, allowing for more exploration of potential sequences compared to greedy search.
    (3) Beam Search with beam_size = 4: Similar to the previous setting, but with a beam size of 4, which
        balances between exploration and computational efficiency.
    (4) Beam Search with beam_size = 5: This setting further increases the beam size to 5, allowing for
        more comprehensive exploration while balancing computational efficiency.
    (5) Beam Search with beam_size = 10: With a beam size of 10, this setting aims for a broader
        exploration of possible sequences, potentially improving the quality of text generation at the cost
        of higher computational resources.
The source code for our experiments is available on GitHub4 .

4.2. Experimental Results
To evaluate the effectiveness of our approach using the BLIP model on the image captioning task,
we conducted a series of experiments on the competition dataset. This section presents the results,
highlighting the model’s performance in generating captions. We compared the experimental results of
the model using different configurations and also benchmark our model’s performance against other
teams’.
2
  https://huggingface.co/Salesforce/blip-image-captioning-base (Last accessed: May 17, 2024)
3
  https://huggingface.co/Salesforce/blip-image-captioning-large (Last accessed: May 17, 2024)
4
  https://github.com/QuangHoang059/DS312
                                   Ground truth: Computed tomography (CT) shows floating thrombosis (white arrow).
                                   Prediction with greedy Search: contrast - enhanced computed tomography image of
                                   the aortic arch ( white arrow ).
                                   Prediction with Beam Search (beam_size = 3): contrast - enhanced computed
                                   tomography image of the aortic arch ( white arrow ).
                                   Prediction with Beam Search (beam_size = 4): contrast - enhanced computed
                                   tomography image of the aortic arch ( white arrow ).
                                   Prediction with Beam Search (beam_size = 5): contrast - enhanced computed
                                   tomography image of the aortic arch ( white arrow ).
                                   Prediction with Beam Search (beam_size = 10): contrast - enhanced computed
   CC BY [Sato et all. (2022)]     tomography image of the aortic arch ( white arrow ).

                                   Ground truth: Early sagittal T2-weighted MRI.
                                   Prediction with greedy Search: sagittal t2 - weighted mri of the thoracic spine.
                                   Prediction with Beam Search (beam_size = 3): sagittal t2 - weighted magnetic
                                   resonance image of the cervical spine.
                                   Prediction with Beam Search (beam_size = 4): sagittal t2 - weighted magnetic
                                   resonance image of the cervical spine.
                                   Prediction with Beam Search (beam_size = 5): sagittal t2 - weighted magnetic
                                   resonance image of the cervical spine.
                                   Prediction with Beam Search (beam_size = 10): sagittal t2 - weighted mri of the
                                   thoracic spine.
           CC BY-NC
    [Trowbridge et all. (2022)]

Figure 4: Two examples of predicted results and ground truths in the validation set of the caption prediction
task.


4.2.1. Results on Validation Set

    Table 1
    Evaluation results of BLIP base and large models on the validation set in 5 generation configurations.
                                                    ROUGE        BERTscore         BLEU
                                  BLIP base (1)     0.263178       0.659321       0.291905
                                  BLIP base (2)     0.264012       0.658852       0.300932
                                  BLIP base (3)     0.264665       0.659548       0.299855
                                  BLIP base (4)     0.264674       0.659648       0.297638
                                  BLIP base (5)     0.263178       0.659321       0.291905
                                  BLIP large (1)    0.269548       0.666101       0.285273
                                  BLIP large (2)    0.274387       0.667651       0.295454
                                  BLIP large (3)    0.274497       0.667971       0.295484
                                  BLIP large (4)    0.272249       0.667263       0.292144
                                  BLIP large (5)    0.269548       0.666101       0.285273


   As shown in Table 1, the BLIP large model outperforms the BLIP base model. Both models show
generative capabilities, with beam search outperforming greedy search across ROUGEScore, as well
as BERTScore and BLEUScore. Specifically, the BLIP base model achieves its highest BERTScore and
ROUGE score with a beam size of 5, and its best BLEU score with a beam size of 3. The BLIP large
model attains optimal results across all three metrics with a beam size of 4. Additionally, as illustrated
by the two examples in Figure 4, the model accurately identifies objects and colors (white arrow), as
well as different imaging modalities (CT and sagittal T2-weighted MRI).

4.2.2. Results on Test Set
According to the private test results announced by the organizing committee and partially presented
in Table 2, our team ranked 5th based on the BERTScore metric. We achieved 3rd place with ROUGE,
BLEURT, and RefCLIPScore metrics. For BLEU-1, METEOR, and CIDEr scores, we achieved 2nd place.
   Table 2
   Results table on the private test set for the top 5 teams based on the primary score.
 Team            BERTScore       ROUGE        BLEU-1      BLEURT       METEOR         CIDEr     CLIPScore
 pclmed            0.629913      0.272626     0.268994    0.337626     0.113264      0.268133   0.823614
 CS_Morgan         0.628059      0.250801     0.209298    0.317385     0.092682      0.245029   0.821262
 DarkCow           0.626720      0.245228     0.195044    0.306005     0.088897      0.224250   0.818440
 auebnlpgroup      0.621112      0.204883     0.111034    0.289907     0.068022      0.176923   0.804067
 2Q2T              0.617814      0.247755     0.221252    0.313942     0.098590      0.220037   0.827074


Notably, we attained 1st place with a CLIPScore of 0.827074. These results demonstrate the model’s
expected performance.


5. Conclusion and Future work
In this paper, we implemented and experimented with the BLIP model for the task of medical image
captioning in the ImageCLEFmedical 2024 Caption challenge. The experimental results across various
configurations showed promising outcomes, with the model achieving a CLIPScore of 0.82707 on the
test set of the ROCOv2 dataset. Despite these achievements, there is still room for improvement in our
research. The primary weakness of the model is its pre-training on a dataset significantly different from
the medical domain, resulting in considerable bias.
   Moving forward, we aim to enhance the model’s accuracy by utilizing pre-trained models with
datasets that are more closely aligned with medical and diagnostic domains. Additionally, we plan
to apply preprocessing methods tailored to different types of medical images to further improve the
performance. Exploring domain-specific augmentation techniques and integrating more diverse medical
datasets could also provide substantial gains. By addressing these areas, we hope to develop a more
robust and accurate medical image captioning model, which can be a valuable tool in clinical settings
for aiding diagnosis and assisting non-experts in understanding medical imagery.
   Furthermore, future research will involve a detailed analysis of the model’s errors to understand the
underlying reasons for its mispredictions. This analysis will guide the development of more effective
strategies for fine-tuning and enhancing the model’s capabilities. Our ultimate goal is to contribute to
the advancement of AI in healthcare by providing reliable and interpretable models that can support
medical professionals and improve patient outcomes.


Acknowledgment
This research is funded by University of Information Technology-Vietnam National University HoChiM-
inh City under grant number D4-2024-01.


References
 [1] Y. Lin, K. Lai, W. Chang, Skin medical image captioning using multi-label classification and siamese
     network, IEEE Access 11 (2023) 23447–23454. doi:10.1109/ACCESS.2023.3249462.
 [2] S. Elbedwehy, T. Medhat, T. Hamza, M. Alrahmawy, Enhanced descriptive captioning model
     for histopathological patches, Multimedia Tools and Applications 83 (2023) 1–20. doi:10.1007/
     s11042-023-15884-y.
 [3] A. Selivanov, O. Rogov, D. Chesakov, A. Shelmanov, I. Fedulova, D. Dylov, Medical image
     captioning via generative pretrained transformers, Scientific Reports 13 (2023). doi:10.1038/
     s41598-023-31223-5.
 [4] H. Lee, H. Cho, J. Park, J. Chae, J. Kim, Cross encoder-decoder transformer with global-local visual
     extractor for medical image captioning, Sensors 22 (2022) 1429. doi:10.3390/s22041429.
 [5] D.-R. Beddiar, M. Oussalah, T. Seppänen, Automatic captioning for medical imaging (mic): a
     rapid review of literature, Artif. Intell. Rev. 56 (2022) 4019–4076. URL: https://doi.org/10.1007/
     s10462-022-10270-w. doi:10.1007/s10462-022-10270-w.
 [6] B. Ionescu, H. Müller, A. Drăgulinescu, J. Rückert, A. Ben Abacha, A. García Seco de Herrera,
     L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. Pakull, H. Damm, B. Bracke,
     C. M. Friedrich, A. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire,
     D. Schwab, B. Lecouteux, E. Esperança-Rodier, W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A. Hicks,
     M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast, B. Stein,
     Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental
     IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International
     Conference of the CLEF Association (CLEF 2024), Springer Lecture Notes in Computer Science
     LNCS, Grenoble, France, 2024.
 [7] J. Rückert, A. Ben Abacha, A. G. Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer,
     B. Bracke, H. Damm, T. Pakull, C. S. Schmidt, H. Müller, C. M. Friedrich, Overview of Image-
     CLEFmedical 2024 – Caption Prediction and Concept Detection, in: CLEF2024 Working Notes,
     CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
 [8] J. Li, D. Li, C. Xiong, S. Hoi, BLIP: Bootstrapping language-image pre-training for unified vision-
     language understanding and generation, in: K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari,
     G. Niu, S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning,
     volume 162 of Proceedings of Machine Learning Research, PMLR, 2022, pp. 12888–12900. URL:
     https://proceedings.mlr.press/v162/li22n.html.
 [9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
     M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Trans-
     formers for image recognition at scale, in: International Conference on Learning Representations,
     2021. URL: https://openreview.net/forum?id=YicbFdNTTy.
[10] J. Rückert, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, S. Koitka, O. Pelka, A. B.
     Abacha, A. G. S. de Herrera, H. Müller, P. A. Horn, F. Nensa, C. M. Friedrich, ROCOv2: Radiology
     Objects in COntext version 2, an updated multimodal image dataset, Scientific Data (2024). URL:
     https://arxiv.org/abs/2405.10004v1. doi:10.1038/s41597-024-03496-6.
[11] O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. Friedrich, Radiology Objects in COntext (ROCO):
     A Multimodal Image Dataset: 7th Joint International Workshop, CVII-STENT 2018 and Third
     International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain,
     September 16, 2018, Proceedings, 2018, pp. 180–189. doi:10.1007/978-3-030-01364-6_20.
[12] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation
     with bert, in: International Conference on Learning Representations, 2020. URL: https://openreview.
     net/forum?id=SkeHuCVFDr.
[13] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization
     Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
     https://www.aclweb.org/anthology/W04-1013.
[14] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
     translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting
     of the Association for Computational Linguistics, Association for Computational Linguistics,
     Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040.
     doi:10.3115/1073083.1073135.
[15] T. Sellam, D. Das, A. Parikh, BLEURT: Learning robust metrics for text generation, in: D. Jurafsky,
     J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for
     Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7881–7892.
     URL: https://aclanthology.org/2020.acl-main.704. doi:10.18653/v1/2020.acl-main.704.
[16] S. Banerjee, A. Lavie, METEOR: An automatic metric for MT evaluation with improved correlation
     with human judgments, in: J. Goldstein, A. Lavie, C.-Y. Lin, C. Voss (Eds.), Proceedings of the
     ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or
     Summarization, Association for Computational Linguistics, Ann Arbor, Michigan, 2005, pp. 65–72.
     URL: https://aclanthology.org/W05-0909.
[17] R. Vedantam, C. L. Zitnick, D. Parikh, Cider: Consensus-based image description evaluation, in:
     2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575.
     doi:10.1109/CVPR.2015.7299087.
[18] J. Hessel, A. Holtzman, M. Forbes, R. Bras, C. Yejin, Clipscore: A reference-free evaluation metric
     for image captioning, 2021, pp. 7514–7528. doi:10.18653/v1/2021.emnlp-main.595.
[19] L. Jin, G. Luo, Y. Zhou, X. Sun, G. Jiang, A. Shu, R. Ji, Refclip: A universal teacher for weakly
     supervised referring expression comprehension, 2023, pp. 01–10. doi:10.1109/CVPR52729.
     2023.00263.
[20] A. Ben Abacha, W.-w. Yim, G. Michalopoulos, T. Lin, An investigation of evaluation methods in
     automatic medical note generation, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the
     Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics,
     Toronto, Canada, 2023, pp. 2575–2588. URL: https://aclanthology.org/2023.findings-acl.161. doi:10.
     18653/v1/2023.findings-acl.161.
[21] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: 7th International Conference
     on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net,
     2019. URL: https://openreview.net/forum?id=Bkg6RiCqY7.