<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>VisualT5: Multitasking Caption and Concept Prediction with Pre-trained ViT, T5 and Customized Spatial Attention in Radiological Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diedre Carmo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Letícia Rittner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Lotufo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Electrical and Computer Engineering, Universidade Estadual de Campinas</institution>
          ,
          <addr-line>Campinas</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>The development of more explainable and general deep learning-based predictive and generative models is of interest to the medical imaging processing field, largely due to the “black box" and often specialized nature of current models. This paper describes our participation in the ImageCLEF Caption Prediction and Concept Detection challenges with a multitasking, multimodal and explainable architecture named VisualT5. VisualT5 couples the embedding power of a frozen pre-trained Vision Transformer (ViT) with the clinical text generation capabilities of the pre-trained ClinicalT5. Moreover, we propose a modified spatial attention module that weights our visual encoder features in the token dimension, showcasing the spatial importance of each ViT token and permitting more interpretability regarding what parts of the image have more impact on the model's conclusions. VisualT5-base-clinical as a single multitasking model achieved 0.61 BERTScore and 0.58 F1-score in the caption prediction and concept detection tasks, respectively, ranking 6/11 in the caption leaderboard and 6/9 in the concept leaderboard.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;vision transformer</kwd>
        <kwd>t5</kwd>
        <kwd>image captioning</kwd>
        <kwd>image classification</kwd>
        <kwd>medical imaging</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The success of deep learning for the creation of predictive and generative models is evident [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], with
success both in academic research and recently being integrated into real products such as ChatGPT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
and other platform based LLMs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Deep learning models have also been applied to medical imaging
classification and caption generation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, the translation of such models to real applications in
medicine is lagging behind, due to the complex nature of medical diagnosis and related signal processing.
Some research has raised the potential problems of bias and other factors leading to the unfeasibility of
translation to real clinic of many deep learning based methods [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Medical information that leads to
a diagnosis or disease understanding is presented in many modalities, either as diferent types of images
acquisitions, structured and free text, and even 1D signals such as electrocardiograms. Moreover, the
number of tasks involved in the pipeline of medical processes can’t be summarized into isolated academic
tasks such as direct image classification, segmentation, or caption generation. Finally, explainability of
key factors that led to decision making is paramount in the medical field [ 8]. This context has led current
research into considering multimodality [9], multitasking [10], and explainability [11] as important
aspects of automated medical imaging processing.
      </p>
      <p>
        In terms of model architecture, current approaches for medical imaging classification mostly consist of
using CNNs with fully connected layers or the vision transformer model, an state-of-the-art transformer
for image classification [ 12]. In the context of image to caption generation, three methodologies are
commonly used: encoder-decoder models, where an encoder generates image features which are
decoded into text either by LSTMs of transformers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]; visual language models, where transformer input
tokens mix ViT-like image representations with text tokens [13]; and finally CLIP-like approaches, were
image and text embeddings are aligned, and embedding alignment after training is used to perform
various multimodal tasks [14].
      </p>
      <p>In this preliminary work, we explore the multitasking of medical image classification and caption
generation in various modalities of radiological images from two ImageCLEF [15] challenges at the same
time: the medical imaging caption prediction and concept detection challenges [16]. Our participation
in this challenge happens as a first step into exploring multitasking, multimodality and explainability
in medical imaging processing for better generalization and usability in practice. Our proposal involves
an encoder-decoder model marrying strong image representation from pre-trained ViT models together
with pre-trained T5 as a text decoder for caption generation, including innovative uses of spatial
attention to promote visual explainability.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Dataset</title>
        <p>The proposed VisualT5 is an image-to-text encoder-decoder architecture coupling a vision transformer
with an encoder-decoder T5 text transformer. VisualT5 is trained and evaluated using the ImageCLEF
dataset, ROCOv2.</p>
        <p>Radiology Objects in COntext (ROCOv2) [17] is the main dataset used by both ImageCLEF challenges
for caption prediction and concept detection labels. In summary, the authors of the dataset performed a
semi-automatic pipeline to extract valid caption and radiological image pairs from publicly available
medical papers. In this year’s version of the dataset, the training set consists of 70108 radiology images,
with 9972 more for validation and 17237 for testing, with testing labels hidden from the participants.</p>
        <p>Concepts classification is multilabel, and the main label used as primary groundtruth in the challenge
uses concepts automatically extracted from captions, represented by 1934 Unified Medical Language
System [18] Concept Unique Identifiers (CUIs). In addition, concepts are reduced into a manually
curated subset containing only modality and body region CUIs for a secondary evaluation.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Architecture</title>
        <p>
          In VisualT5 (Fig. 1), the frozen pre-trained ViT encoder from MEDSam [19, 21] is used to generate latent
representations. To use their ViT-base [12] architecture, images are bilinearly interpolated to 1024x1024
while keeping the aspect ratio with 0 padding, by using their provided image processing pipeline1. The
resulting embedding with batch size 1 of [
          <xref ref-type="bibr" rid="ref1">1, 4096, 768</xref>
          ] reveals a hidden size of 768 and a 16x16 patch
size, given that the sequence length is 4096, the number of 16x16 patches that fit in a 1024x1024 image.
The last hidden state of same shape is used as a latent representation and weighted by a modified spatial
attention mechanism. Instead of using convolutional layers as in Górriz et al. [20]’s 2D spatial attention,
multiple linear layers with bias and LeakyReLU non-linear activations are used in the same fashion
to compress the 768 hidden size into a single channel array of 4096 sigmoid activated values. Given
that each of the 4096 values corresponds to one of the 64x64 patches, these values are used to weight
(multiply) the contribution of each token, i.e, the importance of each region of the input image. These
4096 values can be visualized as a heatmap after a reshaping to 64x64 and bilinear interpolation to
1024x1024. Finally, the weighted latent space is used as visual encoder features for the subsequent
tasks.
        </p>
        <p>For concept detection, the visual encoder features are averaged in the sequence dimension and we
train a projection through a linear layer into 1934 sigmoid activated neurons for multilabel concept
detection, with each output neuron representing a CUI. The corresponding CUI strings are included in
the prediction based on a multilabel activation threshold of 0.5. At the same time, for caption prediction,</p>
        <sec id="sec-2-2-1">
          <title>1https://huggingface.co/flaviagiammarino/medsam-vit-base</title>
          <p>1x3x1024x1024 input image
pre-trained ClinicalT5-base [22, 23] is used, including its text encoder, decoder and tokenizer. Note
that the text encoder and decoder are not frozen, and are adjusted by our training. Our visual encoder
features replace the input embeddings of ClinicalT5’s text encoder. We reduce the sequence length from
4096 to 128 using average pooling, due to limited GPU memory. To continue training and promoting
the alignment of our visual encoder features as input embeddings for ClinicalT5, we follow original
T5’s training procedure [24]. Text generation during evaluation consists of the computation of the
visual encoder features once, and a Seq2Seq greedy decoding strategy for text generation with a 128
tokens maximum sequence length. Attempts at using only the ClinicalT5 decoder with visual encoder
features as "encoder outputs" for T5’s encoder-decoder attention resulted in degraded quantitative
performance with little computational eficiency benefits. With the multilabel concept detection and
generated caption prediction being derived from the same visual encoder features, VisualT5 multitasks
both tasks using the same model, and is trained in both tasks simultaneously.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Implementation Details</title>
        <p>For implementation we used the Hugging Face Transformers library, PyTorch [25] and PyTorch
Lightning [26]. MEDSam’s ViT pre-trained weights were also sourced from Hugging Face1. Note that
acquiring ClinicalT5-base weights for T5-base weight initialization required credentialing and ethical
training through the PhysioNet platform [23]2.</p>
        <p>For validation, we employed the evaluation code provided by ImageCLEF organizers, which computes
BERT [27] and ROUGE [28] scores for caption prediction and multilabel F1-scores for concept detection.
BERTScore and F1-score for all provided concepts were the primary metrics used by challenge runners
for ranking. During their test evaluation, they added and reported on additional metrics [16]. Both
tasks are optimized at the same time using a 4090 24 GB GPU, with a batch size of 5, AdamW optimizer
with 1e-5 initial learning rate and 1e-5 weight decay, and training for 100 epochs with an early stopping
patience of 10 epochs since last validation BERTScore improvement.</p>
        <sec id="sec-2-3-1">
          <title>2https://physionet.org/content/clinical-t5/1.0.0/</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Discussion</title>
      <p>After early experiments defining some hyperparameters, four main experiments were performed and
submitted to the ImageCLEF evaluation platform for testing. Results from the test phase were only
revealed after the end of the challenge. These experiments aimed to evaluate the impact of the design
variations over the previously described architecture (Tab. 1).</p>
      <p>Since ViT-small is not defined in the original ViT publication [ 12], we designed it with 512 hidden
size, image size of 256x256, patch size of 16x16, 8 heads and layers, and 1024 MLP dimensions. With
these parameters, ViT-small analyzes 256 tokens (patches), providing full input embedding alignment
with a 256 sequence length T5-small, without the necessity of sequence length compression through
average pooling. VisualT5-small trains the ViT-small visual encoder from scratch, in contrast with
VisualT5-base where the pre-trained ViT is kept frozen due to memory limitations. Experimental results
showcase the variations in performance resulting from these diferences in VisualT5 design (Tab. 2).</p>
      <p>It is noticeable that caption prediction performance did not change significantly during validation
according to BERTScore. Using a CLS token strategy for concept detection resulted in the worst F1-score
in validation, with the full VisualT5-base-clinical method being the best overall. This also translated
to testing computed by the challenge runners, where the full base models with related pre-trained
weights performed best. Of notice is the apparent lack of generalization to the test set of VisualT5-base,
which used a general T5-base text decoder. This overfitting did not happen when training from the
ClinicalT5-base text decoder weights, suggesting using pre-trained encoders and decoders from the
medical domain is beneficial. In the overall test leaderboard [ 16], our multitask method placed 6/9 in
concept detection and 6/11 in caption prediction.</p>
      <p>In addition to quantitative performance, qualitative evaluation through random visual inspection of
around a hundred test cases reveals that the model can ascertain modality and anatomical information
well in the generated captions and concepts. However, the model is often unable to predict associated
symptoms and diagnostic-related details, which are sometimes present in the target. Those are commonly
related to clinical context or the reason for the examination, information outside of the image scope
(Fig. 2). We believe including more clinical information such as the reason for the image acquisition as
input to these types of methods would lead to improved performance in these tasks.</p>
      <p>Input</p>
      <p>Target</p>
      <p>Prediction</p>
      <p>The proposed spatial attention scheme seems to work well empirically, when rendering the generated
4096 sigmoid weights as heatmaps using the Turbo colormap (Fig 3). The ViT tokens related to
foreground parts of the image are being weighted more than background regions. This type of layer
has the potential to improve the readability of ViT-derived transformers, which are notable for having
dificult to visualize output attentions [ 29]. Note, however, that there is no specific highlight of the
abnormal region. Our spatial attention seems to converge to a state where most foreground tokens are
“important", with values close to 1. More exploration of this type of module in future work might lead to
improved contrast and more specific indication of abnormality localization on the generated heatmaps.
Possibilities include experimenting with diferent activations and colormaps for visualization.
CC BY [Edelbach et al. (2023)]</p>
      <p>CC BY [Cobilinschi et al. (2023)]
0</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We proposed VisualT5, an encoder-decoder model based on coupling pre-trained Vision Transformers
with pre-trained T5 transformers. Better performance in multitasking the ImageCLEF Caption Prediction
and Concept Detection tasks was observed when using models pre-trained on the medical domain. The
same multitasking weight placed in the middle of the leaderboard for both tasks in the challenge’s test
phase. Moreover, the proposed modified spatial attention successfully highlighted areas of medical
interest. Future work will experiment with more general promptable visual language models including
prior information outside of the scope of the radiological acquisition, adding more tasks and modalities,
towards a lightweight open-source, multitasking, multimodal, and explainable model.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>D. Carmo was partially supported by Sao Paulo Research Foundation (FAPESP) grant #2019/21964-4. R
Lotufo is partially supported by CNPq (The Brazilian National Council for Scientific and Technological
Development) under grant 313047/2022-7. L Rittner is partially supported by CNPQ grant 317133/2023-3,
and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) grant 506728/2020-00.
[8] N. Burkart, M. F. Huber, A survey on the explainability of supervised machine learning, Journal
of Artificial Intelligence Research 70 (2021) 245–317.
[9] L. Heiliger, A. Sekuboyina, B. Menze, J. Egger, J. Kleesiek, Beyond medical imaging-a review of
multimodal deep learning in radiology, Authorea Preprints (2023).
[10] Y. Zhao, X. Wang, T. Che, G. Bao, S. Li, Multi-task deep learning for medical image computing and
analysis: A review, Computers in Biology and Medicine 153 (2023) 106496.
[11] T. Dhar, N. Dey, S. Borra, R. S. Sherratt, Challenges of deep learning in medical image
analysis—improving explainability and trust, IEEE Transactions on Technology and Society 4 (2023)
68–75.
[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words:
Transformers for image recognition at scale, International Conference on Learning Representations
abs/2010.11929 (2020).
[13] K. Zhang, J. Yu, Z. Yan, Y. Liu, E. Adhikarla, S. Fu, X. Chen, C. Chen, Y. Zhou, X. Li, et al., Biomedgpt:
A unified and generalist biomedical generative pre-trained transformer for vision, language, and
multimodal tasks, arXiv preprint arXiv:2305.17100 (2023).
[14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual models from natural language supervision, in:
International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[15] B. Ionescu, H. Müller, A. Drăgulinescu, J. Rückert, A. Ben Abacha, A. García Seco de Herrera,
L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. G. Pakull, H. Damm, B. Bracke,
C. M. Friedrich, A. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire,
D. Schwab, B. Lecouteux, E. Esperança-Rodier, W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A. Hicks,
M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast, B. Stein,
Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental
IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International
Conference of the CLEF Association (CLEF 2024), Springer Lecture Notes in Computer Science
LNCS, Grenoble, France, 2024.
[16] J. Rückert, A. Ben Abacha, A. G. Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer,
B. Bracke, H. Damm, T. M. G. Pakull, C. S. Schmidt, H. Müller, C. M. Friedrich, Overview of
ImageCLEFmedical 2024 – Caption Prediction and Concept Detection, in: CLEF2024 Working
Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[17] J. Rückert, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, S. Koitka, O. Pelka, A. B.</p>
      <p>Abacha, A. G. S. de Herrera, H. Müller, P. A. Horn, F. Nensa, C. M. Friedrich, ROCOv2: Radiology
Objects in COntext version 2, an updated multimodal image dataset, Scientific Data (2024). URL:
https://arxiv.org/abs/2405.10004v1. doi:10.1038/s41597-024-03496-6.
[18] O. Bodenreider, The unified medical language system (umls): integrating biomedical terminology,</p>
      <p>Nucleic acids research 32 (2004) D267–D270.
[19] J. Ma, Y. He, F. Li, L. Han, C. You, B. Wang, Segment anything in medical images, Nature</p>
      <p>Communications 15 (2024) 654.
[20] M. Górriz, J. Antony, K. McGuinness, X. Giró-i Nieto, N. E. O’Connor, Assessing knee OA severity
with CNN attention-based end-to-end architectures, in: International conference on medical
imaging with deep learning, PMLR, 2019, pp. 197–214.
[21] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg,
W.-Y. Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF International Conference on
Computer Vision, 2023, pp. 4015–4026.
[22] E. Hernandez, D. Mahajan, J. Wulf, M. J. Smith, Z. Ziegler, D. Nadler, P. Szolovits, A. Johnson,
E. Alsentzer, et al., Do we still need clinical language models?, in: Conference on Health, Inference,
and Learning, PMLR, 2023, pp. 578–597.
[23] E. Lehman, A. Johnson, Clinical-t5: Large language models built using mimic clinical text,</p>
      <p>PhysioNet (2023).
[24] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning
research 21 (2020) 1–67.
[25] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard,
E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng,
J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang,
J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, H. Suk,
M. Suo, P. Tillet, E. Wang, X. Wang, W. Wen, S. Zhang, X. Zhao, K. Zhou, R. Zou, A. Mathews,
G. Chanan, P. Wu, S. Chintala, PyTorch 2: Faster Machine Learning Through Dynamic Python
Bytecode Transformation and Graph Compilation, in: 29th ACM International Conference on
Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24),
ACM, 2024. URL: https://pytorch.org/assets/pytorch2-2.pdf. doi:10.1145/3620665.3640366.
[26] W. Falcon, The PyTorch Lightning team, PyTorch Lightning, 2024. URL: https://github.com/</p>
      <p>Lightning-AI/lightning. doi:10.5281/zenodo.10779019.
[27] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating text generation
with BERT, International Conference on Learning Representations abs/1904.09675 (2019).
[28] L. Chin-Yew, Rouge: A package for automatic evaluation of summaries, in: Proceedings of the</p>
      <p>Workshop on Text Summarization Branches Out, 2004, 2004.
[29] T. Darcet, M. Oquab, J. Mairal, P. Bojanowski, Vision transformers need registers, arXiv preprint
arXiv:2309.16588 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , Y. Bengio, G. Hinton,
          <article-title>Deep learning</article-title>
          ,
          <source>Nature</source>
          <volume>521</volume>
          (
          <year>2015</year>
          )
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Feuerriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Janiesch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zschech</surname>
          </string-name>
          , Generative ai,
          <source>Business &amp; Information Systems Engineering</source>
          <volume>66</volume>
          (
          <year>2024</year>
          )
          <fpage>111</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] OpenAI, Chatgpt,
          <year>2024</year>
          . URL: https://chat.openai.com/chat, accessed:
          <fpage>2024</fpage>
          -06-18.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Abonizio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          , Sabiá:
          <article-title>Portuguese large language models</article-title>
          ,
          <source>in: Brazilian Conference on Intelligent Systems</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.-R.</given-names>
            <surname>Beddiar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oussalah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seppanen</surname>
          </string-name>
          ,
          <article-title>Automatic captioning for medical imaging (mic): a rapid review of literature</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>4019</fpage>
          -
          <lpage>4076</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wynants</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Van Calster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Collins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Riley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Heinze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schuit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Albu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Arshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellou</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Bonten</surname>
          </string-name>
          , et al.,
          <article-title>Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal</article-title>
          , bmj
          <volume>369</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Driggs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Thorpe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gilbey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yeung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ursprung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Aviles-Rivero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Etmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>McCague</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beer</surname>
          </string-name>
          , et al.,
          <article-title>Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>3</volume>
          (
          <year>2021</year>
          )
          <fpage>199</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>