<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Medical Image Report Generation through Standard Language Models: Leveraging the Power of LLMs in Healthcare</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio Leonardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Portinale</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Santomauro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Institute, DISIT, Università del Piemonte Orientale</institution>
          ,
          <addr-line>Alessandria</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, Artificial Intelligence has witnessed a deep transformation, primarily driven by advancements in deep learning architectures. Among these, the Transformer architecture has emerged as a pivotal milestone, revolutionizing natural language processing and several other tasks and domains. The Transformer's ability to capture contextual dependencies across sequences, paired with its parallelizable design, made it exceptionally versatile. This plays a fundamental role in the healthcare field, where the ability to integrate and process data from various modalities, such as medical images, clinical notes and patient records, is of paramount importance in order to enable AI models to provide more informed answers. This complexity raises the demand for models that can integrate information from multiple modalities, such as text, images and audio such as multimodal transformers, which are sophisticated architectures able to process and fuse information across diferent modalities. Furthermore, an important goal to be achieved in the healthcare domain is to focus on pre-trained models, given the scarcity of large datasets in this field, and the need to minimise the computational resources, since healthcare organizations are not equipped with high-performance computation devices. This paper presents a methodology for harnessing pre-trained large language models based on the transformer architecture, in order to facilitate the integration of diferent data sources, with a specific focus on the fusion of radiological images and textual reports. The ensuing approach involves the fine-tuning of pre-existing textual models, enabling their seamless extension into diverse domains.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal machine learning</kwd>
        <kwd>Large language models</kwd>
        <kwd>Automated radiology report generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, the field of artificial intelligence has witnessed a profound transformation,
primarily driven by advancements in deep learning architectures. Among these, the Transformer
architecture has emerged as a pivotal milestone, revolutionizing natural language processing and
numerous other domains. A Transformer is extremely able to capture contextual dependencies
across sequences; this feature, together with its parallelizable design, has rendered this deep
learning architecture exceptionally versatile. However, as the complexity of tasks in AI continues
to evolve, so too does the demand for models having the ability to integrate information from
multiple modalities, such as text, images, and audio. Multimodal transformers, are a potential
answer to the above issues; they are of great significance in the healthcare field due to their
ability to integrate and process data from various modalities, such as medical images, clinical
notes, and patient records. This multimodal approach holds great potential for enhancing
diagnosis, treatment, and healthcare research in general. However, in this context, several
challenges have to be addressed:
• Computational Complexity: the employment of large and complex architectures requires
substantial computational demands, and the need of significant computing resources.
These computational requirements are often cost-prohibitive, hindering the widespread
adoption of such models.
• Data Scarcity: the availability of suficient data for training is often limited, making
customized transformer training a non-trivial efort. The lack of data can lead to overfitting
when attempting to train complex Transformer models, as the available data may not be
suficient to generalize eficiently.
• Lack of Technical Transparency: a noteworthy concern arises from the paucity of
comprehensive technical public and open specifications. Many pioneering works in the literature
refrain from publicly disclosing complete architectural details. Instead, they merely
provide cursory insights into the overall structure while withholding finer-grained specifics.
This opacity complicates eforts to replicate and build upon prior research, hampering
the scientific community’s ability to advance the field with precision.[1][2]</p>
      <p>This paper introduces a robust methodology that stands out by emphasizing the deliberate
reuse of pre-trained large language models, setting it apart from ad-hoc approaches in other
methodologies. Our approach is dedicated to the integration of diverse data sources, with
a special emphasis on merging radiological images and textual reports. In particular, we
focus on the definition and experimentation of a multimodal architecture to automatically
generate natural language radiological reports on the basis of radiological (X-ray) images. A
key peculiarity lies in a principled fine-tuning of pre-existing textual models, ensuring their
efective extension into a specific and particular healthcare domain.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed architecture</title>
      <p>The base model we used is GPT-2, that is a well-known decoder-only transformer. A
decoderonly transformer is a neural network architecture that is derived from the original model,
introduced in [3]. The Transformer architecture has become a fundamental building block in
natural language processing (NLP) and has been adapted for various sequence-to-sequence
tasks, including machine translation, text generation, and many more. Among the various
available transformer architectures, we have excluded all encoder-only models, as the ultimate
objective is text generation, necessitating decoder-only architectures. The selection of GPT-2 is
attributed to its ease of use at the time of the investigation, while the most current architectures
were not open source.</p>
      <p>A traditional Transformer model contains both an encoder and a decoder structure. The
encoder processes the input sequence (e.g., a source sentence in machine translation), while the
decoder generates the output sequence (e.g., a target sentence in machine translation). On the
other hand, a decoder-only transformer is composed by:
• Input Embedding: like the original Transformer, the decoder-only Transformer starts
by embedding the input tokens (e.g., words or subwords) into continuous vector
representations.
• Positional Encoding: to provide information about the position of each token in the
sequence, positional encodings (modeled through sine and cosine functions of diferent
frequencies) are added to the input embeddings.
• Multi-Head Self-Attention Mechanism: the core component of the decoder is the
multi-head self-attention mechanism, which allows the model to attend to diferent parts
of the input sequence and capture contextual information. The decoder attends to the
previously generated tokens in an autoregressive manner, that is it generates one token
at a time and uses the generated tokens as context for generating subsequent tokens.
• Masked Self-Attention: in the decoder, a mask is applied to the self-attention mechanism
to ensure that tokens cannot attend to future tokens. This is important for autoregressive
generation, because each token should only depend on the tokens generated before it.
• Multi-Head Attention Layers: the multi-head self-attention mechanism is usually
followed by feedforward neural networks for each token position. These feedforward
networks can have multiple layers.
• Layer Normalization and Residual Connections: layer normalization and residual
connections are applied after each sub-layer (e.g., self-attention and feedforward layers)
to stabilize training and facilitate the flow of gradients.
• Output Layer: the output of the decoder-only Transformer is typically projected to the
target vocabulary size through a linear layer followed by a softmax activation function.</p>
      <p>This allows the model to generate probability distributions over the possible next tokens
in the sequence.</p>
      <p>
        The decoder is trained using a sequence-to-sequence task with teacher forcing, where the
ground truth target sequence is used as input during training to predict the next token in the
sequence. The loss function used in this particual setting is the Cross Entropy Loss; given a
corpus of tokens U = {1, ..., }, we use a standard language modeling objective to maximize
the following likelihood:
(U) = ∑︁ log  (|− , ..− 1; Θ)

(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where k is the size of the context window, and the conditional probability  is modeled using a
neural network with parameters Θ . During inference (generation), the decoder generates tokens
one by one, using its own previously generated tokens as context. Greedy decoding or beam
search can be used to select the next token. Decoder-only Transformers are commonly used
text generation tasks and this was the reason behind the choice of this architecture. Figure 1
shows the classic decoder-only transformer architecture.
      </p>
      <p>In the context of the multi-modal framework we are interested in, the aforementioned
architectural configuration proves to be insuficient. Indeed, our objective is to employ a
generative model in such a way that, when provided with an image as contextual input, produce
a textual report, as in the task of image captioning. Consequently, an extension of the model is
needed, in order to add the capability to efectively process visual information. Figure 2 depicts
the model adaptation for the multi-modal setting. The modification consists in the addition
of an image encoder, leaving all other components of the transformer unchanged. The input
embedding, in this architecture, is composed by two steps:
• Image embedding: depending of which type of encoder we are using it can involve
diferent steps. The goal is to transform the input image in a 1-d vector of size  , where
 is the size of the context.</p>
      <p>• Text embedding: as before.</p>
      <p>
        The two diferent embeddings (image and text) are then concatenated to obtain a single vector
 provided as input to the decoder-only Transformer. Instead of using a corpus of tokens U, we
consider a set of pairs  = {(, 1), ..., (, )} where  are the images and  are the associated
textual descriptions. The loss function is again the cross entropy loss, having images as fixed
context, trying to predict the next token based on the previous context:
() = ∑︁ log  ( |, 0..− 1; Θ)

(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
      </p>
      <p>Note that we can use pre-trained models for both the entire transformer and the images
encoder.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related works</title>
      <p>Large language models have provided significant advancements in the training of specialized
models tailored for medical applications. Specific models that have emerged in this domain
include BioBERT[4], ClinicalBERT[5], PubMedBERT[6], BioGPT[7], and Med-PaLM[8]. Notably, a
recent addition to this landscape is the Med-Flamingo model[2][9][10], which has demonstrated
remarkable performance.</p>
      <p>These aforementioned models share three common characteristics:
• they are ad hoc models, trained from scratch.
• they necessitate a substantial volume of data for training.</p>
      <p>• they demand a considerable amount of computational power for training and inference.</p>
      <p>In contrast, our proposal ofers an alternative approach by using pre-trained models. This
approach enables fine-tuning of the model on a smaller dataset and reduces the computational
resources required. Particularly, our model is trained using a single RTX6000 GPU, highlighting
its eficiency in comparison to the resource-intensive nature of the aforementioned models.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and results</title>
      <p>For our experiments we use a public available dataset called MIMIC CXR[11]. The MIMIC-CXR
(Medical Information Mart for Intensive Care - Chest X-Ray) dataset is a large and widely
used dataset in the field of medical imaging and healthcare research. It consists of chest X-ray
images and associated clinical metadata, including textual reports. The dataset is composed by
more than 300,000 X-ray images but it’s strongly unbalanced; indeed, about 33% of the clinical
studies represent normal chest X-Rays (i.e. no acute cardiopulmunary diseases are noted). In the
remaining 67% there is a further imbalance between the diferent pathologies, with very frequent
pathologies such as cardiomegaly and pulmonary edema and very rare clinical situations such
as rib fractures (see Table 1). This type of imbalance is quite common in clinical datasets and
can lead to low model performance.</p>
      <p>Clinical pathology</p>
      <p>We also pre-processed the MIMIC-CXR dataset with the tools CXR-RePaiR [12] and
CXRReDonE [13]. Regarding CXR-Repair, we used the data preprocessing component to extract
salient information from textual reports. Specifically, within the MIMIC-CXR dataset, the
reports are organized in a way similar to complete Electronic Health Records (EHRs), and the
tool made easier the extraction of the findings section related to radiological images.
Furthermore, we employed the CXR-ReDonE tool to systematically exclude any references to
prior, unspecified reports. We removed these references because it was not possible the linkage
of the current examination with the one referred to, due to the anonymization of the reports.
Consequently, the removal of these comparative segments is crucial to mitigate the potential
occurrence of erroneous associations or interpretations. We performed a rebalancing of the
dataset applying a downsapling, specifically for the normal X-Ray images, obtaining about
30,000 paired data. The transfomer we use is a pre-trained version of GPT-2[14], from Hugging
Face (huggingface.co). As discussed in Section 2, the main diference between a standard
decoder-only transformer and our architecture is the image encoder. We tested two diferent
architectures as image encoder:
• CheXNet
• ViT input embedder</p>
      <sec id="sec-4-1">
        <title>4.1. CheXNet</title>
        <p>CheXNet is a deep neural network architecture designed for the detection of thoracic diseases,
particularly chest X-ray interpretation. It was introduced in the paper [15], and it aims to assist
medical professionals in diagnosing common thoracic diseases, with a focus on pneumonia
detection. Figure 3 shows the CheXNet architecture.</p>
        <p>We used the last convolutional layer as image embedding, that consist in an matrix with
shape 32x7x7, where 32 is the convolution depth. We then apply a flattening obtaining a matrix
of 32x49. GPT’s embedding size is 768 (i.e. each "token" must have this dimensionality); for this
reason we project the 32x49 matrix in a new matrix 32x768, having a fixed context of 32 image
tokens.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. ViT input embedder</title>
        <p>The Visual Transformer[16][17], often referred to as the ViT (Vision Transformer), is a neural
network architecture designed for computer vision tasks. Unlike traditional convolutional
neural networks (CNNs) that process images using convolutional layers, the Visual Transformer
preprocesses input images in the following way:
• Patch Extraction: the input image is divided into a grid of non-overlapping patches. Each
patch is typically a small square region of the image. For example, if you have an image
of size 224x224 pixels and use a patch size of 16x16, you would have 196 patches (14x14
grid).
• Flattening and Linear Projection: each patch is then flattened into a one-dimensional
vector. This means that the spatial information within each patch is encoded into a
linear sequence of values. These patch embeddings now serve as the input tokens to the
transformer model, in this case with a dimensionality of 768.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>In order to evaluate our results we used a metric called BERTScore. BERTScore [18] is an
evaluation metric for assessing the quality of machine-generated text, such as machine
translation, text summarization, and more. It is designed to address some limitations of traditional
evaluation metrics like BLEU and ROUGE, which often do not correlate well with human
judgment of text quality. Here’s a brief overview of how BERTScore works (see Figure 4):
• Pretrained BERT Model: BERTScore utilizes a pretrained BERT model [20], in order to
capture contextual information from text.
• Sentence Embeddings: BERTScore tokenizes the reference and generated sentences into
subword units and feeds them through the BERT model to obtain contextual embeddings
for each token.
• Cosine Similarity: BERTScore calculates the cosine similarity between the embeddings
of the reference and generated sentences. Cosine similarity measures the similarity in
direction between two vectors and ranges from -1 (completely dissimilar) to 1 (identical).</p>
        <p>
          In this case, higher similarity scores indicate better quality.
• Token-Level Scoring: BERTScore computes the cosine similarity for each token in the
reference and generated sentences and then computes the geometric mean of these
tokenlevel scores. This geometric mean is taken to account for the order and structure of words
in the sentences.
• Aggregation: BERTScore can be aggregated at the sentence level to obtain a single score
for the entire sentence. This is typically done by averaging the token-level scores.
One notable advantage of BERTScore is its ability to capture semantic and contextual
information, which makes it more aligned with human judgment. Additionally, it doesn’t require exact
matches and is more robust to variations in word choice and word order; it has become a popular
metric for evaluating the quality of text generation tasks and has been used in various natural
language processing applications. It provides a more reliable and interpretable assessment of
generated text compared to traditional metrics. In addiction, classical NLP metrics, such as
BLEU score, are not efective in the context of our application, since the final goal is to obtain
generated reports semantically close to the real reports. In Table 2 we can see, some examples
of real and predicted reports; for example, by considering the case in (
          <xref ref-type="bibr" rid="ref1 ref3">1,3</xref>
          ), if we evaluate the
predicted report with respect to the real one using BLEU score, we get a pretty low score since
there are no equal n-grams; however the two reports are indicating the same clinical situation.
The same argument can be made for case in (
          <xref ref-type="bibr" rid="ref1 ref4">1,4</xref>
          ).
        </p>
        <p>Table 3 shows the BERTScore for the diferent architectures using diferent generation
decoding type. The best results are obtained using ViT encoder and GPT-2, generating the text
using a beam search decoder. All the results are obtained using AdamW as optimizer with
learning rate equals to 5e-4, epsilon equals to 1e-8, weight decay equals to 1e-2 and using early
stopping on validation loss in order to avoid overfitting.</p>
        <p>Examples reported in Table 2 are obtained using this combination.
caption</p>
        <p>X-ray
caption</p>
        <sec id="sec-4-3-1">
          <title>Real:Right upper lobe pneumonia</title>
          <p>or mass. Recommend chest ct with
intravenous contrast for further
assessment. Dr. XXX communicated
the above results to dr. XXX at 8 :
55 am on XXX by telephone.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Predicted: Right middle lobe pneu</title>
          <p>
            monia. Recommend followup to
resolution. Recommendation ( s ) :
followup to resolution. Notification :
the findings were discussed by dr.
XXX (
            <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
            )
          </p>
        </sec>
        <sec id="sec-4-3-3">
          <title>Real: Ap and lateral chest : normal</title>
          <p>heart, lungs, hila, mediastinum and
pleural surfaces.</p>
        </sec>
        <sec id="sec-4-3-4">
          <title>Predicted: Cardiac and mediasti</title>
          <p>
            nal contours stable. Lungs appear
well inflated without evidence of
focal airspace consolidation,
pleural efusions, pulmonary edema or
pneumothorax (
            <xref ref-type="bibr" rid="ref1 ref3">1,3</xref>
            )
          </p>
        </sec>
        <sec id="sec-4-3-5">
          <title>Real: Mild bibasilar atelectasis. No</title>
          <p>signs of free air below the right
hemidiaphragm.</p>
        </sec>
        <sec id="sec-4-3-6">
          <title>Predicted: Mild bibasilar atelecta</title>
          <p>
            sis. No evidence of free air beneath
the diaphragms. No free air under
the diaphragms. (
            <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
            )
          </p>
        </sec>
        <sec id="sec-4-3-7">
          <title>Real: 1. Upper lobe collapse and</title>
          <p>some lower lobe atelectasis around
a large obstructing left hilar mass.
2. Probable small bilateral pleural
efusions.</p>
        </sec>
        <sec id="sec-4-3-8">
          <title>Predicted: There is a left pleural</title>
          <p>
            efusion. There is no pneumothorax.
There is atelectasis at the left lung
base. (
            <xref ref-type="bibr" rid="ref1 ref4">1,4</xref>
            )
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussions and future works</title>
      <p>In this paper, we presented a multimodal architecture to automatically generate natural language
reports on the basis of radiological images. In particular, the embeddings of chest RX images
and their textual reports form the multimodal base to train a transformer-based model, able to
generate new reports describing the findings detected in RX images given as query.</p>
      <p>
        An important goal of this work was to focus on pre-trained models to challenge two main
problems afecting the healthcare domain: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the problem of scarcity of data, which is common
in this field, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) to minimise the computational resources required by our system, since
the power of the machines usually available in the healthcare organizations does not allow for
heavy computation. For the sake of comparison, our models are trained on a single NVIDIA
RTX6000 GPU, while ad-hoc complex models, instead, are trained on multiple GPUs, typically
NVIDIA A100, which has got a large amount of VRAM and high performance in terms of FLPOS
but at a very high price, beyond the budget of the most of the healthcare organizations. The
average training time, per epoch, is about 54 minutes while the average inference time is about
45 seconds using greedy search decoding and 2 and a half minutes using beam search decoding.
      </p>
      <p>We performed experiments combining diferent types of models and decoding types. These
experiments show very promising results, especially combining the ViT imput embedder and
GPT-2, with an F1-score of 0.78 in case of Beam search. We would like to emphasise that the
performance of this model can be increased by refining the dataset in diferent ways, such as
collecting more data from diferent sources and testing diferent sampling strategies. Indeed,
we employ a downsampling procedure on the entire dataset to rectify the imbalance in disease
occurrences. Conversely, the utilization of data augmentation techniques, such as targeted
cropping of specific regions in X-ray images, can bring advantages in terms of both raw
performance metrics (e.g., BERTScore) and the model’s generalization capacity. Furthermore,
errors committed by the system in localizing some of the findings (e.g. a clinical condition
located in the upper lobe of a lung is described by the system as "lower lobe"), could be reduced
by increasing the dataset using the aforementioned cropping method.</p>
      <p>As a prospective for further research, we intend to submit our computer-generated reports
to human experts in order to validate the former through evaluation templates, such as SUS
questionnaires[21], aimed at assessing their practical usability within clinical settings as
supportive tools for healthcare professionals.
[8] A. Singhal, A. Banerjee, R. Sood, Med-palm: A large multimodal pre-trained language
model for medical applications, arXiv preprint arXiv:2104.03495 (2021).
[9] A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y. Hanafy, W. Zhu, K. Marathe, Y. Bitton, S. Gadre,
J. Jitsev, S. Kornblith, P. W. Koh, G. Ilharco, M. Wortsman, L. Schmidt, Openflamingo, 2023.</p>
      <p>URL: https://doi.org/10.5281/zenodo.7733589. doi:10.5281/zenodo.7733589.
[10] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K.
Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M.
Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski,
R. Barreira, O. Vinyals, A. Zisserman, K. Simonyan, Flamingo: a visual language model for
few-shot learning, ArXiv abs/2204.14198 (2022).
[11] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Y. Deng,
R. G. Mark, S. Horng, Mimic-cxr: A large publicly available database of labeled chest
radiographs, arXiv preprint arXiv:1901.07042 (2019).
[12] M. Endo, R. Krishnan, V. Krishna, A. Y. Ng, P. Rajpurkar, Retrieval-based chest x-ray report
generation using a pre-trained contrastive language-image model, in: Proceedings of
Machine Learning for Health, volume 158 of Proceedings of Machine Learning Research,
2021, pp. 209–219.
[13] P. R. Vignav Ramesh, Nathan Andrew Chi, Improving radiology report generation systems
by removing hallucinated references to non-existent priors, in: arXiv:2210.06340, 2022.
[14] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are
unsupervised multitask learners (2019).
[15] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz,
K. Shpanskaya, et al., Chexnet: Radiologist-level pneumonia detection on chest x-rays
with deep learning, arXiv preprint arXiv:1711.05225 (2017).
[16] B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, J. Gonzalez, K. Keutzer,
P. Vajda, Visual transformers: Token-based image representation and processing for
computer vision, 2020. arXiv:2006.03677.
[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical
image database, in: 2009 IEEE conference on computer vision and pattern recognition,
Ieee, 2009, pp. 248–255.
[18] T. Zhang, V. Kishore, F. Wu, K. Weinberger, Y. Artzi, Bertscore: Evaluating text generation
with bert, in: Proc. 8th International Conference on Learning Representations (ICLR20),
2020. URL: https://openreview.net/pdf?id=SkeHuCVFDr.
[19] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text
generation with bert, in: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics (ACL), 2020.
[20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, 2019. arXiv:1810.04805.
[21] J. Brooke, Sus: A quick and dirty usability scale, Usability Eval. Ind. 189 (1995).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>O.</surname>
          </string-name>
          (
          <year>2023</year>
          ), Gpt-4
          <source>technical report, arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Moor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zakka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dalmia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Reis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <article-title>Med-flamingo: A multimodal medical few-shot learner (</article-title>
          <year>2023</year>
          ). URL: https: //arxiv.org/abs/2307.15189, arXiv:
          <fpage>2307</fpage>
          .
          <fpage>15189</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>30</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pretrained biomedical language model for biomedical text mining</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2020</year>
          , pp.
          <fpage>4571</fpage>
          -
          <lpage>4580</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altosaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranganath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          , Clinicalbert:
          <article-title>Modeling clinical notes and predicting hospital readmission</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>05342</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, Y.-A. Cheng, R. Sekar,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Liu,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Walker</surname>
          </string-name>
          , et al.,
          <article-title>Pubmedbert: A pretrained language model for biomedical text mining</article-title>
          ,
          <source>arXiv preprint arXiv:2105.07774</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Biogpt: A general purpose language model fine-tuned on biomedical text</article-title>
          ,
          <source>arXiv preprint arXiv:2201.05493</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>