<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>H. Mestha);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Exploring the Missing Medical Context in Generated Radiology Reports</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Karan Bania</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harshvardhan Mestha</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tanmay Tulsidas Verlekar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSIS, BITS Pilani, K. K. Birla, Goa Campus</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of EE, BITS Pilani,K. K. Birla, Goa Campus</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Recent advancements in multimodal LLMs have allowed its use in radiology, where, given an X-ray image, the report can be generated automatically. The state-of-the-art explores LLMs for this task through prompting or ifne-tuning. The emphasis in such cases is to improve the syntax and context of the report with respect to natural language. This paper proposes that LLMS are unable to understand the medical context in images and in reports. Thus, the accuracy of the generated reports' medical context requires further evaluation. This paper uses a pre-trained GPT-4o, Qwen2-VL-7B, and a fine-tuned LLaMA-2 model to demonstrate that these LLMs can identify the input images as chest X-rays but are poor at identifying pathologies, even when fine-tuned. It then attempts to address this problem by proposing a pipeline that allows the LLMs access to a discriminative model that can classify pathologies present in a chest X-ray. The additional context of pathology labels allows the LLMs to generate more accurate reports. A second contribution of the paper involves the assessment of the generated reports. It illustrates that traditional metrics, such as the BERTScore, are inefective in assessing generated medical reports. It then presents an LLM-as-a-judge that successfully compares generated and ground truth reports.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Radiology Report</kwd>
        <kwd>LLM-as-a-judge</kwd>
        <kwd>Report Generation</kwd>
        <kwd>LLM for health</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The field of medicine has traditionally employed discriminative models such as convolutional neural
networks (CNN) for tasks such as pathology classification using X-ray images [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. With the recent
advancements in generative AI, Large Language Models (LLM) can now process text and images,
understand complex instructions and generate detailed responses in natural language [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. Thus,
there is a growing interest in the research community in performing tasks such as medical report
generation, medical visual question answering, and disease identification using LLMs [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. The field
of medicine has been challenging for LLMs as they need to distinguish subtle diferences in images or
even parts of images to provide accurate text descriptions. A second issue with the LLMs is that the
natural language on which it is trained can have a broad range of statements to communicate the same
meaning. Medical language, on the other hand, is highly specific and comprehensive. Thus, the learned
mappings between natural images and texts are inefective in the medical domain, where, for instance,
medical reports are generated using chest X-rays.
      </p>
      <p>
        Recently, LLMs, such as Flamingo [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], have been explored to assess their ability to learn from a few
examples in real-time for the task of medical visual question answering. The results indicated that while
it returned a good BERT score, the exact match with the ground truth was extremely poor. The ability of
a multimodal LLM, such as GPT-4o, to answer image-rich diagnostic radiology exam questions through
prompting is assessed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The results suggest that there is a large variability in answers obtained
from GPT-4o, which highlights its poor medical image interpretation abilities. It also demonstrated that
      </p>
      <p>GPT-4o is far more efective in answering text-only questions.</p>
      <p>
        The XrayGPT [9] explored the route of fine-tuning an LLM on medical data and aligning it with a
vision encoder. It allowed the GPT model to answer open-ended questions about chest X-ray images.
However, the results indicated that it lacks the ability to identify specific pathologies. The work
presented [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] also fine-tuned the LLaMA-2 language model on heterogeneous radiological images,
encompassing X-rays, CT scans, and MRIs for tasks such as disease identification, medical visual
question answering, and the generation of medical reports. It presented a grounding technique to allow
the integration of spatial locations into the text fed into the language model. While the results appear
promising, they were evaluated using the BERT score, which is inefective in comparing the generated
reports with the ground truth.
      </p>
      <p>The current state-of-the-art models struggle to understand subtle diferences in medical images,
which appear homogeneous when compared with natural images. Thus, to generate reports that
are accurate in terms of reporting pathologies that are present in chest X-rays, the paper proposes
a two-step pipeline. To capture these subtle diferences in the chest X-ray images associated with
pathologies, the proposal uses a discriminative model called DenseNet 121, which performs the task of
multi-label pathology classification. The classified pathology labels can then be communicated to the
LLM through prompting, along with the chest X-ray. The ability of LLMs to generate natural language
reports, coupled with the discriminative model’s ability to classify pathologies accurately, leads to more
accurate reports. It also demonstrates that the reports generated using state-of-the-art models, while
being correct syntactically, are poor at communicating medical context. It discusses the drawbacks of
BERT scores in capturing medical context. To address this issue, the paper presents the use of LLM
as a judge, which evaluates the generated report in a structured manner while comparing it with the
ground truth report.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed pipeline</title>
      <p>
        The default pipeline, where an LLM is directly prompted to generate a report for a given chest X-ray
image. For the default pipeline, while the generated report appears natural, it misses a lot of medical
context. It is observed that the missing context is usually about the pathologies that afect a small
part of the whole image -see section 3. Since the LLMs cannot implicitly capture the context of the
pathologies, the paper proposes a pipeline where the possible pathologies are explicitly communicated
to the LLM as additional context, see Figure 1. It is done by allowing the LLM access to the output of a
classifier that can successfully classify chest X-ray images according to the pathologies present in them.
It leads to the generation of reports that can accurately communicate the correct medical diagnosis.
To evaluate the pipeline, the paper considers three LLMs: the GPT-4o [10], Qwen2-VL [11] and the
MiniGPT-Med [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and a discriminative model: DenseNet 121 [12].
2.1. GPT-4o
The Generative Pre-Trained Transformer (GPT)-4o is a decoder-only transformer model trained to
predict the next token in a document [10]. It is also multi-modal and can process text, images, and
audio. It achieves this through a post-training alignment process. It can follow a variety of instructions,
making it useful for classification, generation and logical analysis. To perform classification and report
generation for the proposed pipeline, the default values are set for all parameters except max_tokens,
which is set to 500. For the evaluation, to use GPT-4o as the LLM-as-a-judge, the seed value is set to 42
and the temperature is set to 0.
      </p>
      <sec id="sec-2-1">
        <title>2.2. MiniGPT-Med</title>
        <p>The MiniGPT-Med is composed of a visual backbone, a linear projection layer, and an LLM. The visual
backbone is a specialized (Contrastive Language-Image Pretraining) CLIP model called EVA-CLIP [13]
that is kept frozen during the training of MiniGPT-Med, while the LLM is finetuned.</p>
        <p>(a) Report generation using a generative classifier.</p>
        <p>(b) Report generation using a discriminative classifier.</p>
        <p>It results in an association between visual concepts and text. The MiniGPT-Med uses
Llama2chat(7B) [14] for classifying pathologies and generating reports. It is trained on extensive medical
knowledge and is fine-tuned for dialogue use cases. To perform classification, the mode is set to vqa
with temperature set to 0.0 and a top_p set to 1.0. For report generation, the mode is set to caption
with a temperature of 0.9 and a top_p of 0.9, as instructed in [13].
2.3. Qwen2-VL
The Qwen-VL-7B-Instruct [11] is an open-source LLM capable of image-text understanding and
reasoning. The language component of Qwen-VL-7B consists of the Qwen-7B base model. The vision encoder
of the model is a vision transformer. It transforms images into a variable number of visual tokens that
the LLM can process. This allows Qwen to process images of arbitrary resolutions. While this paper
uses the model’s image processing capabilities, it can also handle videos through its multimodal rotary
position embedding. To evaluate the ability of the Qwen-VL-7B-Instruct in classification pathologies
and report generation, its temperature is set to 0.7, top_p to 0.8 and the max_tokens to 512.</p>
        <p>DenseNet 121: DenseNet 121 is a CNN presented in [12] that performs classification of images
across 1000 diferent categories. It consists of 121 layers organised into four dense blocks and separated
by transition layers. Within a dense block, feature maps of a layer are concatenated with the feature
maps of the preceding layer to maximise information flow. This paper considers a DenseNet 121 model
that is pre-trained on the CheXpert [15] dataset. CheXpert is a large public dataset consisting of 224,316
chest X-ray images of 65,240 patients labelled across 13 diferent pathologies. Since the pathologies
are diferent from the Radiopedia dataset [ 16] considered for evaluation in section ??. The pre-trained
DenseNet 121 model is then fine-tuned on the Radiopedia dataset to perform multilabel classification
across 10 pathologies present in the Radiopedia dataset.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>To evaluate the proposed pipeline and compare it with the state-of-the-art, the paper samples 200 X-ray
images along with the corresponding reports from the Radiopedia dataset [16]. Each report in the dataset
communicates the presence of one or more of the following pathologies: Atelectasis, Cardiomegaly,
Calcifications, Pleural Efusion, COPD, Lung Nodules, Mesothelioma, Tuberculosis, Pneumothorax,
Pneumonia, making this a multi-label classification problem. The subset is selected such that there is a
minimal class imbalance, as illustrated in Figure 2.</p>
      <p>For fine-tuning the DenseNet 121 model, a five-fold cross-validation is adopted over the 200 chest
X-ray images. The model is fine-tuned using a learning rate of 0.0005 for 75 epochs, with a batch size
of 8 using Adam optimizer and binary cross-entropy loss. Thus, each image is used for validation
across the five folds. The experiments are performed on a single NVIDIA A100 GPU with 40GB of main
memory.</p>
      <sec id="sec-3-1">
        <title>3.1. Multi-label classification results</title>
        <p>Both generative and discriminative models are evaluated for multi-label classification. When using the
discriminative model, the Classifier outputs a set of labels representing pathologies. The generative
models are prompted for the presence of each of the ten diseases -see Algorithm 1. The models respond
with a ‘yes’ or a ‘no’ response for each pathology. The list of pathologies identified as ‘yes’ makes up
the output of the Classifier.</p>
        <p>The performance of the considered models is reported in Table 1. To obtain the results with
MiniGPTMed, Qwen2-VL-7B and GPT-4o, they are prompted to classify the pathologies using the prompt in
Figure 3. The accuracies, F1 score and the exact match ratio reported in Table 1 suggest that LLMs are
not efective in capturing the context of pathologies in the Chest X-ray images.</p>
        <p>The MiniGPT-Med and GPT-4o LLMs indicate the presence of most pathologies in every X-ray image.
Meanwhile, the Qwen2-VL-7B reports the absence of pathologies in most X-ray images, resulting in an
accuracy of 81.87% but an F1 score of 0.10. The DenseNet-121 has the best F1 score of 0.2466. To further
highlight the advantage of using the DenseNet-121 in the proposed pipeline over the LLMs for the task
of multi-label classification, Table 1 also reports the exact match ratio for each model. The exact match
ratio accepts only those samples that have all their labels correctly classified. DenseNet 121’s exact
match ratio is 4.5 times higher than the considered LLMs.</p>
        <p>It should be noted that the comparisons in Table 1 do not suggest that the DenseNet 121 is objectively
a better classifier than the LLMs considered in this paper. But, the results do support the use of
DenseNet-121 for the multi-label classification task in the proposed pipeline, as it provides the most
accurate medical context for report generation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Report generation results</title>
        <p>The second step of the proposed pipeline uses the pathology labels obtained from the Classifier as
additional context to prompt the Generator for report generation -see Algorithm 1. The Generator
is an LLM that is instructed to produce a report using the prompt defined in Figure 3.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Evaluation using BERT similarity score</title>
          <p>The generated report is evaluated by matching it with the ground truth using the BERT embeddings [17].
The embeddings are obtained by passing the report to a Bidirectional Encoder Representations from
Transformers (BERT) encoder-only model, which is capable of capturing deep bidirectional context
between words. Thus, for evaluation, the generated report and the ground truth reports are converted
into embeddings, and a similarity score is measured between them using cosine similarity as follows:
similarity =</p>
          <p>[CLS]A · [CLS] TB
norm([CLS]A) ⊙ norm([CLS] B)
where A and B represents the generated and ground truth reports.</p>
          <p>The BERT similarity score reported in Table 2 indicates that all the models have similar performance,
with a similarity score of approximately 0.7 with the ground truth. However, it can be empirically
assessed that while the BERT similarity score captures the semantics and grammatical consistencies, it
fails to capture the logic and the medical context. As illustrated in Figure 4, a generated report predicting
the presence of pathology when compared with the ground truth reporting its absence gets a high
BERT embedding similarity score of 0.8850.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Evaluation using GPT-4o as a Judge</title>
          <p>To address the limitations of the BERT similarity score, the paper proposes to use GPT-4o [10] as a
judge to evaluate the generated reports. Recent work in [18] reports that GPT-4o has a strong logical
reasoning ability and a comprehensive natural language understanding. It also performs better than
most other LLMs in logical reasoning tasks using datasets that are out-of-distribution for the GPT-4o.
This motivates the use of GPT-4o to compare the generated reports with the ground truth. To evaluate
the similarity between the generated report and ground truth, the GPT-4o is prompted with four
questions, to which the GPT-4o responds with either a ‘yes’ or a ‘no’ response. The ground truth was
compared with the generated reports manually, and it was observed that the ground truth significantly
deviated from the generated reports when reporting: number of pathologies, consistency with the
pathology description, the location and number of ailments. These observations motivated the questions
for the LLM. The prompt for the LLM is presented in Figure 5. It follows the Tree-of-Thought [19]
prompting strategy, where the depth of the tree is one. A normalised score is then calculated for the
responses, reported in Table 2.</p>
          <p>The MiniGPT-Med, a finetuned LLM, performs poorly in comparison to GPT-4o. The poor
performance of the MiniGPT-Med can be attributed to the CLIP model, which is frozen during the fine-tuning
process. When prompted with its classification labels, the results are poorer. The poor classification
labels of the MiniGPT-Med mislead the model into generating bad reports. When prompted with
the output from the DenseNet121, the results improve, suggesting that the discriminator model can
correct the LLM following the proposed pipeline. The proprietary GPT-4o model performs much better,
especially when operating with the GPT-4o as the classifier in the proposed pipeline. It allows the
LLM to break down the problem into first classifying the pathology and then using it as context to
generate the reports. It improves the results over direct use of GPT-4o by 0.018. The use of DenseNet
121 as a classifier further improves the results by 0.06. The Qwen-VL-7B model also sees a similar
improvement in performance when using the DenseNet as the classifier in the proposed pipeline. When
the Qwen-VL-7B model uses another Qwen-VL-7B model as a classifier in the proposed pipeline, the
generated reports appear to be worse than when generated without the context of the pathologies. This
further supports the fact that Qwen-VL-7B model is a poor classifier despite good reported accuracy
in Table 1. Table 2 reports a question-wise evaluation of the two pipelines. To assess the capability
of the GPT-4o as a judge, the ground truth is matched with itself using the GPT-4o. For questions A,
B and D, the GPT-4o returns a score of 1.0. It suggests that GPT-4o understands the context of the
question and scores it correctly. Question C returns a score of ≈ 0.7, which suggests that GPT-4o
is unable to understand the context and may require further refining. For a preliminary assessment,
however, this is considered adequate. The question-wise score for the proposed pipeline suggests that
the context obtained from the discriminative models is meaningful in improving the quality of the
generated reports. However, the low absolute scores for each question suggest that the LLMs require
significant improvement before the reports can be trusted for their medical credibility.</p>
          <p>The work presented in this paper is available at https://github.com/Harshvardhan-Mestha/
DETerGENt/.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>The paper assesses a pre-trained GPT-4o, a fine-tuned MiniGPT-Med model and an Qwen-VL-7B open
source LLM for their ability to generate reports for chest X-ray images. It concludes that the poor report
generation can be attributed to the inability of the models to capture the subtle context of pathologies
in the X-ray images. To address this issue, the paper proposes a pipeline that combines an LLM and
discrimination models. The discriminative model identifies the pathologies with significantly better
accuracy than the LLMs. The classification results are used by the LLMs as added context to generate
the report. The generated reports are compared with the ground truth using the BERT embedding
score. While it checks for semantic similarity, it fails to assess the logic and medical context in the
reports. Thus, the paper proposes to use LLM as a judge. It uses GPT-4o to evaluate the generated
reports across four parameters to generate a normalised score. It checks if the pathologies match the
ground truth, if the pathologies are located in the same place, if the number of pathologies is consistent
and if they match their descriptions. The results suggest that the proposed two-step pipeline generates
better reports than the default pipeline.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
Performance of GPT-4 on the american college of radiology in-training examination: Evaluating
accuracy, model drift, and fine-tuning, Acad. Radiol. 31 (2024) 3046–3054.
[9] O. Thawkar, A. Shaker, S. S. Mullappilly, H. Cholakkal, R. M. Anwer, S. Khan, J. Laaksonen, F. S.</p>
      <p>Khan, Xraygpt: Chest radiographs summarization using medical vision-language models, 2023.</p>
      <p>URL: https://arxiv.org/abs/2306.07971. arXiv:2306.07971.
[10] OpenAI, Gpt-4o system card, 2023. arXiv:2303.08774.
[11] P. Wang, S. Bai, e. a. Tan, Qwen2-vl: Enhancing vision-language model’s perception of the world
at any resolution, arXiv preprint arXiv:2409.12191 (2024).
[12] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional networks,
2018. URL: https://arxiv.org/abs/1608.06993. arXiv:1608.06993.
[13] Q. Sun, Y. Fang, L. Wu, X. Wang, Y. Cao, Eva-clip: Improved training techniques for clip at scale,
2023. URL: https://arxiv.org/abs/2303.15389. arXiv:2303.15389.
[14] L. M. e. a. Hugo Touvron, Llama 2: Open foundation and fine-tuned chat models, 2023. URL:
https://arxiv.org/abs/2307.09288. arXiv:2307.09288.
[15] J. Irvin, P. Rajpurkar, M. K. et al., Chexpert: A large chest radiograph dataset with uncertainty
labels and expert comparison, 2019. URL: https://arxiv.org/abs/1901.07031. arXiv:1901.07031.
[16] Radiopaedia, A collaborative educational web resource for radiology, 2007. Retrieved October 23,
2024, from https://radiopaedia.org.
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, 2019. URL: https://arxiv.org/abs/1810.04805. arXiv:1810.04805.
[18] H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou, Y. Zhang, Evaluating the logical reasoning ability of
chatgpt and gpt-4, 2023. URL: https://arxiv.org/abs/2304.03439. arXiv:2304.03439.
[19] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Grifiths, Y. Cao, K. R. Narasimhan, Tree of thoughts:
Deliberate problem solving with large language models, in: Thirty-seventh Conference on Neural
Information Processing Systems, 2023. URL: https://openreview.net/forum?id=5Xc1ecxO1h.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Viviano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bertin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Morrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Torabian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guarrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Lungren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaudhari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brooks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hashir</surname>
          </string-name>
          , H. Bertrand,
          <article-title>TorchXRayVision: A library of chest X-ray datasets and models</article-title>
          ,
          <source>in: Medical Imaging with Deep Learning</source>
          ,
          <year>2022</year>
          . URL: https://github.com/mlmed/ torchxrayvision.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] OpenAI, Gpt-4
          <source>technical report</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Visual instruction tuning,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2304.08485. arXiv:
          <volume>2304</volume>
          .
          <fpage>08485</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          ,
          <article-title>Introducing the next generation of claude: Claude 3 model family</article-title>
          ,
          <year>2024</year>
          . URL: https: //www.anthropic.com/news/claude-3-family, accessed:
          <fpage>2024</fpage>
          -10-21.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Ye</surname>
          </string-name>
          , Llm-cxr:
          <article-title>Instruction-finetuned llm for cxr image understanding and generation, 2024</article-title>
          . URL: https://arxiv.org/abs/2305.11490. arXiv:
          <volume>2305</volume>
          .
          <fpage>11490</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alkhaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Alnajim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Alabdullatef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Alyahya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alsinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elhoseiny</surname>
          </string-name>
          ,
          <article-title>Minigpt-med: Large language model as a general interface for radiology diagnosis</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.04106. arXiv:
          <volume>2407</volume>
          .
          <fpage>04106</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>P. L.</given-names>
          </string-name>
          et al.,
          <article-title>Flamingo: a visual language model for few-shot learning</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2204.14198. arXiv:
          <volume>2204</volume>
          .
          <fpage>14198</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Purohit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Borrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mpoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prasanna</surname>
          </string-name>
          , V. Hill,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>