<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>YuanAI at MEDIQA-MAGIC 2024: Improving Medical VQA Performance through Parameter-Eficient Fine-Tuning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hsian-Hong Fu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hsien-Cheng Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Yuan Ze University</institution>
          ,
          <addr-line>Taoyuan</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>In our participation in the MEDIQA-MAGIC[1] 2024 workshop at CLEF, we employed the mediqa-m3g-dataset[2] for fine-tuning and one-shot sampling. Our primary models for inference were Llama3 and Gemini, where Gemini served as the Vision-Language Pre-training (VLP) model and Llama3 was utilized for downstream tasks. We focused on parameter-eficient fine-tuning of Llama3 using Low-Rank Adaptation (LoRA). Our approach achieved notable results, including a DELTA-BLEU score of 4.461 and a BERTScore of 0.855, the highest in the task competition. This study underscores the eficacy of parameter-eficient fine-tuning techniques in enhancing medical visual question answering (VQA) performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models(LLM)</kwd>
        <kwd>Large multimodal model(LMM)</kwd>
        <kwd>Llama3</kwd>
        <kwd>gemini</kwd>
        <kwd>Parameter-Eficient FineTuning(PEFT)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Large Language Models</title>
        <p>
          Since the release of transformers, numerous large language models (LLMs) have been introduced,
including classic models like BERT and the GPT[
          <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
          ] series. Starting from GPT-3[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], many models
have gained widespread attention for their exceptional performance in few-shot and zero-shot learning
scenarios. Among them, the recently released Llama3[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] by Meta has achieved remarkable results
across various datasets compared to other open-source LLMs.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Parameter-Eficient Fine-Tuning(PEFT)</title>
        <p>Today’s large language models (LLMs) require high-quality and extensive datasets. Therefore,
finetuning the entire model without additional medical datasets becomes an impractical goal. Partial Enabled</p>
        <p>
          Fine-Tuning (PEFT) can achieve more desirable responses even with a small amount of data. PEFT
encompasses various techniques such as "Reparameterized, Additive, Partial and Hybrid Fine-Tuning,"[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
each ofering diverse fine-tuning methods.
        </p>
        <p>
          In Reparameterized Fine-Tuning, notable methods like LoRA[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] introduce additional learnable
variables in Linear Layers or Attention layers.
        </p>
        <p>
          In Additive Fine-Tuning involves fixing LLM parameters and adding learnable parameters in front of
the original prompt. Examples include prefix-tuning[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which trains new prefixes to provide context
and guide LLMs to produce appropriate answers, and Parameter-Eficient Prompt Tuning[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], which
modifies the input layer context without changing the core model parameters. Adapter-based methods,
such as AdapterDrop[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and Hadamard Adapters[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], introduce adapters to adapt the model to new
tasks.
        </p>
        <p>
          In Hybrid Fine-Tuning methods, such as MAM Adapter[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and AutoPEFT[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], establish connections
between the aforementioned techniques. These methods allow for efective model fine-tuning to adapt
to downstream tasks even without access to high-quality and extensive datasets.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Vision language models</title>
        <p>
          Vision language models encompass a wide range of domains, involving both text and images. These
models integrate distinct features from both modalities to perform various downstream tasks, such
as automatic subtitle generation for videos and visual question answering (VQA). The vision
component of these models can be traced back to the field of image classification, where labels are used to
categorize each image, often with multiple labels per image. Backbone networks such as ResNet[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ],
EficientNet[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], ViT[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], and Swin[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] are employed to extract features for classification purposes.
Subsequently, tasks like image captioning have emerged, where backbone networks extract image
features, and use Language model to generate simple text descriptions of the image content. Notable
examples of this approach include the CLIP[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and BLIP series[
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ]. In the context of our task
presented in this paper—MEDIQA-MAGIC[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]—we focus on the VQA domain. Here, we extract features
from images or convert them into textual descriptions, which are then processed by large language
models (LLMs) to produce the desired descriptions.
        </p>
        <p>
          Vision Question Answering (VQA) has long been an established field. Historically, models like RCNN
or ResNet were used to extract image features, which were then aligned with text using models like
BERT or transformers to generate answers, as seen in models like ViLBERT[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], VL-BERT[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], and
Pixel-BERT[
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
        </p>
        <p>
          One mainstream method involves using an image encoder and a text encoder, followed by a
transformer to combine the two features. Notable examples of this approach include models like LLaVA[
          <xref ref-type="bibr" rid="ref25">25</xref>
          ],
BEiTv3[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], and InternVL[
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Overall Approach</title>
        <p>
          In this paper, we describe the techniques we’ve employed in the MEDIQA-MAGIC[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] task, focusing
on data processing, model fine-tuning, and inference. To eficiently extract information from images
without re-training a multimodal model, we utilized a Vision-Language Pre-trained (VLP) model to
obtain image information. This information was then combined with the text extracted from the images.
Subsequently, only the downstream Large Language Model (LLM) required fine-tuning. This approach
minimizes GPU memory usage during fine-tuning and enhances overall eficiency. In this task we only
ues T4 16GB RAM on Colab to fine-tuning and inference.
        </p>
        <p>
          In this section, we will provide a detailed explanation of the steps involved, including data
preprocessing, model pretraining, model inference and post-processing. As illustrated in figure3.1, we adopted
the approach of converting images to text using a VLP model to bridge the gap between images and
text. However, VLP models without specific pretraining may not efectively convert medical images.
Therefore, without using additional training datasets, we employed the more universally efective
Gemini[
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] for this role. For the LLM, we used the latest Llama3-8b[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] model.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Image2Text processing</title>
        <p>
          We used Gemini[
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] for data preprocessing to convert images into text. Gemini ofers better image
descriptions compared to VLP models like CLIP[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and BLIP[
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. pretraining on text2text model</title>
        <p>
          Using the LoRA method provided by UnSloth, we incorporated LoRA[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] into Llama3-8b[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] with their
default parameters: r=16, dropout=0, alpha=16. Subsequently, we employed SFT (Supervised
FineTuning) to optimize LoRA in order to match the style of the correct answers. During fine-tuning, we did
not utilize the query_content_en since most of the training data lacks substantial content. Therefore,
we only used query_title_en as inputs. Additionally, for all instructions, we uniformly used "Give me a
medical advice" as the prompt, then input the patient context inside query_title_en.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. text prompt engineering</title>
        <p>Initially, we used a fixed training example for one-shot learning, which meant we selected a single,
consistent example from our training data. However, this approach proved inefective due to the
varying suitability of examples for diferent tasks. To improve this, we switched to randomly selecting
an example from our training data for each one-shot attempt. This enhancement allowed us to better
match the diversity of tasks and improve the model’s adaptability.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. model inference(LLM) &amp; post-processing(Output)</title>
        <p>During the inference phase, we not only utilize the query_title_en of the current question but also
incorporate zero, one, two, or four shots randomly selected from the training data. We refrain from
using more shots due to the limitations imposed by max_length. Initially, during testing, we included
the text generated from image conversion in the one-shot input. However, experimental results showed
a decrease in performance on validation due to the max_length constraint. Therefore, in the final
evaluation, only the text generated from the current question’s image conversion is included as part of
the input.</p>
        <p>After generating the text, leading and trailing whitespace is removed from the output. However, due
to the limitation of max_new_tokens, it’s not guaranteed that &lt;end_token&gt; will appear in every output.
Thus, there’s no assurance of complete answers in the actual response every time.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Evaluation Methods</title>
        <p>In Table 1, we observe a total of 310 training samples, out of which 271 include image data. For the
validation set, there are 50 samples, with 44 containing images. The test set comprises 93 samples, all of
which include images.</p>
        <p>
          Regarding evaluation metrics, we employ two scoring methods: DeltaBLEU[
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] and BERTScore[
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].
BERTScore[
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] leverages contextual embeddings from BERT pre-trained models to compute the F1
score based on the maximum cosine similarity between two sentences, with a maximum score of
100. The actual algorithm for BERTScore is shown in Equation 3. DeltaBLEU[
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] Score is derived
from BERTScore and incorporates weights derived from human qualitative judgments, as well as the
maximum number of N-gram matches across multiple reference answers, yielding a maximum score of
1. The actual algorithm for DeltaBLEU is shown in Equation 4
        </p>
        <p>Precision(X, Y) = 1 ∑︁ max sim(,  )</p>
        <p>∈X ∈Y
Recall(X, Y) = 1 ∑︁ max sim( , )
 ∈Y ∈X</p>
        <p>Precision(X, Y) · Recall(X, Y)
BertF1Score(X, Y) = 2 · Precision(X, Y) + Recall(X, Y)
∆ BLEU =
∑︀ ∑︀∈n-grams(ℎ) max:∈, {, · #(ℎ, , )}
∑︀ ∑︀∈n-grams(ℎ) max {, · #(ℎ)}
(1)
(2)
(3)
(4)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Result</title>
        <p>Table 2 shows a significant diference in scores before and after fine-tuning. Without fine-tuning, the
performance of one-shot learning heavily depends on the quality of the example used. Since we used
random sampling from the training data to assist LLM predictions during one-shot tasks, if the sampled</p>
        <sec id="sec-4-1-1">
          <title>Gemini</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>The image shows a close-up of a person’s left eye, focusing on the</title>
          <p>lower eyelid. There appears to be a small, yellow bump or pustule
near the inner corner of the eye. The surrounding skin is slightly red
and inflamed. The person’s eye is open, but their expression and the
context of the image (e.g., setting, posture) are unclear. It’s dificult to
determine the person’s age and gender based on the image alone.</p>
          <p>The image shows a close-up of a fingertip with a small, irregular-shaped,
white-yellowish lump. The lump appears dry, crumbly, and slightly
raised. There are no other visible skin conditions, injuries, or
discolorations on the finger or surrounding area. The background suggests a
well-lit, clean environment, possibly a medical setting. The focus on
the isolated lump suggests concern about its nature and origin. Due to
the limited information, it is dificult to determine the patient’s age,
gender, and overall health condition, as well as the context surrounding
the appearance of the lump.</p>
          <p>The image depicts a close-up view of a skin lesion on the back of a
person’s arm, just below the elbow. The lesion is roughly circular,
approximately 1 centimeter in diameter, and raised with a well-defined
border. It exhibits a red or pink coloration and has a scaly, whitish
surface with some crusting. The surrounding skin appears normal in
color and texture. There is no visible hair within the lesion, although
hair is present on the surrounding skin. The setting and the individual’s
age and gender are unclear, as is their expression or posture, making it
dificult to assess their overall condition or level of discomfort.
data were of low quality, the scores might not improve even with multiple samples. After applying
LoRA fine-tuning, the model showed a trend where more shots generally resulted in higher scores.
However, due to the max_token limitation of LLMs, we could not indefinitely add more shots to enhance
performance.</p>
          <p>
            Table 4 shows the results of converting randomly selected images into text using both Gemini[
            <xref ref-type="bibr" rid="ref28">28</xref>
            ]
and BLIP[
            <xref ref-type="bibr" rid="ref21">21</xref>
            ]. It is evident that BLIP[
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] often fails to successfully generate text descriptions and,
when it does, the descriptions tend to be overly brief. In contrast, Gemini[
            <xref ref-type="bibr" rid="ref28">28</xref>
            ] consistently uses more
detailed text to describe the images. We believe that in the medical domain, providing more detailed
descriptions is crucial. Therefore, we conclude that using Gemini is more efective in assisting Llama3[
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]
in answering questions.
          </p>
          <p>
            From Table 3 compares the diferent results obtained from testing img2Text data generated by various
img2Text models. we can see that "no image text" performed well in the zero-shot setting. This is
because our model was fine-tuned using LoRA without image text, and increasing the number of shots
without image text did not yield higher scores. Although BLIP[
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] performed well in one-shot and
four-shot settings, it did not perform well in the two-shot setting. We believe this is because BLIP
often fails to efectively convert images to text, so random sampling does not guarantee better results
with more samples. For the results generated by Gemini[
            <xref ref-type="bibr" rid="ref28">28</xref>
            ], we observed a clear trend of improved
performance with more shots. We believe this is because Gemini provides more stable and reliable
results compared to BLIP[
            <xref ref-type="bibr" rid="ref21">21</xref>
            ].
          </p>
          <p>At the end, to give you a better understanding of our method, we provide an example of the input
and output in appendix Table 5.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations</title>
      <p>After analyzing the results, we identified several limitations in our approach. First, when diagnosing,
the model often mistake the disease for another similar disease. That way, the model may provide
incorrect advice. Although sometimes, the advice is still correct due to the simularity of the diseases.
But, more than half of the time, the advice is incorrect and may cause harm to the patient. Therefore,
it’s irresponsible to use our model in real-world medical applications.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>
        Our current model has limitations in accurately distinguishing between similar diseases, often leading
to incorrect advice despite sometimes being correct due to disease similarities. To address this, we
propose a new method for future implementation that combines Chain of Thought (CoT)[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] reasoning
with Retrieval-Augmented Generation (RAG)[
        <xref ref-type="bibr" rid="ref32">32</xref>
        ].
      </p>
      <p>In this method, we aim to leverage the patient’s provided symptoms to narrow down the potential
diseases. The Chain of Thought approach will help the model logically deduce and refine the possible
diagnoses step by step. By integrating CoT, the model will follow a structured reasoning process,
improving its ability to distinguish between similar diseases.</p>
      <p>Furthermore, we will combine this with RAG to enhance the model’s access to relevant medical
information. RAG will allow the model to retrieve and incorporate external medical knowledge dynamically,
providing more context and supporting the CoT reasoning process.</p>
      <p>
        By combining these approaches, we expect to improve the accuracy and reliability of our model’s
medical advice. Future work will involve implementing and testing this combined method, utilizing
more advanced models like Llama3 70b[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and GPT-4-o[
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. With these enhancements, we believe our
model will better support healthcare professionals in making informed decisions.
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This study demonstrated an eficient approach to the MEDIQA-MAGIC task using Llama3 and Gemini
models. By fine-tuning Llama3 with LoRA and leveraging Gemini for image-to-text conversion, we
significantly improved performance, achieving high DELTA-BLEU metrics. Our results highlight
the efectiveness of parameter-eficient fine-tuning methods and the importance of detailed image
descriptions in medical AI applications.</p>
      <p>However, our observations from the BERTScore and manual review indicate that the model often
misidentifies diseases, leading to inaccuracies in the provided answers. This poses a significant limitation
in medical question answering or diagnostic assistance, as the accuracy of the information is crucial.
Providing inaccurate information could mislead healthcare professionals, potentially causing harm.
Therefore, despite the model’s fluency in generating sentences, the issue of model hallucination remains
unresolved. This necessitates further improvements in the model’s ability to accurately identify diseases
before it can be considered viable for commercial use in the medical field.</p>
    </sec>
    <sec id="sec-8">
      <title>A. appendix</title>
      <sec id="sec-8-1">
        <title>A.1. Online Resources</title>
        <p>Our sample code for model fine-tuning and inferencing is available online at the following links:
• Training Sample Colab,
• Testing Sample Colab.</p>
        <p>title</p>
        <sec id="sec-8-1-1">
          <title>User text input</title>
        </sec>
        <sec id="sec-8-1-2">
          <title>Gemini img2text result</title>
        </sec>
        <sec id="sec-8-1-3">
          <title>Output</title>
          <p>text
Some group of people has a genetic predisposition to these lines. It is more common
in the dominant arm. Research studies are limited on this condition but the more
fat that you carry on your arms, the more likely creases are to form on your skin. If
you don’t want this crease, then first lose some body fat to rule out that cause. If the
crease still persists, then you’ll know that they’re a permanent fixture on your arms
due to your genetics. If they’re not causing you any pain or if they don’t look too
abnormal for your liking , then you don’t need to worry about them.
The image shows a close-up of the back of a person’s hairy leg. The individual’s age
and gender are indeterminate from the image. The skin appears to have a slightly
reddish hue and several small, raised bumps. These bumps could be insect bites,
folliculitis, or another type of skin irritation. There is no visible evidence of injury,
swelling, or medical equipment in the image. The setting and context of the image
are unclear, as it only shows a close-up of the person’s leg. More information about
the individual’s symptoms and medical history would be needed to make a definitive
diagnosis.</p>
        </sec>
        <sec id="sec-8-1-4">
          <title>It is a case of eczema called dyshidrotic eczema. It is aggravated by atopic dermatitis,</title>
          <p>excessive sweating, sun exposure, and smoking. Avoid aggravating factors where
possible. Potassium permanganate soaks may be useful in the acute phase. Apply
topical treatments like topical steroids (usually potent or ultrapotent), pimecrolimus
and tacrolimus, and regular use of emollients and moisturizers. I have also added a
cream for faster healing to be applied twice daily for 2 weeks. Use white toothpaste.</p>
        </sec>
        <sec id="sec-8-1-5">
          <title>Avoid direct contact with food items. Take multivitamins once daily. Rest and I/V</title>
          <p>fluids are must. Antibiotics are given in case of systemic infection. The lesion can
be surgically removed. Referral to a dermatologist is recommended for dermoscopic
examination and skin biopsy. The 5-year survival rate is &gt;95%.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Overview of the mediqa-magic task at imageclef 2024: Multimodal and generative telemedicine in dermatology</article-title>
          ,
          <source>in: CLEF 2024 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Dermavqa: A multilingual visual question answering dataset for dermatology</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K. N. Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <article-title>Improving language understanding by generative pre-training</article-title>
          ,
          <year>2018</year>
          . URL: https: //cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Alec</surname>
          </string-name>
          <string-name>
            <surname>Radford</surname>
          </string-name>
          ,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <year>2019</year>
          . URL: https: //insightcivic.s3.
          <article-title>us-east-1.amazonaws.com/language-models</article-title>
          .
          <source>pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. M. Tom B.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <year>2020</year>
          . URL: https://proceedings. neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>metaAI</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Introducing meta llama 3: The most capable openly available llm to date, 2024</article-title>
          . URL: https://ai.meta.com/blog/meta-llama-
          <volume>3</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H. X.</given-names>
            <surname>Lingling</surname>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Parameter-eficient fine-tuning methods for pretrained language models: A critical review and assessment, 2023</article-title>
          . URL: https://arxiv.org/abs/2312.12148.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.09685. arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Xiang Lisa</surname>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Prefix-tuning: Optimizing continuous prompts for generation, 2021</article-title>
          . URL: https://arxiv.org/abs/2101.00190.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Brian Lester</surname>
          </string-name>
          , Rami Al-Rfou,
          <article-title>The power of scale for parameter-eficient prompt tuning</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2104.08691.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          , G. Geigle,
          <string-name>
            <given-names>M.</given-names>
            <surname>Glockner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pfeifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>AdapterDrop: On the eficiency of adapters in transformers</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>7930</fpage>
          -
          <lpage>7946</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>626</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . emnlp-main.
          <volume>626</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-G.</given-names>
            <surname>Lou</surname>
          </string-name>
          , S. Han,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <article-title>Hadamard adapter: An extreme parameter-eficient adapter tuning method for pre-trained language models</article-title>
          ,
          <source>in: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management</source>
          , CIKM '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>276</fpage>
          -
          <lpage>285</lpage>
          . URL: https://doi.org/10.1145/3583780.3614904. doi:
          <volume>10</volume>
          .1145/3583780.3614904.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>X. M. Junxian He</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chunting Zhou</surname>
          </string-name>
          ,
          <article-title>Towards a unified view of parameter-eficient transfer learning</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2110.04366.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Vulić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          , Autopeft:
          <article-title>Automatic configuration search for parametereficient fine-tuning</article-title>
          ,
          <source>arXiv preprint arXiv:2301.12132</source>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/abs/2301.12132.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <year>2015</year>
          . URL: https: //arxiv.org/abs/1512.03385. arXiv:
          <volume>1512</volume>
          .
          <fpage>03385</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1905</year>
          .11946. arXiv:
          <year>1905</year>
          .11946.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2010</year>
          .11929. arXiv:
          <year>2010</year>
          .11929.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2103.14030. arXiv:
          <volume>2103</volume>
          .
          <fpage>14030</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>J. W. K. Alec Radford</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2103.00020.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D. L. Junnan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022</article-title>
          . URL: https://arxiv.org/abs/2201.12086.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D. L. Junnan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2301.12597.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>D. B. Jiasen Lu</surname>
          </string-name>
          ,
          <article-title>Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-andlanguage tasks</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1908</year>
          .02265.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>X. Z.</given-names>
            <surname>Weijie</surname>
          </string-name>
          <string-name>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>Vl-bert: Pre-training of generic visual-linguistic representations</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1908</year>
          .08530.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Z. Z.</given-names>
            <surname>Zhicheng</surname>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Pixel-bert: Aligning image pixels with text by deep multi-modal transformers</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2004</year>
          .00849.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>C. L. Haotian</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Visual instruction tuning,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2304.08485.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>H. B. Wenhui</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <article-title>Image as a foreign language: Beit pretraining for all vision and vision-language tasks</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2208.10442.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Zhe</surname>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Internvl: Scaling up vision foundation models and aligning for generic visuallinguistic tasks</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2312.14238.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gemini</surname>
          </string-name>
          <string-name>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Gemini: A family of highly capable multimodal models</article-title>
          ,
          <year>2023</year>
          . URL: https: //arxiv.org/abs/2312.11805. arXiv:
          <volume>2312</volume>
          .
          <fpage>11805</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Galley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brockett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sordoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quirk</surname>
          </string-name>
          , M. Mitchell,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Dolan,</surname>
          </string-name>
          <article-title>deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
          </string-name>
          , M. Strube (Eds.),
          <source>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Beijing, China,
          <year>2015</year>
          , pp.
          <fpage>445</fpage>
          -
          <lpage>450</lpage>
          . URL: https://aclanthology.org/P15-2073. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>P15</fpage>
          -2073.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1904</year>
          .09675. arXiv:
          <year>1904</year>
          .09675.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain of thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>CoRR abs/2201</source>
          .11903 (
          <year>2022</year>
          ). URL: https://arxiv.org/ abs/2201.11903. arXiv:
          <volume>2201</volume>
          .
          <fpage>11903</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>OpenAI</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Achiam</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Adler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
          </string-name>
          , Gpt-4
          <source>technical report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/ 2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>