<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">YuanAI at MEDIQA-MAGIC 2024: Improving Medical VQA Performance through Parameter-Efficient Fine-Tuning Notebook for the ImageCLEF Lab at CLEF 2024</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Hsian-Hong</forename><surname>Fu</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Yuan Ze University</orgName>
								<address>
									<settlement>Taoyuan</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hsien-Cheng</forename><surname>Huang</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Yuan Ze University</orgName>
								<address>
									<settlement>Taoyuan</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">YuanAI at MEDIQA-MAGIC 2024: Improving Medical VQA Performance through Parameter-Efficient Fine-Tuning Notebook for the ImageCLEF Lab at CLEF 2024</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">077B81702501211FE3E214EF40C2AA87</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:00+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large Language Models(LLM)</term>
					<term>Large multimodal model(LMM)</term>
					<term>Llama3</term>
					<term>gemini</term>
					<term>Parameter-Efficient Fine-Tuning(PEFT)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In our participation in the MEDIQA-MAGIC[1] 2024 workshop at CLEF, we employed the mediqa-m3g-dataset[2] for fine-tuning and one-shot sampling. Our primary models for inference were Llama3 and Gemini, where Gemini served as the Vision-Language Pre-training (VLP) model and Llama3 was utilized for downstream tasks. We focused on parameter-efficient fine-tuning of Llama3 using Low-Rank Adaptation (LoRA). Our approach achieved notable results, including a DELTA-BLEU score of 4.461 and a BERTScore of 0.855, the highest in the task competition. This study underscores the efficacy of parameter-efficient fine-tuning techniques in enhancing medical visual question answering (VQA) performance.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In this paper, we focus on describing the techniques used in the MEDIQA-MAGIC <ref type="bibr" target="#b0">[1]</ref> task, including pre-processing, fine-tuning, and inferencing. Also, we used the mediqa-m3g-dataset <ref type="bibr" target="#b1">[2]</ref> to train, validate, and test our model. The dataset consists of medical questions and images, with the goal of generating medical advice based on the input. To effectively extract information from images without re-training a multimodal model, we opted to use a Vision-Language Pre-Training(VLP) model to obtain image information first. We then combined the medical question with the text we've generated from the images. That way, we only requires to fine-tune the subsequent LLM to meet the needs of our downstream tasks. This approach reduces the GPU memory needed for fine-tuning and increases efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Large Language Models</head><p>Since the release of transformers, numerous large language models (LLMs) have been introduced, including classic models like BERT and the GPT <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref> series. Starting from GPT-3 <ref type="bibr" target="#b4">[5]</ref>, many models have gained widespread attention for their exceptional performance in few-shot and zero-shot learning scenarios. Among them, the recently released Llama3 <ref type="bibr" target="#b5">[6]</ref> by Meta has achieved remarkable results across various datasets compared to other open-source LLMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Parameter-Efficient Fine-Tuning(PEFT)</head><p>Today's large language models (LLMs) require high-quality and extensive datasets. Therefore, finetuning the entire model without additional medical datasets becomes an impractical goal. Partial Enabled Fine-Tuning (PEFT) can achieve more desirable responses even with a small amount of data. PEFT encompasses various techniques such as "Reparameterized, Additive, Partial and Hybrid Fine-Tuning, " <ref type="bibr" target="#b6">[7]</ref> each offering diverse fine-tuning methods.</p><p>In Reparameterized Fine-Tuning, notable methods like LoRA <ref type="bibr" target="#b7">[8]</ref> introduce additional learnable variables in Linear Layers or Attention layers.</p><p>In Additive Fine-Tuning involves fixing LLM parameters and adding learnable parameters in front of the original prompt. Examples include prefix-tuning <ref type="bibr" target="#b8">[9]</ref>, which trains new prefixes to provide context and guide LLMs to produce appropriate answers, and Parameter-Efficient Prompt Tuning <ref type="bibr" target="#b9">[10]</ref>, which modifies the input layer context without changing the core model parameters. Adapter-based methods, such as AdapterDrop <ref type="bibr" target="#b10">[11]</ref> and Hadamard Adapters <ref type="bibr" target="#b11">[12]</ref>, introduce adapters to adapt the model to new tasks.</p><p>In Hybrid Fine-Tuning methods, such as MAM Adapter <ref type="bibr" target="#b12">[13]</ref> and AutoPEFT <ref type="bibr" target="#b13">[14]</ref>, establish connections between the aforementioned techniques. These methods allow for effective model fine-tuning to adapt to downstream tasks even without access to high-quality and extensive datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Vision language models</head><p>Vision language models encompass a wide range of domains, involving both text and images. These models integrate distinct features from both modalities to perform various downstream tasks, such as automatic subtitle generation for videos and visual question answering (VQA). The vision component of these models can be traced back to the field of image classification, where labels are used to categorize each image, often with multiple labels per image. Backbone networks such as ResNet <ref type="bibr" target="#b14">[15]</ref>, EfficientNet <ref type="bibr" target="#b15">[16]</ref>, ViT <ref type="bibr" target="#b16">[17]</ref>, and Swin <ref type="bibr" target="#b17">[18]</ref> are employed to extract features for classification purposes. Subsequently, tasks like image captioning have emerged, where backbone networks extract image features, and use Language model to generate simple text descriptions of the image content. Notable examples of this approach include the CLIP <ref type="bibr" target="#b18">[19]</ref> and BLIP series <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21]</ref>. In the context of our task presented in this paper-MEDIQA-MAGIC <ref type="bibr" target="#b0">[1]</ref>-we focus on the VQA domain. Here, we extract features from images or convert them into textual descriptions, which are then processed by large language models (LLMs) to produce the desired descriptions.</p><p>Vision Question Answering (VQA) has long been an established field. Historically, models like RCNN or ResNet were used to extract image features, which were then aligned with text using models like BERT or transformers to generate answers, as seen in models like ViLBERT <ref type="bibr" target="#b21">[22]</ref>, VL-BERT <ref type="bibr" target="#b22">[23]</ref>, and Pixel-BERT <ref type="bibr" target="#b23">[24]</ref>.</p><p>One mainstream method involves using an image encoder and a text encoder, followed by a transformer to combine the two features. Notable examples of this approach include models like LLaVA <ref type="bibr" target="#b24">[25]</ref>, BEiTv3 <ref type="bibr" target="#b25">[26]</ref>, and InternVL <ref type="bibr" target="#b26">[27]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Overall Approach</head><p>In this paper, we describe the techniques we've employed in the MEDIQA-MAGIC <ref type="bibr" target="#b0">[1]</ref> task, focusing on data processing, model fine-tuning, and inference. To efficiently extract information from images without re-training a multimodal model, we utilized a Vision-Language Pre-trained (VLP) model to obtain image information. This information was then combined with the text extracted from the images. Subsequently, only the downstream Large Language Model (LLM) required fine-tuning. This approach minimizes GPU memory usage during fine-tuning and enhances overall efficiency. In this task we only ues T4 16GB RAM on Colab to fine-tuning and inference.</p><p>In this section, we will provide a detailed explanation of the steps involved, including data preprocessing, model pretraining, model inference and post-processing. As illustrated in figure3.1, we adopted the approach of converting images to text using a VLP model to bridge the gap between images and text. However, VLP models without specific pretraining may not effectively convert medical images. Therefore, without using additional training datasets, we employed the more universally effective Gemini <ref type="bibr" target="#b27">[28]</ref> for this role. For the LLM, we used the latest Llama3-8b[6] model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Image2Text processing</head><p>We used Gemini <ref type="bibr" target="#b27">[28]</ref> for data preprocessing to convert images into text. Gemini offers better image descriptions compared to VLP models like CLIP <ref type="bibr" target="#b18">[19]</ref> and BLIP <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">pretraining on text2text model</head><p>Using the LoRA method provided by UnSloth, we incorporated LoRA <ref type="bibr" target="#b7">[8]</ref> into Llama3-8b <ref type="bibr" target="#b5">[6]</ref> with their default parameters: r=16, dropout=0, alpha=16. Subsequently, we employed SFT (Supervised Fine-Tuning) to optimize LoRA in order to match the style of the correct answers. During fine-tuning, we did not utilize the query_content_en since most of the training data lacks substantial content. Therefore, we only used query_title_en as inputs. Additionally, for all instructions, we uniformly used "Give me a medical advice" as the prompt, then input the patient context inside query_title_en.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">text prompt engineering</head><p>Initially, we used a fixed training example for one-shot learning, which meant we selected a single, consistent example from our training data. However, this approach proved ineffective due to the varying suitability of examples for different tasks. To improve this, we switched to randomly selecting an example from our training data for each one-shot attempt. This enhancement allowed us to better match the diversity of tasks and improve the model's adaptability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">model inference(LLM) &amp; post-processing(Output)</head><p>During the inference phase, we not only utilize the query_title_en of the current question but also incorporate zero, one, two, or four shots randomly selected from the training data. We refrain from using more shots due to the limitations imposed by max_length. Initially, during testing, we included the text generated from image conversion in the one-shot input. However, experimental results showed a decrease in performance on validation due to the max_length constraint. Therefore, in the final After generating the text, leading and trailing whitespace is removed from the output. However, due to the limitation of max_new_tokens, it's not guaranteed that &lt;end_token&gt; will appear in every output. Thus, there's no assurance of complete answers in the actual response every time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6.">Evaluation Methods</head><p>In Table <ref type="table" target="#tab_0">1</ref>, we observe a total of 310 training samples, out of which 271 include image data. For the validation set, there are 50 samples, with 44 containing images. The test set comprises 93 samples, all of which include images.</p><p>Regarding evaluation metrics, we employ two scoring methods: DeltaBLEU <ref type="bibr" target="#b28">[29]</ref> and BERTScore <ref type="bibr" target="#b29">[30]</ref>. BERTScore <ref type="bibr" target="#b29">[30]</ref> leverages contextual embeddings from BERT pre-trained models to compute the F1 score based on the maximum cosine similarity between two sentences, with a maximum score of 100. The actual algorithm for BERTScore is shown in Equation <ref type="formula" target="#formula_0">3</ref>. DeltaBLEU <ref type="bibr" target="#b28">[29]</ref> Score is derived from BERTScore and incorporates weights derived from human qualitative judgments, as well as the maximum number of N-gram matches across multiple reference answers, yielding a maximum score of 1. The actual algorithm for DeltaBLEU is shown in Equation <ref type="formula" target="#formula_1">4</ref>Precision(X, Y) = 1 𝑚</p><formula xml:id="formula_0">∑︁ 𝑥 𝑖 ∈X max 𝑦 𝑗 ∈Y sim(𝑥 𝑖 , 𝑦 𝑗 ) (1) Recall(X, Y) = 1 𝑛 ∑︁ 𝑦 𝑗 ∈Y max 𝑥 𝑖 ∈X sim(𝑦 𝑗 , 𝑥 𝑖 ) (2) BertF1Score(X, Y) = 2 • Precision(X, Y) • Recall(X, Y) Precision(X, Y) + Recall(X, Y)<label>(3)</label></formula><formula xml:id="formula_1">∆BLEU = ∑︀ 𝑖 ∑︀ 𝑔∈n-grams(ℎ 𝑖 ) max 𝑗:𝑔∈𝑟 𝑖,𝑗 {𝑤 𝑖,𝑗 • #𝑔(ℎ 𝑖 , 𝑟 𝑖,𝑗 )} ∑︀ 𝑖 ∑︀ 𝑔∈n-grams(ℎ 𝑖 ) max 𝑗 {𝑤 𝑖,𝑗 • #𝑔(ℎ 𝑖 )}<label>(4)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Result</head><p>Table <ref type="table" target="#tab_1">2</ref> shows a significant difference in scores before and after fine-tuning. Without fine-tuning, the performance of one-shot learning heavily depends on the quality of the example used. Since we used random sampling from the training data to assist LLM predictions during one-shot tasks, if the sampled The image shows a close-up of a fingertip with a small, irregular-shaped, white-yellowish lump. The lump appears dry, crumbly, and slightly raised. There are no other visible skin conditions, injuries, or discolorations on the finger or surrounding area. The background suggests a well-lit, clean environment, possibly a medical setting. The focus on the isolated lump suggests concern about its nature and origin. Due to the limited information, it is difficult to determine the patient's age, gender, and overall health condition, as well as the context surrounding the appearance of the lump. 11rk8rd</p><p>No Image describe The image depicts a close-up view of a skin lesion on the back of a person's arm, just below the elbow. The lesion is roughly circular, approximately 1 centimeter in diameter, and raised with a well-defined border. It exhibits a red or pink coloration and has a scaly, whitish surface with some crusting. The surrounding skin appears normal in color and texture. There is no visible hair within the lesion, although hair is present on the surrounding skin. The setting and the individual's age and gender are unclear, as is their expression or posture, making it difficult to assess their overall condition or level of discomfort.</p><p>data were of low quality, the scores might not improve even with multiple samples. After applying LoRA fine-tuning, the model showed a trend where more shots generally resulted in higher scores. However, due to the max_token limitation of LLMs, we could not indefinitely add more shots to enhance performance. Table <ref type="table">4</ref> shows the results of converting randomly selected images into text using both Gemini <ref type="bibr" target="#b27">[28]</ref> and BLIP <ref type="bibr" target="#b20">[21]</ref>. It is evident that BLIP <ref type="bibr" target="#b20">[21]</ref> often fails to successfully generate text descriptions and, when it does, the descriptions tend to be overly brief. In contrast, Gemini <ref type="bibr" target="#b27">[28]</ref> consistently uses more detailed text to describe the images. We believe that in the medical domain, providing more detailed descriptions is crucial. Therefore, we conclude that using Gemini is more effective in assisting Llama3 <ref type="bibr" target="#b5">[6]</ref> in answering questions.</p><p>From Table <ref type="table" target="#tab_2">3</ref> compares the different results obtained from testing img2Text data generated by various img2Text models. we can see that "no image text" performed well in the zero-shot setting. This is because our model was fine-tuned using LoRA without image text, and increasing the number of shots without image text did not yield higher scores. Although BLIP <ref type="bibr" target="#b20">[21]</ref> performed well in one-shot and four-shot settings, it did not perform well in the two-shot setting. We believe this is because BLIP often fails to effectively convert images to text, so random sampling does not guarantee better results with more samples. For the results generated by Gemini <ref type="bibr" target="#b27">[28]</ref>, we observed a clear trend of improved performance with more shots. We believe this is because Gemini provides more stable and reliable results compared to BLIP <ref type="bibr" target="#b20">[21]</ref>.</p><p>At the end, to give you a better understanding of our method, we provide an example of the input and output in appendix Table <ref type="table" target="#tab_3">5</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Limitations</head><p>After analyzing the results, we identified several limitations in our approach. First, when diagnosing, the model often mistake the disease for another similar disease. That way, the model may provide incorrect advice. Although sometimes, the advice is still correct due to the simularity of the diseases. But, more than half of the time, the advice is incorrect and may cause harm to the patient. Therefore, it's irresponsible to use our model in real-world medical applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Future Work</head><p>Our current model has limitations in accurately distinguishing between similar diseases, often leading to incorrect advice despite sometimes being correct due to disease similarities. To address this, we propose a new method for future implementation that combines Chain of Thought (CoT) <ref type="bibr" target="#b30">[31]</ref> reasoning with Retrieval-Augmented Generation (RAG) <ref type="bibr" target="#b31">[32]</ref>.</p><p>In this method, we aim to leverage the patient's provided symptoms to narrow down the potential diseases. The Chain of Thought approach will help the model logically deduce and refine the possible diagnoses step by step. By integrating CoT, the model will follow a structured reasoning process, improving its ability to distinguish between similar diseases.</p><p>Furthermore, we will combine this with RAG to enhance the model's access to relevant medical information. RAG will allow the model to retrieve and incorporate external medical knowledge dynamically, providing more context and supporting the CoT reasoning process.</p><p>By combining these approaches, we expect to improve the accuracy and reliability of our model's medical advice. Future work will involve implementing and testing this combined method, utilizing more advanced models like Llama3 70b <ref type="bibr" target="#b5">[6]</ref> and GPT-4-o <ref type="bibr" target="#b32">[33]</ref>. With these enhancements, we believe our model will better support healthcare professionals in making informed decisions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>This study demonstrated an efficient approach to the MEDIQA-MAGIC task using Llama3 and Gemini models. By fine-tuning Llama3 with LoRA and leveraging Gemini for image-to-text conversion, we significantly improved performance, achieving high DELTA-BLEU metrics. Our results highlight the effectiveness of parameter-efficient fine-tuning methods and the importance of detailed image descriptions in medical AI applications.</p><p>However, our observations from the BERTScore and manual review indicate that the model often misidentifies diseases, leading to inaccuracies in the provided answers. This poses a significant limitation in medical question answering or diagnostic assistance, as the accuracy of the information is crucial. Providing inaccurate information could mislead healthcare professionals, potentially causing harm. Therefore, despite the model's fluency in generating sentences, the issue of model hallucination remains unresolved. This necessitates further improvements in the model's ability to accurately identify diseases before it can be considered viable for commercial use in the medical field. and gender are indeterminate from the image. The skin appears to have a slightly reddish hue and several small, raised bumps. These bumps could be insect bites, folliculitis, or another type of skin irritation. There is no visible evidence of injury, swelling, or medical equipment in the image. The setting and context of the image are unclear, as it only shows a close-up of the person's leg. More information about the individual's symptoms and medical history would be needed to make a definitive diagnosis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Output</head><p>It is a case of eczema called dyshidrotic eczema. It is aggravated by atopic dermatitis, excessive sweating, sun exposure, and smoking. Avoid aggravating factors where possible. Potassium permanganate soaks may be useful in the acute phase. Apply topical treatments like topical steroids (usually potent or ultrapotent), pimecrolimus and tacrolimus, and regular use of emollients and moisturizers. I have also added a cream for faster healing to be applied twice daily for 2 weeks. Use white toothpaste. Avoid direct contact with food items. Take multivitamins once daily. Rest and I/V fluids are must. Antibiotics are given in case of systemic infection. The lesion can be surgically removed. Referral to a dermatologist is recommended for dermoscopic examination and skin biopsy. The 5-year survival rate is &gt;95%.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Our workflow visualized</figDesc><graphic coords="3,72.00,65.61,451.27,242.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc></figDesc><table><row><cell>Data Distribution</cell><cell></cell><cell></cell></row><row><cell>Dataset</cell><cell cols="2">All Data number Contain Image</cell></row><row><cell>Train</cell><cell>310</cell><cell>271</cell></row><row><cell>Validation</cell><cell>50</cell><cell>44</cell></row><row><cell>Test</cell><cell>93</cell><cell>93</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Compare the performance on the test set with and without fine-tuning</figDesc><table><row><cell cols="2">Model(deltableu/BERTScore) Zero Shot</cell><cell>One Shot</cell><cell>Two Shot</cell><cell>Four Shot</cell></row><row><cell>No Fine Tune</cell><cell cols="4">1.393/0.820 1.462/0.826 0.991/0.820 1.611/0.826</cell></row><row><cell>LoRA Fine Tune</cell><cell cols="4">2.073/0.829 3.467/0.831 5.522/0.836 6.673/0.835</cell></row><row><cell cols="5">evaluation, only the text generated from the current question's image conversion is included as part of</cell></row><row><cell>the input.</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Compare the performance on the validation set with different img2Text model</figDesc><table><row><cell cols="2">Image2Text Model(deltableu/BERTScore) Zero Shot</cell><cell>One Shot</cell><cell>Two Shot</cell><cell>Four Shot</cell></row><row><cell>no image Text</cell><cell cols="4">5.494/0.848 2.050/0.850 2.998/0.846 2.361/0.844</cell></row><row><cell>BLIP</cell><cell cols="4">2.202/0.846 6.861/0.847 1.631/0.846 6.376/0.850</cell></row><row><cell>Gemini</cell><cell cols="4">2.073/0.829 3.467/0.831 5.522/0.836 6.673/0.835</cell></row><row><cell>Table 4</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">BLIP and Gemini Text Generated from the Same Imaged</cell><cell></cell><cell></cell></row><row><cell>Image ID BLIP</cell><cell>Gemini</cell><cell></cell><cell></cell></row><row><cell>11mk4th The image is of a woman's</cell><cell cols="4">The image shows a close-up of a person's left eye, focusing on the</cell></row><row><cell>eye</cell><cell cols="4">lower eyelid. There appears to be a small, yellow bump or pustule</cell></row><row><cell></cell><cell cols="4">near the inner corner of the eye. The surrounding skin is slightly red</cell></row><row><cell></cell><cell cols="4">and inflamed. The person's eye is open, but their expression and the</cell></row><row><cell></cell><cell cols="4">context of the image (e.g., setting, posture) are unclear. It's difficult to</cell></row><row><cell></cell><cell cols="4">determine the person's age and gender based on the image alone.</cell></row><row><cell>11wvv7y The image is of a small white</cell><cell></cell><cell></cell><cell></cell></row><row><cell>spot on the tip of a finger</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 5</head><label>5</label><figDesc>An Example of Input and Output of our method title text User text input Some group of people has a genetic predisposition to these lines. It is more common in the dominant arm. Research studies are limited on this condition but the more fat that you carry on your arms, the more likely creases are to form on your skin. If you don't want this crease, then first lose some body fat to rule out that cause. If the crease still persists, then you'll know that they're a permanent fixture on your arms due to your genetics. If they're not causing you any pain or if they don't look too abnormal for your liking , then you don't need to worry about them. Gemini img2text result The image shows a close-up of the back of a person's hairy leg. The individual's age</figDesc><table /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of the mediqa-magic task at imageclef 2024: Multimodal and generative telemedicine in dermatology</title>
		<author>
			<persName><forename type="first">W</forename><surname>Yim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yetisgen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF 2024 Working Notes, CEUR Workshop Proceedings</title>
				<meeting><address><addrLine>Grenoble, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Dermavqa: A multilingual visual question answering dataset for dermatology</title>
		<author>
			<persName><forename type="first">W</forename><surname>Yim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yetisgen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2024">2024</date>
			<publisher>CoRR</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Improving language understanding by generative pre-training</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">N</forename><surname>Radford</surname></persName>
		</author>
		<ptr target="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Language models are unsupervised multitask learners</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Radford</surname></persName>
		</author>
		<ptr target="https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">M</forename><surname>Tom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Brown</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Introducing meta llama 3: The most capable openly available llm to date</title>
		<author>
			<persName><surname>Metaai Team</surname></persName>
		</author>
		<ptr target="https://ai.meta.com/blog/meta-llama-3/" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">X</forename><surname>Lingling</surname></persName>
		</author>
		<author>
			<persName><surname>Xu</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2312.12148" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Lora: Low-rank adaptation of large language models</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">J</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wallis</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2106.09685.arXiv:2106.09685" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Prefix-tuning: Optimizing continuous prompts for generation</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">L</forename><surname>Xiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lisa</forename><surname>Li</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2101.00190" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">The power of scale for parameter-efficient prompt tuning</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">C</forename><surname>Brian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rami</forename><surname>Lester</surname></persName>
		</author>
		<author>
			<persName><surname>Al-Rfou</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2104.08691" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">AdapterDrop: On the efficiency of adapters in transformers</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rücklé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Geigle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Glockner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Beck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pfeiffer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.emnlp-main.626</idno>
		<ptr target="https://aclanthology.org/2021.emnlp-main.626.doi:10.18653/v1/2021.emnlp-main.626" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and</title>
				<editor>
			<persName><forename type="first">M.-F</forename><surname>Moens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>.-T</surname></persName>
		</editor>
		<editor>
			<persName><surname>Yih</surname></persName>
		</editor>
		<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and<address><addrLine>Punta Cana, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="7930" to="7946" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Hadamard adapter: An extreme parameter-efficient adapter tuning method for pre-trained language models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-G</forename><surname>Lou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiao</surname></persName>
		</author>
		<idno type="DOI">10.1145/3583780.3614904</idno>
		<idno>doi:10.1145/3583780.3614904</idno>
		<ptr target="https://doi.org/10.1145/3583780.3614904" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM &apos;23</title>
				<meeting>the 32nd ACM International Conference on Information and Knowledge Management, CIKM &apos;23<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="276" to="285" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Towards a unified view of parameter-efficient transfer learning</title>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">M</forename><surname>Junxian He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chunting</forename><surname>Zhou</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2110.04366" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Vulić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Korhonen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.12132</idno>
		<ptr target="https://arxiv.org/abs/2301.12132" />
		<title level="m">Autopeft: Automatic configuration search for parameterefficient fine-tuning</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1512.03385.arXiv:1512.03385" />
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Efficientnet: Rethinking model scaling for convolutional neural networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1905.11946.arXiv:1905.11946" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">An image is worth 16x16 words: Transformers for image recognition at scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2010.11929.arXiv:2010.11929" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Swin transformer: Hierarchical vision transformer using shifted windows</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Guo</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2103.14030.arXiv:2103.14030" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W K</forename><surname>Radford</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2103.00020" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Li</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2201.12086" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Li</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2301.12597" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-andlanguage tasks</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">B</forename><surname>Lu</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1908.02265" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">Z Weijie</forename><surname>Su</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1908.08530" />
		<title level="m">Vl-bert: Pre-training of generic visual-linguistic representations</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Pixel-bert: Aligning image pixels with text by deep multi-modal transformers</title>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">Z</forename><surname>Huang</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2004.00849" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Visual instruction tuning</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Haotian</surname></persName>
		</author>
		<author>
			<persName><surname>Liu</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2304.08485" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Image as a foreign language: Beit pretraining for all vision and vision-language tasks</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">B</forename><surname>Wenhui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wang</forename></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2208.10442" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<title level="m" type="main">Internvl: Scaling up vision foundation models and aligning for generic visuallinguistic tasks</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Zhe Chen</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2312.14238" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<title level="m" type="main">Gemini: A family of highly capable multimodal models</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">Gemini</forename><surname>Team</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2312.11805.arXiv:2312.11805" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets</title>
		<author>
			<persName><forename type="first">M</forename><surname>Galley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Brockett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sordoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Auli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Quirk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mitchell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dolan</surname></persName>
		</author>
		<idno type="DOI">10.3115/v1/P15-2073</idno>
		<ptr target="https://aclanthology.org/P15-2073.doi:10.3115/v1/P15-2073" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</title>
		<title level="s">Short Papers</title>
		<editor>
			<persName><forename type="first">C</forename><surname>Zong</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Strube</surname></persName>
		</editor>
		<meeting>the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing<address><addrLine>Beijing, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="445" to="450" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Bertscore: Evaluating text generation with bert</title>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kishore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Artzi</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1904.09675.arXiv:1904.09675" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<title level="m" type="main">Chain of thought prompting elicits reasoning in large language models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schuurmans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bosma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">H</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
		<idno>CoRR abs/2201.11903</idno>
		<ptr target="https://arxiv.org/abs/2201.11903.arXiv:2201.11903" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Retrieval-augmented generation for knowledge-intensive nlp tasks</title>
		<author>
			<persName><forename type="first">P</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Piktus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Petroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Karpukhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Küttler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>-T. Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rocktäschel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="9459" to="9474" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Openai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Achiam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Adler</surname></persName>
		</author>
		<author>
			<persName><surname>Agarwal</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2303.08774.arXiv:2303.08774" />
		<title level="m">Gpt-4</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">technical report</note>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Online Resources Our sample code for model fine-tuning and inferencing is available online at the following links: • Training Sample Colab</title>
		<author>
			<persName><forename type="first">A</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Testing Sample Colab</title>
				<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
