<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Multimodal Language Models with Olfactory Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Murathan Kurfalı</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonas K. Olofsson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Hörberg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sensory-Cognitive Interaction Lab, Department of Psychology, Stockholm University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper explores the incorporation of olfactory data into multimodal language models, a relatively under-explored area in computational linguistics. We tackled the challenge of detecting olfactory stimuli in text and images, with a particular emphasis on multilingual contexts. Our approach involved enhancing the Large Language and Vision Assistant (LLaVA) model, through fine-tuning with a specialized dataset of around 2500 image-text pairs. By leveraging the open-source nature of LLaVA and the resource-eficient ifne-tuning techniques such as Low-rank Adapter (LoRA), our study aims to contribute to the broader exploration of adapting language models to previously under-researched sensory modalities, such as olfaction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The field of multimodal machine learning has predominantly concentrated on text and image
data. This focus is primarily because eficient representation techniques are readily available
for these modalities. Conversely, other sensory dimensions, such as olfaction and gustation,
have received less attention. This can be attributed partly to the challenge of incorporating
their complex chemical structures into machine learning frameworks. However, it is important
to note that these less-explored modalities are also implicitly present in both text and images,
a facet that has been largely overlooked. The Multimodal Understanding of Smells in Texts
and Images (MUSTI) addresses this gap by encouraging research in the detection of olfactory
sources through texts and images in a multilingual context.</p>
      <p>
        In this paper, we outline our contribution to MUSTI 2023 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which focuses on evaluating
the capabilities of current open-source multimodal language models in identifying the sources
of olfactory stimuli. To this end, we utilized the Large Language and Vision Assistant (LLaVA)
model [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], one of the most prominent multimodal open-source models available. Our research
examines both the standard capabilities of the LLaVA model and its potential for fine-tuning
with olfactory data. We observed significant performance improvements in the model when
ifne-tuned with a very limited dataset of approximately 2000 image-text pairs. We believe that
the investigation of under-researched modalities in such models has great potential to advance
the field. Successful results would showcase the efectiveness of currently available multimodal
(text, image) language models in identifying olfactory sources and reveal the possibilities of
using information from various senses to improve the comprehension of olfaction in these
models. Therefore, in addition to developing more compact language models, such exploration
is also related to broader questions of cognitive science, such as the interplay of diferent senses.
      </p>
      <p>Prompt
Determine if the following text and image share common elements, with a specific focus
on smell sources. Look for entities such as objects, animals, fruits, or any other elements
that could be potential sources of smells. Answer YES or NO. Image: &lt;image&gt; Text:
&lt;text&gt;
Determine if the following text and image share common elements, with a specific focus
on smell sources. Look for entities such as objects, animals, fruits, or any other elements
that could be potential sources of smells. If you identify any such common elements,
please list them in the same language as the provided text. If no common elements are
found, simply respond with ’No common elements identified.’ Image: &lt;image&gt; Text:
&lt;text&gt;</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Prior research in olfaction through natural language processing is limited. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] developed a
FrameNet-like taxonomy to account for diferent aspects of the olfactory situations to facilitate
more NLP-oriented research. Building on this, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] developed a multilingual benchmark with
manual annotations for these situations. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] successfully trained a token classification model
with this benchmark that can accurately identify olfactory elements even in modern
outof-domain texts like perfume reviews. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] used word embeddings for an odor vocabulary in
English, mapping odor descriptors and their olfactory-semantic organization. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] further applied
this to analyze sensory descriptors in wine, perfume, and food. Complementing text-based
research, image-based olfactory reference extraction has advanced: [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] used CNNs for
odorobject localization, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] created an art dataset for olfactory recognition, and [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] developed a
dataset for identifying smell gestures in historical artworks.
      </p>
      <p>
        The closest line of research is last year’s shared task [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] evaluated the then
state-of-theart multimodal models, ViLBERT and mUNITER, for detecting common olfactory references
in multilingual text and images. The researchers formulated the task as a visual entailment
problem and demonstrated significant performance improvements through model fine-tuning.
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] addressed the challenge by constructing a unified text-image object representation method
for olfactory information where Yolov51 is used to represent image data and multilingual BERT
for texts.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>
        3.1. Model
Our approach employs LLaVA, which is a general-domain multimodal conversation model[
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ],
as the starting point and fine-tunes to the olfactory domain. LLaVA has a rather straightforward
architecture, consisting of a vision encoder and a language model which are integrated through
a linear projection layer. The various versions of LLaVA, namely 7B and 13B, are named based
on the size of the Vicuna language model[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] used as the text encoder.
      </p>
      <p>To optimize LLaVA for the designated subtasks, we transform the existing data into a format
suitable for instruction-tuning. The specific prompts used in our final model is provided in Table
1. Despite the related nature of these subtasks, we ensured each prompt was self-contained.
Therefore, from each text-image pair, two training instances were created. An important finding
during this phase was the need to guide the model to list objects in the same language as the
input text, as it tended to default to English otherwise.</p>
      <p>
        During the fine-tuning, instead of updating the entire model, we used LoRA which greatly
reduces the number of learnable parameters by freezing the model and learning much smaller
projection matrices between layers [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>For our study, we followed the LLaVA model’s oficial GitHub hyperparameters 2,
experimenting with various LoRA settings but found no significant performance diferences. Consequently,
we chose a rank and alpha value of 16. We fine-tuned the LLaVA-13B model for three epochs on
our dataset, noting no further gains with extended training, likely due to dataset size limitations.
The fine-tuning was completed in under 3 hours using a 40GB A100 GPU at a batch size of 4.
3.2. Data
The MUSTI 2023 dataset comprises pairs of texts and images, selected to evoke olfactory
experiences and sourced from historical archives. The dataset encompasses four languages:
English (EN), German (DE), French (FR), and Italian (IT), and contains a total of 2,374 image-text
pairs. However, the data is unbalanced, with only 593 pairs annotated as positive, meaning
the existence of at least one common smell source. We created an in-house development set,
constituting 10% of the total data, ensuring a similar distribution of positive and negative
examples across all languages. The development set is primarily used for hyperparameter
tuning and prompt tuning. Table 2 details the distribution of these pairs in the training and
development sets, highlighting the number of positive and negative samples in each language.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>In this section, we present and discuss the results of our models on the in-house development set
allocated during the training phase, as well as the oficial results on the test sets. We participated
in all three subtasks, namely i) identification of whether or not text passages and images evoke
the same smell source, ii) listing them, and also iii) performing the same tasks for another
language in a zero-shot setting.</p>
      <p>During the development phase, we evaluated both the LLaVA-7B and LLaVA-13B models,
before and after fine-tuning, as shown in Table 3. These models were assessed across multiple
languages. The F1-macro scores revealed significant enhancements following fine-tuning.
Notably, the fine-tuned LLaVA-13B model achieved a remarkable overall F1-macro score of
0.882, with particular improvements in recognizing positive samples. This suggests that the
base models initially lacked the necessary knowledge for detecting potential smell sources. The
LLaVA-7B model also showed competitive performance, especially considering its smaller size.
2https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task_lora.sh
Thanks to LoRA, the computational cost of fine-tuning was minimal, making the switch to a
larger model have almost no tangible efect on the resources needed.</p>
      <p>However, the most significant improvement was observed in subtask 2. On the development
data, we noted that the plain LLaVA models scored almost 0 F-score for positive examples, i.e.
when the evaluation ignored the correct classification of no common objects. This was because
the models either failed to provide any list or listed all objects in the images. After fine-tuning,
the performance of LLaVA-13B drastically improved to an F-score of 0.61, indicating that the
model learned to discern which objects needed to be detected.</p>
      <p>In the oficial results (Table 4), the fine-tuned LLaVA-13B model showed balanced macro
precision and recall on both test and test-zero sets, performing well in identifying negative
samples. The F1-Scores reached 0.776 for Subtask 1 and 0.698 for Subtask 2 in the test set.
In the zero-shot scenario, performance dropped to 0.65 and 0.538 for Subtask 2, respectively.
However, this is still promising, especially considering that Slovenian was not included in
the pre-training or fine-tuning phases. These results highlight the fine-tuned LLaVA model’s
eficacy in recognizing olfactory data, marking a notable advancement in the capabilities of
multimodal language models.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In our research, we delved into the less-explored territory of integrating olfactory data into
multimodal language models. Using the LLaVA model, we focused on recognizing olfactory cues
in a diverse range of texts and images. By fine-tuning LLaVA with around 2500 image-text pairs
and employing the Low-rank Adapter (LoRA) method, we achieved notable enhancements in the
model’s ability to detect olfactory stimuli. We believe that our findings highlight the potential
of multimodal language models in processing sensory information beyond conventional texts
and visuals.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          , I. Novalija,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          , The MUSTI challenge @
          <article-title>MediaEval 2023 - multimodal understanding of smells in texts and images with zero-shot evaluation</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2023 Workshop</source>
          , Amsterdam,
          <source>the Netherlands and Online, 1-2 February</source>
          <year>2024</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Visual instruction tuning</article-title>
          ,
          <source>arXiv preprint arXiv:2304.08485</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Improved baselines with visual instruction tuning</article-title>
          ,
          <source>arXiv preprint arXiv:2310.03744</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          , S. Menini,
          <article-title>FrameNet-like annotation of olfactory information in texts</article-title>
          , in: S. DegaetanoOrtlieb,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kazantseva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reiter</surname>
          </string-name>
          , S. Szpakowicz (Eds.),
          <source>Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage</source>
          ,
          <source>Social Sciences, Humanities and Literature</source>
          , Association for Computational Linguistics, Punta Cana,
          <source>Dominican Republic (online)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .latechclfl-
          <volume>1</volume>
          .2. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .latechclfl-
          <volume>1</volume>
          .2.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Tekiroglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <article-title>Building a multilingual taxonomy of olfactory terms with timestamps</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4030</fpage>
          -
          <lpage>4039</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kurfalı</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hörberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Olofsson</surname>
          </string-name>
          ,
          <article-title>Automatic detection of olfactory context elements, in: 15TH PANGBORN SENSORY SCIENCE SYMPOSIUM-MEETING NEW CHALLENGES IN A CHANGING WORLD</article-title>
          (PSSS
          <year>2023</year>
          ), volume
          <volume>6</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hörberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Olofsson</surname>
          </string-name>
          ,
          <article-title>The semantic organization of the english odor vocabulary</article-title>
          ,
          <source>Cognitive science 46</source>
          (
          <year>2022</year>
          )
          <article-title>e13205</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hörberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kurfalı</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Olofsson</surname>
          </string-name>
          ,
          <article-title>Odor and flavor vocabulary in wine, perfume and food product reviews: insights from language modeling, Food Quality and Preference (under review).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Seeing is smelling: Localizing odor-related objects in images</article-title>
          ,
          <source>in: Proceedings of the 9th Augmented Human International Conference</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Odor: The icpr2022 odeuropa challenge on olfactory object recognition</article-title>
          ,
          <source>in: 2022 26th International Conference on Pattern Recognition (ICPR)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>4989</fpage>
          -
          <lpage>4994</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hussian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Snifyart: The dataset of smelling persons</article-title>
          ,
          <source>in: Proceedings of the 5th Workshop on analySis, Understanding and proMotion of heritAge Contents</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <article-title>MUSTI - multimodal understanding of smells in texts and images at mediaeval 2022</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper50.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoğlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Multimodal and Multilingual Understanding of Smells using VilBERT and mUNITER</article-title>
          , in: MediaEval Benchmarking Initiative for Multimedia Evaluation,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Wan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Multilingual Text-Image Olfactory Object Matching Based on Object Detection, in: MediaEval Benchmarking Initiative for Multimedia Evaluation</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>W.-L. Chiang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , L. Zheng,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalez</surname>
          </string-name>
          , et al.,
          <article-title>Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality</article-title>
          , See https://vicuna. lmsys.
          <source>org (accessed 14 April</source>
          <year>2023</year>
          ) (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2106.09685</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>