<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Handle the problem of ample label space by using the Image-guided Feature Extractor on the MUSTI dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Le Ngoc-Duc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Le Minh-Hung</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dinh Quang-Vinh</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ho Chi Minh City University of Science</institution>
          ,
          <addr-line>Vietnamese</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vietnamese-German University</institution>
          ,
          <addr-line>Vietnamese</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Among multimodal tasks, olfactory perception remains a largely unexplored field. The two most significant dificulties that need to be overcome are that the label space is ample while the data set size is generally of too small volume. The second is the imbalanced nature of labels in the data set. In this paper, we develop and evaluate our model in the task of predicting the congruence of olfactory experiences between an image and a corresponding text passage on the MUSTI dataset. To solve the label imbalance problem and optimize the process of extracting multimedia images and text with large feature spaces, we propose a model that selectively selects the text features based on image features. By selecting texts that need attention, our model outperforms existing baselines on training and testing data sets. Code available at: https://github.com/Haru-Lab-Space/MMM2024.git.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        This paper aims to enhance the process of multimodal text-image retrieval through ensemble
learning with image extraction models that have been trained on large datasets. Besides,
Imageguided Feature Extractor can help the model encode text more efectively by providing it with
potential values that should give more attention. Thanks to the excellent and diverse results
from combining image extraction models, text encoders can now receive the most optimal
summary information. Then, the information extracted from images and text is brought into a
shared space, and their similarity is evaluated based on cosine distance. We evaluate our model
results in the task of predicting the congruence of olfactory experiences between an image and a
corresponding text passage, on the MUSTI dataset and obtain superior results in all parameters
compared to the baselines. Through models like these, museums and galleries can enhance the
experience for users with special needs, such as those with visual impairments. Besides, it can
also become a new future for the perfume industry when they can recreate the history of its
formation and development [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>In the remainder of this paper, we present a comprehensive exploration of our proposed
model and its implications for the field. Section 3 delves into the intricacies of our proposed
model and the underlying motivation for its development. Section 4 is dedicated to presenting
and analyzing the results of our experiments, comparing the performance of our proposed
model against established benchmarks. In Section 5, we further investigate by exploring various
model variants, elucidating the nuanced diferences in performance.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Multimodal image-text retrieval aims to enable eficient retrieval of image-text correlations. It is
a dificult task because of the diferences in the information representation space of the features.
To overcome that, Y. He used two convolutional neural networks to adapt to learning features
between images and text [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. By bringing them to the same representation space, the model
evaluates their similarity through cosine similarity. Y. Zhan constructs a multimodal projection
and evaluates the diference between pairs of input images and text [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A. Baldrati synthesize
the characteristics of the two phases according to the features to find the diferential features
between query and target images [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. H. Dong proposed a model that leverages text information
to teach the image encoder to retrieve features based on the graph [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Although there have
been many eforts to improve the ability to grasp image and text objects, previous research
methods performed information extraction fragmentarily. The information generated from
separate encoding and extraction units creates information-rich vectors, which unintentionally
confuses the model when choosing where to pay attention among countless potential options.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <sec id="sec-3-1">
        <title>3.1. Cross-Feature Encoder</title>
        <p>Through problem analysis, we realized we could use some of the features found in images and
text. Those information-rich feature embeddings sometimes complicate the inference process
and confuse the model when concluding. Realizing this, we built a model consisting of three
main components: Image Encoder, Image-guided Feature Extractor, and Text Encoder.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Image Encoder</title>
          <p>
            The image encoder is a set of encoders of the Vision Transformer [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] and Resnet-34 [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] models.
We use the image as input to all two of these encoding units. Because the output size of each
model is diferent, we use linear layers to bring them to the same size as the text embedding.
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Image-guided Feature Extractor (IGFE)</title>
          <p>The idea behind this processing block is that we only need a minimal amount of features
compared to what the actual text encoder can do. Therefore, we use the output of the image
encoder to target the text embedding to the needed features. This helps the text encoding unit
to pay more attention to essential features. We do this by successively appending the text
embedding to the outputs of the image encoding units. A linear class here will decide how
much of the image encoder’s information is retained to become the "instructor".</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Text Encoder</title>
          <p>
            The instructors and text embeddings are then added together and normalized before being fed
into the text encoder. In this setting, we use BERT’s multilingual text encoder - a text processing
model that consists only of Encoder classes [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prediction</title>
        <p>We use a cosine distance measure and a weighted soft aggregation method based on the weights
of a linear layer. Thanks to the flexibility of the weights, we can allocate the decision influence
of each model accordingly.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>
        We found that our model outperformed every metric compared to the baseline when evaluating
the model’s accuracy on the test dataset. We took the measurement data of the baseline models
directly from the author’s article [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Meanwhile, compared to the results on the test set, our model has outstanding improvements
in accuracy of about 14% (74.41% on the proposed model compared to 60.33% on the Yolov5 +
BERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and 61.76% on the mUNITER-SNLI-MUSTI [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). This superiority comes from four
main reasons. First, we use Local Loss [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] as the loss function of the model. This makes it
possible for us to overcome the problem of imbalanced datasets. We tested the data augmentation
method and removed NO labels until their numbers equalized. However, this method produces a
large dataset and requires ten times more training time. Besides, its accuracy has not improved
at all. Second, we use a feature selection unit, which allows us to help the model focus on what
is needed and reduces the model’s confusion when predicting. Third, we apply the ensemble
model to take advantage of the good points of each model. At the final voting step, we use a
linear layer as a soft voting method instead of allocating equal attention to the prediction results
of the two individual models. This way, we can take advantage of all two models without being
wholly afected by each model’s adverse efects or suboptimal corners. Finally, we use BERT’s
multilingual text encoding model, which allows us to learn better text embeddings.
      </p>
      <p>Based on the achieved results described in Tables 1 and 2, we realize that choosing a
combination of Ensemble learning methods, Focal loss, and Image-guided Feature Extractor can help
the model achieve certain successes. However, choosing a suboptimal set of coeficients (in this
setting  and  of focal loss) and overusing dropout layers can degrade model performance.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Ablation Study</title>
      <p>In addition to models incapable of capturing imbalanced data, we find that there is no absolute
superiority in any specific model. However, models that used an Image-guided Feature Extractor
consistently recorded better results than baseline models that did not use proposed units.</p>
      <p>
        Furthermore, in terms of ensemble learning, we can partly predict the results of the
combination model based on its components. Specifically, we use combined variants of Vision
Transformer, Resnet-34, and SwinTransformer in this study. If evaluated on each model, we can
see that Resnet and SwinTransformer are models for best and worst results. While
SwinTransformer has demonstrated excellent results in various tasks [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], it falls short in this particular
challenge. Its performance does not reach its full potential in this task, resulting in model
combinations built on the SwinTransformer foundation consistently underperforming compared to
variants with the same number of sub-components. Furthermore, the combined model based on
all three image encoders gives lower results than the model, including only Vision Transformer
and Resnet-34. From this, combining models in ensemble learning does not improve accuracy
based on the number of components it covers. Instead, it requires more testing and experience
to choose suitable combinations. Besides, through testing, we found that Focal Loss’s gamma
and alpha values are 2 and 0.3, respectively, which will help the model achieve the best accuracy
on the test set and maintain the necessary simplicity.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Outlook</title>
      <p>We have proposed a methodology for feature encoding in text-image integration, utilizing
the Image-guided Feature Extractor. The incorporation of this component empowers the
model to concentrate its attention on discerned objects within the input image. The model
has demonstrated noteworthy eficacy in the task of predicting the congruence of olfactory
experiences between an image and a corresponding text passage, assessed on the MUSTI dataset.
Nonetheless, the dissimilarity in embedding spaces between images and text persists as an
unresolved challenge in the meticulous selection of congruent features.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          , I. Novalija,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Erp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          , The MUSTI challenge @
          <article-title>MediaEval 2023 - multimodal understanding of smells in texts and images with zero-shot evaluation</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2023 Workshop</source>
          , Amsterdam,
          <source>the Netherlands and Online, 1-2 February</source>
          <year>2024</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <article-title>Cross-modal retrieval via deep and bidirectional representation learning</article-title>
          ,
          <source>IEEE Transactions on Multimedia</source>
          <volume>18</volume>
          (
          <year>2016</year>
          )
          <fpage>1363</fpage>
          -
          <lpage>1377</lpage>
          . doi:
          <volume>10</volume>
          .1109/TMM.
          <year>2016</year>
          .
          <volume>2558463</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Lu,
          <article-title>Deep cross-modal projection learning for image-text matching</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision (ECCV)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baldrati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bertini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Uricchio</surname>
          </string-name>
          ,
          <source>A. del Bimbo</source>
          ,
          <article-title>Composed image retrieval using contrastive learning and task-oriented clip-based features</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2308</volume>
          .
          <fpage>11485</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Qiu</surname>
          </string-name>
          , G. Sapiro,
          <article-title>Using text to teach image retrieval</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2011</year>
          .09928.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , ArXiv abs/
          <year>2010</year>
          .11929 (
          <year>2020</year>
          ). URL: https://api. semanticscholar.org/CorpusID:225039882.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <year>2015</year>
          . arXiv:
          <volume>1512</volume>
          .
          <fpage>03385</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Wan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Multilingual text-image olfactory object matching based on object detection</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper15.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Akdemir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hürriyetoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zinnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christlein</surname>
          </string-name>
          ,
          <article-title>Multimodal and multilingual understanding of smells using vilbert and muniter</article-title>
          , in: S. Hicks,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Langguth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Andreadis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hürriyetoglu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Nordmo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vuillemot</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          (Eds.),
          <source>Working Notes Proceedings of the MediaEval 2022 Workshop</source>
          , Bergen, Norway and Online,
          <volume>12</volume>
          -
          <fpage>13</fpage>
          January
          <year>2023</year>
          , volume
          <volume>3583</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3583</volume>
          /paper36.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <year>2018</year>
          . arXiv:
          <fpage>1708</fpage>
          .
          <year>02002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Bui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. H.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <article-title>C2t-net: Channel-aware cross-fused transformer-style networks for pedestrian attribute recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>351</fpage>
          -
          <lpage>358</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>