<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Topic Modelling for NewsImage Task at MediaEval 2021</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lidia Pivovarova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elaine Zosa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Helsinki</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>We present the Visual Topic Model (VTM)-a model able to generate a topic distribution for an image, without using any text during inference. The model is applied to an image-text matching task at MediaEval 2021. Though results for this specific task are negative (the model works worse than a baseline), we demonstrate that VTM produces meaningful results and can be used in other applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>We present a novel approach for Visual Topic Modelling (VTM), i.e.
assigning to an image a topic distribution, where 2-3 topics are the
most probable ones. A topic is represented as a list of words, so an
image is labeled with a set of predefined keywords.</p>
      <p>
        VTM is an extension of Contextualized Topic Models (CTM) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
For training it requires pairs of images and texts. During inference,
it takes as an input only an image. Thus, the model is capable of
assigning topics to an image without any textual description.
      </p>
      <p>
        In this paper, we apply VTM for MediaEval 2021 NewsImage
Task 1, i.e. matching news articles with corresponding images [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Our approach consists of training two aligned topic models: one
model takes as an input text, another takes as an input image, both
produce as an output a topic distribution from the common set of
topics. During training, we use aligned texts and images and train
models in such a way that they have a similar output distributions.
During inference, to find images corresponding to a given text, we
apply visual and text models independently and then sort images
by topic distribution similarity to the text topic distribution. Since
each topic can be represented as a set of keywords, results are
interpretable.
      </p>
      <p>
        To train aligned visual and text topic models we use knowledge
distillation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], i.e. first train a teacher and then train a student model
that should produce an output similar to those produced by the
teacher.
      </p>
      <p>
        Our experiments with text to image matching produced negative
results: a solution based on VTM works worse than a baseline,
based on cosine similarity between out-of-the-box text and image
embeddings [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Nevertheless, we believe that topic modelling for
images can have many other applications. It can also be possible
to improve the current solution with hyperparameter tuning or by
using a larger training set.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHOD</title>
      <p>
        VTM is an extension of CTM [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. CTM is a family of neural topic
models that is trained to take as an input, text embeddings and to
produce as an output the bag-of-words reconstruction. The model
trains an inference network to estimate the parameters of the topic
distribution of the input. During inference this topic distribution is
used as the model output to describe texts unseen during training.
      </p>
      <p>Thus, to train a model, each input instance has two parts: text
embeddings and bag-of-words representation (BoW). Our main
contribution is that we replace text embeddings with visual
embeddings and demonstrate that they can be used to train a topic model.
The ZeroShot CTM model uses the BoW representation only to
compute loss, i.e. this information is not needed during inference
time. Since we have a training set that consists of aligned text and
image pairs we can use the texts to produce the BoW representation
and use it to train a model.</p>
      <p>
        To obtain image embeddings we use CLIP—a pretrained model
that produces text and image embeddings in the same space [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
CLIP representations for text and image are already aligned.
However, this is not a requirement for VTM: in our preliminary
experiments we used ViT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for image and German BERT for texts
(https://huggingface.co/bert-base-german-cased). The results
obtained using non-aligned embeddings were only slightly worse than
those with CLIP embeddings. Topic models converge to similar
results because they use the same BoW to compute loss; alignment
of embeddings simplifies this process but is not necessary.
      </p>
      <p>This basic procedure, i.e. training image and text models
independently, produces similar but not aligned topic models. Topics
could be slightly diferent and even similar topics are organized
in diferent (random) order. To increase similarity between text
and image models we use knowledge distillation. In this approach
a student model uses a diferent input than a teacher—e.g. image
instead of text—but should produce the same result.</p>
      <p>CTM uses a sum of two losses: reconstruction loss and divergence
loss. The reconstruction loss ensures that the reconstructed BoW
representation is not far from the true one. The divergence loss,
measured as KL-divergence between priors and posteriors, ensures
a diversity property, that is desired for any topic model: only few
words have large probabilities for a given topic and only few topics
have high probabilities for a given document.</p>
      <p>In the knowledge distillation approach we leave the
reconstruction loss intact but replace divergence loss with KL-divergence with
regards to the teacher output. The assumption here is that since a
teacher model is already trained to be diverse and a student model
is trained to mimic the teacher, the student does not need priors.
Experiments supported this assumption.</p>
      <p>We use knowledge distillation in two versions: joint model and
text-teacher. In the joint approach we first train a joint model that
takes as an input a concatenation of text and image embeddings,
then train two student models for image and text separately. In
the second approach, we first train a text model and then train an
image model as a student.</p>
      <p>We try 60 and 120 topics with both joint and text-teacher
approaches. Preliminary experiments showed that the more topics are
used the higher is the model performance in text-image matching.</p>
      <p>As a baseline, we use raw cosine similarities between CLIP
embeddings, without any domain adaptation for the text. We use an
implementation provided as a part of Sentence Bert package (https://
www.sbert.net/examples/applications/image-search/README.html).
3</p>
    </sec>
    <sec id="sec-3">
      <title>RESULTS</title>
      <p>The results are presented in Table 1. As can be seen from the table,
the best results are obtained with CLIP embeddings, that are used
without any fine-tuning to the training set. They are able to find the
correct image in 1225 cases out of 1915 and has a Mean Reciprocal
Rank (MRR) of 0.17. The best VTM model finds correct image in
816 cases out of 1915 and yields an MRR of 0.03.</p>
      <p>
        These results to some extent correspond to our previous
experiments, where we showed that topic modelling does not work well
for document linking [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The probable explanation for that might
be that topic modelling produces a sparse representation of the
data. While CLIP embeddings are continuous vectors and could
represent an almost infinite amount of information, in topic modelling
dimensions are not independent due to the diversity requirement
described above. It can be seen from Table 1 that models that have
more topics yield better performance.
      </p>
      <p>Another interesting observation is that models that use the text
model as a teacher for a visual model work better than joint models.
This is an unexpected result, since one would expect that a model
that has access to full information could serve as a better teacher.
It is possible that text bears less noise: a text model uses the same
text for contextual and BoW representation, while an image could
be completely irrelevant to a corresponding article.</p>
      <p>The fact that embeddings and topic modelling work on diferent
principles is illustrated in Figure 1, where we reproduce images
found by the model for the text about the Anna Semenova trial.
CLIP model finds photos of Anna Semenova, probably due to the
huge text and image base used to train the embeddings. VTM
returns images with a statue of Themis, a personification of Justice,
which represent the text topic rather than specific facts. Though
according to our results, CLIP embeddings outperform VTM, the
ability to illustrate text topic might be a desirable property for some
applications, as well as topic interpretability.</p>
      <p>Our code is available at https://github.com/lmphcs/media_eval_
vctm.</p>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work has been partly supported by the European Union’s
Horizon 2020 research and innovation programme under grants
770299 (NewsEye) and 825153 (EMBEDDIA).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Federico</given-names>
            <surname>Bianchi</surname>
          </string-name>
          , Silvia Terragni, Dirk Hovy, Debora Nozza, and
          <string-name>
            <given-names>Elisabetta</given-names>
            <surname>Fersini</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Cross-lingual Contextualized Topic Models with Zero-shot Learning</article-title>
          .
          <source>In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume.
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <fpage>1676</fpage>
          -
          <lpage>1683</lpage>
          . https://www.aclweb.org/anthology/2021.eacl-main.
          <fpage>143</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          , Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and others.
          <year>2020</year>
          .
          <article-title>An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</article-title>
          . In International Conference on Learning Representations.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jianping</given-names>
            <surname>Gou</surname>
          </string-name>
          , Baosheng Yu,
          <string-name>
            <surname>Stephen J Maybank</surname>
            , and
            <given-names>Dacheng</given-names>
          </string-name>
          <string-name>
            <surname>Tao</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Knowledge distillation: A survey</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>129</volume>
          ,
          <issue>6</issue>
          (
          <year>2021</year>
          ),
          <fpage>1789</fpage>
          -
          <lpage>1819</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Kille</surname>
          </string-name>
          , Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi, and
          <string-name>
            <surname>Duc-Tien</surname>
          </string-name>
          Dang-Nguyen.
          <year>2021</year>
          .
          <article-title>News Images in MediaEval 2021</article-title>
          . CEUR Workshop Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
          <article-title>and others</article-title>
          .
          <source>Learning Transferable Visual Models From Natural Language Supervision</source>
          .
          <source>Technical Report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Elaine</given-names>
            <surname>Zosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Granroth-Wilding</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Lidia</given-names>
            <surname>Pivovarova</surname>
          </string-name>
          .
          <article-title>A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval</article-title>
          .
          <source>In LREC 2020 Language Resources and Evaluation Conference</source>
          <volume>11</volume>
          -16 May
          <year>2020</year>
          .
          <volume>32</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>