=Paper=
{{Paper
|id=Vol-3181/paper37
|storemode=property
|title=Visual Topic Modelling for NewsImage Task at MediaEval 2021
|pdfUrl=https://ceur-ws.org/Vol-3181/paper37.pdf
|volume=Vol-3181
|authors=Lidia Pivovarova,Elaine Zosa
|dblpUrl=https://dblp.org/rec/conf/mediaeval/PivovarovaZ21
}}
==Visual Topic Modelling for NewsImage Task at MediaEval 2021==
<pdf width="1500px">https://ceur-ws.org/Vol-3181/paper37.pdf</pdf>
<pre>
    Visual Topic Modelling for NewsImage Task at MediaEval 2021
                                                             Lidia Pivovarova, Elaine Zosa
                                                                University of Helsinki, Finland
                                                                    first.last@helsinki.fi
ABSTRACT                                                                             trains an inference network to estimate the parameters of the topic
We present the Visual Topic Model (VTM)—a model able to generate                     distribution of the input. During inference this topic distribution is
a topic distribution for an image, without using any text during                     used as the model output to describe texts unseen during training.
inference. The model is applied to an image-text matching task at                        Thus, to train a model, each input instance has two parts: text
MediaEval 2021. Though results for this specific task are negative                   embeddings and bag-of-words representation (BoW). Our main
(the model works worse than a baseline), we demonstrate that VTM                     contribution is that we replace text embeddings with visual embed-
produces meaningful results and can be used in other applications.                   dings and demonstrate that they can be used to train a topic model.
                                                                                     The ZeroShot CTM model uses the BoW representation only to
                                                                                     compute loss, i.e. this information is not needed during inference
1    INTRODUCTION                                                                    time. Since we have a training set that consists of aligned text and
We present a novel approach for Visual Topic Modelling (VTM), i.e.                   image pairs we can use the texts to produce the BoW representation
assigning to an image a topic distribution, where 2-3 topics are the                 and use it to train a model.
most probable ones. A topic is represented as a list of words, so an                     To obtain image embeddings we use CLIP—a pretrained model
image is labeled with a set of predefined keywords.                                  that produces text and image embeddings in the same space [5].
    VTM is an extension of Contextualized Topic Models (CTM) [1].                    CLIP representations for text and image are already aligned. How-
For training it requires pairs of images and texts. During inference,                ever, this is not a requirement for VTM: in our preliminary ex-
it takes as an input only an image. Thus, the model is capable of                    periments we used ViT [2] for image and German BERT for texts
assigning topics to an image without any textual description.                        (https://huggingface.co/bert-base-german-cased). The results ob-
    In this paper, we apply VTM for MediaEval 2021 NewsImage                         tained using non-aligned embeddings were only slightly worse than
Task 1, i.e. matching news articles with corresponding images [4].                   those with CLIP embeddings. Topic models converge to similar re-
Our approach consists of training two aligned topic models: one                      sults because they use the same BoW to compute loss; alignment
model takes as an input text, another takes as an input image, both                  of embeddings simplifies this process but is not necessary.
produce as an output a topic distribution from the common set of                         This basic procedure, i.e. training image and text models inde-
topics. During training, we use aligned texts and images and train                   pendently, produces similar but not aligned topic models. Topics
models in such a way that they have a similar output distributions.                  could be slightly different and even similar topics are organized
During inference, to find images corresponding to a given text, we                   in different (random) order. To increase similarity between text
apply visual and text models independently and then sort images                      and image models we use knowledge distillation. In this approach
by topic distribution similarity to the text topic distribution. Since               a student model uses a different input than a teacher—e.g. image
each topic can be represented as a set of keywords, results are                      instead of text—but should produce the same result.
interpretable.                                                                           CTM uses a sum of two losses: reconstruction loss and divergence
    To train aligned visual and text topic models we use knowledge                   loss. The reconstruction loss ensures that the reconstructed BoW
distillation [3], i.e. first train a teacher and then train a student model          representation is not far from the true one. The divergence loss,
that should produce an output similar to those produced by the                       measured as KL-divergence between priors and posteriors, ensures
teacher.                                                                             a diversity property, that is desired for any topic model: only few
    Our experiments with text to image matching produced negative                    words have large probabilities for a given topic and only few topics
results: a solution based on VTM works worse than a baseline,                        have high probabilities for a given document.
based on cosine similarity between out-of-the-box text and image                         In the knowledge distillation approach we leave the reconstruc-
embeddings [5]. Nevertheless, we believe that topic modelling for                    tion loss intact but replace divergence loss with KL-divergence with
images can have many other applications. It can also be possible                     regards to the teacher output. The assumption here is that since a
to improve the current solution with hyperparameter tuning or by                     teacher model is already trained to be diverse and a student model
using a larger training set.                                                         is trained to mimic the teacher, the student does not need priors.
                                                                                     Experiments supported this assumption.
2    METHOD                                                                              We use knowledge distillation in two versions: joint model and
VTM is an extension of CTM [1]. CTM is a family of neural topic                      text-teacher. In the joint approach we first train a joint model that
models that is trained to take as an input, text embeddings and to                   takes as an input a concatenation of text and image embeddings,
produce as an output the bag-of-words reconstruction. The model                      then train two student models for image and text separately. In
                                                                                     the second approach, we first train a text model and then train an
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons   image model as a student.
License Attribution 4.0 International (CC BY 4.0).
MediaEval’21, December 13-15 2021, Online
MediaEval’21, December 13-15 2021, Online                                                                           L. Pivovarova and E. Zosa

                                                               Table 1: Results

          Model                        Correct in Top100       MRR@100    Recall@5       Recall@10      Recall@50       Recall@100
          baseline (CLIP)                      1225               0.169       0.22           0.30           0.53             0.64
          joint 120 topics                      767               0.043       0.06           0.09           0.26             0.40
          joint 60 topics                       698               0.030       0.04           0.07           0.24             0.36
          text teacher 120 topics               816               0.042       0.05           0.09           0.30             0.43
          text teacher 60 topics                757               0.037       0.05           0.08           0.26             0.39


                              (a) CLIP 1st                                                           (b) VTM 1st


                              (c) CLIP 2nd                                                          (d) VTM 2nd

Figure 1: Images, most close to the story about the trial of Anna Semenova according to the baseline (a,c) and VTM (b,d) models.


  We try 60 and 120 topics with both joint and text-teacher ap-               Another interesting observation is that models that use the text
proaches. Preliminary experiments showed that the more topics are         model as a teacher for a visual model work better than joint models.
used the higher is the model performance in text-image matching.          This is an unexpected result, since one would expect that a model
  As a baseline, we use raw cosine similarities between CLIP em-          that has access to full information could serve as a better teacher.
beddings, without any domain adaptation for the text. We use an           It is possible that text bears less noise: a text model uses the same
implementation provided as a part of Sentence Bert package (https://      text for contextual and BoW representation, while an image could
www.sbert.net/examples/applications/image-search/README.html).            be completely irrelevant to a corresponding article.
                                                                              The fact that embeddings and topic modelling work on different
                                                                          principles is illustrated in Figure 1, where we reproduce images
3   RESULTS                                                               found by the model for the text about the Anna Semenova trial.
The results are presented in Table 1. As can be seen from the table,      CLIP model finds photos of Anna Semenova, probably due to the
the best results are obtained with CLIP embeddings, that are used         huge text and image base used to train the embeddings. VTM re-
without any fine-tuning to the training set. They are able to find the    turns images with a statue of Themis, a personification of Justice,
correct image in 1225 cases out of 1915 and has a Mean Reciprocal         which represent the text topic rather than specific facts. Though
Rank (MRR) of 0.17. The best VTM model finds correct image in             according to our results, CLIP embeddings outperform VTM, the
816 cases out of 1915 and yields an MRR of 0.03.                          ability to illustrate text topic might be a desirable property for some
   These results to some extent correspond to our previous experi-        applications, as well as topic interpretability.
ments, where we showed that topic modelling does not work well                Our code is available at https://github.com/lmphcs/media_eval_
for document linking [6]. The probable explanation for that might         vctm.
be that topic modelling produces a sparse representation of the
data. While CLIP embeddings are continuous vectors and could rep-
resent an almost infinite amount of information, in topic modelling       ACKNOWLEDGMENTS
dimensions are not independent due to the diversity requirement           This work has been partly supported by the European Union’s
described above. It can be seen from Table 1 that models that have        Horizon 2020 research and innovation programme under grants
more topics yield better performance.                                     770299 (NewsEye) and 825153 (EMBEDDIA).
NewsImages                                                                   MediaEval’21, December 13-15 2021, Online


REFERENCES
[1] Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, and Elis-
    abetta Fersini. 2021. Cross-lingual Contextualized Topic Models with
    Zero-shot Learning. In Proceedings of the 16th Conference of the Euro-
    pean Chapter of the Association for Computational Linguistics: Main
    Volume. Association for Computational Linguistics, Online, 1676–1683.
    https://www.aclweb.org/anthology/2021.eacl-main.143
[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis-
    senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani,
    Matthias Minderer, Georg Heigold, Sylvain Gelly, and others. 2020.
    An Image is Worth 16x16 Words: Transformers for Image Recognition
    at Scale. In International Conference on Learning Representations.
[3] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao.
    2021. Knowledge distillation: A survey. International Journal of Com-
    puter Vision 129, 6 (2021), 1789–1819.
[4] Benjamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi,
    and Duc-Tien Dang-Nguyen. 2021. News Images in MediaEval 2021.
    CEUR Workshop Proceedings.
[5] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel
    Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela
    Mishkin, Jack Clark, and others. Learning Transferable Visual Models
    From Natural Language Supervision. Technical Report.
[6] Elaine Zosa, Mark Granroth-Wilding, and Lidia Pivovarova. A Com-
    parison of Unsupervised Methods for Ad hoc Cross-Lingual Document
    Retrieval. In LREC 2020 Language Resources and Evaluation Conference
    11–16 May 2020. 32.

</pre>