=Paper=
{{Paper
|id=Vol-3181/paper37
|storemode=property
|title=Visual Topic Modelling for NewsImage Task at MediaEval 2021
|pdfUrl=https://ceur-ws.org/Vol-3181/paper37.pdf
|volume=Vol-3181
|authors=Lidia Pivovarova,Elaine Zosa
|dblpUrl=https://dblp.org/rec/conf/mediaeval/PivovarovaZ21
}}
==Visual Topic Modelling for NewsImage Task at MediaEval 2021==
Visual Topic Modelling for NewsImage Task at MediaEval 2021 Lidia Pivovarova, Elaine Zosa University of Helsinki, Finland first.last@helsinki.fi ABSTRACT trains an inference network to estimate the parameters of the topic We present the Visual Topic Model (VTM)—a model able to generate distribution of the input. During inference this topic distribution is a topic distribution for an image, without using any text during used as the model output to describe texts unseen during training. inference. The model is applied to an image-text matching task at Thus, to train a model, each input instance has two parts: text MediaEval 2021. Though results for this specific task are negative embeddings and bag-of-words representation (BoW). Our main (the model works worse than a baseline), we demonstrate that VTM contribution is that we replace text embeddings with visual embed- produces meaningful results and can be used in other applications. dings and demonstrate that they can be used to train a topic model. The ZeroShot CTM model uses the BoW representation only to compute loss, i.e. this information is not needed during inference 1 INTRODUCTION time. Since we have a training set that consists of aligned text and We present a novel approach for Visual Topic Modelling (VTM), i.e. image pairs we can use the texts to produce the BoW representation assigning to an image a topic distribution, where 2-3 topics are the and use it to train a model. most probable ones. A topic is represented as a list of words, so an To obtain image embeddings we use CLIP—a pretrained model image is labeled with a set of predefined keywords. that produces text and image embeddings in the same space [5]. VTM is an extension of Contextualized Topic Models (CTM) [1]. CLIP representations for text and image are already aligned. How- For training it requires pairs of images and texts. During inference, ever, this is not a requirement for VTM: in our preliminary ex- it takes as an input only an image. Thus, the model is capable of periments we used ViT [2] for image and German BERT for texts assigning topics to an image without any textual description. (https://huggingface.co/bert-base-german-cased). The results ob- In this paper, we apply VTM for MediaEval 2021 NewsImage tained using non-aligned embeddings were only slightly worse than Task 1, i.e. matching news articles with corresponding images [4]. those with CLIP embeddings. Topic models converge to similar re- Our approach consists of training two aligned topic models: one sults because they use the same BoW to compute loss; alignment model takes as an input text, another takes as an input image, both of embeddings simplifies this process but is not necessary. produce as an output a topic distribution from the common set of This basic procedure, i.e. training image and text models inde- topics. During training, we use aligned texts and images and train pendently, produces similar but not aligned topic models. Topics models in such a way that they have a similar output distributions. could be slightly different and even similar topics are organized During inference, to find images corresponding to a given text, we in different (random) order. To increase similarity between text apply visual and text models independently and then sort images and image models we use knowledge distillation. In this approach by topic distribution similarity to the text topic distribution. Since a student model uses a different input than a teacher—e.g. image each topic can be represented as a set of keywords, results are instead of text—but should produce the same result. interpretable. CTM uses a sum of two losses: reconstruction loss and divergence To train aligned visual and text topic models we use knowledge loss. The reconstruction loss ensures that the reconstructed BoW distillation [3], i.e. first train a teacher and then train a student model representation is not far from the true one. The divergence loss, that should produce an output similar to those produced by the measured as KL-divergence between priors and posteriors, ensures teacher. a diversity property, that is desired for any topic model: only few Our experiments with text to image matching produced negative words have large probabilities for a given topic and only few topics results: a solution based on VTM works worse than a baseline, have high probabilities for a given document. based on cosine similarity between out-of-the-box text and image In the knowledge distillation approach we leave the reconstruc- embeddings [5]. Nevertheless, we believe that topic modelling for tion loss intact but replace divergence loss with KL-divergence with images can have many other applications. It can also be possible regards to the teacher output. The assumption here is that since a to improve the current solution with hyperparameter tuning or by teacher model is already trained to be diverse and a student model using a larger training set. is trained to mimic the teacher, the student does not need priors. Experiments supported this assumption. 2 METHOD We use knowledge distillation in two versions: joint model and VTM is an extension of CTM [1]. CTM is a family of neural topic text-teacher. In the joint approach we first train a joint model that models that is trained to take as an input, text embeddings and to takes as an input a concatenation of text and image embeddings, produce as an output the bag-of-words reconstruction. The model then train two student models for image and text separately. In the second approach, we first train a text model and then train an Copyright 2021 for this paper by its authors. Use permitted under Creative Commons image model as a student. License Attribution 4.0 International (CC BY 4.0). MediaEval’21, December 13-15 2021, Online MediaEval’21, December 13-15 2021, Online L. Pivovarova and E. Zosa Table 1: Results Model Correct in Top100 MRR@100 Recall@5 Recall@10 Recall@50 Recall@100 baseline (CLIP) 1225 0.169 0.22 0.30 0.53 0.64 joint 120 topics 767 0.043 0.06 0.09 0.26 0.40 joint 60 topics 698 0.030 0.04 0.07 0.24 0.36 text teacher 120 topics 816 0.042 0.05 0.09 0.30 0.43 text teacher 60 topics 757 0.037 0.05 0.08 0.26 0.39 (a) CLIP 1st (b) VTM 1st (c) CLIP 2nd (d) VTM 2nd Figure 1: Images, most close to the story about the trial of Anna Semenova according to the baseline (a,c) and VTM (b,d) models. We try 60 and 120 topics with both joint and text-teacher ap- Another interesting observation is that models that use the text proaches. Preliminary experiments showed that the more topics are model as a teacher for a visual model work better than joint models. used the higher is the model performance in text-image matching. This is an unexpected result, since one would expect that a model As a baseline, we use raw cosine similarities between CLIP em- that has access to full information could serve as a better teacher. beddings, without any domain adaptation for the text. We use an It is possible that text bears less noise: a text model uses the same implementation provided as a part of Sentence Bert package (https:// text for contextual and BoW representation, while an image could www.sbert.net/examples/applications/image-search/README.html). be completely irrelevant to a corresponding article. The fact that embeddings and topic modelling work on different principles is illustrated in Figure 1, where we reproduce images 3 RESULTS found by the model for the text about the Anna Semenova trial. The results are presented in Table 1. As can be seen from the table, CLIP model finds photos of Anna Semenova, probably due to the the best results are obtained with CLIP embeddings, that are used huge text and image base used to train the embeddings. VTM re- without any fine-tuning to the training set. They are able to find the turns images with a statue of Themis, a personification of Justice, correct image in 1225 cases out of 1915 and has a Mean Reciprocal which represent the text topic rather than specific facts. Though Rank (MRR) of 0.17. The best VTM model finds correct image in according to our results, CLIP embeddings outperform VTM, the 816 cases out of 1915 and yields an MRR of 0.03. ability to illustrate text topic might be a desirable property for some These results to some extent correspond to our previous experi- applications, as well as topic interpretability. ments, where we showed that topic modelling does not work well Our code is available at https://github.com/lmphcs/media_eval_ for document linking [6]. The probable explanation for that might vctm. be that topic modelling produces a sparse representation of the data. While CLIP embeddings are continuous vectors and could rep- resent an almost infinite amount of information, in topic modelling ACKNOWLEDGMENTS dimensions are not independent due to the diversity requirement This work has been partly supported by the European Union’s described above. It can be seen from Table 1 that models that have Horizon 2020 research and innovation programme under grants more topics yield better performance. 770299 (NewsEye) and 825153 (EMBEDDIA). NewsImages MediaEval’21, December 13-15 2021, Online REFERENCES [1] Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora Nozza, and Elis- abetta Fersini. 2021. Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the Euro- pean Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 1676–1683. https://www.aclweb.org/anthology/2021.eacl-main.143 [2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and others. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. [3] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Com- puter Vision 129, 6 (2021), 1789–1819. [4] Benjamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi, and Duc-Tien Dang-Nguyen. 2021. News Images in MediaEval 2021. CEUR Workshop Proceedings. [5] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and others. Learning Transferable Visual Models From Natural Language Supervision. Technical Report. [6] Elaine Zosa, Mark Granroth-Wilding, and Lidia Pivovarova. A Com- parison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval. In LREC 2020 Language Resources and Evaluation Conference 11–16 May 2020. 32.