=Paper=
{{Paper
|id=Vol-3181/paper26
|storemode=property
|title=Image-Text Re-Matching Using Swin Transformer and DistilBERT
|pdfUrl=https://ceur-ws.org/Vol-3181/paper26.pdf
|volume=Vol-3181
|authors=Yuta Fukatsu,Masaki Aono
|dblpUrl=https://dblp.org/rec/conf/mediaeval/FukatsuA21
}}
==Image-Text Re-Matching Using Swin Transformer and DistilBERT==
Image-Text Re-Matching Using Swin Transformer and DistilBERT Yuta Fukatsu Masaki Aono Department of Computer Science & Engineering Department of Computer Science & Engineering Toyohashi University of Technology Toyohashi University of Technology Toyohashi, Aichi, Japan Toyohashi, Aichi, Japan fukatsu.yuta.ye@tut.jp aono@tut.jp ABSTRACT In recent years, the news media has become multimodal. The 2 RELATED WORK relationship between text and images in news is complex and needs to be understood. In this paper, we work on Image-Text Re- 2.1 ADAPT Matching to understand the relationship between images and text, ADAPT, which one of the image-to-text (text-to-image) and apply and improve the image retrieval method, ADAPT. alignment model are used for cross-modal retrieval. ADAPT Improvements are made by reconsidering the feature extraction takes a text (image) as input and then searches for the closest methods in image retrieval. We employ Swin Transformer for image (text) and outputs it. In ADAPT, the features for the image feature extraction and DistilBERT for text feature input modality are used to recalculate the features for the other modality. extraction. According to the report from organizers, our runs resulted in MRR@100 score of 0.0789 and Recall@100 score of 2.2 DistilBERT 0.5781 for test set. DistilBERT [5] is a distillation of the BERT (Bidirectional Encoder Representations from Transformers) model, which 1 INTRODUCTUIN is a natural language model that can understand context backwards and forwards and has been pre-trained on a large Online news articles in recent years have mixed components, scale. However, BERT has the disadvantage that the model consisting of texts and images. It is often the case that images is too large for its performance, so DistilBERT achieves are added to text articles to attract attention and to help lightweight and speedup by distilling the model. readers understand the articles intuitively. Usually, in research on multimedia and recommendation systems, a 2.3 Swin Transformer simple relationship between images and text is assumed. As an example, in the study of image captioning [1], the caption Swin Transformer [7] is a type of Vision Transformer, an is assumed to be a literal representation of the image image recognition model that introduces the concept of landscape. However, news-specific studies have pointed out Transformer, which has been successful in natural language. a more complex relationship [2]. The NewsImages task of Vision Transformer can benefit from the Transformer by MediaEval 2021, investigates this relationship to understand dividing images into patches and treating them like words in its impact on journalism and news personalization. Our team NLP. Swin Transformer is a model that solves the (KDEval 2021) participated in subtask 1, Image-Text Re- shortcoming of Vision Transformer, that is, the fixed size Matching. In this task, links between a series of articles and patches are insufficient for recognizing objects of various images have been removed. sizes. In MediaEval 2020 [3], metric learning was introduced. We thus adopt a metric learning based method is inspired by ADAPT [4]. Re-Matching is performed by a text-based 3 APPROACH image retrieval method. We reconsider and experiment with image feature extraction and text feature extraction in As shown in next sections, we reconsider the feature ADAPT [4] for NewsImages. After reconsidering the feature extraction methods used in ADAPT for the specific case of extraction method, we confirmed that the best results are news articles and explain our method. obtained by using Swin Transformer for image feature extraction and DistilBERT for text feature extraction. 3.1 Reconsidering of Text Feature Extraction GloVe embedding and bi-directional GRU are used in Copyright 2021 for this paper by its authors. Use permitted under Creative ADAPT to extract text features considering contextual Commons License Attribution 4.0 International (CC BY 4.0). information. However, even with context-aware methods MediaEval’21, 13-15 December 2021, Online MediaEval’21, December 13-15 2021, Online Y. Fukatsu, M. Aono using bi-directional GRU, there is a limitation on appear in the success and failure cases. Therefore, the maintaining context information with distant words. performance of the method in this paper for articles with Especially in news articles, it is highly likely that the text similar content is considered to be low. tends to be long. Thus, we have newly adopted DistilBERT as our text feature extraction method, which can handle longer texts and can obtain better features due to its rich pre- Table 1 : Submission result training. DistilBERT is also lighter than plain BERT, which MRR@100 R@5 R@10 R@50 R@100 would be more practical for applications to real-time search Run 1 0.0466 0.0642 0.1081 0.3159 0.4637 and recommendation. Run 2 0.0738 0.0971 0.1629 0.4318 0.5749 3.2 Reconsidering of Image Feature Extraction Run 3 0.0789 0.1044 0.1687 0.4371 0.5781 The image feature extraction in ADAPT is based on a Faster R-CNN pre-trained on the Visual Genome dataset [6]. This Table 2 : Comparison between Batch3 as validation data method uses 36 objects as features with high confidence in and Batch4 as test data the image. However, the images given in news articles are often abstract or imaginative of the article content. Therefore, MRR@100 R@5 R@10 R@50 R@100 we cannot obtain useful features in such cases, or we need to Batch3 0.0695 0.0956 0.1653 0.4014 0.5357 extract 36 objects using extremely low confidence thresholds. Thus, we decided to reconsider how to acquire Batch4 0.0789 0.1044 0.1687 0.4371 0.5781 useful features while retaining the advantages of ADAPT, which is more efficient than attention-based methods by using spatial-level features. To deal this problem, we adopted the Swin Transformer. By using Swin Transformer, it is possible to obtain spatial-level and meaningful features. 3.3 Training and Submitted Runs In subtask 1, Batch1 to Batch3 over three periods are provided by organizers as training data and Batch4 is as test data. Thus, we used Batch1 and Batch2 as training data, Batch3 as validation data. The predictions for the test data were conducted by extracting features from all the test data, Figure 1 : Top 10 most frequent words followed by using the features to compute cosine similarity found in the top 5 to obtain the top 100 candidates. In the Run1, we used DistilBERT pre-trained on German for text feature extraction and Faster R-CNN trained on Visual Genome dataset for image feature extraction. In Run2, we changed the image feature extraction to Swin Transformer pre-trained on ImageNet 21K [8]. In Run3 we changed the batch size from 105 to 32. 4 RESULT AND ANALYSIS The results of the submitted runs are summarized in Table 1. Figure 2 : Top 10 most frequent words The left column shows the name of the Runs. The evaluation not found in the top 100 metrics shown are MRR@100, Recall@5, Recall@10, Recall@50, and Recall@100. In the table, Recall@k is written as R@k for simplicity. 5 CONCLUSION AND FUTUREWORKS Table2 shows the comparison of evaluation metrics between the data against Batch3 treated as validation data We changed the image feature extraction in ADAPT to Swin and the test data. The results demonstrate that there is no Transformer and the text feature extraction to DistilBERT. significant difference in distribution between the data With this change, we achieved MRR@100 score of 0.07885 provided by organizers for training and the test data. This led and Recall@100 score of 0.57807. This means that using our us to perform several analyses on Batch3. retrieval method, we can find relevance with some accuracy Figure 1 and 2 show the word frequency when found in of 50% for matching images and text. Looking at the word the top 5 search results and the word frequency when not frequency against successful and unsuccessful search results, found in the top 100. The words displayed here are limited to the same words are frequently used, and we need to improve the nouns (lemma of tag) in the text that were extracted using our search method for articles with similar contents. Tree Tagger [9]. Comparing the two figures, we can see that there is no significant difference in the words that frequently NewsImages Y. Fukatsu, M. Aono REFERENCES [1] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. [6] Krishna, R., Zhu, Y., Groth, O. et al. 2017. Visual Genome: Connecting 2019. A Comprehensive Survey of Deep Learning for Image Captioning. ACM Language and Vision Using Crowdsourced Dense Image Annotations. In Comput. Surv. 51, 6, Article 118 (Feb. 2019). https://doi.org/10.1145/3295748 International Journal of Computer Vision 123, 32–73 (2017). [2] Nelleke Oostdijk, Hans van Halteren, Erkan Bas, ar, and Martha Larson.2020. DOI:https://doi.org/10.1007/s11263-016-0981-7 The Connection between the Text and Images of News Articles:New Insights [7] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen for Multimedia Analysis. In Proceedings of The 12th Language Resources and Lin, Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer Evaluation Conference. 4343–4351. using Shifted Windows In Proceedings of the IEEE/CVF International [3] Quang-Thuc Nguyen, Tuan-Duy Nguyen, Thang-Long Nguyen-Ho, Anh-Kiet Conference on Computer Vision (ICCV), 2021, pp. 10012-10022. Duong, Xuan-Nhat Hoang, Vinh-Thuyen Nguyen-Truong, Hai-Dang Nguyen, [8] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, Lihi Zelnik-Manor. 2021. Minh-Triet Tran. 2020. HCMUS at MediaEval 2020:Image-Text Fusion for ImageNet-21K Pretraining for the Masses. ArXiv, abs/2104.10972. Automatic News-Images Re-Matching. In Proceedings of the MediaEval 2020 [9] Helmut Schmid. 1995. Improvements In Part-of-Speech Tagging With an Workshop, Online, 14-15 December 2020. Application To German. In Proceedings of the ACL SIGDAT-Workshop, pp 47- [4] Wehrmann, J., Kolling, C. and C Barros, R. 2020. Adaptive Cross-Modal 50 Embeddings for Image-Text Alignment. Proceedings of the AAAI Conference on Artificial Intelligence. 34, 07 (Apr. 2020), 12313-12320. DOI:https://doi.org/10.1609/aaai.v34i07.6915. [5] Sanh, Victor, Lysandre Debut, Julien Chaumond and Thomas Wolf. 2019.DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019)