=Paper= {{Paper |id=Vol-3181/paper26 |storemode=property |title=Image-Text Re-Matching Using Swin Transformer and DistilBERT |pdfUrl=https://ceur-ws.org/Vol-3181/paper26.pdf |volume=Vol-3181 |authors=Yuta Fukatsu,Masaki Aono |dblpUrl=https://dblp.org/rec/conf/mediaeval/FukatsuA21 }} ==Image-Text Re-Matching Using Swin Transformer and DistilBERT== https://ceur-ws.org/Vol-3181/paper26.pdf
                                           Image-Text Re-Matching Using
                                          Swin Transformer and DistilBERT
                             Yuta Fukatsu                                                             Masaki Aono
       Department of Computer Science & Engineering                                 Department of Computer Science & Engineering
           Toyohashi University of Technology                                           Toyohashi University of Technology
                  Toyohashi, Aichi, Japan                                                      Toyohashi, Aichi, Japan
                   fukatsu.yuta.ye@tut.jp                                                           aono@tut.jp




ABSTRACT
In recent years, the news media has become multimodal. The                   2     RELATED WORK
relationship between text and images in news is complex and
needs to be understood. In this paper, we work on Image-Text Re-
                                                                             2.1    ADAPT
Matching to understand the relationship between images and text,             ADAPT, which one of the image-to-text (text-to-image)
and apply and improve the image retrieval method, ADAPT.                     alignment model are used for cross-modal retrieval. ADAPT
Improvements are made by reconsidering the feature extraction                takes a text (image) as input and then searches for the closest
methods in image retrieval. We employ Swin Transformer for                   image (text) and outputs it. In ADAPT, the features for the
image feature extraction and DistilBERT for text feature                     input modality are used to recalculate the features for the
                                                                             other modality.
extraction. According to the report from organizers, our runs
resulted in MRR@100 score of 0.0789 and Recall@100 score of
                                                                             2.2    DistilBERT
0.5781 for test set.
                                                                             DistilBERT [5] is a distillation of the BERT (Bidirectional
                                                                             Encoder Representations from Transformers) model, which
1    INTRODUCTUIN                                                            is a natural language model that can understand context
                                                                             backwards and forwards and has been pre-trained on a large
Online news articles in recent years have mixed components,                  scale. However, BERT has the disadvantage that the model
consisting of texts and images. It is often the case that images             is too large for its performance, so DistilBERT achieves
are added to text articles to attract attention and to help                  lightweight and speedup by distilling the model.
readers understand the articles intuitively. Usually, in
research on multimedia and recommendation systems, a                         2.3    Swin Transformer
simple relationship between images and text is assumed. As
an example, in the study of image captioning [1], the caption                Swin Transformer [7] is a type of Vision Transformer, an
is assumed to be a literal representation of the image                       image recognition model that introduces the concept of
landscape. However, news-specific studies have pointed out                   Transformer, which has been successful in natural language.
a more complex relationship [2]. The NewsImages task of                      Vision Transformer can benefit from the Transformer by
MediaEval 2021, investigates this relationship to understand                 dividing images into patches and treating them like words in
its impact on journalism and news personalization. Our team                  NLP. Swin Transformer is a model that solves the
(KDEval 2021) participated in subtask 1, Image-Text Re-                      shortcoming of Vision Transformer, that is, the fixed size
Matching. In this task, links between a series of articles and               patches are insufficient for recognizing objects of various
images have been removed.                                                    sizes.
   In MediaEval 2020 [3], metric learning was introduced.
We thus adopt a metric learning based method is inspired by
ADAPT [4]. Re-Matching is performed by a text-based                          3     APPROACH
image retrieval method. We reconsider and experiment with
image feature extraction and text feature extraction in                      As shown in next sections, we reconsider the feature
ADAPT [4] for NewsImages. After reconsidering the feature                    extraction methods used in ADAPT for the specific case of
extraction method, we confirmed that the best results are                    news articles and explain our method.
obtained by using Swin Transformer for image feature
extraction and DistilBERT for text feature extraction.                       3.1    Reconsidering of Text Feature Extraction
                                                                             GloVe embedding and bi-directional GRU are used in
Copyright 2021 for this paper by its authors. Use permitted under Creative   ADAPT to extract text features considering contextual
Commons License Attribution 4.0 International (CC BY 4.0).                   information. However, even with context-aware methods
MediaEval’21, 13-15 December 2021, Online
MediaEval’21, December 13-15 2021, Online                                                                      Y. Fukatsu, M. Aono

using bi-directional GRU, there is a limitation on                 appear in the success and failure cases. Therefore, the
maintaining context information with distant words.                performance of the method in this paper for articles with
Especially in news articles, it is highly likely that the text     similar content is considered to be low.
tends to be long. Thus, we have newly adopted DistilBERT
as our text feature extraction method, which can handle
longer texts and can obtain better features due to its rich pre-                       Table 1 : Submission result
training. DistilBERT is also lighter than plain BERT, which                 MRR@100        R@5      R@10      R@50     R@100
would be more practical for applications to real-time search       Run 1     0.0466       0.0642    0.1081   0.3159     0.4637
and recommendation.
                                                                   Run 2     0.0738       0.0971    0.1629   0.4318     0.5749
3.2    Reconsidering of Image Feature Extraction                   Run 3     0.0789       0.1044    0.1687   0.4371     0.5781
The image feature extraction in ADAPT is based on a Faster
R-CNN pre-trained on the Visual Genome dataset [6]. This               Table 2 : Comparison between Batch3 as validation data
method uses 36 objects as features with high confidence in                            and Batch4 as test data
the image. However, the images given in news articles are
often abstract or imaginative of the article content. Therefore,            MRR@100        R@5      R@10     R@50     R@100
we cannot obtain useful features in such cases, or we need to
                                                                   Batch3     0.0695      0.0956    0.1653   0.4014   0.5357
extract 36 objects using extremely low confidence
thresholds. Thus, we decided to reconsider how to acquire          Batch4     0.0789      0.1044    0.1687   0.4371   0.5781
useful features while retaining the advantages of ADAPT,
which is more efficient than attention-based methods by
using spatial-level features. To deal this problem, we adopted
the Swin Transformer. By using Swin Transformer, it is
possible to obtain spatial-level and meaningful features.

3.3    Training and Submitted Runs
In subtask 1, Batch1 to Batch3 over three periods are
provided by organizers as training data and Batch4 is as test
data. Thus, we used Batch1 and Batch2 as training data,
Batch3 as validation data. The predictions for the test data
were conducted by extracting features from all the test data,                  Figure 1 : Top 10 most frequent words
followed by using the features to compute cosine similarity                               found in the top 5
to obtain the top 100 candidates.
    In the Run1, we used DistilBERT pre-trained on German
for text feature extraction and Faster R-CNN trained on
Visual Genome dataset for image feature extraction. In Run2,
we changed the image feature extraction to Swin
Transformer pre-trained on ImageNet 21K [8]. In Run3 we
changed the batch size from 105 to 32.


4     RESULT AND ANALYSIS
The results of the submitted runs are summarized in Table 1.                   Figure 2 : Top 10 most frequent words
The left column shows the name of the Runs. The evaluation                            not found in the top 100
metrics shown are MRR@100, Recall@5, Recall@10,
Recall@50, and Recall@100. In the table, Recall@k is
written as R@k for simplicity.                                     5       CONCLUSION AND FUTUREWORKS
    Table2 shows the comparison of evaluation metrics
between the data against Batch3 treated as validation data         We changed the image feature extraction in ADAPT to Swin
and the test data. The results demonstrate that there is no        Transformer and the text feature extraction to DistilBERT.
significant difference in distribution between the data            With this change, we achieved MRR@100 score of 0.07885
provided by organizers for training and the test data. This led    and Recall@100 score of 0.57807. This means that using our
us to perform several analyses on Batch3.                          retrieval method, we can find relevance with some accuracy
    Figure 1 and 2 show the word frequency when found in           of 50% for matching images and text. Looking at the word
the top 5 search results and the word frequency when not           frequency against successful and unsuccessful search results,
found in the top 100. The words displayed here are limited to      the same words are frequently used, and we need to improve
the nouns (lemma of tag) in the text that were extracted using     our search method for articles with similar contents.
Tree Tagger [9]. Comparing the two figures, we can see that
there is no significant difference in the words that frequently
NewsImages                                                                                                                              Y. Fukatsu, M. Aono

REFERENCES
[1] MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga.      [6] Krishna, R., Zhu, Y., Groth, O. et al. 2017. Visual Genome: Connecting
    2019. A Comprehensive Survey of Deep Learning for Image Captioning. ACM             Language and Vision Using Crowdsourced Dense Image Annotations. In
    Comput. Surv. 51, 6, Article 118 (Feb. 2019). https://doi.org/10.1145/3295748       International Journal of Computer Vision 123, 32–73 (2017).
[2] Nelleke Oostdijk, Hans van Halteren, Erkan Bas, ar, and Martha Larson.2020.         DOI:https://doi.org/10.1007/s11263-016-0981-7
    The Connection between the Text and Images of News Articles:New Insights        [7] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen
    for Multimedia Analysis. In Proceedings of The 12th Language Resources and          Lin, Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer
    Evaluation Conference. 4343–4351.                                                   using Shifted Windows In Proceedings of the IEEE/CVF International
[3] Quang-Thuc Nguyen, Tuan-Duy Nguyen, Thang-Long Nguyen-Ho, Anh-Kiet                  Conference on Computer Vision (ICCV), 2021, pp. 10012-10022.
    Duong, Xuan-Nhat Hoang, Vinh-Thuyen Nguyen-Truong, Hai-Dang Nguyen,             [8] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, Lihi Zelnik-Manor. 2021.
    Minh-Triet Tran. 2020. HCMUS at MediaEval 2020:Image-Text Fusion for                ImageNet-21K Pretraining for the Masses. ArXiv, abs/2104.10972.
    Automatic News-Images Re-Matching. In Proceedings of the MediaEval 2020         [9] Helmut Schmid. 1995. Improvements In Part-of-Speech Tagging With an
    Workshop, Online, 14-15 December 2020.                                              Application To German. In Proceedings of the ACL SIGDAT-Workshop, pp 47-
[4] Wehrmann, J., Kolling, C. and C Barros, R. 2020. Adaptive Cross-Modal               50
    Embeddings for Image-Text Alignment. Proceedings of the AAAI Conference
    on Artificial Intelligence. 34, 07 (Apr. 2020), 12313-12320.
    DOI:https://doi.org/10.1609/aaai.v34i07.6915.
[5] Sanh, Victor, Lysandre Debut, Julien Chaumond and Thomas Wolf.
    2019.DistilBERT, a distilled version of BERT: smaller, faster, cheaper and
    lighter. ArXiv abs/1910.01108 (2019)