=Paper= {{Paper |id=Vol-3181/paper45 |storemode=property |title=Methods for Text-Image-Rematching using Pair-wise Similarity and Canonical Similarity Analysis |pdfUrl=https://ceur-ws.org/Vol-3181/paper45.pdf |volume=Vol-3181 |authors=Kani Abdul,Kiran Kiran,Max Rudat,Alexandros Vasileiou,Andreas Lommatzsch |dblpUrl=https://dblp.org/rec/conf/mediaeval/AbdulKRVL21 }} ==Methods for Text-Image-Rematching using Pair-wise Similarity and Canonical Similarity Analysis== https://ceur-ws.org/Vol-3181/paper45.pdf
 Methods for Text-Image-Rematching using Pair-wise Similarity
               and Canonical Similarity Analysis
                  Kani Abdul, Kiran Kiran, Max Rudat, Alexandros Vasileiou, Andreas Lommatzsch
                                                   Technische Universität Berlin, Germany
                              {kani.abdul,k.kiran,rudat,a.vasileiou,andreas.lommatzsch}@campus.tu-berlin.de

ABSTRACT                                                                             strategies in detail. Sec. 3 presented the evaluation results. Finally,
Matching images to text plays an important role in cross-media                       the overall findings are discussed in Sec. 4.
retrieval and research has proven this to be an underestimated
challenge. This problem is addressed by the MediaEval 2021 News-                     2    APPROACH
Images Challenge with the goal to gain more insights into the                        We develop two approaches addressing the text-image rematching
real-world relationship of news articles and images. We develop                      task. This section discusses the steps of our approaches.
models for re-establishing the connection of a news article to its
corresponding image using datasets of a German news publisher                           Data Preprocessing. We preprocess the provided data for effi-
(“task 1”). Our approaches follow the idea of pairwise similarity                    ciently computing similarity scores. Firstly, we translate the image
learning and are optimized by algorithmic hill climbing. Addition-                   labels (computed by VGG-19 trained on ImageNet) from English to
ally, we employ Canonical Correlation Analysis as an approach                        German using Google Translate [7]. We decided to translate the im-
using joint embedding learning. The evaluation shows that our                        age labels instead of the article snippets due to the smaller volume
approaches produce good results for the underlying image-text                        to translate. In addition, we enhanced the dataset by extracting
rematching task, yet require further optimization to yield stable                    the item category (e.g. ‘koeln’, ‘panorama’ ‘wirtschaft’ ‘politik’)
prediction performance.                                                              from the article URL. In the next step, we normalize the dataset
                                                                                     by removing stop words, punctuation marks, spaces, special char-
                                                                                     acters, and digits. After removing the above tokens, we employ
1    INTRODUCTION                                                                    part-of-speech (POS) tagging to identify the nouns in the dataset.
Multimedia content is accompanying our everyday life. News arti-                     Finally, we perform Morphological Processing (“lemmatization”) for
cles are one form of multimedia that are characterized by textual                    creating the dataset.
content accompanied by imagery. The assumption that a simple                            In addition to the standard preprocessing, we integrate Open-de-
relationship underlies this connection has frequently turned out                     WordNet [6] enabling us to consider synonyms when computing
to be oversimplified in research [2]. The MediaEval 2021 NewsIm-                     the similarity score. Moreover, we implement an outlier detection
ages task aims at addressing this challenge by investigating the                     and removal strategy based on the Z-score for the terms derived
real-world relationship of news and images. The challenge provides                   from the textual description. If a word’s z-score is larger than 3.0 (3
a dataset consisting of three batches training data and one batch                    standard deviations away from the mean), then it is considered an
for the evaluation. The performance of the participants’ algorithms                  outlier and gets removed from the dataset. Adding these two derives
is evaluated on a test batch. The evaluation metrics Recall@𝑘 and                    fields to our dataset, enables us to research, whether additional pre-
Mean Reciprocal Rank are used [3].                                                   processing improves the performance.
   The challenge of image-text retrieval has been addressed broadly                     Pairwise similarity learning & algorithmic hill climbing. Our first
in research around multimedia analysis [8]. Deep image-text match-                   approach follows the idea of pairwise similarity learning. The ob-
ing serves as one frequently used approach for this scenario. Zhang                  jective is to compute a similarity score for each image-article pair
and Lu [9] classify the main approaches based on deep learning                       to be used to construct 1-to-1 matches. We implement this using
into two categories: pairwise similarity learning and joint embed-                   spaCy similarity from the natural language processing (nlp) mod-
ding learning. For pairwise similarity learning, the main idea is to                 ule SpaCy [4]. SpaCy offers two methods to find the similarity
learn a similarity network for predicting the score of image-text                    between words: one based on context-sensitive tensors and another
pairs [8]. As for the other category of joint embedding learning, a                  one based on word vectors [1]. We utilize the later method and
joint latent space is defined in that the vectors of texts and images                generate a similarity matrix containing the similarity scores for
can be compared directly. The typically used learning methods                        each image-text pair. Our pre-processed dataset gives different op-
belonging to this category are canonical correlation analysis (CCA)                  tions for setting the input for computing the similarity scores. For
and bi-directional ranking loss [9].                                                 example, we tested to include only the words within the article
   Based on the existing methods, we develop two text-image match-                   text or only the words within the article title. Analogously for the
ing strategies optimized for the specific requirements of the NewsIm-                images, we have data from 10 different labeler configurations, each
ages task. In Sec. 2 we explain the preprocessing and the steps of our               generating at least 2 image labels with different label probabilities.
                                                                                        For computing the best parameter configuration, we make use
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                     of algorithmic hill climbing. Starting with an (arbitrary) initial con-
MediaEval’21, 13-15 December 2021, Online                                            figuration, the solution is incrementally adapted [5]. We implement
                                                                                     this by first initializing the parameter (e.g. set labelprobability to 0.0,
MediaEval’21, 13-15 December 2021, Online                                                                                          Abdul et al.


number of considered labels per image to nine and the remaining           Table 1: The table shows the evaluation results. Both
parameters to False). Then we go iteratively through the parameter        developed approaches outperform the baseline “random”.
space and compute for each parameter the (locally) optimal value.         Hill climbing provides better results than the CCA-based
For the optimization of each parameter, we randomly select 1,000          method for short result lists (𝑘 < 50). The CCA-based ap-
samples.                                                                  proach outperforms Hill climbing based on Recall@50 and
   The parameter configurations are evaluated using matrices con-         Recall@100.
taining pairwise similarity scores with the columns representing
the image IDs, the rows the article IDs and the correct matches be-             Strategy        CCA-based      Hill-climbing    Random
ing located along the diagonal of the matrix. The score is computed             MRR@100           0.018            0.019         0.003
using the variables Row counter, Column counter, and Total counter.             Recall@5          0.018            0.021         0.002
   For each generated parameter configuration, the performance is               Recall@10         0.034            0.041         0.005
evaluated as follows: if the similarity score for the diagonal value            Recall@50         0.136            0.125         0.026
is higher than the other scores within its row and column, then                 Recall@100        0.236            0.206         0.051
the index of the total counter is increased by 1. If this condition is
not met, then it is checked whether the diagonal value is higher
than the other scores within its row and the row counter index
                                                                             We analyze the parameter settings for maximizing the perfor-
increased by 1 dependently. If both conditions are not met, the
                                                                          mance for the pairwise similarity learning approach. We find
column counter index is increased by 1. Once the values of all
                                                                          that the used configuration considers for each image the 8 labels
the 3 counting variables are set, the performance is calculated by
                                                                          with the highest score (no label probability score has been applied).
comparing the values of the row and column counters: if the row
                                                                          This indicates that a detailed image description is crucial for the
counter is larger than the column counter, then the row counter
                                                                          text-image rematching task. Furthermore, we find that considering
is divided by the number of pairs (n) and returned. Otherwise the
                                                                          the title in addition to the article snippet does not improve the
column value divided by the number of pairs (n) is returned as the
                                                                          performance. Moreover, our analysis shows that the replacement
final score and evaluated by our hill climbing algorithm.
                                                                          of words with the first word of their synsets as well as the removal
                                                                          of outliers and duplicates are not activated for the final evaluation.
   Canonical Correlation Analysis. As a second approach, we apply
                                                                             Analyzing the parameters used by CCA we find, that considering
the sklearn Canonical Correlation Analysis. On the preprocessed
                                                                          lemmatized article texts and image labels does not yield the optimal
dataset, the spaCy implementation of Word2Vec is used for comput-
                                                                          results. Using Kernel PCA with a RBF kernel, the R2 score, which
ing a vector representation (having 300 dimensions) of the dataset.
                                                                          indicates how well the regression model fits the observed data, was
We use a random sample of size 1,500 data points, utilizing the
                                                                          greatly increased for the train set (achieving values up to 99.8%
article text and image labels columns. Then we split this set into
                                                                          with k=100). The performance on the test set reached a R2 score
train set (2/3) and validation set (1/3). Initial tests of CCA showed a
                                                                          of 38.5% (k=100); thus this method outperformed the hill-climbing
poor performance; that is why, we adapted the method. We applied
                                                                          optimized pairwise similarity learning approach, that reached an
Kernel PCA (kPCA) to transform the data through a Radial Basis
                                                                          R2 score of 32.6%.
Function (RBF) expansion, limited to utmost 700 generated data
dimensions. The kPCA transformation is applied on both the train
and test data to keep compatible and comparable dimensions. Then,         4   CONCLUSION
we train a CCA instance on the train data set and evaluate the model      The evaluation shows, that both approaches yield robust results
on both the train and validation set. The evaluation is performed         for the image-text re-matching. The results slightly outperform
based on the predicted vector for each of the article text vectors.       the best results from MediaEval NewsImages 2020. The CCA-based
Since the CCA-predicted vectors are most likely not corresponding         approach reaches a recall@100 score of 23.6% on the evaluation set;
to actual word2vec image label transformations, we compare each           the hill climbing approach based on pairwise similarity learning
CCA prediction to all of the word2vec image label vectors using           yields a recall@100 score of 20.6%. Due to limited resources, we have
the cosine similarity measure, thus constructing a similarity matrix.     tested only a restricted set of parameter configurations; we think
This allows us to make 1-1 article texts to image label mappings.         that a further parameter optimization will improve the performance.
                                                                          For our CCA model, we observe a performance difference between
                                                                          training and testing set, indicating the presence of overfitting. The
3   EVALUATION                                                            overfitting could be tackled by applying regularization or adding a
The evaluation (by the task organizers) shows that out pairwise           dropout layer (eliminating features with a low impact). Furthermore,
similarity-based approach reaches 𝑅𝑒𝑐𝑎𝑙𝑙@100 = 0.21, the CCA-             a penalty-based component could be used for boosting articles
based approach reaches 𝑅𝑒𝑐𝑎𝑙𝑙@100 = 0.24. Even though CCA                 considering the margin-based error. The ranking of the top-k images
outperforms the pairwise similarity-based approach in general, the        for an article could then be optimized significantly.
recall score for k=5 and k=10 is higher with hill-climbing, indicat-         Furthermore, an alternative image labeling component should be
ing that low-level semantics are found more effectively using the         considered to getting a more detailed image description that could
straightforward method of pairwise similarity learning. The better        be matched with the article text. This is based on the observation
evaluation scores for higher 𝑘 observed for CCA show that CCA             that we observed a better performance when considering more
performs better with regard to high-level semantic similarity.            images labels.
NewsImages                                                                                          MediaEval’21, 13-15 December 2021, Online


REFERENCES                                                                   [5] Dimitris Papadias. 2000. Hill Climbing Algorithms for Content-Based
[1] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane             Retrieval of Similar Configurations. In Proceedings of the 23rd Annual
    Boyd. 2020. spaCy: Industrial-strength Natural Language Processing           International ACM SIGIR Conference on Research and Development in
    in Python. (2020). https://doi.org/10.5281/zenodo.1212303                    Information Retrieval. Association for Computing Machinery, 240–247.
[2] Benjamin Kille, Andreas Lommatzsch, and Özlem Özgöbek. 2020.                 https://doi.org/10.1145/345508.345587
    NewsImages: The role of images in online news. In Proceedings of         [6] Melanie Siegel and Francis Bond. 2021. OdeNet: Compiling a Ger-
    the MediaEval Benchmarking Initiative for Multimedia Evaluation 2020.        man Wordnet from other Resources. In Proceedings of the 11th Global
    CEUR Workshop Proceedings. http://ceur-ws.org/Vol-2882/                      Wordnet Conference (GWC 2021). 192–198. https://www.aclweb.org/
[3] Benjamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi,              anthology/2021.gwc-1.22
    and Duc-Tien Dang-Nguyen. 2021. News Images in MediaEval 2021.           [7] Google Translator. 2020. (2020). https://pypi.org/project/googletrans/
    In Proceedings of the MediaEval Benchmarking Initiative for Multimedia   [8] Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao
    Evaluation 2021. CEUR Workshop Proceedings. http://ceur-ws.org/              Shen. 2020. Cross-Modal Attention With Semantic Consistence for
    Vol-2882/                                                                    Image–Text Matching. IEEE Transactions on Neural Networks and
[4] Fouad Omran and Christoph Treude. 2017. Choosing an NLP Library              Learning Systems 31, 12 (2020), 5412–5425. https://doi.org/10.1109/
    for Analyzing Software Documentation: A Systematic Literature Re-            TNNLS.2020.2967597
    view and a Series of Experiments. (05 2017). https://doi.org/10.1109/    [9] Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection
    MSR.2017.42                                                                  Learning for Image-Text Matching. Springer International Publishing,
                                                                                 Cham, 707–723.