=Paper= {{Paper |id=Vol-3181/paper45 |storemode=property |title=Methods for Text-Image-Rematching using Pair-wise Similarity and Canonical Similarity Analysis |pdfUrl=https://ceur-ws.org/Vol-3181/paper45.pdf |volume=Vol-3181 |authors=Kani Abdul,Kiran Kiran,Max Rudat,Alexandros Vasileiou,Andreas Lommatzsch |dblpUrl=https://dblp.org/rec/conf/mediaeval/AbdulKRVL21 }} ==Methods for Text-Image-Rematching using Pair-wise Similarity and Canonical Similarity Analysis== https://ceur-ws.org/Vol-3181/paper45.pdf

Methods for Text-Image-Rematching using Pair-wise Similarity
and Canonical Similarity Analysis
Kani Abdul, Kiran Kiran, Max Rudat, Alexandros Vasileiou, Andreas Lommatzsch
Technische Universität Berlin, Germany
{kani.abdul,k.kiran,rudat,a.vasileiou,andreas.lommatzsch}@campus.tu-berlin.de

ABSTRACT strategies in detail. Sec. 3 presented the evaluation results. Finally,
Matching images to text plays an important role in cross-media the overall findings are discussed in Sec. 4.
retrieval and research has proven this to be an underestimated
challenge. This problem is addressed by the MediaEval 2021 News- 2 APPROACH
Images Challenge with the goal to gain more insights into the We develop two approaches addressing the text-image rematching
real-world relationship of news articles and images. We develop task. This section discusses the steps of our approaches.
models for re-establishing the connection of a news article to its
corresponding image using datasets of a German news publisher Data Preprocessing. We preprocess the provided data for effi-
(“task 1”). Our approaches follow the idea of pairwise similarity ciently computing similarity scores. Firstly, we translate the image
learning and are optimized by algorithmic hill climbing. Addition- labels (computed by VGG-19 trained on ImageNet) from English to
ally, we employ Canonical Correlation Analysis as an approach German using Google Translate [7]. We decided to translate the im-
using joint embedding learning. The evaluation shows that our age labels instead of the article snippets due to the smaller volume
approaches produce good results for the underlying image-text to translate. In addition, we enhanced the dataset by extracting
rematching task, yet require further optimization to yield stable the item category (e.g. ‘koeln’, ‘panorama’ ‘wirtschaft’ ‘politik’)
prediction performance. from the article URL. In the next step, we normalize the dataset
by removing stop words, punctuation marks, spaces, special char-
acters, and digits. After removing the above tokens, we employ
1 INTRODUCTION part-of-speech (POS) tagging to identify the nouns in the dataset.
Multimedia content is accompanying our everyday life. News arti- Finally, we perform Morphological Processing (“lemmatization”) for
cles are one form of multimedia that are characterized by textual creating the dataset.
content accompanied by imagery. The assumption that a simple In addition to the standard preprocessing, we integrate Open-de-
relationship underlies this connection has frequently turned out WordNet [6] enabling us to consider synonyms when computing
to be oversimplified in research [2]. The MediaEval 2021 NewsIm- the similarity score. Moreover, we implement an outlier detection
ages task aims at addressing this challenge by investigating the and removal strategy based on the Z-score for the terms derived
real-world relationship of news and images. The challenge provides from the textual description. If a word’s z-score is larger than 3.0 (3
a dataset consisting of three batches training data and one batch standard deviations away from the mean), then it is considered an
for the evaluation. The performance of the participants’ algorithms outlier and gets removed from the dataset. Adding these two derives
is evaluated on a test batch. The evaluation metrics Recall@𝑘 and fields to our dataset, enables us to research, whether additional pre-
Mean Reciprocal Rank are used [3]. processing improves the performance.
The challenge of image-text retrieval has been addressed broadly Pairwise similarity learning & algorithmic hill climbing. Our first
in research around multimedia analysis [8]. Deep image-text match- approach follows the idea of pairwise similarity learning. The ob-
ing serves as one frequently used approach for this scenario. Zhang jective is to compute a similarity score for each image-article pair
and Lu [9] classify the main approaches based on deep learning to be used to construct 1-to-1 matches. We implement this using
into two categories: pairwise similarity learning and joint embed- spaCy similarity from the natural language processing (nlp) mod-
ding learning. For pairwise similarity learning, the main idea is to ule SpaCy [4]. SpaCy offers two methods to find the similarity
learn a similarity network for predicting the score of image-text between words: one based on context-sensitive tensors and another
pairs [8]. As for the other category of joint embedding learning, a one based on word vectors [1]. We utilize the later method and
joint latent space is defined in that the vectors of texts and images generate a similarity matrix containing the similarity scores for
can be compared directly. The typically used learning methods each image-text pair. Our pre-processed dataset gives different op-
belonging to this category are canonical correlation analysis (CCA) tions for setting the input for computing the similarity scores. For
and bi-directional ranking loss [9]. example, we tested to include only the words within the article
Based on the existing methods, we develop two text-image match- text or only the words within the article title. Analogously for the
ing strategies optimized for the specific requirements of the NewsIm- images, we have data from 10 different labeler configurations, each
ages task. In Sec. 2 we explain the preprocessing and the steps of our generating at least 2 image labels with different label probabilities.
For computing the best parameter configuration, we make use
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
of algorithmic hill climbing. Starting with an (arbitrary) initial con-
MediaEval’21, 13-15 December 2021, Online figuration, the solution is incrementally adapted [5]. We implement
this by first initializing the parameter (e.g. set labelprobability to 0.0,
MediaEval’21, 13-15 December 2021, Online Abdul et al.

number of considered labels per image to nine and the remaining Table 1: The table shows the evaluation results. Both
parameters to False). Then we go iteratively through the parameter developed approaches outperform the baseline “random”.
space and compute for each parameter the (locally) optimal value. Hill climbing provides better results than the CCA-based
For the optimization of each parameter, we randomly select 1,000 method for short result lists (𝑘 < 50). The CCA-based ap-
samples. proach outperforms Hill climbing based on Recall@50 and
The parameter configurations are evaluated using matrices con- Recall@100.
taining pairwise similarity scores with the columns representing
the image IDs, the rows the article IDs and the correct matches be- Strategy CCA-based Hill-climbing Random
ing located along the diagonal of the matrix. The score is computed MRR@100 0.018 0.019 0.003
using the variables Row counter, Column counter, and Total counter. Recall@5 0.018 0.021 0.002
For each generated parameter configuration, the performance is Recall@10 0.034 0.041 0.005
evaluated as follows: if the similarity score for the diagonal value Recall@50 0.136 0.125 0.026
is higher than the other scores within its row and column, then Recall@100 0.236 0.206 0.051
the index of the total counter is increased by 1. If this condition is
not met, then it is checked whether the diagonal value is higher
than the other scores within its row and the row counter index
We analyze the parameter settings for maximizing the perfor-
increased by 1 dependently. If both conditions are not met, the
mance for the pairwise similarity learning approach. We find
column counter index is increased by 1. Once the values of all
that the used configuration considers for each image the 8 labels
the 3 counting variables are set, the performance is calculated by
with the highest score (no label probability score has been applied).
comparing the values of the row and column counters: if the row
This indicates that a detailed image description is crucial for the
counter is larger than the column counter, then the row counter
text-image rematching task. Furthermore, we find that considering
is divided by the number of pairs (n) and returned. Otherwise the
the title in addition to the article snippet does not improve the
column value divided by the number of pairs (n) is returned as the
performance. Moreover, our analysis shows that the replacement
final score and evaluated by our hill climbing algorithm.
of words with the first word of their synsets as well as the removal
of outliers and duplicates are not activated for the final evaluation.
Canonical Correlation Analysis. As a second approach, we apply
Analyzing the parameters used by CCA we find, that considering
the sklearn Canonical Correlation Analysis. On the preprocessed
lemmatized article texts and image labels does not yield the optimal
dataset, the spaCy implementation of Word2Vec is used for comput-
results. Using Kernel PCA with a RBF kernel, the R2 score, which
ing a vector representation (having 300 dimensions) of the dataset.
indicates how well the regression model fits the observed data, was
We use a random sample of size 1,500 data points, utilizing the
greatly increased for the train set (achieving values up to 99.8%
article text and image labels columns. Then we split this set into
with k=100). The performance on the test set reached a R2 score
train set (2/3) and validation set (1/3). Initial tests of CCA showed a
of 38.5% (k=100); thus this method outperformed the hill-climbing
poor performance; that is why, we adapted the method. We applied
optimized pairwise similarity learning approach, that reached an
Kernel PCA (kPCA) to transform the data through a Radial Basis
R2 score of 32.6%.
Function (RBF) expansion, limited to utmost 700 generated data
dimensions. The kPCA transformation is applied on both the train
and test data to keep compatible and comparable dimensions. Then, 4 CONCLUSION
we train a CCA instance on the train data set and evaluate the model The evaluation shows, that both approaches yield robust results
on both the train and validation set. The evaluation is performed for the image-text re-matching. The results slightly outperform
based on the predicted vector for each of the article text vectors. the best results from MediaEval NewsImages 2020. The CCA-based
Since the CCA-predicted vectors are most likely not corresponding approach reaches a recall@100 score of 23.6% on the evaluation set;
to actual word2vec image label transformations, we compare each the hill climbing approach based on pairwise similarity learning
CCA prediction to all of the word2vec image label vectors using yields a recall@100 score of 20.6%. Due to limited resources, we have
the cosine similarity measure, thus constructing a similarity matrix. tested only a restricted set of parameter configurations; we think
This allows us to make 1-1 article texts to image label mappings. that a further parameter optimization will improve the performance.
For our CCA model, we observe a performance difference between
training and testing set, indicating the presence of overfitting. The
3 EVALUATION overfitting could be tackled by applying regularization or adding a
The evaluation (by the task organizers) shows that out pairwise dropout layer (eliminating features with a low impact). Furthermore,
similarity-based approach reaches 𝑅𝑒𝑐𝑎𝑙𝑙@100 = 0.21, the CCA- a penalty-based component could be used for boosting articles
based approach reaches 𝑅𝑒𝑐𝑎𝑙𝑙@100 = 0.24. Even though CCA considering the margin-based error. The ranking of the top-k images
outperforms the pairwise similarity-based approach in general, the for an article could then be optimized significantly.
recall score for k=5 and k=10 is higher with hill-climbing, indicat- Furthermore, an alternative image labeling component should be
ing that low-level semantics are found more effectively using the considered to getting a more detailed image description that could
straightforward method of pairwise similarity learning. The better be matched with the article text. This is based on the observation
evaluation scores for higher 𝑘 observed for CCA show that CCA that we observed a better performance when considering more
performs better with regard to high-level semantic similarity. images labels.
NewsImages MediaEval’21, 13-15 December 2021, Online

REFERENCES [5] Dimitris Papadias. 2000. Hill Climbing Algorithms for Content-Based
[1] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Retrieval of Similar Configurations. In Proceedings of the 23rd Annual
Boyd. 2020. spaCy: Industrial-strength Natural Language Processing International ACM SIGIR Conference on Research and Development in
in Python. (2020). https://doi.org/10.5281/zenodo.1212303 Information Retrieval. Association for Computing Machinery, 240–247.
[2] Benjamin Kille, Andreas Lommatzsch, and Özlem Özgöbek. 2020. https://doi.org/10.1145/345508.345587
NewsImages: The role of images in online news. In Proceedings of [6] Melanie Siegel and Francis Bond. 2021. OdeNet: Compiling a Ger-
the MediaEval Benchmarking Initiative for Multimedia Evaluation 2020. man Wordnet from other Resources. In Proceedings of the 11th Global
CEUR Workshop Proceedings. http://ceur-ws.org/Vol-2882/ Wordnet Conference (GWC 2021). 192–198. https://www.aclweb.org/
[3] Benjamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi, anthology/2021.gwc-1.22
and Duc-Tien Dang-Nguyen. 2021. News Images in MediaEval 2021. [7] Google Translator. 2020. (2020). https://pypi.org/project/googletrans/
In Proceedings of the MediaEval Benchmarking Initiative for Multimedia [8] Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao
Evaluation 2021. CEUR Workshop Proceedings. http://ceur-ws.org/ Shen. 2020. Cross-Modal Attention With Semantic Consistence for
Vol-2882/ Image–Text Matching. IEEE Transactions on Neural Networks and
[4] Fouad Omran and Christoph Treude. 2017. Choosing an NLP Library Learning Systems 31, 12 (2020), 5412–5425. https://doi.org/10.1109/
for Analyzing Software Documentation: A Systematic Literature Re- TNNLS.2020.2967597
view and a Series of Experiments. (05 2017). https://doi.org/10.1109/ [9] Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection
MSR.2017.42 Learning for Image-Text Matching. Springer International Publishing,
Cham, 707–723.