=Paper=
{{Paper
|id=Vol-3181/paper45
|storemode=property
|title=Methods for Text-Image-Rematching using Pair-wise Similarity and Canonical
Similarity Analysis
|pdfUrl=https://ceur-ws.org/Vol-3181/paper45.pdf
|volume=Vol-3181
|authors=Kani Abdul,Kiran Kiran,Max Rudat,Alexandros Vasileiou,Andreas
Lommatzsch
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AbdulKRVL21
}}
==Methods for Text-Image-Rematching using Pair-wise Similarity and Canonical
Similarity Analysis==
Methods for Text-Image-Rematching using Pair-wise Similarity and Canonical Similarity Analysis Kani Abdul, Kiran Kiran, Max Rudat, Alexandros Vasileiou, Andreas Lommatzsch Technische Universität Berlin, Germany {kani.abdul,k.kiran,rudat,a.vasileiou,andreas.lommatzsch}@campus.tu-berlin.de ABSTRACT strategies in detail. Sec. 3 presented the evaluation results. Finally, Matching images to text plays an important role in cross-media the overall findings are discussed in Sec. 4. retrieval and research has proven this to be an underestimated challenge. This problem is addressed by the MediaEval 2021 News- 2 APPROACH Images Challenge with the goal to gain more insights into the We develop two approaches addressing the text-image rematching real-world relationship of news articles and images. We develop task. This section discusses the steps of our approaches. models for re-establishing the connection of a news article to its corresponding image using datasets of a German news publisher Data Preprocessing. We preprocess the provided data for effi- (“task 1”). Our approaches follow the idea of pairwise similarity ciently computing similarity scores. Firstly, we translate the image learning and are optimized by algorithmic hill climbing. Addition- labels (computed by VGG-19 trained on ImageNet) from English to ally, we employ Canonical Correlation Analysis as an approach German using Google Translate [7]. We decided to translate the im- using joint embedding learning. The evaluation shows that our age labels instead of the article snippets due to the smaller volume approaches produce good results for the underlying image-text to translate. In addition, we enhanced the dataset by extracting rematching task, yet require further optimization to yield stable the item category (e.g. ‘koeln’, ‘panorama’ ‘wirtschaft’ ‘politik’) prediction performance. from the article URL. In the next step, we normalize the dataset by removing stop words, punctuation marks, spaces, special char- acters, and digits. After removing the above tokens, we employ 1 INTRODUCTION part-of-speech (POS) tagging to identify the nouns in the dataset. Multimedia content is accompanying our everyday life. News arti- Finally, we perform Morphological Processing (“lemmatization”) for cles are one form of multimedia that are characterized by textual creating the dataset. content accompanied by imagery. The assumption that a simple In addition to the standard preprocessing, we integrate Open-de- relationship underlies this connection has frequently turned out WordNet [6] enabling us to consider synonyms when computing to be oversimplified in research [2]. The MediaEval 2021 NewsIm- the similarity score. Moreover, we implement an outlier detection ages task aims at addressing this challenge by investigating the and removal strategy based on the Z-score for the terms derived real-world relationship of news and images. The challenge provides from the textual description. If a word’s z-score is larger than 3.0 (3 a dataset consisting of three batches training data and one batch standard deviations away from the mean), then it is considered an for the evaluation. The performance of the participants’ algorithms outlier and gets removed from the dataset. Adding these two derives is evaluated on a test batch. The evaluation metrics Recall@𝑘 and fields to our dataset, enables us to research, whether additional pre- Mean Reciprocal Rank are used [3]. processing improves the performance. The challenge of image-text retrieval has been addressed broadly Pairwise similarity learning & algorithmic hill climbing. Our first in research around multimedia analysis [8]. Deep image-text match- approach follows the idea of pairwise similarity learning. The ob- ing serves as one frequently used approach for this scenario. Zhang jective is to compute a similarity score for each image-article pair and Lu [9] classify the main approaches based on deep learning to be used to construct 1-to-1 matches. We implement this using into two categories: pairwise similarity learning and joint embed- spaCy similarity from the natural language processing (nlp) mod- ding learning. For pairwise similarity learning, the main idea is to ule SpaCy [4]. SpaCy offers two methods to find the similarity learn a similarity network for predicting the score of image-text between words: one based on context-sensitive tensors and another pairs [8]. As for the other category of joint embedding learning, a one based on word vectors [1]. We utilize the later method and joint latent space is defined in that the vectors of texts and images generate a similarity matrix containing the similarity scores for can be compared directly. The typically used learning methods each image-text pair. Our pre-processed dataset gives different op- belonging to this category are canonical correlation analysis (CCA) tions for setting the input for computing the similarity scores. For and bi-directional ranking loss [9]. example, we tested to include only the words within the article Based on the existing methods, we develop two text-image match- text or only the words within the article title. Analogously for the ing strategies optimized for the specific requirements of the NewsIm- images, we have data from 10 different labeler configurations, each ages task. In Sec. 2 we explain the preprocessing and the steps of our generating at least 2 image labels with different label probabilities. For computing the best parameter configuration, we make use Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). of algorithmic hill climbing. Starting with an (arbitrary) initial con- MediaEval’21, 13-15 December 2021, Online figuration, the solution is incrementally adapted [5]. We implement this by first initializing the parameter (e.g. set labelprobability to 0.0, MediaEval’21, 13-15 December 2021, Online Abdul et al. number of considered labels per image to nine and the remaining Table 1: The table shows the evaluation results. Both parameters to False). Then we go iteratively through the parameter developed approaches outperform the baseline “random”. space and compute for each parameter the (locally) optimal value. Hill climbing provides better results than the CCA-based For the optimization of each parameter, we randomly select 1,000 method for short result lists (𝑘 < 50). The CCA-based ap- samples. proach outperforms Hill climbing based on Recall@50 and The parameter configurations are evaluated using matrices con- Recall@100. taining pairwise similarity scores with the columns representing the image IDs, the rows the article IDs and the correct matches be- Strategy CCA-based Hill-climbing Random ing located along the diagonal of the matrix. The score is computed MRR@100 0.018 0.019 0.003 using the variables Row counter, Column counter, and Total counter. Recall@5 0.018 0.021 0.002 For each generated parameter configuration, the performance is Recall@10 0.034 0.041 0.005 evaluated as follows: if the similarity score for the diagonal value Recall@50 0.136 0.125 0.026 is higher than the other scores within its row and column, then Recall@100 0.236 0.206 0.051 the index of the total counter is increased by 1. If this condition is not met, then it is checked whether the diagonal value is higher than the other scores within its row and the row counter index We analyze the parameter settings for maximizing the perfor- increased by 1 dependently. If both conditions are not met, the mance for the pairwise similarity learning approach. We find column counter index is increased by 1. Once the values of all that the used configuration considers for each image the 8 labels the 3 counting variables are set, the performance is calculated by with the highest score (no label probability score has been applied). comparing the values of the row and column counters: if the row This indicates that a detailed image description is crucial for the counter is larger than the column counter, then the row counter text-image rematching task. Furthermore, we find that considering is divided by the number of pairs (n) and returned. Otherwise the the title in addition to the article snippet does not improve the column value divided by the number of pairs (n) is returned as the performance. Moreover, our analysis shows that the replacement final score and evaluated by our hill climbing algorithm. of words with the first word of their synsets as well as the removal of outliers and duplicates are not activated for the final evaluation. Canonical Correlation Analysis. As a second approach, we apply Analyzing the parameters used by CCA we find, that considering the sklearn Canonical Correlation Analysis. On the preprocessed lemmatized article texts and image labels does not yield the optimal dataset, the spaCy implementation of Word2Vec is used for comput- results. Using Kernel PCA with a RBF kernel, the R2 score, which ing a vector representation (having 300 dimensions) of the dataset. indicates how well the regression model fits the observed data, was We use a random sample of size 1,500 data points, utilizing the greatly increased for the train set (achieving values up to 99.8% article text and image labels columns. Then we split this set into with k=100). The performance on the test set reached a R2 score train set (2/3) and validation set (1/3). Initial tests of CCA showed a of 38.5% (k=100); thus this method outperformed the hill-climbing poor performance; that is why, we adapted the method. We applied optimized pairwise similarity learning approach, that reached an Kernel PCA (kPCA) to transform the data through a Radial Basis R2 score of 32.6%. Function (RBF) expansion, limited to utmost 700 generated data dimensions. The kPCA transformation is applied on both the train and test data to keep compatible and comparable dimensions. Then, 4 CONCLUSION we train a CCA instance on the train data set and evaluate the model The evaluation shows, that both approaches yield robust results on both the train and validation set. The evaluation is performed for the image-text re-matching. The results slightly outperform based on the predicted vector for each of the article text vectors. the best results from MediaEval NewsImages 2020. The CCA-based Since the CCA-predicted vectors are most likely not corresponding approach reaches a recall@100 score of 23.6% on the evaluation set; to actual word2vec image label transformations, we compare each the hill climbing approach based on pairwise similarity learning CCA prediction to all of the word2vec image label vectors using yields a recall@100 score of 20.6%. Due to limited resources, we have the cosine similarity measure, thus constructing a similarity matrix. tested only a restricted set of parameter configurations; we think This allows us to make 1-1 article texts to image label mappings. that a further parameter optimization will improve the performance. For our CCA model, we observe a performance difference between training and testing set, indicating the presence of overfitting. The 3 EVALUATION overfitting could be tackled by applying regularization or adding a The evaluation (by the task organizers) shows that out pairwise dropout layer (eliminating features with a low impact). Furthermore, similarity-based approach reaches 𝑅𝑒𝑐𝑎𝑙𝑙@100 = 0.21, the CCA- a penalty-based component could be used for boosting articles based approach reaches 𝑅𝑒𝑐𝑎𝑙𝑙@100 = 0.24. Even though CCA considering the margin-based error. The ranking of the top-k images outperforms the pairwise similarity-based approach in general, the for an article could then be optimized significantly. recall score for k=5 and k=10 is higher with hill-climbing, indicat- Furthermore, an alternative image labeling component should be ing that low-level semantics are found more effectively using the considered to getting a more detailed image description that could straightforward method of pairwise similarity learning. The better be matched with the article text. This is based on the observation evaluation scores for higher 𝑘 observed for CCA show that CCA that we observed a better performance when considering more performs better with regard to high-level semantic similarity. images labels. NewsImages MediaEval’21, 13-15 December 2021, Online REFERENCES [5] Dimitris Papadias. 2000. Hill Climbing Algorithms for Content-Based [1] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Retrieval of Similar Configurations. In Proceedings of the 23rd Annual Boyd. 2020. spaCy: Industrial-strength Natural Language Processing International ACM SIGIR Conference on Research and Development in in Python. (2020). https://doi.org/10.5281/zenodo.1212303 Information Retrieval. Association for Computing Machinery, 240–247. [2] Benjamin Kille, Andreas Lommatzsch, and Özlem Özgöbek. 2020. https://doi.org/10.1145/345508.345587 NewsImages: The role of images in online news. In Proceedings of [6] Melanie Siegel and Francis Bond. 2021. OdeNet: Compiling a Ger- the MediaEval Benchmarking Initiative for Multimedia Evaluation 2020. man Wordnet from other Resources. In Proceedings of the 11th Global CEUR Workshop Proceedings. http://ceur-ws.org/Vol-2882/ Wordnet Conference (GWC 2021). 192–198. https://www.aclweb.org/ [3] Benjamin Kille, Andreas Lommatzsch, Özlem Özgöbek, Mehdi Elahi, anthology/2021.gwc-1.22 and Duc-Tien Dang-Nguyen. 2021. News Images in MediaEval 2021. [7] Google Translator. 2020. (2020). https://pypi.org/project/googletrans/ In Proceedings of the MediaEval Benchmarking Initiative for Multimedia [8] Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Evaluation 2021. CEUR Workshop Proceedings. http://ceur-ws.org/ Shen. 2020. Cross-Modal Attention With Semantic Consistence for Vol-2882/ Image–Text Matching. IEEE Transactions on Neural Networks and [4] Fouad Omran and Christoph Treude. 2017. Choosing an NLP Library Learning Systems 31, 12 (2020), 5412–5425. https://doi.org/10.1109/ for Analyzing Software Documentation: A Systematic Literature Re- TNNLS.2020.2967597 view and a Series of Experiments. (05 2017). https://doi.org/10.1109/ [9] Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection MSR.2017.42 Learning for Image-Text Matching. Springer International Publishing, Cham, 707–723.