UMONS @ MediaEval 2017: Diverse Social Images Retrieval Omar Seddati, Nada Ben Lhachemi, Stéphane Dupont, Saïd Mahmoudi Mons University, Belgium {omar.seddati,nada.ben-lhachemi,stephane.dupont,said.mahmoudi}@umons.ac.be ABSTRACT 2 APPROACH This paper presents the results achieved during our participation In this work, we combine visual and/or textual descriptors with the at the MediaEval 2017 Retrieving Diverse Social Images Task. The DBSCAN algorithm at two different stages. In the first stage, we re- proposed unsupervised multimodal approach exploits visual and rank the provided list of results in order to remove some irrelevant textual information in a fashion that prioritizes both relevance images, while during the second stage, we aim to improve diversity. and diversification. As features, we used a modified version of the In our approach, the visual features based on the work of Tolias et RMAC (Regional Maximum Activation of Convolutions) descriptor al. [10]. Tolias et al. discarded the fully connected layers of a pre- for visual information and word2vec-based weighted averaging for trained CNN (VGG16) and used the resulting fully convolutional textual information. In order to provide an adaptive unsupervised CNN for feature extraction. Let assume we have an input image I of solution, we combine these features with the DBSCAN (density- size (WI × H I ), the output feature maps (FMs) will form a 3D tensor based spatial clustering of applications with noise) clustering al- in the form C × W × H (where C is the number of channels, (W , H ) gorithm. Our system achieved promising results and reached an the width and height of FMs). If we represent this 3D tensor as a F1@20 of 0.6554. set of 2D feature maps X = {Xc }, c = 1...C, we can compute the MAC (Maximum Activations of Convolutions) using the following equation: f = [f 1 ...fc ...fC ], with fc = max x (1) 1 INTRODUCTION x ∈Xc In order to compute the RMAC descriptor, Tolias et al. proposed Over the past decades, available image collections have seen con- a simple approach to sample R = {Ri }, a set of square regions sistent growth thanks to the easily accessible devices that we now within X, and compute the MAC for each region The sum aggre- use on a daily basis. These huge multimedia collections motivated gation of the resulting vectors after an l 2 -normalization provides researchers to look for efficient approaches for image retrieval. the RMAC descriptor (for more details please refer to the original However, most of the approaches in this field primarily aim at the paper [10]). In [5], Gordo et al. proposed two simple modifications improvement of the relevance of the results, commonly neglect- to bring significant improvements to the RMAC representation: 1) ing the diversity aspect. The goal of the Retrieving Diverse Social using ResNet101 instead of VGG16; 2) three resolutions of the input Images Task [14] is to encourage researchers to propose new so- image are feeded to the network. The RMAC descriptors are com- lutions that offer a good relevance-diversity balance. Participants puted separately and l 2 −normalized. Then, the three vectors are are provided with several queries and up to 300 results correspond- summed and l 2 −normalized. In this work, we use the ResNet50 [6] ing to each query retrieved using the Flicker search engine. Each and the publicly available Torch toolbox [3] to extract the RMAC participating system is expected to provide a list with up to 50 descriptor with multi-resolution. However, instead of computing ranked images per query that are both relevant and diversified. In the RMAC descriptor separately for each resolution, we rescale the addition to the images and the Flicker ranking, several metadata output feature maps of the three resolutions to the same resolu- are provided such as username, credibility, etc. Both, visual infor- tion (the highest resolution) and sum them. Following, we compute mation and metadata have been exploited in several ways by the the RMAC descriptor and do the sum-aggregation followed by an participants of previous editions of the task [2, 11, 13]. The most l 2 -normalization (more information on the approach can be found used text-based features are Term Frequency-Inverse Document in [7]). The RMAC descriptor has the advantages of keeping the Frequency (TF-IDF)[9], Latent Dirichlet Allocation (LDA)[1], and aspect ratio of the inputs and encoding efficiently spatial informa- word embeddings like word2vec [12]. For visual information, the tion while keeping the size of the descriptor independent of the most used features are Convolutional Neural Networks (CNN) based resolution of the input (but rather on the number of channels of features. Several clustering algorithms have been explored such as the selected layer for feature extraction, which can be used as a k-means [13], X-means [13], agglomerative hierarchical clustering parameter of the method). (AHC) [11], etc. In our work, we use word2vec-based weighted average as text-based features, an improvement of the RMAC de- 3 EXPERIMENTAL RESULTS scriptor [10] based on CNN features for visual information, and DBSCAN [4] as clustering algorithm. In this section, we detail the five runs submitted by our team. Then, we briefly present the results obtained with the proposed approach on the development and test set. Copyright held by the owner/author(s). Run 1: In the first run, only visual features are allowed to be MediaEval’17, 13-15 September 2017, Dublin, Ireland used. Since the query is a textual query, we used the Flicker initial ranking and we made the assumption that the first three results (top MediaEval’17, 13-15 September 2017, Dublin, Ireland O. Seddati et al. 3) are relevant and can be used to generate a visual representation Table 1: Results on the development and test set. of the query. In order to re-rank the initial list, we extract the RMAC features from each image using three different input scales (where Devset Testset S is the largest side of the input and S ∈ [550, 800, 1050]). Then, we Run P@20 CR@20 F1@20 P@20 CR@20 F1@20 do a first clustering using the DBSCAN algorithm and we follow the Run1 0.6327 0.409 0.4722 0,6780 0,5599 0,5789 following steps: 1. For each cluster, we find the closest feature vector Run2 0,5595 0,4148 0,4581 0,5702 0,5834 0.5521 to the cluster’s center (Vcl ); 2. We select the n clusters that contain Run3 0.6359 0.4222 0.4827 0,6643 0,5780 0,5886 the top 3 images (n ≤ 3); 3. We compute the distance between each Run4 0.6373 0.4196 0.4825 0,6690 0,5649 0,5809 of the n clusters (centers) and the remaining r clusters, for each Run5 0.7386 0.4467 0.5253 0,8071 0,5856 0,6554 of these r clusters we keep as representative distance the minimal distance to one of the n clusters and we use it to re-rank the list of results; 4. We remove clusters that are at the bottom of the list, but we make sure that we keep enough clusters to have at least Next, we follow the same steps to re-rank the Flicker list. Since 150 images. This first stage enables us to remove some irrelevant Google image results match better the queries, we can expect better images. In the second stage, we do another clustering (DBSCAN) visual representations, which allows us to use more efficiently the and we sort the different clusters using the initial Flicker rank of RMAC descriptor. In addition to that, as in Run 3 & 4, we use the the centroid. Then, we select one image per cluster until we obtain grouping by username approach to further improve diversity. the required number of result images, if the last cluster is reached, Note: in order to retrieve enough results from Google image and we start again from beginning. Finally, we group the images that enhance diversity, we used the query in the following way: let’s assume belong to the same cluster and present the results in the clusters that we have a query with five words w 1 , w 2 , w 3 , w 4 , w 5 , we use the order (based on the rank of centroids). following for image crawling: Note: In order to correctly use the DBSCAN algorithm, we should w 1 + w 2 + w 3 + w 4 + w 5 + w 1 _w 2 _w 3 _w 4 _w 5 carefully define the maximum radius ϵ. In our case, for each query, For example if the query is animalatzoo, the query used for Google we compute a vector with n elements, where n is the number of avail- image is animal + zoo + animal_zoo. able results and each element ei , i ∈ 1, ..., n is the minimal distance All results are reported in Table 1. As we can see, the approach between image i and any other image. Finally, we use the median based on visual features (Run 1) gives better results than those of this vector as ϵ, one as the number of minimum points, and the obtained when textual features are used (Run 2). This confirms that Manhattan distance as metric. the assumption made about the visual representation of the query Run 2: The second run uses the provided word2vec (dimen- (using the RMAC descriptors of the top 3 images) is admissible. sionality = 300) semantic vectors for English terms (trained over The comparison of the results of Run 3 and 4 shows that using the Wikipedia). Unlike TF-IDF or LDA, word2vec vectors do not look tags (with the proposed approach) was not able to bring significant at word co-occurrence patterns but they have the advantage of improvements in comparison to the simple combination of visual addressing various sorts of similarities between words (syntactic features with username grouping. Finally, using images retrieved and semantic). In order to select the textual information to use, we by Google engine (Run 5) outperforms significantly the results of examined the devset queries and noticed that tags are more signif- Run 3 (visual + textual). This achievement leads us to reflect on the icant syntactically and semantically than other textual fields (e.g. effectiveness of the proposed approach based on textual features. title and descriptions). For each image, we compute the weighted In order to achieve results close to those of Run 5, we should find average vector representation (as described in [8]) based on its tags a better solution for text analysis since there is no image query. . Then, we do clustering using The DBSCAN algorithm and sort Our future developments will mainly focus on exploiting different the clusters using the distance between the query representation approaches to improve image retrieval based on metadata. and the representation of the centroid of a given cluster. Finally, we re-rank the images following the same approach as the last step of 4 CONCLUSION run 1. In this paper we presented a detailed description of the approach Run 3 & 4: In the third run, we concatenate the RMAC feature proposed to address the task of retrieving diverse social images. vector of Run 1 with the textual feature vector of Run 2 and followed The proposed approach achieves promising results and shows the the different steps of Run 1. In addition to that, just after the second potential of automatic techniques in improving both precision and clustering, we group the images uploaded by the same user and diversity. The comparison of the different runs shows that contrary make sure that when picking images for the final ranking, we to what we expected, textual information is outperformed by visual choose images from the different user groups of a given cluster. In information. This observation raises some questions regarding the Run 4, we followed the same steps as in Run 3, but we used only proposed approach and the quality of the provided metadata. We the RMAC descriptor as feature vector and the username grouping plan to investigate these questions in more detail and bring new technique. solutions in our future work. Run 5: In the fifth run, we first remove stop words from the queries. Then, we use each query to retrieve 10 images using Google REFERENCES image engine. We extract the RMAC features from these images [1] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent and use them as a visual representation of the query as in Run 1. dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022. Retrieving Diverse Social Images MediaEval’17, 13-15 September 2017, Dublin, Ireland [2] Bogdan Boteanu, Ionut Mironica, and Bogdan Ionescu. 2016. LAPI@ 2015 Retrieving Diverse Social Images Task: A Pseudo-Relevance Feed- back Diversification Perspective.. In MediaEval. [3] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop. [4] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, and others. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96. 226–231. [5] Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision (2016), 1–18. [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. [7] Seddati Omar, Dupont Stéphane, Mahmoudi Saïd, and Pariyaan Mah- naaz. 2017. Towards Good Practices for Image Retrieval Based on CNN Features. In International Conference on Computer Vision Workshop (ICCVW). [8] Arora Sanjeev, Liang Yingyu, and Ma Tengyu. 2017. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of ICLR 2017. [9] Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28, 1 (1972), 11–21. [10] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2015. Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015). [11] Sabrina Tollari. 2016. UPMC at MediaEval 2016 Retrieving Diverse Social Images Task. In MediaEval 2016 Workshop. [12] Greg Corrado Jeffrey Dean Tomas Mikolov, Kai Chen. 20013. Efficient Estimation of Word Representations in Vector Space.. In arXiv preprint arXiv:1301.3781. [13] Maia Zaharieva. 2016. An Adaptive Clustering Approach for the Diversification of Image Retrieval Results.. In MediaEval. [14] Maia Zaharieva, Bogdan Ionescu, Alexandru Lucian Gînscă, Rodrygo L.T. Santos, and Henning Müller. 2017. Retrieving Diverse Social Images at MediaEval 2017: Challenges, Dataset and Evaluation. In Proc. of the MediaEval 2017 Workshop, Dublin, Ireland, Sept. 13-15, 2017.