-

Retrieving Social Images using Relevance Filtering and Diverse Selection

Taruna Agrawal

tagrawal@usc.edu 0

Rahul Gupta

guptarah@usc.edu 1

Shrikanth Narayanan

shri@sipi.usc.edu 1 0 Ming Hsieh Department of Electrical Engineering, University of Southern California , Los Angeles , USA 1 Signal Analysis and Interpretation Lab (SAIL), University of Southern California , Los Angeles , USA

2015

14 15

Retrieving relevant and diverse images from a large set of images is problem of interest in social media. Given a set of images pertaining to a location or a concept, a subset of diverse image can summarize the attributes of the corresponding location/concept. In this work, we present a two step image retrieval model involving relevance ltering followed by diverse selection. Based on the visual features, textual descriptions and Flickr rank, relevance ltering initially determines a subset of images which have correspondence to a topic of interest. Subsequently, diverse selection determines a smaller subset of images to provide a diverse perspective of the concept. We obtain an F1 score of .509 on a test set containing 139 concepts, when computed over the top 20 images output by our system. We analyze the outcomes of our system and investigate the utility of image metadata (reviews, Flickr content) when combined with visual descriptors.

1. INTRODUCTION

\Deluge of information" is a term prevalent in present day social media [1{4], often attributed to advances in technology and social connectivity. Compact representation of relevant information is a major challenge posed by the growth of social media. Retrieving diverse social images task at MediaEval challenge 2015 [ 5 ] addresses this problem in the domain of images on social media such as Flickr. The goal is to design a query based social image retrieval engine, focusing on obtaining relevant images while covering diverse aspects of the query, for instance, various sub-topics of the query. Potential information sources include image attributes as well as image metadata such as image description, view count and image rank on social media.

Various previous works [ 6, 7 ] have focused on knowledge based image selection for relevant image selection and/or clustering based methods for diversi cation. The relevance selection is usually based on image attributes such as presence of people [ 8 ], image quality [ 7 ] and similarity to a standard source of images like Wikipedia [ 9 ]. In this work, we adopt a combination of supervised and unsupervised schemes for relevance ltering followed by clustering for diverse selection. After ltering out irrelevant images, we use clustering for diverse selection of images. Through our methods, we show the promise of using supervised learning methods in addition to existing knowledge based methods in such retrieval tasks. In the next section, we describe our methodology in detail followed by the results. 2.

METHODOLOGY DESCRIPTION

Our system for retrieving diverse social images consists of two steps: (i) Relevance ltering, and (ii) Diverse selection. Relevance ltering helps us to lter out images that have no or little relation with the concept of interest and diverse selection provides a subset of images which are di erent from each other. We provide a detailed description of the two systems below. 2.1

Relevance filtering

We perform relevance ltering to lter out images unrelated to a concept. The 2015 MediaEval challenge data provides a set of visual and textual descriptors over 153 concepts for model development and 139 concepts for evaluation. Given the visual descriptors, textual information and Flickr metadata, we train several supervised and knowledge based ltering schemes. We describe these models below. 2.1.1

Supervised methods

K-nearest neighbor classi er on visual descriptors: The 2015 MediaEval challenge data set provides a set of general purpose visual descriptors such as color, texture and feature information along with a binary label indicating if an image is relevant/irrelevant to the concept under consideration [ 5 ]. We train a K-nearest neighbor (KNN) classi er on these visual descriptors using these labels. The features are z-normalized before training and K is tuned on the development set using a 3-fold cross-validation.

Maximum entropy model on textual descriptors: The textual descriptors are extracted from sources such as photo title, description as provided by the author and photo tags on Flickr. We extract features from these sources using the following steps: 1. Feature standardization: This step is performed to train a universal model for all the concepts instead of concept speci c models. We replace any word related to a concept by a keyword. For instance, if the query is \The great wall of china", words such as \great wall", \wall of china" and \great wall china" occurring anywhere in textual descriptions are replaced by a single keyword \Place of interest". The list of words to be replaced is created based on the query title and contains various combinations of words in the query. 2. Feature selection: Given the set of standardized features, we retain the words within the top 10% of word frequencies. This step is performed to reduce the feature dimensionality while training the model. 3. Model training : Given the set of selected features, we train a maximum entropy model to predict the binary labels (relevant/irrelevant). 2.1.2

Unsupervised methods

Removal of images with people in focus: Relevant images do not have a person as the subject of focus. We incorporated this fact by using the facedetect software [ 10 ] to lter out images containing people as the main subjects. Relevance ltering based on Flickr rank: As a nal relevance ltering scheme, we remove images above a certain threshold (>200) on Flickr rank. The motivation behind this scheme is that images low in rank are more likely to be not associated with the concept in question. 2.2

Diverse selection

After obtaining the set of images based on relevance ltering, we use image clustering for diverse selection. Given a query size of K^ images, we perform K^ -means clustering on the visual descriptors. We hypothesize that similar images fall into a single cluster and retain only image per cluster. We select the image closest to the cluster centroid as the cluster representative.

In order to compute the selection score for each image, we use the output of the KNN classi er, maxent model and distance of image from cluster centroid. The score is given by an unweighted sum of the ratio of relevant images amongst closest K images, the maxent output probability for image being relevant and inverse of Euclidean distance of image from cluster centroid. The last term is added based on the assumption that images closer to centroids are more representative of the cluster. In the next section we present our results and discussion.

RESULTS

In run 1, we only use the relevance ltering model developed on visual descriptors (K-nearest neighbors classi er) and face detection. In run 2, we append ltering using maximum entropy model on textual descriptors. Finally, run5 uses all the relevance ltering schemes (visual, face detection, text and Flickr rank based). Note that in all the three runs diverse selection is based on visual descriptors only. The evaluation metric is cluster recall (CR) and precision (P) for top X ranked images as predicted by the system. We show the CR@X and P@X along with corresponding F-score F1@X for X = 5; 10; 20; 30; 40; 50 in Figure 1. All these outcomes are based on cluster with K^ set to 50. Also, in the 2015 challenge, separate metrics were reported for concepts which share images with other concepts (multiconcept) along with single-concept images. We report the o cial score of CR,P and F1 @X=20 for the multi and single concept images in Table 1.

From the results, we observe that for low values of X the combined system (visual + face detection +text + Flickr rank) marginally (although insigni cantly) outperforms the system using only the visual cues. However the performance degrades signi cantly at higher value of X. Note that this decrease in performance is not due to additional ltering schemes not performing well. Instead, this decrease in performance is due to the fact that additional ltering leads to decrease in data points available for diverse selection. Therefore we had to reduce the number of clusters in relevance selection, sometimes to the extent that our model returned less than 50 images. However, better performance at lower X (e.g. X = 20 in Table 1) shows the promise of using additional modalities. In Table 1, we observe minor improvements in F1@20 after adding subsequent relevance ltering schemes. One interesting observation is that while using Flickr ranks, F1 for multiple concepts decreases, whereas for single concepts increases. This indicates that Flickr ranks are more reliable in the case of single concept images than multiple-concept images. This factor can be regarded in future system designs. 4.

CONCLUSION

In this work, we present a two stage system for social image retrieval. In the rst stage, we perform relevance ltering to remove irrelevant images and in the second stage we perform diverse selection using clustering in the visual descriptor space. Our relevance ltering system involves a combination of supervised and unsupervised methods. In the future, we can extend the work presented by exploring other methods ( ltering, clustering) under a similar system development paradigm. We can also reformulate the problem as a diverse system development and can be inspired from several of the existing works [ 11, 12 ]. Finally, we would also like additional metadata like Flickr user credibility [ 5, 13 ] and other image properties (CNN features) to further improve our system.

[1] Holly

Bik and Miriam C Goldstein . An introduction to social media for scientists . 2013 .

[2] Sophia

B Liu.

Trends in distributed curatorial technology to manage data deluge in a networked world . The European Journal for the Informatics Professional , 11 ( 4 ): 18 { 24 , 2010 .

[3]

Szongott , Benjamin Henne , G von Voigt, et al. Big data privacy issues in public social media . In Digital Ecosystems Technologies (DEST) , 2012 6th IEEE International Conference on, pages 1 {6 . IEEE, 2012 .

[4] Duc-Tien Dang-Nguyen, Luca Piras, Giorgio Giacinto, Giulia Boato, and F De Natale . Retrieval of diverse images by pre- ltering and hierarchical clustering . MediaEval Benchmarking Initiative for Multimedia Evaluation , 2014 .

[5]

Bogdan

Ionescu , Alexandru L G nsca, Bogdan Boteanu, Adrian Popescu, Mihai Lupu, and Henning Muller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation . In MediaEval 2015 Workshop, Wurzen, Germany, 2015 .

[6]

Bogdan

Ionescu , Adrian Popescu, Mihai Lupu, Alexandru L G nsca, and Henning Muller. Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation . In MediaEval 2014 Workshop, Barcelona, Spain, 2014 .

[7]

Tomas

Brodsky . Relevant image detection in a camera, recorder, or video streaming device , April 4 2006. US Patent App. 11 /397, 780 .

[8]

Alexandru

Lucian Ginsca , Adrian Popescu, and

Navid

Rekabsaz . Cea lista^AZs participation at the mediaeval 2014 retrieving diverse social images task . In Proceedings of the MediaEval Multimedia Benchmark Workshop , CEURWS. org, volume 1263 , pages 1613 { 0073 , 2014 .

[9]

Maia

Zaharieva and

Patrick

Schwab . A uni ed framework for retrieving diverse social images . 2014 .

[10]

Robert

Frischholz . The face detection homepage . https://facedetection.com/.

[11] Rahul

Gupta

, Kartik Audhkhasi, and

Shrikanth

Narayanan . A mixture of experts approach towards intelligibility classi cation of pathological speech . In Acoustics, Speech and Signal Processing (ICASSP) , 2015 IEEE International Conference on, pages 1986 { 1990 . IEEE, 2015 .

[12] Rahul

Gupta

, Kartik Audhkhasi, and

Shrikanth

Narayanan . Training ensemble of diverse classi ers on feature subsets . In Acoustics, Speech and Signal Processing (ICASSP) , 2014 IEEE International Conference on, pages 2927 { 2931 . IEEE, 2014 .

[13] Bogdan

Ionescu

, Adrian Popescu, Mihai Lupu, Alexandru Lucian G^nsca, Bogdan Boteanu, and Henning Muller. Div150cred: A social image retrieval result diversi cation with user tagging credibility dataset . ACM Multimedia Systems-MMSys , Portland, Oregon, USA, 2015 .