Retrieving Social Images using Relevance Filtering and Diverse Selection Taruna Agrawal1 , Rahul Gupta2 , Shrikanth Narayanan2 1 Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, USA 2 Signal Analysis and Interpretation Lab (SAIL), University of Southern California, Los Angeles, USA tagrawal@usc.edu, guptarah@usc.edu, shri@sipi.usc.edu ABSTRACT trieval tasks. In the next section, we describe our method- Retrieving relevant and diverse images from a large set of ology in detail followed by the results. images is problem of interest in social media. Given a set of images pertaining to a location or a concept, a subset of di- 2. METHODOLOGY DESCRIPTION verse image can summarize the attributes of the correspond- Our system for retrieving diverse social images consists of ing location/concept. In this work, we present a two step two steps: (i) Relevance filtering, and (ii) Diverse selection. image retrieval model involving relevance filtering followed Relevance filtering helps us to filter out images that have by diverse selection. Based on the visual features, textual no or little relation with the concept of interest and diverse descriptions and Flickr rank, relevance filtering initially de- selection provides a subset of images which are different from termines a subset of images which have correspondence to a each other. We provide a detailed description of the two topic of interest. Subsequently, diverse selection determines systems below. a smaller subset of images to provide a diverse perspective of the concept. We obtain an F1 score of .509 on a test set con- 2.1 Relevance filtering taining 139 concepts, when computed over the top 20 images We perform relevance filtering to filter out images unre- output by our system. We analyze the outcomes of our sys- lated to a concept. The 2015 MediaEval challenge data pro- tem and investigate the utility of image metadata (reviews, vides a set of visual and textual descriptors over 153 con- Flickr content) when combined with visual descriptors. cepts for model development and 139 concepts for evalua- tion. Given the visual descriptors, textual information and 1. INTRODUCTION Flickr metadata, we train several supervised and knowledge “Deluge of information” is a term prevalent in present day based filtering schemes. We describe these models below. social media [1–4], often attributed to advances in technol- ogy and social connectivity. Compact representation of rel- 2.1.1 Supervised methods evant information is a major challenge posed by the growth K-nearest neighbor classifier on visual descriptors: of social media. Retrieving diverse social images task at Me- The 2015 MediaEval challenge data set provides a set of diaEval challenge 2015 [5] addresses this problem in the do- general purpose visual descriptors such as color, texture and main of images on social media such as Flickr. The goal is to feature information along with a binary label indicating if design a query based social image retrieval engine, focusing an image is relevant/irrelevant to the concept under consid- on obtaining relevant images while covering diverse aspects eration [5]. We train a K-nearest neighbor (KNN) classifier of the query, for instance, various sub-topics of the query. on these visual descriptors using these labels. The features Potential information sources include image attributes as are z-normalized before training and K is tuned on the de- well as image metadata such as image description, view velopment set using a 3-fold cross-validation. count and image rank on social media. Various previous works [6, 7] have focused on knowledge Maximum entropy model on textual descriptors: The based image selection for relevant image selection and/or textual descriptors are extracted from sources such as photo clustering based methods for diversification. The relevance title, description as provided by the author and photo tags selection is usually based on image attributes such as pres- on Flickr. We extract features from these sources using the ence of people [8], image quality [7] and similarity to a stan- following steps: dard source of images like Wikipedia [9]. In this work, we adopt a combination of supervised and unsupervised schemes 1. Feature standardization: This step is performed to train for relevance filtering followed by clustering for diverse selec- a universal model for all the concepts instead of concept spe- tion. After filtering out irrelevant images, we use clustering cific models. We replace any word related to a concept by for diverse selection of images. Through our methods, we a keyword. For instance, if the query is “The great wall of show the promise of using supervised learning methods in china”, words such as “great wall”, “wall of china” and “great addition to existing knowledge based methods in such re- wall china” occurring anywhere in textual descriptions are replaced by a single keyword “Place of interest”. The list of words to be replaced is created based on the query title and Copyright is held by the author/owner(s). contains various combinations of words in the query. MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany Run Relevance Single Multi All # filter concept concept F1/P/CR F1/P/CR F1/P/CR 1 Visual desc. .492/ .514/ .504/ .664/.408 .700/.426 .682/.417 3 + Textual desc. .497/ .517/ .507/ .677/.410 .708/.426 .692/.418 5 + Flickr rank .512/ .507/ .509/ .702/.421 .708/.411 .705/.416 Table 1: Results (F1 score/Precision/Cluster recall) for the proposed system @X = 20. and face detection. In run 2, we append filtering using max- imum entropy model on textual descriptors. Finally, run5 uses all the relevance filtering schemes (visual, face detec- tion, text and Flickr rank based). Note that in all the three runs diverse selection is based on visual descriptors only. The evaluation metric is cluster recall (CR) and precision (P) for top X ranked images as predicted by the system. Figure 1: Results for the proposed system at differ- We show the CR@X and P@X along with corresponding ent number of retrieved images (X). F-score F1@X for X = 5, 10, 20, 30, 40, 50 in Figure 1. All these outcomes are based on cluster with K̂ set to 50. Also, 2. Feature selection: Given the set of standardized features, in the 2015 challenge, separate metrics were reported for we retain the words within the top 10% of word frequencies. concepts which share images with other concepts (multi- This step is performed to reduce the feature dimensionality concept) along with single-concept images. We report the while training the model. official score of CR,P and F1 @X=20 for the multi and single 3. Model training: Given the set of selected features, we concept images in Table 1. train a maximum entropy model to predict the binary la- From the results, we observe that for low values of X the bels (relevant/irrelevant). combined system (visual + face detection +text + Flickr rank) marginally (although insignificantly) outperforms the 2.1.2 Unsupervised methods system using only the visual cues. However the performance Removal of images with people in focus: Relevant im- degrades significantly at higher value of X. Note that this ages do not have a person as the subject of focus. We in- decrease in performance is not due to additional filtering corporated this fact by using the facedetect software [10] to schemes not performing well. Instead, this decrease in per- filter out images containing people as the main subjects. formance is due to the fact that additional filtering leads to decrease in data points available for diverse selection. Relevance filtering based on Flickr rank: As a final Therefore we had to reduce the number of clusters in rel- relevance filtering scheme, we remove images above a certain evance selection, sometimes to the extent that our model threshold (>200) on Flickr rank. The motivation behind this returned less than 50 images. However, better performance scheme is that images low in rank are more likely to be not at lower X (e.g. X = 20 in Table 1) shows the promise associated with the concept in question. of using additional modalities. In Table 1, we observe mi- nor improvements in F1@20 after adding subsequent rele- 2.2 Diverse selection vance filtering schemes. One interesting observation is that After obtaining the set of images based on relevance fil- while using Flickr ranks, F1 for multiple concepts decreases, tering, we use image clustering for diverse selection. Given whereas for single concepts increases. This indicates that a query size of K̂ images, we perform K̂-means clustering on Flickr ranks are more reliable in the case of single concept the visual descriptors. We hypothesize that similar images images than multiple-concept images. This factor can be fall into a single cluster and retain only image per cluster. regarded in future system designs. We select the image closest to the cluster centroid as the cluster representative. 4. CONCLUSION In order to compute the selection score for each image, we In this work, we present a two stage system for social use the output of the KNN classifier, maxent model and dis- image retrieval. In the first stage, we perform relevance fil- tance of image from cluster centroid. The score is given by tering to remove irrelevant images and in the second stage an unweighted sum of the ratio of relevant images amongst we perform diverse selection using clustering in the visual closest K images, the maxent output probability for image descriptor space. Our relevance filtering system involves a being relevant and inverse of Euclidean distance of image combination of supervised and unsupervised methods. In from cluster centroid. The last term is added based on the the future, we can extend the work presented by explor- assumption that images closer to centroids are more repre- ing other methods (filtering, clustering) under a similar sys- sentative of the cluster. In the next section we present our tem development paradigm. We can also reformulate the results and discussion. problem as a diverse system development and can be in- spired from several of the existing works [11, 12]. Finally, 3. RESULTS we would also like additional metadata like Flickr user cred- In run 1, we only use the relevance filtering model devel- ibility [5, 13] and other image properties (CNN features) to oped on visual descriptors (K-nearest neighbors classifier) further improve our system. 5. REFERENCES [1] Holly M Bik and Miriam C Goldstein. An introduction to social media for scientists. 2013. [2] Sophia B Liu. Trends in distributed curatorial technology to manage data deluge in a networked world. The European Journal for the Informatics Professional, 11(4):18–24, 2010. [3] C Szongott, Benjamin Henne, G von Voigt, et al. Big data privacy issues in public social media. In Digital Ecosystems Technologies (DEST), 2012 6th IEEE International Conference on, pages 1–6. IEEE, 2012. [4] Duc-Tien Dang-Nguyen, Luca Piras, Giorgio Giacinto, Giulia Boato, and F De Natale. Retrieval of diverse images by pre-filtering and hierarchical clustering. MediaEval Benchmarking Initiative for Multimedia Evaluation, 2014. [5] Bogdan Ionescu, Alexandru L Gınsca, Bogdan Boteanu, Adrian Popescu, Mihai Lupu, and Henning Müller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation. In MediaEval 2015 Workshop, Wurzen, Germany, 2015. [6] Bogdan Ionescu, Adrian Popescu, Mihai Lupu, Alexandru L Gınsca, and Henning Müller. Retrieving diverse social images at mediaeval 2014: Challenge, dataset and evaluation. In MediaEval 2014 Workshop, Barcelona, Spain, 2014. [7] Tomas Brodsky. Relevant image detection in a camera, recorder, or video streaming device, April 4 2006. US Patent App. 11/397,780. [8] Alexandru Lucian Ginsca, Adrian Popescu, and Navid Rekabsaz. Cea listâĂŹs participation at the mediaeval 2014 retrieving diverse social images task. In Proceedings of the MediaEval Multimedia Benchmark Workshop, CEURWS. org, volume 1263, pages 1613–0073, 2014. [9] Maia Zaharieva and Patrick Schwab. A unified framework for retrieving diverse social images. 2014. [10] Robert Frischholz. The face detection homepage. https://facedetection.com/. [11] Rahul Gupta, Kartik Audhkhasi, and Shrikanth Narayanan. A mixture of experts approach towards intelligibility classification of pathological speech. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 1986–1990. IEEE, 2015. [12] Rahul Gupta, Kartik Audhkhasi, and Shrikanth Narayanan. Training ensemble of diverse classifiers on feature subsets. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2927–2931. IEEE, 2014. [13] Bogdan Ionescu, Adrian Popescu, Mihai Lupu, Alexandru Lucian Gı̂nsca, Bogdan Boteanu, and Henning Müller. Div150cred: A social image retrieval result diversification with user tagging credibility dataset. ACM Multimedia Systems-MMSys, Portland, Oregon, USA, 2015.