=Paper=
{{Paper
|id=None
|storemode=property
|title=SocialSensor: Finding Diverse Images at MediaEval 2013
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_24.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/CorneyMGXPKAT13
}}
==SocialSensor: Finding Diverse Images at MediaEval 2013==
SocialSensor: Finding Diverse Images at MediaEval 2013 David Corney, Carlos Martin, Eleftherios Spyromitros-Xioufis, Ayse Göker Symeon Papadopoulos, IDEAS Research Institute Yiannis Kompatsiaris Robert Gordon University, Aberdeen Information Technologies Institute [d.p.a.corney|c.j.martin- CERTH, Thessaloniki, Greece dancausa|a.s.goker]@rgu.ac.uk [espyromi|papadop|ikom]@iti.gr Luca Aiello, Bart Thomee Yahoo! Research Barcelona 08018 Barcelona, Spain [alucca|bthomee]@yahoo-inc.com ABSTRACT is defined using a similarity measure between each image We describe the participation of the SocialSensor team in the and a given query image, we use the ground truth data to Retrieving Diverse Social Images Task of MediaEval 2013. train a classifier whose prediction for an image is used as the We submitted entries for all five runs after developing in- relevance score. We use all relevant images as positive and dependent algorithms for visual features, text features and all irrelevant images as negative examples. Diversity in [2] 1 P internet features (including local weather data). Our best is defined as: D(imsi |S, l) = |S| imsj ∈S,j6=i d(imsi , imsj ) CR@10 results came in the visual-only run, while the vision- where d(imsi , imsj ) is a dissimilarity measure between imsi text fusion run produced a slightly higher precision. and imsj . We found that this definition is not ideal because a single image imsj in S that has a high similarity with imsi reduces the diversity of the set. Instead, we define diversity 1. INTRODUCTION as: D(imsi |S, l) = minj,j6=i d(imsi , imsj ) which defines it as The goal here is to produce a ranked list of images that the dissimilarity of imsi to the most similar image in S. As a are both relevant and diverse in response to a location-based dissimilarity measure we use the Euclidean distance between query [3]. Throughout our work, we aimed to maximise the VLAD vectors representing each image. the CR@10 score based on leave-one(-location)-out cross- Optimization & Experiments: To find a set S that validation results from the 50 devset locations. Below, we approximately optimizes U , we use the greedy optimization describe our methods for the five runs in turn before briefly algorithm of [2]. This algorithm first adds to S the image summarising and discussing the results. with the highest relevance score and then sequentially adds the remaining image which has the highest RD score. We 2. APPROACHES experimented with several types of relevance classifiers used in the RD method. Area Under ROC (AUC) was used for 2.1 Run 1: Visual-only features model selection by applying cross-validation. We applied For the visual-only run, each image is represented using the greedy optimization algorithm with the best performing optimized VLAD+SURF vectors. Compared to standard classifier for several values of the weight w and chose the VLAD+SURF vectors [6], these vectors include multiple vo- parameters that gave the best results for CR@10 (' 0.56) cabulary aggregation (four visual vocabularies with k = 128 on the devset, for producing the test set predictions. centroids each) and joint dimensionality reduction (to 1024 dimensions) with PCA and whitening [4]. 2.2 Run 2: Text-only features Relevance & Diversity Method: Given a set of im- To predict the relevance of an image, we built a forest ages I = {im1 , ..., imN }, we developed an algorithm that of 100 random decision trees [1] using most of the textual selects a fixed-size set S ⊂ I that is (approximately) op- descriptors available in the datasets. The textual descriptors timal with respect to both relevance to the query location used for classification were: number of comments and views; and diversity within S. We define the utility U of a set Flickr ranking; author name. We also derived features from of P images S with respect to a query location l as: U (S|l) = the description, tags and title fields separately: the number imsi ∈S w ∗R(imsi |l)+(1−w)∗D(imsi |S) where R(im|l) is of words in the field; the normalised sum of tf-idf, social the relevance score for im given the location and D(im|S) is tf-idf and probabilistic values of each word (as provided by the diversity score within S. The same joint criterion, which the organisers); the normalised sum of tf-idf values of each we call Relevance & Diversity (RD), was used in [2]. How- keyword where each value is the tf-idf value of each word ever, we use different definitions for R(im|l) and D(im|S) from the Wikipedia page of the corresponding location, and that are more suitable for this task. While relevance in [2] using the remaining locations as the full corpus; and the average of the previous four values. We also discretized the continuous variables; the Flickr ranking and author were Copyright is held by the author/owner(s). already discrete. MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain Independently, we used hierarchical clustering to find 15 clusters for each location. Within each cluster, we then Expert Crowd-sourced Method P@10 CR@10 F1@10 P@10 CR@10 F1@10 ranked the images by the predicted relevance using the ran- Run 1 0.733 0.429 0.521 0.729 0.764 0.723 dom forest. We then stepped through the clusters iteratively Run 2 0.732 0.390 0.491 0.702 0.760 0.691 selecting the most relevant remaining image until (up to) 50 Run 3 0.785 0.405 0.510 0.800 0.763 0.753 had been selected. Run 4 0.750 0.408 0.508 0.725 0.738 0.698 We found some cases where groups of images have identi- Run 5 0.733 0.406 0.504 0.702 0.696 0.672 cal text features but had different ground truth labels. These include casual holiday pictures where the Flickr user pro- Table 1: Results for test set for top 10 results, using vided the same tags, descriptions etc. for a whole set of expert and crowd-sourced ground truth sets. images, despite their diversity. Any deterministic text-only approach will fail to label these images correctly. 2.3 Run 3: Visual-text fusion We combine all these data sources to get pictures that are diverse in terms of distance from the landmark, angle of the In order to leverage both visual and textual information shot, weather conditions and time of the day. We input the we developed a simple late fusion scheme that combines the feature to the k-means algorithm (k = 10). Inside each clus- outputs of the visual and textual approaches described in the ter, when multiple candidates photos are available, we select previous subsections. This is done by taking the union of the photo with the highest number of Flickr favourites. We the images returned for each location by the two approaches verified that including the number of favourites as an addi- and ordering them in ascending average rank, i.e. the aver- tional feature to the k-means is beneficial for the selection age of the ranks that they receive by each approach. Pre- of diverse images. liminary experiments indicated that early fusion (i.e. taking the individual features derived from each aspect of the data and combining them before making any decisions about rel- 3. RESULTS AND DISCUSSION evance or diversity) was less effective. Table 1 summarises the results when returning the top 10 images per location compared to the expert and crowd- 2.4 Run 4: Human-machine hybrid approach sourced ground truth. Our strongest results came from the We developed a very simple approach to combine human visual features (run 1); a slight improvement in precision and computer responses in an attempt to make use of peo- came when these were combined with text features (run 3). ple’s natural visual processing abilities and their abilities to Our results are close for all five runs, despite the variety of make rapid judgements from incomplete data. The test set features and algorithms used. This could indicate that the comprised a total of 38,300 images from 346 locations. To inherent signal/noise ratio of the data is a limiting factor, obtain any form of human response requires either a large although further algorithmic development and optimisation number of people (e.g. through crowd-sourcing) or a sub- could also improve matters. Future work includes the use stantial reduction in the number of images. We chose the lat- of concept detection algorithms to improve diversity by ex- ter and presented the participants with computer-generated plicitly including images matching different concepts (e.g. short-lists of images and asked them to improve it. Specif- exterior; detail; night-time etc.). ically, we used the text-only methods (Section 2.2) to list the top 15 relevant and diverse images. The human par- 4. ACKNOWLEDGEMENTS ticipant then had to select five of these 15 as being either This work is supported by the SocialSensor FP7 project, poor-quality images or images that (nearly) duplicate any partially funded by the EC under contract number 287975. of the remaining set. Participants were not expected to be familiar with any of the locations, nor did they consult other 5. REFERENCES sources. The final submission for each location consisted of [1] L. Breiman. Random forests. Machine Learning, the 10 remaining images, followed by the 5 “rejected” images. 45(1):5–32, 2001. Two participants carried out the annotation on a total of 46 locations, around 12% of the total test set. [2] T. Deselaers, T. Gass, P. Dreuw, and H. Ney. Jointly optimising relevance and diversity in image retrieval. In 2.5 Run 5: Device and local weather data ACM CIVR ’09, New York, USA, 2009. ACM. [3] B. Ionescu, M. Menéndez, H. Müller, and A. Popescu. Multimedia objects captured with modern cameras and Retrieving diverse social images at MediaEval 2013: smartphones are labeled with Exif metadata generated di- Objectives, dataset and evaluation. In MediaEval 2013 rectly from the mobile device at the time the photo or video Workshop, Barcelona, Spain, October 18-19 2013. is taken. For this task, among all the data available we con- sider i) date and time the photo was taken, generally reliable [4] H. Jégou and O. Chum. Negative evidences and at the granularity of one day; ii) f-stop (aperture size of the co-occurences in image retrieval: The benefit of PCA shutter) and the exposure time (shutter speed), that can be and whitening. In ECCV, 2012. combined as EV =f-stop2 ·exposure, used previously to dif- [5] B. N. Lee, W.-Y. Chen, and E. Y. Chang. A scalable ferentiate indoor from outdoor pictures [5]; iii) geo-location service for photo annotation, sharing, and search. In of the device when the photo was taken, from which we com- ACM MULTIMEDIA ’06, pages 699–702, Santa pute the angle and distance to the photographed landmark. Barbara, CA, USA, 2006. ACM. We also query a public database of historical weather data [6] E. Spyromitros-Xioufis, S. Papadopoulos, (www.ncdc.noaa.gov) to get the weather of the day the pic- I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. An ture was taken, which indicates the main weather conditions empirical study on the combination of SURF features (e.g. sun, fog, rain, snow, haze, thunderstorm, tornado). with VLAD vectors for image search. In WIAMIS, 2012.