=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_1
|storemode=property
|title=Retrieving Diverse Social Images at MediaEval 2017: Challenges, Dataset and Evaluation
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_1.pdf
|volume=Vol-1984
|authors=Maia Zaharieva,Bogdan Ionescu,Alexandru Lucian Gînscă,Rodrygo L.T. Santos,Henning Müller
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ZaharievaIGSM17
}}
==Retrieving Diverse Social Images at MediaEval 2017: Challenges, Dataset and Evaluation==
Retrieving Diverse Social Images at MediaEval 2017:
Challenges, Dataset and Evaluation
Maia Zaharieva1 , Bogdan Ionescu2 , Alexandru Lucian Gînscă3 ,
Rodrygo L.T. Santos4 , Henning Müller5
1 TU Wien, Austria
2 University Politehnica of Bucharest, Romania
3 CEA LIST, France
4 Universidade Federal de Minas Gerais, Brazil
5 University of Applied Sciences Western Switzerland, Switzerland
maia.zaharieva@tuwien.ac.at,bionescu@imag.pub.ro,alexandru.ginsca@cea.fr
rodrygo@dcc.ufmg.br,henning.mueller@hevs.ch
ABSTRACT Search1 ). Given a ranked list of up to 300 query-related images
This paper provides an overview of the Retrieving Diverse Social retrieved from Flickr2 using text-based queries, participants are
Images task that is organized as part of the MediaEval 2017 Bench- required to refine the results by providing a set of images that are
marking Initiative for Multimedia Evaluation. The task addresses relevant to the query and, at the same time, represent a visually
the challenge of visual diversification of image retrieval results, diversified summary of it. The queries include complex and general-
where images, metadata, user tagging profiles, and content and text purpose, multi-concept queries (e.g. "dancing on the street", "trees
models are available for processing. We present the task challenges, reflected in water", "sailing boat"). The queries in the development
the employed dataset and ground truth information, the required set result from a broad user study and are constructed around the
runs, and the considered evaluation metrics. the data of the MediaEval 2016 Retrieving Diverse Social Images
task [8]. The queries in the test set were collected using Google
Trends3 for image search (worldwide, last 5 years: 2012-2017).
1 INTRODUCTION The goal of the task is to refine the image set retrieved as a result
to a given text-based query by providing a ranked list of up to 50
An efficient image retrieval system should be able to present results photos that are both relevant and visually diversified representations
that are both relevant to the provided query and that are covering of the query, according to the following definitions:
different visual aspects of it. The diversification of image search Relevance: an image is considered to be relevant for the query
results can considerably increase the probability of a system to if it is a common visual representation of the query topics (all at
address a broad range of user information needs. In general, di- once). Bad quality photos (e.g., severely blurred, out of focus) are
versification is an actively researched problem in various domains not considered relevant in this scenario;
ranging from web search and query result diversification [9, 15, 19] Diversity: a set of images is considered to be diverse if it depicts
to recommender systems [16, 17] and summarization [13, 14]. With different visual characteristics of the query topics and subtopics
the emerging availability of publicly available images, the impor- with a certain degree of complementarity, i.e. most of the perceived
tance for diversification of image data is steadily growing. This task visual information is different from one image to another.
is especially challenging when handling real-world queries, which
are often complex, consisting of multiple concepts. 3 DATA DESCRIPTION
The 2017 Retrieving Diverse Social Images task is a follow-up of
the 2016 edition [8] and fosters the development of new techniques The data consists of a development set (devset) with 110 queries
for improving both the relevance and the visual diversification of (32, 487 images) and a test set (testset) with 84 queries (24, 986
image search results. The task is designed to support the evaluation images). An additional dataset (credibilityset) provides credibility
and comparison of approaches emerging from a wide range of re- estimation for ca. 685 users and metadata for more than 3.5M im-
search fields, such as information retrieval (text, vision, multimedia ages. We also provide semantic vectors for general English terms
communities), machine learning, relevance feedback, and natural computed on top of the English Wikipedia4 (wikiset), which could
language processing. help participants to develop advanced text models.
Each query is accompanied by the following information: query
text formulation (the actual query formulation used on Flickr to
2 TASK DESCRIPTION
retrieve the data), a ranked list of up to 300 images in jpeg format
The task is built around the use case of a general ad-hoc image retrieved from Flickr using Flickr’s default "relevance" algorithm
retrieval system, which provides the user with visually diversified (all images are redistributable Creative Commons licensed5 ), an
representations of query results (see for instance Google Image
1 https://images.google.com/
2 https://www.flickr.com.
Copyright held by the owner/author(s). 3 http://trends.google.com/
MediaEval’17, 13-15 September 2017, Dublin, Ireland 4 https://en.wikipedia.org/
5 http://creativecommons.org/
MediaEval’17, 13-15 September 2017, Dublin, Ireland M.Zaharieva et al.
xml file containing Flickr metadata for the retrieved images, and In contrast to the single relevance score for each query, in terms
ground truth for both relevance and diversity. of diversity we consider all three annotations as correct (ground
To facilitate participation from various communities, we also truth) as they typically depict different possibilities to group the
provide the following content-based descriptors: images representing different points of view.
- general purpose, visual-based descriptors extracted using the LIRE
library6 [10]: auto color correlogram (ACC) [12]; color and edge 5 RUN DESCRIPTION
directivity descriptor (CEDD) [5], fuzzy color and texture histogram Participants were allowed to submit up to five runs. The first three
(FCTH) [6], Gabor texture, joint composite descriptor (JCD) [4], are required (dedicated) runs: run1 – automated run using visual
several MPEG7 features including color layout, edge histogram, and information only; run2 – automated run using text information
scalable color [11], pyramid of histograms of orientation gradients only; and run3 – automated run using both visual and text infor-
(PHOG) [2], and speeded up robust features (SURF) [1]. mation. For the generation of run1 to run3 only information that
- convolutional neural network (CNN)-based descriptors based on the can be extracted from the provided data (e.g. provided descriptors,
reference model provided with the Caffe framework7 . The descrip- descriptors of their own, etc.) is allowed to be used. The last two
tors are extracted from the last fully connected layer (fc7). runs, run4 and run5, are general ones, i.e. any approach is allowed,
- text-based features include term frequency and document fre- e.g. human-based or hybrid human-machine approaches, including
quency information and their ratio (TF-IDF). The text-based features using data from external sources, such as Internet or pre-trained
are computed per image, per query, and per user basis. models obtained from external datasets related to this task.
- user annotation credibility descriptors provide an estimation of the
quality of the users’ tag-image content relationships. The follow- 6 EVALUATION
ing descriptors are provided: visualScore (measure of user image Performance is assessed for both diversity and relevance using clus-
relevance), faceProportion (the percentage of images with faces), ter recall at X (CR@X ), precision at X (P@X ), and their harmonic
tagSpecificity (average specificity of a user’s tags, where tag speci- mean F 1@X . CR@X provides the ratio of the number of clusters
ficity is the percentage of users having annotated with that tag in a from the ground truth that are represented in the top X results and,
large Flickr corpus), locationSimilarity (average similarity between a thus, it reflects the diversification quality of a given image result
user’s geotagged photos and a probabilistic model of a surrounding set. We compute CR@X for each one of the available ground truth
cell), photoCount (total number of images a user shared), uniqueTags diversity annotations and select the one which maximizes CR@X
(proportion of unique tags), uploadFrequency (average time between for each query. Since the clusters in the ground truth consider rel-
two consecutive uploads), bulkProportion (the proportion of bulk evant images only, the relevance of the top X results is implicitly
taggings in a user’s stream, i.e., of tag sets that appear identical for measured by CR@X . Nevertheless, P@X provides a more precise
at least two distinct photos), meanPhotoViews (mean value of the view on the relevance of a particular image set since it directly
number of times a user’s image has been seen by other members of measures the relevance among the top X images. We consider var-
the community), meanTitleWordCounts (mean value of the number ious cut off points, i.e. X = {5, 10, 20, 30, 40, 50}. Additionally, we
of words found in the titles associated with users’ photos), mean- consider two further evaluation metrics, which are well-established
TagsPerPhoto (mean value of the number of tags users put for their in the information retrieval community, the intent-aware expected
images), meanTagRank (mean rank of a user’s tags in a list in which reciprocal rank (ERR-IA@X ) [3] and the α-normalized discounted
the tags are sorted in descending order according the number of cumulative gain (α-nDCG@X ) [7] metrics. The official ranking met-
appearances in a large subsample of Flickr images), and meanIm- ric is F 1@20 which gives equal importance to diversity (via CR@20)
ageTagClarity (adaptation of the Image Tag Clarity from [18] using and relevance (via P@20). This metric simulates the content of a
a TF-IDF language model as individual tag language model). single page of a typical Web image search engine and reflects user
behavior, i.e., inspecting the first page of results with priority.
4 GROUND TRUTH
Both relevance and diversity annotations were carried out by 17 7 CONCLUSION
human annotators. The data were distributed among the annotators
The 2017 Retrieving Diverse Social Images task provides partici-
such that each query was labeled by three different annotators. For
pants with a comparative and collaborative evaluation benchmark
relevance, annotators were asked to label each image (one at a time)
for social image retrieval approaches focusing on visual-based di-
as being relevant to the underlying query (value 1), non-relevant
versification. The task explores the diversification in the context of
(0), or with "don’t know" (−1). The final relevance ground truth
a challenging, ad-hoc image retrieval system, which should be able
score was determined using a majority voting scheme. For diver-
to tackle complex and general-purpose multi-concept queries. This
sity, only the images that were judged as relevant in the previous
year, we explicitly accounted for the possibility of having multi-
step were considered. For each query, annotators were provided
ple different views on a given retrieval result, which might all be
with a thumbnail list of all relevant images. After getting familiar
subjectively correct. This allows for an investigation of the aspect
with their contents, they were asked to re-group the images into
of subjectivity in the perception of diversification in a next step.
clusters with similar visual appearance (up to 25 clusters in total).
Details on the methods and results of the participating teams can be
6 http://www.lire-project.net/ found in the working note papers of the MediaEval 2017 workshop
7 http://caffe.berkeleyvision.org/ proceedings.
Retrieving Diverse Social Images Task MediaEval’17, 13-15 September 2017, Dublin, Ireland
REFERENCES
[1] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. 2008. Speeded-Up
Robust Features (SURF). Computer Vision and Image Understanding 110, 3 (2008),
346–359. https://doi.org/10.1016/j.cviu.2007.09.014
[2] Anna Bosch, Andrew Zisserman, and Xavier Munoz. 2007. Representing Shape
with a Spatial Pyramid Kernel. In ACM International Conference on Image and
Video Retrieval (CIVR). ACM, New York, NY, USA, 401–408. https://doi.org/10.
1145/1282280.1282340
[3] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected
Reciprocal Rank for Graded Relevance. In ACM Conference on Information and
Knowledge Management (CIKM). ACM, New York, NY, USA, 621–630. https:
//doi.org/10.1145/1645953.1646033
[4] Savvas A Chatzichristofis, YS Boutalis, and Mathias Lux. 2009. Selection of
the proper compact composite descriptor for improving content based image
retrieval. In Signal Processing, Pattern Recognition and Applications (SPPRA). ACTA
Press, 134–140.
[5] Savvas A. Chatzichristofis and Yiannis S. Boutalis. 2008. CEDD: Color and Edge
Directivity Descriptor: A Compact Descriptor for Image Indexing and Retrieval.
In International Conference on Computer Vision Systems (ICCV). Springer-Verlag,
Berlin, Heidelberg, 312–322. https://doi.org/10.1007/978-3-540-79547-6_30
[6] S. A. Chatzichristofis and Y. S. Boutalis. 2008. FCTH: Fuzzy Color and Texture
Histogram - A Low Level Feature for Accurate Image Retrieval. In Int. Workshop
on Image Analysis for Multimedia Interactive Services (WIAMIS). IEEE Computer
Society, Washington, DC, USA, 191–196. https://doi.org/10.1109/WIAMIS.2008.24
[7] Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova,
Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity
in Information Retrieval Evaluation. In International ACM SIGIR Conference on
Research and Development in Information Retrieval. ACM, New York, NY, USA,
659–666. https://doi.org/10.1145/1390334.1390446
[8] Bogdan Ionescu, Alexandru Lucian Ginsca, Maia Zaharieva, Bogdan Boteanu,
Mihai Lupu, and Henning Müller. 2016. Retrieving Diverse Social Images at
MediaEval 2016: Challenge, Dataset and Evaluation. In MediaEval 2016 Multime-
dia Benchmark Workshop, Vol. 1739. CEUR-WS.org.
[9] Bogdan Ionescu, Adrian Popescu, Anca-Livia Radu, and Henning Müller. 2016.
Result diversification in social image retrieval: a benchmarking framework.
Multimedia Tools and Applications 75, 2 (2016), 1301–1331. https://doi.org/10.
1007/s11042-014-2369-4
[10] Mathias Lux. 2011. Content Based Image Retrieval with LIRe. In ACM In-
ternational Conference on Multimedia. ACM, New York, NY, USA, 735–738.
https://doi.org/10.1145/2072298.2072432
[11] B.S. Manjunath, J.-R. Ohm, V.V. Vasudevan, and A. Yamada. 2001. Color and
texture descriptors. IEEE Transactions on Circuits and Systems for Video Technology
11, 6 (2001), 703–715. https://doi.org/10.1109/76.927424
[12] Mandar Mitra, Ramin Zabih, Jing Huang, Wei-Jing Zhu, and S. Ravi Kumar. 1997.
Image Indexing Using Color Correlograms. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). IEEE Computer Society, Washington, DC,
USA, 762–768. https://doi.org/10.1109/CVPR.1997.609412
[13] Yanwei Pang, Qiang Hao, Yuan Yuan, Tanji Hu, Rui Cai, and Lei Zhang. 2011.
Summarizing Tourist Destinations by Mining User-generated Travelogues and
Photos. Computer Vision and Image Understanding 115, 3 (2011), 352–363. https:
//doi.org/10.1016/j.cviu.2010.10.010
[14] S. Rudinac, A. Hanjalic, and M. Larson. 2013. Generating Visual Summaries of
Geographic Areas Using Community-Contributed Images. IEEE Transactions on
Multimedia 15, 4 (2013), 921–932. https://doi.org/10.1109/TMM.2013.2237896
[15] Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2015. Search result
diversification. Foundations and Trends in Information Retrieval 9, 1 (2015), 1–90.
https://doi.org/10.1561/1500000040
[16] Markus Schedl and David Hauger. 2015. Tailoring Music Recommendations to
Users by Considering Diversity, Mainstreaminess, and Novelty. In International
ACM SIGIR Conference on Research and Development in Information Retrieval.
ACM, New York, NY, USA, 947–950. https://doi.org/10.1145/2766462.2767763
[17] Yue Shi, Xiaoxue Zhao, Jun Wang, Martha Larson, and Alan Hanjalic. 2012. Adap-
tive Diversification of Recommendation Results via Latent Factor Portfolio. In
International ACM SIGIR Conference on Research and Development in Information
Retrieval. ACM, New York, NY, USA, 175–184. https://doi.org/10.1145/2348283.
2348310
[18] Aixin Sun and Sourav S. Bhowmick. 2009. Image Tag Clarity: In Search of Visual-
representative Tags for Social Images. In SIGMM Workshop on Social Media. ACM,
New York, NY, USA, 19–26. https://doi.org/10.1145/1631144.1631150
[19] Kaiping Zheng, Hongzhi Wang, Zhixin Qi, Jianzhong Li, and Hong Gao. 2016. A
survey of query result diversification. Knowledge and Information Systems 51, 1
(2016), 1–36. https://doi.org/10.1007/s10115-016-0990-4