=Paper=
{{Paper
|id=Vol-1263/paper2
|storemode=property
|title=A Unified Framework for Retrieving Diverse Social Images
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_2.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/ZaharievaS14
}}
==A Unified Framework for Retrieving Diverse Social Images==
A Unified Framework for Retrieving Diverse Social Images
Maia Zaharieva1,2 and Patrick Schwab1
1
Multimedia Information Systems, Faculty of Computer Science, University of Vienna, Austria
2
Interactive Media Systems, Institute of Software Technology and Interactive Systems,
Vienna University of Technology, Austria
maia.zaharieva@[univie|tuwien].ac.at, a0927193@unet.univie.ac.at
ABSTRACT 1) relevance ranking of input images, 2) image clustering for
In this paper we explore the performance of a generic, uni- diversification, and 3) final image selection. The initial set
fied framework for the retrieval of relevant and diverse im- of input images may be optionally pre-processed in order to
ages from social photo collections. The approach allows filter potentially irrelevant images, such as images with a
for the easy evaluation of different visual and textual im- human as main subject.
age descriptions, clustering algorithms, and similarity met- In the first stage, relevance ranking, each image of the
rics. Preliminary results show strong dependance between input set is first represented by a feature vector ~v , where ~v
the choice of underlying technology and similarity metric, is a concatenation of the standardized z-scores of the feature
and the achieved performance. descriptors d1 , ..., dn :
~v = zscore( d1 , ..., dn ) (1)
1. INTRODUCTION Since provided Wikipedia photos are per definition represen-
The immense daily growth of publicly shared media bears tative [2], we additionally compute a representative feature
both potentials and challenges for automated media analysis vector ~vr for each referenced Wikipedia image. Following,
and understanding. Currently, image access and retrieval is the relevance score, s, of an image is defined as the smallest
usually based on user-provided and camera-generated meta- distance between its feature vector ~v and all ~vr from the set
data. Although, more and more available, such metadata of representative feature vectors W :
often suffers limitations such as imprecise capture time and
GPS information or misleading and ambiguous textual de- s = min distance(~v , ~vr ) (2)
vr ∈W
~
scriptions. Additionally, the large number of shared items
often results in a high-number of visually highly similar data. The aim of the second stage, image clustering, is to find
This challenge is the focus of the MediaEval 2014 Retriev- groups of similar images that can be used to diversify the
ing Diverse Social Images Task [2]. The aim of the task is final retrieval results. Note that, distance measures and im-
the refinement of location images retrieved from Flickr while age features at this step are not necessarily the same ones
taking into consideration both their relevance and diversity. employed for relevance ranking.
Previous work in the context of this task shows a broad The third and last stage of the approach, final image se-
field of possible approaches ranging from re-ranking and lection, combines the results of the previous steps to retrieve
clustering to greedy optimization and graph representations images that are both relevant and diverse according to the
[3]. Several authors propose different systems for different initial image set. For this stage we use a Round-Robin al-
feature types (e.g., [1][5]) that impedes the reasoning about gorithm. We start by selecting the image with the best
the selection of an approach or particular features. Fur- relevance score from each cluster. These images, sorted in
thermore, some methods build upon assumptions that hold ascending order, constitute the m highest ranked results,
true in a limited setting only (e.g., relevance of an image is where m is the number of detected clusters. The selected
related to the number of views or the length of the descrip- images are removed from their corresponding clusters and
tions [4]). While most of the presented approaches employ a the selection process is repeated until the required number
combination of a re-ranking (for relevance improvement) and of retrieved results is achieved.
a clustering (for ensuring diversification) method, we build In general, the clustering algorithm, the metric used to
a unified framework that allows for a thorough evaluation compare the feature vectors, and the underlying image fea-
of various textual and visual features, clustering algorithms, tures (for both image ranking and image clustering) are
and similarity metrics. up to choice. In our experiments we tested different clus-
tering algorithms: k-means, Adaptive Hierarchical Cluster-
ing (AHC), MeanShift, and Lingo, several comparison met-
2. APPROACH rics: Euclidean, city-block, χ2 , cosine, correlation, Maha-
We employ a multi-stage approach for the retrieval of di- lanobis, Spearman, Hamming, and Jaccard, and all visual
verse social images. The workflow passes three main stages: and textual features provided by the organizers [2]: term
frequency - inverse document frequency (TF-IDF), Color
Naming Histogram (CN), Histogram of Oriented Gradients
Copyright is held by the author/owner(s). (HOG), Color Moments (CM), Locally Binary Pattern (LBP),
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain Statistics of Gray Level Run Length Matrix (GLRLM), and
Table 1: Best feature-metric combinations for AHC. Table 2: Official runs configurations (V: visual, T:
textual descriptors employed).
Relevance ranking Image clustering
1 CM3x3 Euclidean 1 CM χ2 Relevance ranking Image Clustering
2 TF-IDF Spearman CM3x3 Euclidean run1 (V) CM3x3 SIFT
SIFT Euclidean 2 HOG cosine run2 (T) TF-IDF TF-IDF
3 CM χ2 3 GLRLM3x3 χ2 run3 (VT) TF-IDF CSD
LBP χ2 LBP3x3 χ2 run5 (V) CM3x3 CSD
LBP3x3 χ2 4 GLRLM Euclidean
4 HOG cosine LBP χ2
GLRLM χ2 SIFT Euclidean
GLRLM3x3 χ2 5 CSD cosine Table 3: Evaluation results.
5 CSD cosine CN Euclidean
CN correlation CN3x3 Euclidean Development dataset Test dataset
6 CN3x3 Euclidean 6 TF-IDF Euclidean CR@20 P@20 F1@20 CR@20 P@20 F1@20
run1 0.4426 0.7600 0.5552 0.3901 0.6646 0.4863
run2 0.4132 0.7250 0.5188 0.3909 0.6809 0.4888
run3 0.4484 0.7567 0.5559 0.3982 0.6732 0.4949
run5 0.4369 0.7617 0.5499 0.3915 0.6752 0.4897
the corresponding spatial pyramid representations (3x3) in
addition to Bag-of-Visual Words (BoVW) of dense SIFT de-
scriptors.
4. CONCLUSION
In this paper we presented a generic, unsupervised frame-
work for the evaluation of various visual and textual fea-
3. EXPERIMENTS AND RESULTS tures, similarity metrics, and clustering approaches for the
retrieval of diverse social images. Performed experiments
In our first experiments we compared the performances of
aim at the evaluation of the potentials and limitations of
the different clustering algorithms. Results on the develop-
the provided visual and textual descriptions and, thus, we
ment data set showed that AHC significantly outperforms
refrain from employing any assumptions or external sources
k-means, MeanShift and Lingo for all explored features (sig-
of information. Although, there are significant differences in
nificance t-test, p < 0.001). Thus, we employed AHC in all
the performances of single features, the top performing fea-
follow up experiments.
tures prove to be highly interchangeable. Achieved results
We conducted a thorough evaluation of the performance
indicate that - for the given datasets - the crucial part of
of the employed features at the two main stages of our ap-
the process is not so much the diversification but more the
proach: relevance ranking and image clustering. Table 1
assessment of image relevance.
summarizes the results by means of ranked feature lists. The
reported feature rankings and the selection of correspond-
ing best performing distance measures are the product of Acknowledgment
significance t-tests with overall p < 0.003. While the Color This work has been partly funded by the Vienna Science and
Naming Histograms (CN and CN3x3) are usually outper- Technology Fund (WWTF) through project ICT12-010.
formed by any other feature, the Color Moments (CM and
CM3x3) show robust performance in both the ranking and
the clustering tasks. In contrast to the ranking, which is
5. REFERENCES
clearly dominated by the performance of CM3x3, TF-IDF [1] D. Corney, C. Martin, A. Göker, E. S. Xioufis,
and SIFT, image clustering using AHC is more robust and S. Papadopoulos, Y. Kompatsiaris, L. M. Aiello, and
the difference in the performance of global and local features B. Thomee. Socialsensor: Finding diverse images at
decreases to a large extent. mediaeval 2013. In MediaEval 2013 Workshop, 2013.
Eventually, we submitted four runs for the final evalua- [2] B. Ionescu, A. Popescu, M. Lupu, A. L. Gı̂nscâ, and
tion (see Table 2 for the configurations). Table 3 shows the H. Müller. Retrieving diverse social images at
results for the submitted runs for both development and test mediaeval 2014: Challenge, dataset, and evaluation. In
datasets. Best performances are achieved by the combina- MediaEval 2014 Workshop, 2014.
tion of textual and visual information (run3 ). However, in [3] B. Ionescu, A. Popescu, H. Müller, M. Menéndez, and
the context of the test dataset, the differences between the A.-L. Radu. Benchmarking Result Diversification in
performances of the different runs vanish. Overall, clustering Social Image Retrieval. IEEE International Conference
recall (CR) remains relatively low due to the large number on Image Processing, 2014.
of irrelevant images building noisy clusters. In general, the [4] N. Jain, J. Hare, S. Samangooei, J. Preston, J. Davies,
achieved results outline the limitations of the available tex- D. Dupplaw, and P. H. Lewis. Experiments in
tual (and visual) information in assessing image relevance. diversifying flickr result sets. In MediaEval 2013
This is mainly due to the fact, that user-provided textual de- Multimedia Benchmark Workshop, 2013.
scriptions on social media sites often contain ambiguous or [5] B. Vandersmissen, A. Tomar, F. Godin, W. D. Neve,
irrelevant information. A possible approach to improve the and R. V. de Walle. Ghent University-iMinds at
results may consider occasionally available GPS data and MediaEval 2013 Diverse Images: Relevance-Based
employ external resources as additional source for informa- Hierarchical Clustering. In MediaEval 2013 Workshop,
tion. 2013.