=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_12 |storemode=property |title=An Adaptive Clustering Approach for the Diversification of Image Retrieval Results |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_12.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/Zaharieva16 }} ==An Adaptive Clustering Approach for the Diversification of Image Retrieval Results== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_12.pdf
                    An Adaptive Clustering Approach for the
                    Diversification of Image Retrieval Results

                                                         Maia Zaharieva∗
                       Interactive Media Systems Group, Vienna University of Technology, Austria
                          Multimedia Information Systems Group, University of Vienna, Austria
                                             maia.zaharieva@tuwien.ac.at

ABSTRACT
In this paper, we explore the application of an adaptive clus-
tering approach for the diversification of image retrieval re-
sults in the context of the MediaEval 2016 Retrieving Di-
verse Social Images Task. The proposed approach exploits
available textual descriptions, the visual content of the im-
ages, and a set of common clustering techniques to select               (a) sailing boats on sea      (b) trees reflected in water
the best combination for each image query individually and
in an unsupervised manner.                                           Figure 1: Retrieval results for two image queries.
                                                                     Each line corresponds to a desired image groupings.
1.   INTRODUCTION
   The immense amount on publicly available media con-               to efficiently describe the retrieval results in terms of diver-
tent commonly challenges end users in making use of the              sification of the final image set. Figure 1 shows examples for
broad variety of accessible data. As a result, a lot of recent       desired image groupings for two queries: sailing boats on sea
research focuses on the optimization of retrieval results in         and trees reflected in water. While the results of the first
terms of improved relevance estimation and increased diver-          query indicate potential relevance of the overall composi-
sification [3, 6, 13, 21]. The MediaEval Retrieving Diverse          tion and edge-based descriptors, the second query suggests
Social Images task fosters the development and comparabil-           the use of color-based descriptors. On the contrary, color
ity of algorithms in this context [12]. The goal of the task         information is less meaningful for the first query since the
in 2016 is to refine a set of images retrieved from Flickr as        retrieved images exhibit common color settings. Similarly,
result of a general (and often multi-topic) query.                   edges do not provide enough discriminative power to support
   Increasing the relevance commonly leads to decreased di-          the building of the desired image groupings for the second
versity of the underling image set and vice versa. The mag-          query. Therefore, in our study we consider a set of commonly
nitude of this reciprocal effect is difficult to estimate for dif-   employed visual- and text-based descriptors to represent the
ferent data settings. Therefore, in order to exploit the real        image data. Next to the two convolutional neural network
potential of a clustering-based approach for data diversifica-       (CNN)-based descriptors provided by the organizers [12], we
tion, we consider all images as relevant for the given query.        additionally employ the first 36 coefficients of the discrete
To address a general image retrieval scenario with arbitrary         cosine transform (DCT) [1], intensity histogram (IH) [10],
queries, we only consider commonly available textual infor-          KANSEI shape descriptor [15], and six MPEG-7 visual de-
mation (title, description, tags) and the visual content of          scriptors [4, 18]: color layout (CL), color structure (CS),
the images. The proposed approach does not make any as-              edge histogram (EH), homogeneous texture (HT), region-
sumptions about the initial image query or data characteris-         based shape (RS), and scalable color (SC). As text-based
tics. Moreover, the approach autonomously selects the best           features we consider the well-established term frequency-
combination of image descriptions and clustering approach            inverse document frequency (TF-IDF) [14]. We compute the
for each query individually and, thus, it is fully adaptive for      TF-IDF vector for each image using the available textual de-
different queries and data settings. Preliminary experiments         scription (title, tags, and descriptions) in combination and
demonstrate the generalization ability of the proposed ap-           individually. The textual descriptions are first preprocessed
proach and both its potentials and limitations.                      to increase their expressiveness, i.e., we remove potential
                                                                     occurrences of the corresponding user name, web links, and
2.   APPROACH                                                        stopwords and we additionally stem all remaining terms.
  The fundamental assumption behind the proposed ap-                    The unsupervised detection of existing groupings in a data-
proach is that different queries require for different features      set is commonly performed by means of a clustering algo-
                                                                     rithm. However, the choice of a clustering approach is not a
∗
  This work has been partly funded by the Vienna Science             trivial decision. Different clustering approaches commonly
and Technology Fund (WWTF) through project ICT12-010.                address different data characteristics. Furthermore, poten-
                                                                     tial clustering parameters usually require for an additional
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, October 20-21, 2016, Hilversum,             parameter tuning process. In this context, clustering in-
Netherlands.                                                         ternal validation indices are commonly applied in order to
Table 1: Optimal solution: visual features. Num-                                Table 2: Experiments on the development dataset.
bers in the cells correspond to topic IDs.                                           Approach                           P@20     CR@20     F1@20
                                                                                     Baseline: Flickr                   0.6979    0.3717    0.4674
                                              P
 Feature        AP   EM                  KM                   XM
 CNN ad         58   11, 25              29                   38, 51        6        Optimal solution:
 CNN gen                                 4, 33, 39, 59        17, 54        6                         visual features   0.8179    0.6575    0.7122
 DCT            48   19, 60, 67          50                                 5                           text features   0.8136    0.6453    0.7043
 IH             45   2                   28, 53               6, 65         6                          visual + text    0.8250    0.6634    0.7186
 KANSEI shape        9, 20, 26, 36, 46   13, 57                             7
                                                                                     Best performing fixed settings:
 MPEG7 CL            22, 47, 55          5, 31, 61            12, 18, 21    9
 MPEG7 CS            15                  7, 62, 70                          4                         visual features   0.6657    0.4453    0.5237
 MPEG7 EH            32, 49, 68          14, 34, 40, 42, 52                 8                           text features   0.6636    0.4274    0.5045
 MPEG7 HT            1, 44               8, 27, 63            24            6                          visual + text    0.6657    0.4453    0.5237
 MPEG7 RS            23, 69              10, 41               35, 64, 66    7        Proposed adaptive approach:
 MPEG7 SC P          3, 43, 56                                16, 30, 37    6                         visual features   0.6500    0.4398    0.5123
                 3   25                  26                   16           70                           text features   0.6729    0.4230    0.5029
                                                                                                         visual+text    0.6286    0.4061    0.4803

assess the quality of a clustering solution in terms of com-
pactness and/or separability of the detected clusters [16].                          Table 3: MediaEval 2016 benchmark results.
We exploit the performance of several, broadly employed
validation indices covering different aspects of a clustering                    Run     Configuration                       P@20    CR@20     F1@20
solutions: 1) compactness in terms of sum of squares and                         run 1   Adaptive, visual features          0.5141    0.4024   0.4292
                                                                                 run 2   Adaptive, text features            0.5406    0.4130   0.4463
C-index [11], 2) separability by means of single linkage dis-                    run 3   Adaptive, visual+text features     0.5430    0.4130   0.4471
tance between two clusters, 3) combination of compactness                        run 4   Fixed settings, visual features    0.4969    0.3603   0.4006
and separability in terms of Calinski-Harabasz [5], Davies-
Bouldin [7], and Silhouette [20], and 4) consistency compar-
ison measures by means of Gamma and Tau indices [2]. We                         Additionally, the achieved performance in terms of F1@20-
employ the clustering validation measures to select the best                    score significantly outperforms the baseline defined by the
combination of image feature and clustering algorithm for                       original Flickr result. The best performing clustering solu-
each query individually. As clustering methods we inves-                        tion considers the DCT feature in combination with the KM
tigate two model-based approaches: Affinity Propagation                         clustering approach and the C validation index. This result
(AP) [9] and expectation maximization (EM) [8], and two                         is notably lower than the achievable performance. Addition-
partitional approaches: k-means (KM) [17] and X-means                           ally, the combination is selected based on experiments on a
(XM) [19]. The clustering methods were selected for their                       single dataset and, thus, overfitting the data. Experiments
efficiency. Additionally, the only parameter tuning concerns                    with the performance of the clustering validation indices in-
the potential specification of the expected number of clus-                     dicated the C-index as the best performing validation mea-
ters, k. In this case, we perform clustering for various set-                   sure stressing the importance of compactness for the final
tings, k = {5, 10, 20, 30, 40, 50}, and consider each clustering                clustering solution. The results achieved using the proposed
solution individually. The final selection of a clustering so-                  adaptive approach in combination with the C-index are pre-
lution for a given query is based on the quality assessment                     sented in Table 2. The results are slightly lower than the
of all possible combinations between the considered cluster-                    best performing fixed setting yet not overfitting the data.
ing approaches and the employed image features. The final                          Table 3 summarizes the results on the test set. Runs 1 −
selection of images from the clusters follows a Round-Robin                     3 employ the adaptive approach while run 4 exploits the
approach according to the Flickr-provided relevance scores.                     best performing fixed settings on the development set as
We start by selecting the image with the best relevance score                   proof-of-concept for its overfitting. The proposed approach
from each cluster. These images, sorted in ascending order,                     adapts well to the new dataset indicated by the difference in
constitute the m highest ranked results, where m is the num-                    the selected features and clustering approaches. The fixed
ber of detected clusters. The selected images are removed                       selection of image feature and clustering approach in run 4
from their clusters and the selection process is repeated until                 performs lowest as expected. Overall, the results are lower
the required number of retrieved results is achieved.                           in comparison to the development dataset. However, this
                                                                                can only be evaluated in relation to the baseline which is
                                                                                not available for the test set yet.
3.   EXPERIMENTAL RESULTS
   In our first experiment we investigate whether or not dif-
ferent datasets require for different features. For this pur-                   4.    CONCLUSION
pose we perform clustering on the development dataset us-                          In this paper we presented an initial study of the applica-
ing all possible combinations between the considered clus-                      bility of an unsupervised and adaptive clustering-based ap-
tering approaches (including potential clustering configura-                    proach for the diversification of image retrieval results. The
tions) and the employed visual- and text-based features.                        results indicate that different image queries favor different
The best clustering solution is selected using the ground                       image representations and different clustering methods. The
truth information in terms of highest F1@20-score. Due to                       optimal solution for a dataset achieves an outstanding per-
space limitations Table 1 shows an overview of the optimal                      formance. However, the considered validation indices could
clustering combinations for each topic (query) using the vi-                    not reflect the optimal solution. This might be due to the
sual features only. The achieved results are presented in Ta-                   fact that different clustering solutions require for different
ble 2 (see Optimal solution). The balanced distribution of                      validations. Our future work will exploit the potential of
the clustering solutions across the employed visual features                    combining clustering validation indices in order to approach
indicate that different image queries favor different features.                 the best achievable performance.
5.   REFERENCES                                                      A. Yamada. Color and texture descriptors. IEEE
 [1] N. Ahmed, T. Natarajan, and K. R. Rao. Discrete                 Transactions on Circuits and Systems for Video
     cosine transform. IEEE Transactions on Computers,               Technology, 11(6):703–715, Jun 2001.
     C-23(1):90–93, 1974.                                       [19] D. Pelleg and A. W. Moore. X-means: Extending
 [2] F. B. Baker and L. J. Hubert. Measuring the power of            k-means with efficient estimation of the number of
     hierarchical cluster analysis. Journal of the American          clusters. In International Conference on Machine
     Statistical Association, 70(349):31–38, 1975.                   Learning (ICML), pages 727–734, 2000.
 [3] G. Boato, D.-T. Dang-Nguyen, O. Muratov,                   [20] P. Rousseeuw. Silhouettes: A graphical aid to the
     N. Alajlan, and F. G. B. De Natale. Exploiting visual           interpretation and validation of cluster analysis.
     saliency for increasing diversity of image retrieval            Journal of Computational and Applied Mathematics,
     results. Multimedia Tools and Applications,                     20(1):53–65, 1987.
     75(10):5581–5602, 2016.                                    [21] E. Spyromitros-Xioufis, S. Papadopoulos, A. L.
 [4] M. Bober. Mpeg-7 visual shape descriptors. IEEE                 Ginsca, A. Popescu, Y. Kompatsiaris, and I. Vlahavas.
     Transactions on Circuits and Systems for Video                  Improving diversity in image search via supervised
     Technology, 11(6):716–719, Jun 2001.                            relevance scoring. In ACM International Conference
 [5] T. Calinski and J. Harabasz. A dendrite method for              on Multimedia Retrieval, pages 323–330, 2015.
     cluster analysis. Communications in Statistics,
     3(1):1–27, 1974.
 [6] D. T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato,
     and F. G. B. D. Natale. A hybrid approach for
     retrieving diverse social images of landmarks. In IEEE
     International Conference on Multimedia and Expo
     (ICME), pages 1–6, June 2015.
 [7] D. L. Davies and D. W. Bouldin. A cluster separation
     measure. IEEE Transactions on Pattern Analysis and
     Machine Intelligence, PAMI-1(2):224–227, 1979.
 [8] A. P. Dempster, N. M. Laird, and D. B. Rubin.
     Maximum Likelihood from Incomplete Data via the
     EM Algorithm. Journal of the Royal Statistical
     Society. Series B (Methodological), 39(1):1–38, 1977.
 [9] B. J. Frey and D. Dueck. Clustering by passing
     messages between data points. Science,
     315(5814):972–976, 2007.
[10] R. C. Gonzalez and R. E. Woods. Digital Image
     Processing. Prentice-Hall, Inc., 3rd edition, 2006.
[11] L. J. Hubert and J. R. Levin. A general statistical
     framework for assessing categorical clustering in free
     recall. Psychological Bulletin, 83(6):1072–1080, 1978.
[12] B. Ionescu, A. L. Ginsca, M. Zaharieva, B. Boteanu,
     M. Lupu, and H. Müller. Retrieving diverse social
     images at MediaEval 2016: Challenge, dataset and
     evaluation. In MediaEval 2016 Multimedia Benchmark
     Workshop, 2016.
[13] B. Ionescu, A. Popescu, A.-L. Radu, and H. Müller.
     Result diversification in social image retrieval: a
     benchmarking framework. Multimedia Tools and
     Applications, 75(2):1301–1331, 2016.
[14] K. S. Jones. A statistical interpretation of term
     specificity and its application in retrieval. Journal of
     Documentation, 28(1):11–21, 1972.
[15] H. Kobayashi, Y. Okouchi, and S. Ota. Image retrieval
     system using KANSEI features. In Pacific Rim
     International Conference on Artificial Intelligence:
     Topics in Artificial Intelligence, pages 626–635, 1998.
[16] Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu.
     Understanding of internal clustering validation
     measures. In IEEE International Conference on Data
     Mining (ICDM), pages 911–916, 2010.
[17] S. Lloyd. Least squares quantization in PCM. IEEE
     Transactions on Information Theory, 28(2):129–137,
     1982.
[18] B. Manjunath, J.-R. Ohm, V. Vasudevan, and