=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_38
|storemode=property
|title=Exploiting Visual-based Intent Classification for Diverse Social Image Retrieval
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_38.pdf
|volume=Vol-1984
|authors=Bo Wang,Martha Larson
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WangL17
}}
==Exploiting Visual-based Intent Classification for Diverse Social Image Retrieval==
<pdf width="1500px">https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_38.pdf</pdf>
<pre>
                           Exploiting Visual-based Intent Classification
                                for Diverse Social Image Retrieval
                                                              Bo Wang1 , Martha Larson1,2
                                                        1 Delft University of Technology, Netherlands
                                                              2 Radboud University, Netherlands

                                                      b.wang-6@student.tudelft.nl,m.a.larson@tudelft.nl

ABSTRACT                                                                         in building a taxonomy of intent classes with higher abstraction
In the 2017 MediaEval Retrieving Diverse Social Images task, we                  level that goes beyond concept detection, we choose to use NUS-
(TUD-MMC team) propose a novel method, namely an intent-based                    WIDE concepts (81 concepts) [2]. We use these concepts as queries
approach, for social image search result diversification. The un-                to retrieve images from the YFCC100M data set (using a tag-based
derlying assumption is that the visual appearance of social images               retrieval system). For each query, we collect the top-200 relevant
is impacted by the underlying photographic act, i.e., why the im-                images. We use the entire results list if less than 200 images are
ages were taken. Better understanding the rationale behind the                   found. After querying for all NUS-WIDE concepts, we arrive at a
photographic act could potentially benefit social image search re-               data set containing 15618 images.
sult diversification. To investigate this idea, we employ a manual
content analysis approach to create a taxonomy of intent classes.                2.2    Intent Labeling
Our experiments show that a CNN-based neural network classifier                  The intent taxonomy and labeled data set were produced by an
is able to capture the visual difference between the classes in the              expert annotator, who examined each image in turn. The manual
intent taxonomy. We cluster images of the Flickr baseline based on               content analysis approach used by the annotator consists of several
predicted intent class and generate a re-ranked list by alternating              steps. For each image, the annotator first assigns a preliminary
images from different clusters. Our results reveal that, compared to             intent label. Each new image is then judged as either belonging to
conventional diversification strategies, intent-based search result              an existing intent class, or requiring the creation of a new intent
diversification is able to bring a considerable improvement in terms             class. Before introducing a new class, the annotator returns to
of cluster recall with several extra benefits.                                   the previous annotated images to ensure that it is not possible to
                                                                                 accommodate the new image by updating the description of an
1    INTRODUCTION                                                                existing class. If no existing class can be extended to incorporate
The recent advances in deep learning, especially convolutional neu-              the new image, a new intent class is introduced. The final 14 classes
ral networks, have been successfully applied in various computer                 intent taxonomy are described in [11].
vision and multimedia tasks such as object recognition and scene
labeling [4]. However, recognition of the literally depicted content             3     INTENT CLASSIFICATION
of multimedia documents (i.e., what is visible in the image) has                 We adopt a conventional transfer learning scheme to predict the
absorbed most of the research attention. In contrast, less research              intent class of an image. Transfer learning trains models on one
has focused on social, affective and subjective properties of data,              task, and leverages them for a different, but related task [6]. In
for example, why the image was taken.                                            our case, we used VGGNet [9] to extract visual content features
   In this paper, we focus on user intent, i.e., the goals that users            from our images (originally trained on ImageNet [3]). The last fully
are pursuing when they take photos. We assume that intent has                    connected layer (between 2048 neurons and 1000 class scores) was
visual reflexes that can be captured by automatic visual classifiers.            removed and the rest of the network serves as a feature extractor.
Intent classes can be further applied to search result diversification.          We retrained a Softmax classifier using a cross-entropy Softmax
The goals of the photographer provide a simple, easily understand-               loss on our image data set annotated with 14 intent classes. We
able explanation for the differences observed between photos [7].                used 70% of the data for training and held 25% of the data out for
However, given the lack of intent taxonomies (definitions of intent              validation purposes. (The remaining 5% is not used here.) Before
classes) and data sets annotated with intent labels, we will start               we trained, we re-sized all images to 224x224 pixels, and applied
with creating a taxonomy of intent classes, which we turn to next.               data augmentation (random horizontal flipping, chopping and re-
                                                                                 scaling). Our model achieved 71% accuracy on the validation set,
2 INTENT DISCOVERY                                                               suggesting that intent classes are visually stable enough to allow a
2.1 Data Set Generation                                                          classifier to generalize over them.
The intent taxonomy was created using a manual content analy-
sis [5] approach on the basis of YFCC100M [10], the largest social               4     DIVERSIFICATION
image collection that has ever been released. Since we are interested            The intent-based search result diversification works as follows:
                                                                                 The first step is to create a refined initial ranked list by re-ranking
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland
                                                                                 the Flickr baseline using textual features (vector space model with
                                                                                 tf-idf weights) with the aim of increasing precision. After that,
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                   Bo Wang and M. Larson

    Table 1: Results in terms of Precision, Cluster Recall and F1 score with respect to 4 different runs on Dev and Test set.
    Data Set Evaluation visual (run1) text-rerank+text (run2) text-rerank+visual (run3) text-rerank+intent (run4)
                 P@20         61.52%                67.72%                     67.72%                       67.69%
    Dev Set     CR@20         49.29%                52.36%                     53.61%                       55.61%
                F1@20         54.73%                59.05%                     59.83%                       61.07%
                 P@20         66.01%                70.36%                     70.71%                       72.62%
    Test Set    CR@20         56.98%                61.42%                     58.09%                       61.25%
                F1@20         58.30%                63.43%                     61.21%                       64.62%

the top N images in the re-ranked list are classified by our intent        text-based approach text-rerank+text and our intent-based strategy
classifier. In our case, N is 50. To generate the final results list, we   text-rerank+intent perform comparably on the test set. The intent-
apply a round-robin approach. We consider each intent class to             based approach appears to give a boost to relevance as measured
be a cluster of images, and pick the top-ranked photo from each            by P@20 and F@20.
intent cluster (without replacement) in turn. This approach applies
the assumption that new clusters reflect diversity as captured by
photographer’s intent.
    In addition to the intent-based approach, we also submitted three
runs: visual (run1), text-rerank+text (run2) and text-rerank+visual
(run3) for search result diversification. The intent based approach
is designated text-rerank+intent (run4).
    For visual (run 1), we directly apply k-means clustering to the
CNN-based descriptors provided by the task organizers [12]. We
employed a heuristic approach to initialize k. Specifically, we treat
k as a variable and initialize k ∈ (1, n] and apply k-means clustering
for n times. For each k, we evaluate clustering performance with
silhouette analysis [8] and select the best k with respect to the
achieved silhouette score.
    Our text-rerank+visual (run3) adopts the same general strategy as
the visual-based approach. The difference is that instead of directly
apply k-means clustering, we first re-rank the Flickr baseline with
tf-idf weights and then cluster.
    For our text-rerank+text(run2) approach, again, we first re-rank
the Flickr baseline with tf-idf weights. Since in this case, we are
not allowed to use visual descriptors, the most critical issue is to       Figure 1: Comparison between text-rerank+intent (run4)
learn a good representation for each “short document" consisting           (above) and text-rerank+text run (run2) (below) over all
of title, description and tags. To achieve this, we adopted the idea       query id (x-axis), purple: P@20, red: CR@20.
of weighted word embedding aggregation proposed by Cedric et
al. [1]. More concretely, for each term associated with an image, we
use its 50-dimensional word embedding vector. (Word embedding                 Figure 1 shows that both metrics fluctuate widely with respect
vectors were supplied by the organizers.) Each image is thus rep-          to different queries. We measured the Pearson coefficient between
resented as a set of vectors. For an image with m terms, we have           P@20 and CR@20 for text-rerank+intent (run4) (0.41) and text-
set of m 50-dimensional vectors. To model an image, we take the            rerank+text (run2) (0.35), which reveals that the intent-based ap-
coordinate-wise maximum and minimum of the set of m vectors.               proach is more sensitive to initial ranking precision. The stan-
We concatenate the two resulting vectors (min and max) to arrive           dard deviations are comparable: σ = 0.17 for text-rerank+text and
at a 100-dimensional vector, which is our final text-based image           σ = 0.18 for text-rerank+intent.
representation. For each query, we have a set of 300 image vectors,           We point out three other aspects of the intent-based diversifica-
to which we apply k-means clustering with silhouette analysis.             tion approach that make it practically useful. First, intent-based di-
                                                                           versification has the advantage of better understandability since the
5    RESULTS AND ANALYSIS                                                  classification result is able to directly provide a user-interpretable
Table 1 reports the results in terms of the official MediaEval 2017        indication of the reason behind the ranking. The retrieval system
evaluation metrics P@20, CR@20 and F1@20. In general, higher pre-          can provide the user with an explanation for its prioritization of
cision is usually associated with relatively higher cluster recall and     search results. Second, once the model has been trained, we do
F1 scores because non-relevant images have no associated diversity         not necessarily need to fine-tune the hyper parameters, i.e., the
cluster label. This phenomenon can be clearly observed compar-             position to cut the dendrogram (for hierarchical clustering) or the
ing visual and text-rerank+visual. What is surprising is that the          initial k (for k-means clustering). Third, image labels are generated
text-based image representation achieves a better clustering result        off-line at indexing time, and a clustering step at query time, which
on the test set compared with the visual CNN representation. The           increases the system response time, is not necessary.
Retrieving Diverse Social Images Task                                                      MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES                                                                        of How Users Frame Social Images. In Proceedings of the ACM Interna-
[1] Cedric De Boom, Steven Van Canneyt, Thomas Demeester, and Bart                tional Conference on Multimedia, MM ’14, Orlando, FL, USA, November
    Dhoedt. 2016. Representation learning for very short texts using              03 - 07, 2014. 397–406.
    weighted word embedding aggregation. Pattern Recognition Letters 80       [8] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpreta-
    (2016), 150–156.                                                              tion and validation of cluster analysis. Journal of computational and
[2] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo,             applied mathematics 20 (1987), 53–65.
    and Yantao Zheng. 2009. NUS-WIDE: a real-world web image data-            [9] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Con-
    base from National University of Singapore. In Proceedings of the 8th         volutional Networks for Large-Scale Image Recognition. CoRR
    ACM International Conference on Image and Video Retrieval, CIVR 2009,         abs/1409.1556 (2014).
    Santorini Island, Greece, July 8-10, 2009.                               [10] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde,
[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li.        Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M:
    2009. ImageNet: A large-scale hierarchical image database. In 2009            the new data in multimedia research. Commun. ACM 59, 2 (2016),
    IEEE Computer Society Conference on Computer Vision and Pattern               64–73.
    Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. 248–      [11] Bo Wang and Martha Larson. 2017. Beyond Concept Detection: The
    255.                                                                          Potential of User Intent for Image Retrieval. In Proceedings of the ACM
[4] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir                      MM’17 Workshop on Multimodal Understanding of Social, Affective and
    Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, and Gang Wang.                Subjective Attributes MUSA’17. to appear.
    2015. Recent Advances in Convolutional Neural Networks. CoRR             [12] Maia Zaharieva, Bogdan Ionescu, Alexandru-Lucian Gînsca,
    abs/1512.07108 (2015).                                                        Rodrygo L.T. Santos, and Henning Müller. 2017. Retrieving Diverse
[5] Kimberly A Neuendorf. 2016. The content analysis guidebook. Sage.             Social Images at MediaEval 2017: Challenges, Dataset and Evaluation.
[6] Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning.         In Working Notes Proceedings of the MediaEval 2017 Workshop, Dublin,
    IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345–1359.                        Ireland, September 13-15, 2017.
[7] Michael Riegler, Martha Larson, Mathias Lux, and Christoph Kofler.
    2014. How ’How’ Reflects What’s What: Content-based Exploitation

</pre>