Your Click Matters: Enhancing Click-based Image Retrieval performance
                    through Collaborative Filtering

        Deepanwita Datta                   Manajit Chakraborty                                Aveek Biswas
              NTNU                     Università della Svizzera italiana                        UCSD
        Trondheim, Norway                   Lugano, Switzerland                              California, USA
    ddatta.rs.cse13@itbhu.ac.in                    chakrm@usi.ch                           a4biswas@ucsd.edu


                       Abstract                               tion to bridge the semantic gap between the high-
                                                              level information needs of users and commonly
     Image retrieval has been an active research              employed low-level features, it continues to be a
     area since the early days of computing.                  major challenge. The existing state-of-the-art so-
     While ensemble, multimodal and hybrid                    lutions to this challenge is two-pronged. Few of
     methods coupled with machine learning                    the existing works (Feng et al., 2014; Zhen and Ye-
     has seen an upward surge replacing uni-                  ung, 2012) stress on learning mapping functions
     modal, heuristic-based methods; a rather                 whereas rest of the works explore the high-level
     new offshoot has been to identify new fea-               semantic representation of modalities (Karpathy
     tures associated with images on the web.                 and Fei-Fei, 2015; Reed et al., 2016). Among
     One such feature is the ‘click count’ based              these, semantic representation based approaches
     on the clicks an image or its correspond-                and deep learning based approaches have gained
     ing text gets in response to a query. Previ-             reasonable success. Deep Convolutional Neural
     ous state-of-the-art methods have tried to               Networks (CNNs) are used to learn the latent fea-
     exploit this feature by using its raw count              tures and these learned features are utilized as vi-
     and machine learning. In this paper, we                  sual and textual semantic representation in these
     build on this idea and propose a new col-                models.
     laborative filtering based technique to em-              In ACM Multimedia 2015 MSR-Bing Image Re-
     ploy the click-log of users from the web to              trieval Challenge1 , it was argued that the massive
     better identify and associate images in re-              amount of click data from commercial search en-
     sponse to either a text or an image query.               gines provide a data set that is unique in bridging
     Experiments performed on a large scale                   the semantic and intent gap. Millions of click data
     publicly available standard dataset having               i.e. clicked image-query pairs, generated from
     genuine click logs from actual users cor-                search engines, were collected and released pub-
     roborate the efficacy and significant in-                licly as a new large-scale real-world image click
     crease in efficiency of our approach.                    data (Clickture) to investigate how to effectively
                                                              leverage this click-count based data to mitigate the
1     Introduction
                                                              semantic gap. This click data is stored in a large ta-
Cross-media retrieval has proven to be an effective           ble with multiple rows and three tuples (I, Q, C),
solution to search through enormous multi-varied              indicating that the image I was clicked C times
datasets. A fairly common example of such cross-              against the search results of a given textual query
media retrieval is when an image is searched us-              Q. Wu et al. (Wu et al., 2016a) view the entire
ing a text query. Here the textual description of             dataset as a bipartite graph which has two types
the image content acts as the text query. However,            of vertices, queries and images respectively, and
it is not a trivial task to illustrate a non-textual vi-      the edge’s weight is assigned according to the to-
sual content using only text. Hence, a semantic               tal number of clicks from all the users. The au-
gap is introduced between the user needs and the              thors learn a common representation for both im-
given (existing) descriptions. Although in litera-            age and text query from the perspective of encod-
ture, a significant amount of work has been de-
voted to correlating textual and visual informa-                  1
                                                                      http://press.liacs.nl/mmgrand/microsoft.pdf


    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
ing the explicit/implicit relevance relationship be-    2   Related Works
tween the vertices in the click graph. The common
representation is obtained as well as any unseen        A fundamental image retrieval technique is to
query or image is dealt with by reducing the trun-      search for images by textual queries. The con-
cated random walk loss and the distance between         ventional image search engines leverage the ben-
the learned representation of vertices and their cor-   efits of associated or surrounding text which are
responding deep neural network output.                  generally collected from the data like image cap-
Thus, the relevance relationship between a text         tions, tags, comments etc. To train such systems
query and its resulting image is obtained purely by     by labeled text-image pairs human intervention is
measuring the distance between their correspond-        necessary. However, such human labeling is ex-
ing learned high-level feature representation. In       pensive, time-consuming and quite cumbersome.
other words, the cross-modal retrieval on the un-       These labeled data is unreliable as quite often they
seen queries and images is dealt with here in a         suffer from noise. Expressing an image entirely
content-based fashion. As these content-based           through a concise set of keywords, keyphrases
systems operate solely on feature representations,      or free text is a non-trivial task even for humans
the definition of similarity in these systems is fre-   let alone a system. The problem is compounded
quently ad-hoc and not explicitly optimized or          when the user has limited to no knowledge of how
generalized for any particular task i.e. here cross-    the search system or IR work. To alleviate these
modal retrieval. Frequently, the optimization of        problems of cross-view learning, the use of click-
similarity for ranking affects the quantity of in-      through data have gained momentum (Pan et al.,
terest. Thus the retrieved items often become           2014). Cross-View learning creates a common la-
coarsely abstracted or potentially irrelevant. To       tent subspace where the data from different modal-
overcome this shortfall, we try to capture relevant     ities like text, image etc., can be compared with
similarity information expressed by collaborative       each other easily.
filtering.                                              On the other hand, the click-through data is avail-
The motivation behind using collaborative filter-       able in huge amount and is relatively easy to ac-
ing (CF) is that this method produces user-specific     cess. Also, this click-through data helps in bet-
recommendations of items based on patterns of           ter understanding a query. In the work by Pan et
ratings or usage without the need for exogenous         al. (Pan et al., 2014), the distance between map-
information about either items or users (Koren and      pings of query and image in the latent subspace
Bell, 2015). Recommendation by collaborative fil-       is reduced and the inherent structure is preserved
tering relies on explicit or implicit feedback from     back to each original space. Once the mapping is
the user or indirectly obtained through observing       done, and the latent representations are acquired,
user behavior respectively. In our scenario, for any    the next step is to compute the distance among
unseen query, aside from relying on the similarity      these representations. Hence, choosing an ap-
score of the learned features, we can exploit some      propriate similarity function becomes crucial as
previous knowledge i.e. implicit feedback of the        it is the key to make the cross-modal similarity
user. We consider click counts as the implicit feed-    tractable. He et al. (He et al., 2016) propose
back from the user. Stemming from this observa-         a deep and bidirectional representation learning
tion, we predict the similarity structure encoded       model to address the issue of imagetext cross-
by collaborative filtering data. Finally, we use col-   modal retrieval. The authors adopt two convo-
laborative filtering for generating a ranked list for   lutional deep neural networks to extract seman-
cross-modal retrieval. To the best of our knowl-        tic representation from both raw image and text
edge, ours is the first attempt in using collabora-     data and calculate cosine distance among those.
tive filtering in conjunction with Deep Learning        Subsequently, a bidirectional network is learned
towards cross-modal image retrieval. A rigorous         from the matched and unmatched image-text pairs
experiment is carried out over the Clickture dataset    for training to capture the property of the cross-
and the experimental results validate our claim.        modal retrieval. This learning framework uses
Our method outperforms the current state of the         maximum likelihood criterion and optimizes the
art on the learning and content-based methods.          network through backpropagation and stochastic
                                                        gradient descent.
                                                        Similarly, in the paper by Wang et al. (Wang et al.,
2015), the authors propose a supervised frame-          modal ranking over the new images and queries
work based on a deep neural network which cap-          that are not involved in the training click graph.
tures the intra-modal and inter-modal relationships
efficiently. The proposed model requires only a         3.1      Obtaining feature vector representations
little prior knowledge to exploring high-level se-               of query and documents
mantic correlation and also it can tackle the situ-     Multimodal objects from different feature spaces
ation if any modality is missing. While most of         are present in the click graph. So, the first step of
the recent works focus on learning semantic rep-        our model consists of projecting a feature vector
resentation, the work by Wu et al. (Wu et al.,          representation of multimodal data into a common
2016b) concentrates on distance metric learning         dimensional space. Here, image and text are the
which is essential to improve similarity search for     two different sources of information where the di-
content-based retrieval. Usually, single modal dis-     mension of an image depends on the pixel inten-
tance metric learning methods suffer from some          sity, and the number of pixels present in the im-
critical issues such as choosing the dominant fea-      age and vocabulary size of bag-of-words denotes
ture from diverse feature representation, learning a    the dimension of the text. Let us say if M is the
distance metric on the combined high-dimensional        dimension of image feature and N is the dimen-
feature space which is very time-consuming etc..        sion of text feature then our objective is to come
To overcome these issues, the authors proposed          up with a common latent vector space of dimen-
a multi-modal distance metric learning scheme           sion D for both the image and the text.
called online multi-modal distance metric learning      We obtain a common representation through Con-
(OMDML), which can learn an optimized distance          volutional Neural Network (CNN). Some pre-
metric on each individual feature space and learn       trained model such as inception-v3 model2 can be
to find an optimal combination of diverse types of      used to learn the proper representation of input
features.                                               data. The learned representations account for the
Our work in this paper adopts the approach of us-       variations associated with the features. Any estab-
ing convolutional neural networks as suggested by       lished CNN model consists of layers like convolu-
He et al. (He et al., 2016). The reason for choos-      tional filtering, local contrast normalization, max-
ing this over other learning methods is that while      pooling and finally fully connected neural network
training stage might take longer than some other        layers. For our model, we eliminate the last fully
methods, CNN usually supersedes others when it          connected layer of the CNN (the output layer) and
comes to the accuracy of learning. It should be         retrieve the image vectors from the penultimate
noted that we have modified the settings of CNN         layer. This is done since we are not interested in
to fit our problem and adapted it to our needs.         the classification of the images but instead in the
                                                        generated embeddings of the images. The raw im-
3   Methodology                                         age features (embeddings) are directly fed into the
                                                        model to get a latent representation. However, the
Our model consists of two phases, training and          text queries are represented in vector space model
testing, as is the case with any learning based tech-   (bag-of-words). So, all words present in the vo-
nique. A click graph is used as labeled training        cabulary are inserted into a vector lookup table.
data where the number of click count is treated as      Finally, a D dimensional representation is learned
a label between a text query-image pair i.e. if any     for each word from the lookup table. Test query
click is present between any text query and image,      usually consists of multiple words. So, the entire
the image must be relevant to the text query. This      query can be represented by summing up all their
assumption stems from the fact that a user usually      corresponding word vectors. The obtained latent
clicks on an image against a text query only if she     representations are used for learning purpose as
finds it relevant and useful. Here, the number of       depicted in the next subsection.
click counts reinforces how strongly relevant the
image is against the query or vice versa. Thus          3.2      Learning from labeled click data
a labeled query-image pair is learned through the
                                                        In Recommendation System, Collaborative Filter-
click count. In the testing phase, relevant images
                                                        ing (CF) models capture the interaction between
from the test set are retrieved against any given
test query and ranked. Hence, we perform cross-            2
                                                               https://www.kaggle.com/google-brain/inception-v3
users and item based on the rating. A rating indi-     3.4      Calculating similarity score
cates the preference of an individual user towards
                                                       By incorporating predicted click count between
a particular item. High values of rating indicate a
                                                       the unseen query and each image in the dataset,
stronger preference of the user towards the partic-
                                                       we calculate the similarity between each pair of
ular item. The rating values are by nature either
                                                       the images. Let us consider, the predicted click
implicit or explicit feedback provided by the user
                                                       count for the two images iu and iv against the nth
or collected from user behavior or history. Per-
                                                       query qn as Rn,u and Rn,v respectively. Then the
ceiving some resemblance with the inherent nature
                                                       similarity measure between any two images, Su,v ,
of collaborative filtering with the characteristics
                                                       can be calculated by using the following Equation
of our dataset, we hypothesize that learning from
                                                       3:
clicked data through collaborating filtering may
increase the retrieval performance. In this sce-                        P                  ¯              ¯
                                                                           n∈I (Rn,u − Rn )(Rn,v − Rn )
nario, the text query and the images corresponding     Su,v = qP                              qP
                                                                                      ¯   2                   ¯ 2
to the query play the role of user and item respec-                   n∈I (Rn,u − Rn )           n∈I (Rn,v − Rn )
tively. We treat click-count of each query-image                                                        (3)
pair as an implicit rating and train our model from    where R¯n is the average click-count for those im-
these labeled query-image pairs and the rating ma-     ages and I denotes the entire image set. It can be
trix.                                                  observed from the above equation that the similar-
                                                       ity measure depends on how much the click-count
3.3 Prediction of click count for unseen query         for a pair of images deviates from the average rat-
                                                       ing for those images. So, the similarity measure
Model-based collaborative filter predicts users’       is purely dependent on the predicted click-count.
rating of unrated items. CF engines are more ver-      As stated earlier, we calculate all the similarities
satile, in the sense that they can be applied to any   between each pair of the images present in the
domain, and with some care could also provide          dataset and based on the similarity score we rank
cross-domain recommendations. Also, CF works           the images against the each query. Thus a final
best when the user space is large, which is the        ranked list is prepared and we select top-most im-
case for image searching where thousands of users      ages as the most relevant retrieved ones.
are looking for images over the web every second.
Taking a cue from this fact, we choose a model-        4       Experimental Setup
based collaborative approach to predict the click
count for an unseen query. The collaborative filter    In this section we list the possible requirements for
tries to predict ratings or click counts by charac-    the experiment. To run Convolutional Neural Net-
terizing both the query and image. Let us consider     work for learning the latent representations over
that the learned latent vector for any query q is Vq   the large set of images, we have used cloud com-
and the learned latent vector for any image i is Vi    puting services of Google Cloud TPU 3 through 10
such that for a given query q, R measure the ex-       different instances. The learning is done by a pre-
tent of relevance the query has with images that are   trained model through ImageNet4 i.e. inception-v3
highly relevant. The interaction R between query       model5 . The last fully connected layer of the Con-
q and image i can be captured by the following         voluted Neural Network, i.e. the penultimate layer
Equation 1:                                            of the CNN, is extracted using TensorFlow6 . The
                                                       dimension of all the learnt image vectors are kept
                    R = VqT Vi                  (1)
                                                       to 2048. The other libraries which aid this process
where, the dot product between two vectors x, y ∈      are NumPy7 , SciPy8 , scikit-learn9 , pickle10 etc.
Rf is defined as in Equation 2.                            3
                                                            https://cloud.google.com/tpu/
                                                           4
                                                            www.image-net.org/
                                                          5
                          f
                          X                                 https://cloud.google.com/tpu/docs/inception-v3-
                 xT y =         xk yk           (2)    advanced
                                                          6
                          k=1                               https://www.tensorflow.org/
                                                          7
                                                            https://www.numpy.org/
                                                          8
                                                            https://www.scipy.org/
Thus, the predicted click count becomes R which           9
                                                            https://scikit-learn.org/stable/
is calculated using the Equation 1.                      10
                                                            https://docs.python.org/3/library/pickle.html
                         Figure 1: An example of the subgraph of the click graph


Dataset Our experiments are performed over an           5        Results and Analysis
established real world dataset, Clickture 2014 (Mi-
crosoft, 2014), released by Microsoft as part of        In this work, each image is ranked by its respec-
an Image Retrieval Challenge in 2015. Com-              tive Discounted Cumulated Gain (DCG) measure
mercial image search engines like Google, Bing          against the test queries. To calculate DCG, we sort
record clicks against queries to capture the user       the images against each query based on the final
behaviour. Insightful usage of the recorded click-      similarity score, obtained from our process. DCG
logs may lead to better cross-modal retrieval. The      for each query is computed as depicted in the fol-
dataset comprises of two parts: (a) the training        lowing Equation 4. This metric was the official
dataset and (b) the testing dataset or Dev dataset.     metric for the MSR- Bing Image Retrieval Chal-
The training dataset, consists of 1 million images      lenge 2014 and 201511 .
and 11.7 million unique queries, is a sample of
                                                                                          25
                                                                                          X
user click log which is a large table consisting of                                          2reli − 1
                                                                       DCG25 = Z25                                 (4)
text queries, its associated images and number of                                         i=1
                                                                                                log2 (i + 1)
of clicks for each query-image pair. An example
of a subgraph of the Clickture dataset is depicted      where reli = {Excelent = 3; Good = 2; Bad =
in Figure 1.                                            0} is the manually judged relevance for each im-
                                                        age with respect to the query, and Z25 = 0.01757
   The click count between an image and a query is      is a normalizer to make the score for 25 Excellent
calculated from different users at different times.     results. Here, we report the final evaluation metric
There are at least 23.1 million query-image pairs       as the average of for all queries present in the test
which have click count equal or more than 1.            set. We choose the comparative methods (base-
The Dev Dataset, which has 79, 926 query-image          lines) against our proposed system from the base
pairs generated from 1, 000 queries, is composed        paper (Wu et al., 2016a). The comparative meth-
to have consistent query distribution, judgment         ods are as follows:
guidelines and quality of a test dataset. For perfor-
mance evaluation, manually annotated relevance              1. Bag-of-Words similarity               based     ranking
measurement which is purely qualitative (labeled               method (BoWDNN-R)
as Excellent, Good and Bad) is provided with the
dataset.                                                    11
                                                                 http://press.liacs.nl/mmgrand/microsoft.pdf
                                          Model                     DCG
                                        BoWDNN-R                    50.89
                                           CCL                      50.59
                                          PAMIR                     50.17
                                            PSI                     49.91
                                         CMRNN                      50.71
                                         MRW-NN                     51.04
                               Our Proposed Model (CNN+CF)         79.88p

                       Table 1: A comparison of various image retrieval methods


    2. Click-through-based Cross-view Learning         rior performance when compared to unimodal re-
       (CCL)                                           trieval systems, the problem of capturing the user
                                                       information need ideally continues to be a chal-
    3. Passive-Aggressive Model for Image Re-          lenge. Click counts offer a new dimension to aid in
       trieval (PAMIR)                                 better understanding user’s information need con-
                                                       cerning images and when used judiciously can sig-
    4. Polynomial Semantic Indexing (PSI)
                                                       nificantly improve the corresponding IR’s perfor-
    5. Cross-Model Ranking Neural Network (CM-         mance. In this paper, we have applied the knowl-
       RNN) and                                        edge embedded within the clicks by using a col-
                                                       laborative filtering technique as an implicit feed-
    6. Multimodal Random Walk Neural Network           back mechanism to enhance the latent representa-
       (MRW-NN) respectively.                          tion based similarity computation. Our proposed
                                                       technique performs superlatively against the state-
From Table 1, we can conclude that there is            of-the-art baseline over a real-world dataset.
marked improvement in terms of retrieval per-          As part of our future work, we would like to ad-
formance when compared to other state-of-the-art       dress the irregularities in the retrieval mechanism
techniques. The gain in terms of DCG is also sta-      in the absence of modalities and would like to
tistically significant (indicated with a superscript   explore and suggest techniques to handle com-
p). Hence, we can safely conclude that consid-         mon problems associated with collaborative fil-
ering click-counts as ratings and formulating the      tering methods. We would also like to study
problem of image retrieval as item recommenda-         our method’s effectiveness and scalability for real-
tion yields significantly better performance. The      time data. This work is a part of a larger project
possible reason why our system performs better         where we aim to integrate the retrieval model with
than others could be attributed to the fact that we    image classification and automatic image anno-
did not rely on either latent semantic representa-     tation techniques proposed by us in our earlier
tion based learning or collaborative filtering in-     works.
dividually. Instead, we proposed a new model
that incorporates the learned feature representa-
tions from CNN as user and items, which possibly       References
negated the shortfall of both techniques.              Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014.
                                                         Cross-modal Retrieval with Correspondence Au-
6    Conclusion and Future Work                          toencoder.     In Proceedings of the 22Nd ACM
                                                         International Conference on Multimedia. ACM,
Image retrieval has been one of the focal points         New York, NY, USA, MM ’14, pages 7–16.
of information retrieval systems since the early         https://doi.org/10.1145/2647868.2654902.
days of computing. Recent techniques have fo-          Y. He, S. Xiang, C. Kang, J. Wang, and C. Pan.
cused on various learning techniques to minimize         2016.       Cross-Modal Retrieval via Deep and
the semantic gap between the query intent of users       Bidirectional Representation Learning.    IEEE
and the actual information retrieved by IRs. The         Transactions on Multimedia 18(7):1363–1377.
                                                         https://doi.org/10.1109/TMM.2016.2558463.
same applies to image retrieval as well. While hy-
brid and multi-modal systems have shown supe-          Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-
  Semantic Alignments for Generating Image De-          2015 IEEE 27th International Conference on Tools
  scriptions. In The IEEE Conference on Computer        with Artificial Intelligence (ICTAI). pages 234–241.
  Vision and Pattern Recognition (CVPR).                https://doi.org/10.1109/ICTAI.2015.45.
Yehuda Koren and Robert Bell. 2015. Advances in
  Collaborative Filtering, Springer US, Boston, MA,   Fei Wu, Xinyan Lu, Jun Song, Shuicheng Yan,
  pages 77–118. https://doi.org/10.1007/978-1-4899-     Zhongfei Mark Zhang, Yong Rui, and Yuet-
  7637-6 3.                                             ing Zhuang. 2016a.            Learning of Multi-
                                                        modal Representations With Random Walks
Microsoft. 2014. Clickture project. https://            on the Click Graph.          IEEE transactions on
  www.microsoft.com/en-us/research/                     image processing :          a publication of the
  project/clickture/. Accessed: 2019-04-22.             IEEE Signal Processing Society 25(2):630642.
                                                        https://doi.org/10.1109/tip.2015.2507401.
Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li,
  Chong-Wah Ngo, and Yong Rui. 2014. Click-
  through-based Cross-view Learning for Image         P. Wu, S. C. H. Hoi, P. Zhao, C. Miao, and Z. Y.
  Search. In Proceedings of the 37th International       Liu. 2016b.         Online Multi-Modal Distance
  ACM SIGIR Conference on Research &#38; De-             Metric Learning with Application to Image
  velopment in Information Retrieval. ACM, New           Retrieval.       IEEE Transactions on Knowl-
  York, NY, USA, SIGIR ’14, pages 717–726.               edge and Data Engineering 28(2):454–467.
  https://doi.org/10.1145/2600428.2609568.               https://doi.org/10.1109/TKDE.2015.2477296.

Scott Reed, Zeynep Akata, Honglak Lee, and Bernt
  Schiele. 2016. Learning Deep Representations of     Yi Zhen and Dit-Yan Yeung. 2012.           A Prob-
  Fine-Grained Visual Descriptions. In The IEEE         abilistic Model for Multimodal Hash Function
  Conference on Computer Vision and Pattern Recog-      Learning.      In Proceedings of the 18th ACM
  nition (CVPR).                                        SIGKDD International Conference on Knowl-
                                                        edge Discovery and Data Mining. ACM, New
C. Wang, H. Yang, and C. Meinel. 2015. Deep Se-         York, NY, USA, KDD ’12, pages 940–948.
  mantic Mapping for Cross-Modal Retrieval. In          https://doi.org/10.1145/2339530.2339678.