Your Click Matters: Enhancing Click-based Image Retrieval performance through Collaborative Filtering Deepanwita Datta Manajit Chakraborty Aveek Biswas NTNU Università della Svizzera italiana UCSD Trondheim, Norway Lugano, Switzerland California, USA ddatta.rs.cse13@itbhu.ac.in chakrm@usi.ch a4biswas@ucsd.edu Abstract tion to bridge the semantic gap between the high- level information needs of users and commonly Image retrieval has been an active research employed low-level features, it continues to be a area since the early days of computing. major challenge. The existing state-of-the-art so- While ensemble, multimodal and hybrid lutions to this challenge is two-pronged. Few of methods coupled with machine learning the existing works (Feng et al., 2014; Zhen and Ye- has seen an upward surge replacing uni- ung, 2012) stress on learning mapping functions modal, heuristic-based methods; a rather whereas rest of the works explore the high-level new offshoot has been to identify new fea- semantic representation of modalities (Karpathy tures associated with images on the web. and Fei-Fei, 2015; Reed et al., 2016). Among One such feature is the ‘click count’ based these, semantic representation based approaches on the clicks an image or its correspond- and deep learning based approaches have gained ing text gets in response to a query. Previ- reasonable success. Deep Convolutional Neural ous state-of-the-art methods have tried to Networks (CNNs) are used to learn the latent fea- exploit this feature by using its raw count tures and these learned features are utilized as vi- and machine learning. In this paper, we sual and textual semantic representation in these build on this idea and propose a new col- models. laborative filtering based technique to em- In ACM Multimedia 2015 MSR-Bing Image Re- ploy the click-log of users from the web to trieval Challenge1 , it was argued that the massive better identify and associate images in re- amount of click data from commercial search en- sponse to either a text or an image query. gines provide a data set that is unique in bridging Experiments performed on a large scale the semantic and intent gap. Millions of click data publicly available standard dataset having i.e. clicked image-query pairs, generated from genuine click logs from actual users cor- search engines, were collected and released pub- roborate the efficacy and significant in- licly as a new large-scale real-world image click crease in efficiency of our approach. data (Clickture) to investigate how to effectively leverage this click-count based data to mitigate the 1 Introduction semantic gap. This click data is stored in a large ta- Cross-media retrieval has proven to be an effective ble with multiple rows and three tuples (I, Q, C), solution to search through enormous multi-varied indicating that the image I was clicked C times datasets. A fairly common example of such cross- against the search results of a given textual query media retrieval is when an image is searched us- Q. Wu et al. (Wu et al., 2016a) view the entire ing a text query. Here the textual description of dataset as a bipartite graph which has two types the image content acts as the text query. However, of vertices, queries and images respectively, and it is not a trivial task to illustrate a non-textual vi- the edge’s weight is assigned according to the to- sual content using only text. Hence, a semantic tal number of clicks from all the users. The au- gap is introduced between the user needs and the thors learn a common representation for both im- given (existing) descriptions. Although in litera- age and text query from the perspective of encod- ture, a significant amount of work has been de- voted to correlating textual and visual informa- 1 http://press.liacs.nl/mmgrand/microsoft.pdf Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). ing the explicit/implicit relevance relationship be- 2 Related Works tween the vertices in the click graph. The common representation is obtained as well as any unseen A fundamental image retrieval technique is to query or image is dealt with by reducing the trun- search for images by textual queries. The con- cated random walk loss and the distance between ventional image search engines leverage the ben- the learned representation of vertices and their cor- efits of associated or surrounding text which are responding deep neural network output. generally collected from the data like image cap- Thus, the relevance relationship between a text tions, tags, comments etc. To train such systems query and its resulting image is obtained purely by by labeled text-image pairs human intervention is measuring the distance between their correspond- necessary. However, such human labeling is ex- ing learned high-level feature representation. In pensive, time-consuming and quite cumbersome. other words, the cross-modal retrieval on the un- These labeled data is unreliable as quite often they seen queries and images is dealt with here in a suffer from noise. Expressing an image entirely content-based fashion. As these content-based through a concise set of keywords, keyphrases systems operate solely on feature representations, or free text is a non-trivial task even for humans the definition of similarity in these systems is fre- let alone a system. The problem is compounded quently ad-hoc and not explicitly optimized or when the user has limited to no knowledge of how generalized for any particular task i.e. here cross- the search system or IR work. To alleviate these modal retrieval. Frequently, the optimization of problems of cross-view learning, the use of click- similarity for ranking affects the quantity of in- through data have gained momentum (Pan et al., terest. Thus the retrieved items often become 2014). Cross-View learning creates a common la- coarsely abstracted or potentially irrelevant. To tent subspace where the data from different modal- overcome this shortfall, we try to capture relevant ities like text, image etc., can be compared with similarity information expressed by collaborative each other easily. filtering. On the other hand, the click-through data is avail- The motivation behind using collaborative filter- able in huge amount and is relatively easy to ac- ing (CF) is that this method produces user-specific cess. Also, this click-through data helps in bet- recommendations of items based on patterns of ter understanding a query. In the work by Pan et ratings or usage without the need for exogenous al. (Pan et al., 2014), the distance between map- information about either items or users (Koren and pings of query and image in the latent subspace Bell, 2015). Recommendation by collaborative fil- is reduced and the inherent structure is preserved tering relies on explicit or implicit feedback from back to each original space. Once the mapping is the user or indirectly obtained through observing done, and the latent representations are acquired, user behavior respectively. In our scenario, for any the next step is to compute the distance among unseen query, aside from relying on the similarity these representations. Hence, choosing an ap- score of the learned features, we can exploit some propriate similarity function becomes crucial as previous knowledge i.e. implicit feedback of the it is the key to make the cross-modal similarity user. We consider click counts as the implicit feed- tractable. He et al. (He et al., 2016) propose back from the user. Stemming from this observa- a deep and bidirectional representation learning tion, we predict the similarity structure encoded model to address the issue of imagetext cross- by collaborative filtering data. Finally, we use col- modal retrieval. The authors adopt two convo- laborative filtering for generating a ranked list for lutional deep neural networks to extract seman- cross-modal retrieval. To the best of our knowl- tic representation from both raw image and text edge, ours is the first attempt in using collabora- data and calculate cosine distance among those. tive filtering in conjunction with Deep Learning Subsequently, a bidirectional network is learned towards cross-modal image retrieval. A rigorous from the matched and unmatched image-text pairs experiment is carried out over the Clickture dataset for training to capture the property of the cross- and the experimental results validate our claim. modal retrieval. This learning framework uses Our method outperforms the current state of the maximum likelihood criterion and optimizes the art on the learning and content-based methods. network through backpropagation and stochastic gradient descent. Similarly, in the paper by Wang et al. (Wang et al., 2015), the authors propose a supervised frame- modal ranking over the new images and queries work based on a deep neural network which cap- that are not involved in the training click graph. tures the intra-modal and inter-modal relationships efficiently. The proposed model requires only a 3.1 Obtaining feature vector representations little prior knowledge to exploring high-level se- of query and documents mantic correlation and also it can tackle the situ- Multimodal objects from different feature spaces ation if any modality is missing. While most of are present in the click graph. So, the first step of the recent works focus on learning semantic rep- our model consists of projecting a feature vector resentation, the work by Wu et al. (Wu et al., representation of multimodal data into a common 2016b) concentrates on distance metric learning dimensional space. Here, image and text are the which is essential to improve similarity search for two different sources of information where the di- content-based retrieval. Usually, single modal dis- mension of an image depends on the pixel inten- tance metric learning methods suffer from some sity, and the number of pixels present in the im- critical issues such as choosing the dominant fea- age and vocabulary size of bag-of-words denotes ture from diverse feature representation, learning a the dimension of the text. Let us say if M is the distance metric on the combined high-dimensional dimension of image feature and N is the dimen- feature space which is very time-consuming etc.. sion of text feature then our objective is to come To overcome these issues, the authors proposed up with a common latent vector space of dimen- a multi-modal distance metric learning scheme sion D for both the image and the text. called online multi-modal distance metric learning We obtain a common representation through Con- (OMDML), which can learn an optimized distance volutional Neural Network (CNN). Some pre- metric on each individual feature space and learn trained model such as inception-v3 model2 can be to find an optimal combination of diverse types of used to learn the proper representation of input features. data. The learned representations account for the Our work in this paper adopts the approach of us- variations associated with the features. Any estab- ing convolutional neural networks as suggested by lished CNN model consists of layers like convolu- He et al. (He et al., 2016). The reason for choos- tional filtering, local contrast normalization, max- ing this over other learning methods is that while pooling and finally fully connected neural network training stage might take longer than some other layers. For our model, we eliminate the last fully methods, CNN usually supersedes others when it connected layer of the CNN (the output layer) and comes to the accuracy of learning. It should be retrieve the image vectors from the penultimate noted that we have modified the settings of CNN layer. This is done since we are not interested in to fit our problem and adapted it to our needs. the classification of the images but instead in the generated embeddings of the images. The raw im- 3 Methodology age features (embeddings) are directly fed into the model to get a latent representation. However, the Our model consists of two phases, training and text queries are represented in vector space model testing, as is the case with any learning based tech- (bag-of-words). So, all words present in the vo- nique. A click graph is used as labeled training cabulary are inserted into a vector lookup table. data where the number of click count is treated as Finally, a D dimensional representation is learned a label between a text query-image pair i.e. if any for each word from the lookup table. Test query click is present between any text query and image, usually consists of multiple words. So, the entire the image must be relevant to the text query. This query can be represented by summing up all their assumption stems from the fact that a user usually corresponding word vectors. The obtained latent clicks on an image against a text query only if she representations are used for learning purpose as finds it relevant and useful. Here, the number of depicted in the next subsection. click counts reinforces how strongly relevant the image is against the query or vice versa. Thus 3.2 Learning from labeled click data a labeled query-image pair is learned through the In Recommendation System, Collaborative Filter- click count. In the testing phase, relevant images ing (CF) models capture the interaction between from the test set are retrieved against any given test query and ranked. Hence, we perform cross- 2 https://www.kaggle.com/google-brain/inception-v3 users and item based on the rating. A rating indi- 3.4 Calculating similarity score cates the preference of an individual user towards By incorporating predicted click count between a particular item. High values of rating indicate a the unseen query and each image in the dataset, stronger preference of the user towards the partic- we calculate the similarity between each pair of ular item. The rating values are by nature either the images. Let us consider, the predicted click implicit or explicit feedback provided by the user count for the two images iu and iv against the nth or collected from user behavior or history. Per- query qn as Rn,u and Rn,v respectively. Then the ceiving some resemblance with the inherent nature similarity measure between any two images, Su,v , of collaborative filtering with the characteristics can be calculated by using the following Equation of our dataset, we hypothesize that learning from 3: clicked data through collaborating filtering may increase the retrieval performance. In this sce- P ¯ ¯ n∈I (Rn,u − Rn )(Rn,v − Rn ) nario, the text query and the images corresponding Su,v = qP qP ¯ 2 ¯ 2 to the query play the role of user and item respec- n∈I (Rn,u − Rn ) n∈I (Rn,v − Rn ) tively. We treat click-count of each query-image (3) pair as an implicit rating and train our model from where R¯n is the average click-count for those im- these labeled query-image pairs and the rating ma- ages and I denotes the entire image set. It can be trix. observed from the above equation that the similar- ity measure depends on how much the click-count 3.3 Prediction of click count for unseen query for a pair of images deviates from the average rat- ing for those images. So, the similarity measure Model-based collaborative filter predicts users’ is purely dependent on the predicted click-count. rating of unrated items. CF engines are more ver- As stated earlier, we calculate all the similarities satile, in the sense that they can be applied to any between each pair of the images present in the domain, and with some care could also provide dataset and based on the similarity score we rank cross-domain recommendations. Also, CF works the images against the each query. Thus a final best when the user space is large, which is the ranked list is prepared and we select top-most im- case for image searching where thousands of users ages as the most relevant retrieved ones. are looking for images over the web every second. Taking a cue from this fact, we choose a model- 4 Experimental Setup based collaborative approach to predict the click count for an unseen query. The collaborative filter In this section we list the possible requirements for tries to predict ratings or click counts by charac- the experiment. To run Convolutional Neural Net- terizing both the query and image. Let us consider work for learning the latent representations over that the learned latent vector for any query q is Vq the large set of images, we have used cloud com- and the learned latent vector for any image i is Vi puting services of Google Cloud TPU 3 through 10 such that for a given query q, R measure the ex- different instances. The learning is done by a pre- tent of relevance the query has with images that are trained model through ImageNet4 i.e. inception-v3 highly relevant. The interaction R between query model5 . The last fully connected layer of the Con- q and image i can be captured by the following voluted Neural Network, i.e. the penultimate layer Equation 1: of the CNN, is extracted using TensorFlow6 . The dimension of all the learnt image vectors are kept R = VqT Vi (1) to 2048. The other libraries which aid this process where, the dot product between two vectors x, y ∈ are NumPy7 , SciPy8 , scikit-learn9 , pickle10 etc. Rf is defined as in Equation 2. 3 https://cloud.google.com/tpu/ 4 www.image-net.org/ 5 f X https://cloud.google.com/tpu/docs/inception-v3- xT y = xk yk (2) advanced 6 k=1 https://www.tensorflow.org/ 7 https://www.numpy.org/ 8 https://www.scipy.org/ Thus, the predicted click count becomes R which 9 https://scikit-learn.org/stable/ is calculated using the Equation 1. 10 https://docs.python.org/3/library/pickle.html Figure 1: An example of the subgraph of the click graph Dataset Our experiments are performed over an 5 Results and Analysis established real world dataset, Clickture 2014 (Mi- crosoft, 2014), released by Microsoft as part of In this work, each image is ranked by its respec- an Image Retrieval Challenge in 2015. Com- tive Discounted Cumulated Gain (DCG) measure mercial image search engines like Google, Bing against the test queries. To calculate DCG, we sort record clicks against queries to capture the user the images against each query based on the final behaviour. Insightful usage of the recorded click- similarity score, obtained from our process. DCG logs may lead to better cross-modal retrieval. The for each query is computed as depicted in the fol- dataset comprises of two parts: (a) the training lowing Equation 4. This metric was the official dataset and (b) the testing dataset or Dev dataset. metric for the MSR- Bing Image Retrieval Chal- The training dataset, consists of 1 million images lenge 2014 and 201511 . and 11.7 million unique queries, is a sample of 25 X user click log which is a large table consisting of 2reli − 1 DCG25 = Z25 (4) text queries, its associated images and number of i=1 log2 (i + 1) of clicks for each query-image pair. An example of a subgraph of the Clickture dataset is depicted where reli = {Excelent = 3; Good = 2; Bad = in Figure 1. 0} is the manually judged relevance for each im- age with respect to the query, and Z25 = 0.01757 The click count between an image and a query is is a normalizer to make the score for 25 Excellent calculated from different users at different times. results. Here, we report the final evaluation metric There are at least 23.1 million query-image pairs as the average of for all queries present in the test which have click count equal or more than 1. set. We choose the comparative methods (base- The Dev Dataset, which has 79, 926 query-image lines) against our proposed system from the base pairs generated from 1, 000 queries, is composed paper (Wu et al., 2016a). The comparative meth- to have consistent query distribution, judgment ods are as follows: guidelines and quality of a test dataset. For perfor- mance evaluation, manually annotated relevance 1. Bag-of-Words similarity based ranking measurement which is purely qualitative (labeled method (BoWDNN-R) as Excellent, Good and Bad) is provided with the dataset. 11 http://press.liacs.nl/mmgrand/microsoft.pdf Model DCG BoWDNN-R 50.89 CCL 50.59 PAMIR 50.17 PSI 49.91 CMRNN 50.71 MRW-NN 51.04 Our Proposed Model (CNN+CF) 79.88p Table 1: A comparison of various image retrieval methods 2. Click-through-based Cross-view Learning rior performance when compared to unimodal re- (CCL) trieval systems, the problem of capturing the user information need ideally continues to be a chal- 3. Passive-Aggressive Model for Image Re- lenge. Click counts offer a new dimension to aid in trieval (PAMIR) better understanding user’s information need con- cerning images and when used judiciously can sig- 4. Polynomial Semantic Indexing (PSI) nificantly improve the corresponding IR’s perfor- 5. Cross-Model Ranking Neural Network (CM- mance. In this paper, we have applied the knowl- RNN) and edge embedded within the clicks by using a col- laborative filtering technique as an implicit feed- 6. Multimodal Random Walk Neural Network back mechanism to enhance the latent representa- (MRW-NN) respectively. tion based similarity computation. Our proposed technique performs superlatively against the state- From Table 1, we can conclude that there is of-the-art baseline over a real-world dataset. marked improvement in terms of retrieval per- As part of our future work, we would like to ad- formance when compared to other state-of-the-art dress the irregularities in the retrieval mechanism techniques. The gain in terms of DCG is also sta- in the absence of modalities and would like to tistically significant (indicated with a superscript explore and suggest techniques to handle com- p). Hence, we can safely conclude that consid- mon problems associated with collaborative fil- ering click-counts as ratings and formulating the tering methods. We would also like to study problem of image retrieval as item recommenda- our method’s effectiveness and scalability for real- tion yields significantly better performance. The time data. This work is a part of a larger project possible reason why our system performs better where we aim to integrate the retrieval model with than others could be attributed to the fact that we image classification and automatic image anno- did not rely on either latent semantic representa- tation techniques proposed by us in our earlier tion based learning or collaborative filtering in- works. dividually. Instead, we proposed a new model that incorporates the learned feature representa- tions from CNN as user and items, which possibly References negated the shortfall of both techniques. Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Au- 6 Conclusion and Future Work toencoder. In Proceedings of the 22Nd ACM International Conference on Multimedia. ACM, Image retrieval has been one of the focal points New York, NY, USA, MM ’14, pages 7–16. of information retrieval systems since the early https://doi.org/10.1145/2647868.2654902. days of computing. Recent techniques have fo- Y. He, S. Xiang, C. Kang, J. Wang, and C. Pan. cused on various learning techniques to minimize 2016. Cross-Modal Retrieval via Deep and the semantic gap between the query intent of users Bidirectional Representation Learning. IEEE and the actual information retrieved by IRs. The Transactions on Multimedia 18(7):1363–1377. https://doi.org/10.1109/TMM.2016.2558463. same applies to image retrieval as well. While hy- brid and multi-modal systems have shown supe- Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual- Semantic Alignments for Generating Image De- 2015 IEEE 27th International Conference on Tools scriptions. In The IEEE Conference on Computer with Artificial Intelligence (ICTAI). pages 234–241. Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/ICTAI.2015.45. Yehuda Koren and Robert Bell. 2015. Advances in Collaborative Filtering, Springer US, Boston, MA, Fei Wu, Xinyan Lu, Jun Song, Shuicheng Yan, pages 77–118. https://doi.org/10.1007/978-1-4899- Zhongfei Mark Zhang, Yong Rui, and Yuet- 7637-6 3. ing Zhuang. 2016a. Learning of Multi- modal Representations With Random Walks Microsoft. 2014. Clickture project. https:// on the Click Graph. IEEE transactions on www.microsoft.com/en-us/research/ image processing : a publication of the project/clickture/. Accessed: 2019-04-22. IEEE Signal Processing Society 25(2):630642. https://doi.org/10.1109/tip.2015.2507401. Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click- through-based Cross-view Learning for Image P. Wu, S. C. H. Hoi, P. Zhao, C. Miao, and Z. Y. Search. In Proceedings of the 37th International Liu. 2016b. Online Multi-Modal Distance ACM SIGIR Conference on Research & De- Metric Learning with Application to Image velopment in Information Retrieval. ACM, New Retrieval. IEEE Transactions on Knowl- York, NY, USA, SIGIR ’14, pages 717–726. edge and Data Engineering 28(2):454–467. https://doi.org/10.1145/2600428.2609568. https://doi.org/10.1109/TKDE.2015.2477296. Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning Deep Representations of Yi Zhen and Dit-Yan Yeung. 2012. A Prob- Fine-Grained Visual Descriptions. In The IEEE abilistic Model for Multimodal Hash Function Conference on Computer Vision and Pattern Recog- Learning. In Proceedings of the 18th ACM nition (CVPR). SIGKDD International Conference on Knowl- edge Discovery and Data Mining. ACM, New C. Wang, H. Yang, and C. Meinel. 2015. Deep Se- York, NY, USA, KDD ’12, pages 940–948. mantic Mapping for Cross-Modal Retrieval. In https://doi.org/10.1145/2339530.2339678.