CuratorNet: Visually-aware Recommendation of Art Images Pablo Messina∗ Manuel Cartagena Patricio Cerda Pontificia Universidad Católica Pontificia Universidad Católica Pontificia Universidad Católica Santiago, Chile Santiago, Chile Santiago, Chile pamessina@uc.cl micartagena@uc.cl pcerdam@uc.cl Felipe del Rio† Denis Parra‡ Pontificia Universidad Católica Pontificia Universidad Católica Santiago, Chile Santiago, Chile fidelrio@uc.cl dparra@ing.puc.cl ABSTRACT crafted visual features [43]. Regarding application domains of recent Although there are several visually-aware recommendation models image recommendation methods using neural visual embeddings, in domains like fashion or even movies, the art domain lacks the to the best of our knowledge most of them focus on fashion recom- same level of research attention, despite the recent growth of the mendation [20, 22, 30], a few on art recommendation [19, 32] and online artwork market. To reduce this gap, in this article we intro- photo recommendation [29]. He et al. [19] proposed Vista, a model duce CuratorNet, a neural network architecture for visually-aware combining neural visual embeddings, collaborative filtering as well recommendation of art images. CuratorNet is designed at the core as temporal and social signals for digital art recommendation. with the goal of maximizing generalization: the network has a fixed However, digital art projects can differ significantly from physi- set of parameters that only need to be trained once, and thereafter cal art (paintings and photographs). Messina et al. [32] study recom- the model is able to generalize to new users or items never seen mendation of paintings in an online art store using a simple k-NN before, without further training. This is achieved by leveraging model based on neural visual features and metadata. Although visual content: items are mapped to item vectors through visual memory-based models perform fairly well, model-based methods embeddings, and users are mapped to user vectors by aggregating using neural visual features report better performance [19, 20] in the visual content of items they have consumed. Besides the model the fashion domain, indicating room for improvement in this area, architecture, we also introduce novel triplet sampling strategies to considering the growing sales in the global online artwork market1 . build a training set for rank learning in the art domain, resulting in The most popular model-based method for image recommenda- more effective learning than naive random sampling. With an eval- tion using neural visual embeddings is VBPR [20], a state-of-the-art uation over a real-world dataset of physical paintings, we show that model that integrates implicit feedback collaborative filtering with CuratorNet achieves the best performance among several baselines, neural visual embeddings into a Bayesian Personalized Ranking including the state-of-the-art model VBPR. CuratorNet is motivated (BPR) learning framework [33]. VBPR performs well, but it has and evaluated in the art domain, but its architecture and training some drawbacks. VBPR learns a latent embedding for each user scheme could be adapted to recommend images in other areas. and for each item, so new users cannot receive suggestions and new items cannot be recommended until re-training is carried out. CCS CONCEPTS An alternative is training a model such as Youtube’s Deep Neural Recommender [8] which allows to recommend to new users with • Information systems → Recommender systems; • Comput- little preference feedback and without additional model training. ing methodologies → Machine learning approaches; • Ap- However, Youtube’s model was trained on millions of user transac- plied computing → Media arts. tions and with large amounts of profile and contextual data, so it KEYWORDS does not easily fit to datasets that are small, with little user feedback or with little contextual and profile data. recommender systems, neural networks, visual art In this work, we introduce a neural network for visually-aware recommendation of images focused on visual art named Curator- 1 INTRODUCTION Net, whose general structure can be seen in Figure 1. CuratorNet The big revolution of deep convolutional neural networks (CNN) leverages neural image embeddings as those obtained from CNNs in the area of computer vision for tasks such as image classification [18, 27, 41] pre-trained on the Imagenet dataset (ILSVRC [36]). We [18, 27, 41], object recognition [1], image segmentation [3] or scene train CuratorNet for ranking with triplets (𝑃𝑢 , 𝑖 + , 𝑗− ), where 𝑃𝑢 is identification [40] has reached the area of image recommender sys- the history of image preferences of a user 𝑢, whereas 𝑖 + and 𝑗− are tems in recent years [19, 20, 22, 29, 30, 32]. These works use neural a pair of images with higher and lower preference respectively. Cu- visual embeddings to improve the recommendation performance ratorNet draws inspiration from VBPR [20] and Youtube’s Recom- compared to previous approaches for image recommendation based mender System [8]. VBPR [20] inspired us to leverage pre-trained on ratings and text [2], social tags [39], context [5] and manually image embeddings as well as optimizing the model for ranking as in ∗ Also with Millennium Institute Foundational Research on Data, IMFD. † Also with Millennium Institute Foundational Research on Data, IMFD. 1 https://www.artsy.net/article/artsy-editorial-global-art-market-reached-674- ‡ Also with Millennium Institute Foundational Research on Data, IMFD. billion-2018-6 Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Cartagena, et al. BPR [33]. From the work of Convington et al. [8] we took the idea through a mobile application, with the aim of making museum tour of designing a deep neural network that can generalize to new users recommendations more useful. Their content-based approach used without introducing new parameters or further training (unlike ratings given by the users during the tour and metadata from the VBPR which needs to learn a latent user vector for each new user). artworks rated, e.g. title or artist names. As a result, CuratorNet can recommend to new users with very Finally, the most recent works use neural image embeddings little feedback, and without additional training CuratorNet’s deep [19, 32]. He et al. [19] propose the system Vista, which addresses neural network is trained for personalized ranking using triplets digital artwork recommendations based on pre-trained deep neural and the architecture contains a set of layers with shared weights, visual features, as well as temporal and social data. On the other inspired by models using triplet loss for non-personalized image hand, Messina et al. [32] address the recommendation of one-of-a- ranking [38, 44]. In these works, a single image represents the input kind physical paintings, comparing the performance of metadata, query, but in our case, the input query is a set images representing manually-curated visual features, and neural visual embeddings. a user preference history, 𝑃𝑢 . In summary, compared to previous Messina et al. [32] recommend to users by computing a simple works, our main contributions are: K-NN based similarity score among users’ purchased paintings and • a novel neural-based visually-aware architecture for image the paintings in the dataset, a method that Kang et al. [22] call recommendation, VisRank. • a set of sampling guidelines for the creation of the training dataset (triplets), which improve the performance of Cu- ratorNet as well as VBPR with respect to random negative 2.2 Visually-aware Image Recommender sampling, and Systems • presenting a thorough evaluation of the method against competitive state-of-the-art methods (VisRank [22, 32] and In this section we survey works using visual features to recom- VBPR[20]) on a dataset of purchases of physical art (paintings mend images. We also cite a few works using visual information to and photographs). recommend non-image items, but these are not too relevant for the present research. We also share the dataset2 of user transactions (with hashed Manually-engineered visual features extracted from images (tex- user and item IDs due to privacy requirements) as well as visual ture, sharpness, brightness, etc.) have been used in several tasks embeddings of the paintings image files. One aspect to highlight for information filtering, such as retrieval [28, 35, 43] and ranking about this research, is that although the triplets’ sampling guidelines [37]. More recently, interesting results have been shown for the to build the BPR training set apply specifically to visual art, the use of low-level handcrafted stylistic visual features automatically architecture of CuratorNet can be used in other visual domains for extracted from video frames for content-based video recommen- image recommendation. dation [11]. Even better results are obtained when both stylistic visual features and annotated metadata are combined in a hybrid 2 RELATED WORK recommender, as shown in the work of Elahi et al. [13]. In a visually- In this section we provide an overview of relevant related work, aware setting not related to recommending images, Elsweiller et al. considering: Artwork Recommender Systems (2.1), Visually-aware [14] used manually-crafted attractiveness visual features [37], in Recommender Systems (2.2), as well as highlights of what differenti- order to recommend healthy food recipes to users. ates our work to the existing literature. Another branch of visually-aware image recommender systems focuses on using neural embeddings to represent images [19, 20, 2.1 Artwork Recommender Systems 22, 29, 32]. The computer vision community has a large track of With respect to artwork recommender systems, one of the first successful systems based on neural networks for several tasks contributions was the CHIP Project [2]. The aim of the project [1, 3, 18, 27, 40, 41]. This trend started from the outstanding per- was to build a recommendation system for the Rijksmuseum. The formance of the AlexNet [27] in the Imagenet Large Scale Visual project used traditional techniques such as content-based filtering Recognition challenge (ILSVRC [36]), but the most notable implica- based on metadata provided by experts, as well as collaborative tion is that the neural image embeddings have shown impressive filtering based on users’ ratings. Another similar system but non- performance for transfer learning, i.e., for tasks different from the personalized was 𝑚4𝑎𝑟𝑡 by Van den Broek et al. [43], who used original one [10, 26]. Usually these neural image embeddings are color histograms to retrieve similar art images given a painting as obtained from CNN models such as AlexNet [27], VGG [41] and input query. ResNet [18], among others. Motivated by these results, McAuley et Another important contribution is the work by Semeraro et al. al. [30] introduced an image-based recommendation system based [39], who introduced an artwork recommender system called FIRSt on styles and substitutes for clothing using visual embeddings pre- (Folksonomy-based Item Recommender syStem) which utilizes so- trained on a large-scale dataset obtained from Amazon.com. Later, cial tags given by experts and non-experts of over 65 paintings of He et al. [20] went further in this line of research and introduced the Vatican picture gallery. They did not employ visual features a visually-aware matrix factorization approach that incorporates among their methods. Benouaret et al. [5] improved the state-of- visual signals (from a pre-trained CNN) into predictors of people’s the-art in artwork recommender systems using context obtained opinions, called VBPR. Their training model is based on Bayesian Personalized Ranking (BPR), a model previously introduced by 2 https://drive.google.com/drive/folders/1Dk7_BRNtN_IL8r64xAo6GdOYEycivtLy Rendle et al. [33]. CuratorNet: Visually-aware Recommendation of Art Images The next work by He et al. [19] deals with visually-aware digital Table 1: Notation for CuratorNet. art recommendation, building a model called Vista which combines ratings, temporal and social signals and visual features. Symbol Description Another relevant work was the research by Lei et al. [29] who introduced comparative deep learning for hybrid image recommen- 𝑈,𝐼 user set, item set dation. In this work, they use a siamese neural network architecture 𝑢 a specific user 𝑖, 𝑗 a specific item (resp.) for making recommendations of images using user information 𝑖 +, 𝑗− a positive item and negative item (resp.) (such as demographics and social tags) as well as images in pairs 𝐼𝑢+ or 𝑃𝑢 set of all items which the user 𝑢 has expressed a positive (one liked, one disliked) in order to build a ranking model. The ap- preference (full history) proach is interesting, but they work with Flickr photos, not artwork + 𝐼𝑢,𝑘 set of all items which the user 𝑢 has expressed a positive images, and use social tags, not present in our problem setting. The preference up to his 𝑘-th purchase basket (inclusive) work by Kang et al. [22] expands VBPR but they focus on generat- 𝑃𝑢,𝑘 set of all items which the user 𝑢 has expressed a positive ing images using Generative adversarial networks [17] rather than preference in his 𝑘-th purchase basket recommending, with an application in the fashion domain. Finally, Messina et al. [32] was already mentioned, but we can add that their neural image embeddings outperformed other visual (manually- each user 𝑢 ∈ 𝑈 a personalized ranked list of the items for which extracted) and metadata features for ranking, with the exception of the user still have not expressed preference, i.e., for 𝐼 \ 𝐼𝑢+ . the metadata given by user’s favorite artist, which predicted even better than neural embeddings for top@k recommendation. 3.2 Preference Predictor The preference predictor in CuratorNet is inspired by VBPR [20], a 2.3 Differences to Previous Research state-of-the-art visual recommender model. Almost all the surveyed articles on artwork recommendation have However, CuratorNet has some important differences. First, in common that they used standard techniques such as collabo- we do not use non-visual latent factors, so we remove the traditional rative filtering and content-based filtering, as well as manually- user and item non-visual latent embeddings. Second, we do not curated visual image features, but only the most recent works have learn a specific embedding per user such as VBPR, but we learn a exploited visual features extracted from CNNs [19, 32]. In compari- joint model that, given a user’s purchase/like history, it outputs a son to these works, we introduce a model-based approach (unlike single embedding which can be used to rank unobserved artworks the memory-based VisRank method by Messina et al. [32]) and in the dataset, similar to YouTube’s Deep Learning network [8]. which recommends to cold-start items and users without additional Another important difference of VBPR with CuratorNet is that the model training (unlike [19]). With regards to more general work former has a single matrix E to project a visual item embedding 𝑓𝑖 on visually-aware image recommender systems, almost all of the into the user latent space. In CuratorNet, we rather learn a neural surveyed articles have focused on tasks different from art recom- network Φ(·) to perform that projection, which receives as input mendation, such as fashion recommendation [20, 22, 30], photo either a single image embedding f𝑖 or a set of image embeddings [29] and video recommendation [13]. Only Vista, the work by He et representing users’ purchase/like history 𝑃𝑢 = {f1, ..., f𝑁 } . Given al. [19], resembles ours in terms of the topic (art recommendation) all these aspects, the preference predictor of CuratorNet is given and the use of visual features. Unlike them, we evaluate our pro- by: posed method, CuratorNet, in a dataset of physical paintings and photographs, not only digital art. Moreover, Vista uses social and 𝑥𝑢,𝑖 = 𝛼 + 𝛽𝑢 + Φ(𝑃𝑢 )𝑇 Φ(f𝑖 ) (1) temporal metadata which we do not have and many other datasets might not have either. Compared to all these previous research, and where 𝛼 is an offset, 𝛽𝑢 represents a user bias, Φ(·) represents to the best of our knowledge, CuratorNet is the first architecture CuratorNet neural network and 𝑃𝑢 represents the set of visual for image recommendation that takes advantage of shared weights embeddings of the images in user 𝑢 history. After some experiments in a triplet loss setting, an idea inspired by the results of Wang et we found no differences between using or not a variable for item bias al. [44] and Schroff et al. [38], but here adapted to the personalized 𝛽𝑖 so we dropped it in order to decrease the number of parameters image recommendation domain. (Occam’s razor). Finally, since we calculate the model parameters using BPR [33], 3 CURATORNET the parameters 𝛼, 𝛽𝑢 cancel out (details in the coming subsection) 3.1 Problem Formulation and our final preference predictor is simply We approach the problem of recommending art images from user positive-only feedback (e.g., purchase history, likes, etc.) upon visual 𝑥𝑢,𝑖 = Φ(𝑃𝑢 )𝑇 Φ(f𝑖 ) (2) items (paintings, photographs, etc.). Let 𝑈 and 𝐼 be the set of users and items in a dataset, respectively. We assume only one image 3.3 Model Learning via BPR per each single item 𝑖 ∈ 𝐼 . Considering either user purchases or We use the Bayesian Personalized Ranking (BPR) framework [33] likes, the set of items for which a user 𝑢 has expressed positive to learn the model parameters. Our goal is to optimize ranking by preference is defined as 𝐼𝑢+ . In this work, we considered purchases training a model which orders triples of the form (𝑢, 𝑖, 𝑗) ∈ D S , to be positive feedback from the user. Our goal is to generate for where 𝑢 denotes a user, 𝑖 an item with positive feedback from 𝑢, Cartagena, et al. Figure 1: Architecture of CuratorNet showing in detail the layers with shared weights for training. and 𝑗 an item with non-observed feedback from 𝑢. The training set Adam optimizer [23], using the implementation in Tensorflow3 . of triples D S is defined as: During each iteration of stochastic gradient descent, we sample a user 𝑢, a positive item 𝑖 ∈ 𝐼 +𝑢 (i.e., removed from 𝑃𝑢 ), a negative item D S = {(𝑢, 𝑖, 𝑗)|𝑢 ∈ 𝑈 ∧ 𝑖 ∈ 𝐼𝑢+ ∧ 𝑗 ∈ 𝐼 \ 𝐼𝑢+ } (3) 𝑗 ∈ 𝐼 \ 𝐼 +𝑢 , and user 𝑢 purchase/like history with item 𝑖 removed, i.e., 𝑃𝑢 \ 𝑖. Table 1 shows that 𝐼 +𝑢 denotes the set of all items with positive feedback from 𝑢 while 𝐼 \𝐼 +𝑢 shows those items without such positive feedback. Considering our previously defined preference predictor 3.4 Model Architecture 𝑥𝑢,𝑖 , we would expect a larger preference score of 𝑢 over 𝑖 than over The architecture of the CuratorNet neural network is summarized in 𝑗, then BPR defines the difference between scores Figure 1, but is presented with more details in Figure 1. For training, each imput instance is expected to be a triple (𝑃𝑢 ,𝑖,𝑗), where 𝑃𝑢 is the 𝑥𝑢,𝑖,𝑗 = 𝑥𝑢,𝑖 − 𝑥𝑢,𝑗 (4) set of images in user 𝑢 history (purchases, likes) with a single item 𝑖 removed from the set, 𝑖 is an item with positive preference, and an then BPR aims at finding the parameters Θ which optimize 𝑗 is an item with assumed negative user preference. The negative the objective function user preference is assumed since the item 𝑗 is sampled from the ∑︁ list of images which 𝑢 has not interacted with yet. Each image (𝑖, argmax ln 𝜎 (𝑥𝑢,𝑖,𝑗 ) − 𝜆Θ ||Θ|| 2 (5) 𝑗 and all images ∈ 𝑃𝑢 ) goes through a ResNet [18] (pre-trained Θ DS with ImageNet data), which outputs a visual image embedding where 𝜎 (·) is the sigmoid function, Θ includes all model param- in R2,048 . ResNet weights are fixed during CuratorNet’s training. eters, and 𝜆Θ is a regularization hyperparameter. Then, the network has two layers with scale exponential linear In CuratorNet, unlike BPR-MF [33] and VBPR [20], we use a units (hereinafter, SELU [24]), with 200 neurons each, which reduce sigmoid cross entropy loss, considering that we can interpret the the dimensionality of each image. Notice that these two layers work decision over triplets as a binary classification problem, where if similar to a siamese [7] or triplet loss architecture [38, 44], i.e., they 𝑥𝑢,𝑖,𝑗 > 0 represents class 𝑐 = 1 (triple well ranked, since 𝑥𝑢,𝑖 > 𝑥𝑢,𝑗 have shared weights. Each image is represented at the output of ) and 𝑥𝑢,𝑖,𝑗 ≤ 0 signifies class 𝑐 = 0 (triplet wrongly ranked, since this section of the network by a vector in R200 . Then, for the case 𝑥𝑢,𝑖 ≤ 𝑥𝑢,𝑗 ). Then, CuratorNet loss can be expressed as: of the images in 𝑃𝑢 , their embeddings are both averaged (average pooling [6]) as well as max-pooled per dimension (max pooling [6]) , and next concatenated to a resultant vector in R400 . Finally, three ∑︁ L=− 𝑐 ln(𝜎 (𝑥𝑢,𝑖,𝑗 )) + (1 − 𝑐) ln(1 − 𝜎 (𝑥𝑢,𝑖,𝑗 )) + 𝜆Θ ||Θ|| 2 (6) DS SELU consecutive layers of 300, 200, and 200 neurons respectively end up with an output representation for 𝑃𝑢 in R200 . The final part where 𝑐 ∈ {0, 1} is the class, Θ includes all model parameters, 𝜆Θ of the network is a ranking layer which evaluates a loss such that is a regularization hyperparameter, and 𝜎 (𝑥𝑢,𝑖,𝑗 ) is the probability Φ(𝑃𝑢 ) · Φ(𝑖) > Φ(𝑃𝑢 ) · Φ( 𝑗), where replacing in Equation (2), we that a user 𝑢 really prefers 𝑖 over 𝑗, 𝑃 (𝑖 >𝑢 𝑗 |Θ) [33], calculated have 𝑥𝑢,𝑖 > 𝑥𝑢 𝑗 . There are several options of loss functions, but with the sigmoid function, i.e., due to good results of the cross-entropy loss in similar architectures with shared weights [25] rather than, e.g. the hinge loss where we 1 𝑃 (𝑖 >𝑢 𝑗 |Θ) = 𝜎 (𝑥𝑢,𝑖,𝑗 ) = (7) 1 + 𝑒 −(𝑥𝑢,𝑖 −𝑥𝑢,𝑗 ) We perform the optimization to learn the parameters which 3 A reference CuratorNet implementation may be found at https://github.com/ialab- reduce the loss function L by stochastic gradient descent with the puc/CuratorNet. CuratorNet: Visually-aware Recommendation of Art Images need to optimize an additional margin parameter 𝑚, we chose the Guidelines for sampling triples. We generate the training sigmoid cross-entropy for CuratorNet. set D S as the union of multiple disjoint4 training sets, each one Notice that in this article we used a pre-trained ResNet [18] to generated with a different strategy in mind. These strategies and obtain the image visual features, but the model could use other their corresponding training sets are: CNNs such as AlexNet [27], VGG [41], etc. We chose ResNet since (1) Removing item from purchase basket, and predicting this it has performed the best in transfer learning tasks [10, 26]. missing item. (2) Sort items purchased sequentially, and then predict next purchase in basket. 3.5 Data Sampling for Training (3) Recommending visually similar artworks from the favorite The original BPR article [33] suggests the creation of training triples artists of a user. (𝑢, 𝑖 +, 𝑗− ) simply by, given a user 𝑢, randomly sampling a positive (4) Recommending profile items from the same user profile. element 𝑖 + among those consumed, as well as sampling a negative (5) Create an artificial user profile of a single item purchased, feedback element 𝑗− among those not consumed. However, eventual and recommending profile items given this artificially cre- research has shown that there are more effective ways to create ated user profile. these training triples [12]. In our case, we define some guidelines to (6) Create artificial profile with a single item, then recommend sample triples for the training set based on analyses from previous visually similar items from the same artist. studies indicating features which provide signals of user preference. Finally, the training set D S is formally defined as: For instance, Messina et al. [32] showed that people are very likely to buy several artworks with similar visual themes, as well as from 6 Ø the same artist, then we used visual clusters and user’s favorite artist DS = DS𝑖 (8) to set some of these sampling guidelines. 𝑖=1 Creating Visual Clusters. Some of the sampling guidelines In practice, we uniformly sample about 10 million training triples, are based on visual similarity of the items, and although we have distributed uniformly among the six training sets D S 𝑖 . Likewise, some metadata for the images in the dataset, there is a significant we sample about 300,000 validation triples. To avoid sampling iden- number of missing values: only 45% of the images have information tical triples, we hash them and compare the hashes to check for about subject (e.g., architecture, nature, travel) and 53% about style potential collisions. Before sampling the training and validation (e.g., abstract, surrealism, pop art). For this reason, we conduct a sets, we hide the last purchase basket of each user, using them later clustering of images based on their visual representation, in such a on for testing. way that items with visual embeddings that are too similar will not be used to sample positive/negative pairs (𝑖 +, 𝑗− ). To obtain these 4 EXPERIMENTS visual clusters, we followed the following procedure: (i) Conduct a Principal Component Analysis to reduce the dimensionality of 4.1 Datasets images embedding vectors from R2,048 to R200 , (ii) perform k-means For our experiments we used a dataset where the user preference is clustering with 100 clusters. We conducted k-means clustering 20 in the form of purchases over physical art (painting and pictures). times and for each time we calculated the Silhouette coefficient [34] This private dataset was collected and shared by an online art store. (an intrinsic metric of clustering quality), so we kept the clustering The dataset consists of 2, 378 users, 6, 040 items (paintings and resulting with the highest Silhouette value. Finally, (iii) we assign photographs) and 5, 336 purchases. On average, each user bought each image the label of its respective visual cluster. Samples of our 2-3 items. One important aspect of this dataset is that paintings are clusters in a 2-dimensional projection map of images, built with one-of-a-kind, i.e., there is a single instance of each item and once it the UMAP method [31], can be seen in Figure 2. is purchased, is removed from the inventory. Since most of the items in the dataset are one-of-a-kind paintings (78%) and most purchase transactions have been made over these items (81.7%) a method relying on collaborative filtering model might suffer in performance, since user co-purchases are only possible on photographs. Another notable aspect in the dataset is that each item has a single creator (artist). In this dataset there are 573 artists, who have uploaded 10.54 items in average to the online art store. The dataset5 with transaction tuples (user, item), as well as the tuples used for testing (the last purchase of each user with at least two purchases) are available for replicating our results as well as for training other models. Due to copyright restrictions we cannot share the original image files, but we share the embeddings of the images obtained with ResNet50 [18]. 4 Theoretically, these training sets are not perfectly disjoint, but in practice we hash Figure 2: Examples of visual clusters automatically gener- all training triples and make sure no two training triples have the same hash. This prevents duplicates from being added to the final training set. ated to sample triples for the training set. 5 https://drive.google.com/drive/folders/1Dk7_BRNtN_IL8r64xAo6GdOYEycivtLy Cartagena, et al. 4.2 Evaluation Methodology In order to build and test the models, we split the data into train, validation and test sets. To make sure that we could make rec- ommendations for all cases in the test set, and thus make a fair comparison among recommendation methods, we check that ev- ery user considered in the test set was also present in the training set. All baseline methods were trained on the training set with hyperparameters tuned with the validation set. Next, the trained models are used to report performance over different metrics on the test set. For the dataset, the test set consists of the last transaction from every user that purchased at least twice, the rest of previous purchases are used for train and validation. Metrics. To measure the results we used several metrics: AUC (also used in [19, 20, 22]), normalized discounted cumulative gain (nDCG@k)[21], as well as Precision@k and Recall@k [9]. Although Figure 3: The sampling guidelines had a positive effect on it might seem counter-intuitive, we calculate these metrics for a AUC compared to random negative sampling for building low (k=20) as well as high values of k (𝑘 = 100). Most research on the BPR training set. top-k recommendation systems focuses on the very top of the rec- ommendation list, (k=5,10,20). However, Valcarce et al. [42] showed that top-k ranking metrics measured at higher values of k (k=100, Precision@100 and nDCG@100), while it stands second in Re- 200) are specially robust to biases such as sparsity and popularity call@20 and nDCG@20 against the non-regularized version of biases. The sparsity bias refers to the lack of known relevance for CuratorNet. This implies that CuratorNet overall ranks very well all the user-items pairs, while the popularity bias is the tendency at top positions, and is specially robust against sparsity and pop- of popular items to receive more user feedback, then missing user- ularity bias [42]. In addition, CuratorNet seems robust to changes items are not missing at random. We are specially interested in in the regularization hyperparameter. preventing popularity bias since we want to recommend not only • Compared to VBPR, CuratorNet is better in all seven metrics from the artists that each user is commonly purchasing from. We (AUC, Precision@20, Recall@100, Precision@100 and nDCG@100). aim at promoting novelty as well as discovery of relevant art from Notably, it is also more robust to the regularization hyperparam- newcomer artists. eter 𝜆 than VBPR. We think that this is explained in part due to the characteristics of the dataset: VBPR exploits non-visual 4.3 Baselines co-occurrance patterns, but in our dataset this signal provides a The methods used in the evaluation are the following: rather small preference information, since almost 80% are one- (1) CuratorNet: The method described in this paper. We also test of-a-kind items and transactions. it with four regularization values for 𝜆 = {0, .01, .001, .0001}. • VisRank presents very competitive results, specially in terms of (2) VBPR [20]: The state-of-the-art. We used the same embedding AUC, nDCG@20 and nDCG@100, performing better than VBPR size as in CuratorNet (200), we optimized it until converge in in this high one-of-a-kind dataset. However, CuratorNet performs the training set and we also tested the four regularization values better than VisRank in all metrics. This provides evidence that for 𝜆 = {0, .01, .001, .0001}. the model-based approach of CuratorNet that aggregates user (3) VisRank [22, 32]: This is a simple memory-based content fil- preferences into a single embedding is a better approach than tering method that ranks a candidate painting 𝑖 for a user 𝑢 the heuristic-based scoring of VisRank. based on the maximum cosine similarity with some existing item in the user profile 𝑗 ∈ 𝑃𝑢 i.e. 5.1 Effect of Sampling Guidelines 𝑠𝑐𝑜𝑟𝑒 (𝑢, 𝑖) = 𝑚𝑎𝑥 𝑗 ∈𝑃𝑢 𝑐𝑜𝑠𝑖𝑛𝑒 (𝑖, 𝑗) (9) We studied the effect of using our sampling guidelines for build- ing the training set D S compared to the traditional BPR setting where negative samples 𝑗 are sampled uniformly at random from 5 RESULTS AND DISCUSSION the set of unobserved items by the user, i.e., 𝐼 \ 𝐼𝑢+ . In the case of In Table 2, we can see the results comparing all methods. As refer- CuratorNet we use all six sampling guidelines (D S 1 − D S 6 ), while ence, at the top rows we present an oracle (perfect ranking), and in VBPR we only used two sampling guidelines (D S 3 and D S 4 ), in the bottom row a random recommender. Notice that AUC for a since VBPR has no notion of session or purchase baskets in its random recommender should be theoretically =0.5 (sorting pairs of original formulation, and it has more parameters than CuratorNet items given a user), so the AUC= .4973 serves as a check. In terms to model collaborative non-visual latent preferences. We tested of AUC, Recall@100, and Precision@100 CuratorNet with a small AUC in both CuratorNet and VBPR, under their best performance regularization (𝜆 = .0001) is the top model among other methods. with regularization parameter 𝜆, with and without our sampling We highlight the following points from these results: guidelines. Notice that results in Table 2 all consider the use of our • CuratorNet, with a small regularization 𝜆 = .0001, outperforms sampling guidelines. After conducting pairwise t-tests, we found the other methods in five metrics (AUC, Precision@20, Recall@100, a significant improvement in CuratorNet and VBPR, as shown in CuratorNet: Visually-aware Recommendation of Art Images Table 2: Results for all methods, sorted by AUC performance. The top five results are highlighted for each metric. For reference, the bottom row presents a random recommender, while the top row presents results of a perfect Oracle. Method 𝜆 (L2 Reg.) AUC R@20 P@20 nDCG@20 R@100 P@100 nDCG@100 Oracle – 1.0000 1.0000 .0655 1.0000 1.0000 .0131 1.0000 CuratorNet .0001 .7204 .1683 .0106 .0966 .3200 .0040 .1246 CuratorNet .001 .7177 .1566 .0094 .0895 .2937 .0037 .1160 VisRank – .7151 .1521 .0093 .0956 .2765 .0034 .1195 CuratorNet 0 .7131 .1689 .0100 .0977 .3048 .0038 .1239 CuratorNet .01 .7125 .1235 .0075 .0635 .2548 .0032 .0904 VBPR .0001 .6641 .1368 .0081 .0728 .2399 .0030 .0923 VBPR 0 .6543 .1287 .0078 .0670 .2077 .0026 .0829 VBPR .001 .6410 .0830 .0047 .0387 .1948 .0024 .0620 VBPR .01 .5489 .0101 .0005 .0039 .0506 .0006 .0118 Random – .4973 .0103 .0006 .0041 .0322 .0005 .0098 Figure 3. CuratorNet with sampling guidelines (AUC=.7204) had a for future work is integrating multitask learning in our framework, significant improvement over CuratorNet with random negative such as the recently published paper on the newest Youtube recom- sampling (AUC=.6602), 𝑝 = 7.7 · 10−5 . Likewise, VBPR with guide- mender [45]. Finally, from a methodological point-of-view, we will lines (AUC=.6641) had a significant improvement compared with test other datasets with likes rather than purchases, since we aim at VBPR with random sampling (AUC=.5899), 𝑝 = 1.6 · 10−6 . With this understanding how the model will behave under a different type of result, we conclude that the proposed sampling guidelines help in user relevance feedback. selecting better triplets for more effective learning in our art image recommendation setting. ACKNOWLEDGMENTS This work has been supported by the Millennium Institute for 6 CONCLUSION Foundational Research on Data (IMFD) and by the Chilean research In this article we have introduced CuratorNet, an art image rec- agency ANID, FONDECYT grant 1191791. ommender system based on neural networks. The learning model of CuratorNet is inspired by VBPR [20], but it incorporates some REFERENCES additional aspects such as layers with shared weights and it works [1] S. Akçay, M. E. Kundegorski, M. Devereux, and T. P. Breckon. 2016. Transfer specially well in situations of one-of-a-kind items, i.e., items which learning using convolutional neural networks for object classification within X- disappear from the inventory once consumed, making difficult to ray baggage security imagery. In Proceedings of the IEEE International Conference on Image Processing (ICIP). 1057–1061. user traditional collaborative filtering. Notice that an important [2] LM Aroyo, Y Wang, R Brussee, Peter Gorgels, LW Rutledge, and N Stash. 2007. contribution of this article are the data shared, since we could not Personalized museum experience: The Rijksmuseum use case. In Proceedings of Museums and the Web. find on the internet any other dataset of user transactions over [3] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep physical paintings. We have anonymized the user and item IDs convolutional encoder-decoder architecture for image segmentation. IEEE trans- and we have provided ResNet visual embeddings to help other actions on pattern analysis and machine intelligence 39, 12 (2017), 2481–2495. [4] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. researchers building and validating models with these data. Network dissection: Quantifying interpretability of deep visual representations. Our model outperforms state-of-the-art VBPR as well as other In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. simple but strong baselines such as VisRank [22, 32]. We also intro- 6541–6549. [5] Idir Benouaret and Dominique Lenne. 2015. Personalizing the museum experience duce a series of guidelines for sampling triples for the BPR training through context-aware recommendations. In 2015 IEEE International Conference set, and we show significant improvements in performance of both on Systems, Man, and Cybernetics. IEEE, 743–748. [6] Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A theoretical analysis of CuratorNet and VBPR versus traditional random sampling for neg- feature pooling in visual recognition. In Proceedings of the 27th international ative instances. conference on machine learning (ICML-10). 111–118. Future Work. Among our ideas for future work, we will test [7] Sumit Chopra, Raia Hadsell, Yann LeCun, et al. 2005. Learning a similarity metric discriminatively, with application to face verification. In CVPR (1). 539–546. our neural architecture using end-to-end-learning, in a similar fash- [8] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks ion than [22] who used a light model called CNN-F to replace the for youtube recommendations. In Proceedings of the 10th ACM Conference on pre-trained AlexNet visual embeddings. Another idea we will test is Recommender Systems. 191–198. [9] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of to create explanations for our recommendations based on low-level recommender algorithms on top-n recommendation tasks. In Proceedings of the (textures) and high level (objects) visual features which some re- fourth ACM conference on Recommender systems. ACM, 39–46. [10] Felipe del Rio, Pablo Messina, Vicente Dominguez, and Denis Parra. 2018. Do cent research are able to identify from CNNs, such as the Network Better ImageNet Models Transfer Better... for Image Recommendation?. In 2nd Dissection approach by Bau et al. [4]. Also, we will explore ideas workshop on Intelligent Recommender Systems by Knowledge Transfer and Learning. from the research on image style transfer [15, 16], which might https://arxiv.org/abs/1807.09870 [11] Yashar Deldjoo, Mehdi Elahi, Paolo Cremonesi, Franca Garzotto, Pietro Piazzolla, help us to identify styles and then use this information as context and Massimo Quadrana. 2016. Content-based video recommendation system to produce style-aware recommendations. Another interesting idea based on stylistic visual features. Journal on Data Semantics 5, 2 (2016), 99–113. Cartagena, et al. [12] Jingtao Ding, Fuli Feng, Xiangnan He, Guanghui Yu, Yong Li, and Depeng Jin. Transactions on circuits and systems for video technology 8, 5 (1998), 644–655. 2018. An improved sampler for bayesian personalized ranking by leveraging [36] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, view data. In Companion of the The Web Conference 2018 on The Web Conference Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. 2018. International World Wide Web Conferences Steering Committee, 13–14. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. [13] Mehdi Elahi, Yashar Deldjoo, Farshad Bakhshandegan Moghaddam, Leonardo International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https: Cella, Stefano Cereda, and Paolo Cremonesi. 2017. Exploring the Semantic Gap //doi.org/10.1007/s11263-015-0816-y for Movie Recommendations. In Proceedings of the Eleventh ACM Conference on [37] Jose San Pedro and Stefan Siersdorfer. 2009. Ranking and Classifying Attractive- Recommender Systems (Como, Italy) (RecSys ’17). 326–330. ness of Photos in Folksonomies. In Proceedings of the 18th International Conference [14] David Elsweiler, Christoph Trattner, and Morgan Harvey. 2017. Exploiting food on World Wide Web (Madrid, Spain) (WWW ’09). 771–780. choice biases for healthier recipe recommendation. In Proceedings of the 40th [38] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A international acm sigir conference on research and development in information unified embedding for face recognition and clustering. In Proceedings of the IEEE retrieval. ACM, 575–584. conference on computer vision and pattern recognition. 815–823. [15] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer [39] Giovanni Semeraro, Pasquale Lops, Marco De Gemmis, Cataldo Musto, and using convolutional neural networks. In Proceedings of the IEEE conference on Fedelucio Narducci. 2012. A folksonomy-based recommender system for person- computer vision and pattern recognition. 2414–2423. alized access to digital artworks. Journal on Computing and Cultural Heritage [16] Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, and Jonathon (JOCCH) 5, 3 (2012), 11. Shlens. 2017. Exploring the structure of a real-time, arbitrary neural artistic [40] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. stylization network. arXiv preprint arXiv:1705.06830 (2017). 2014. CNN features off-the-shelf: an astounding baseline for recognition. In [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial Workshops. 806–813. nets. In Advances in neural information processing systems. 2672–2680. [41] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net- [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual works for Large-Scale Image Recognition. In International Conference on Learning learning for image recognition. In Proceedings of the IEEE conference on computer Representations. vision and pattern recognition. 770–778. [42] Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On [19] Ruining He, Chen Fang, Zhaowen Wang, and Julian McAuley. 2016. Vista: A the robustness and discriminative power of information retrieval metrics for top- Visually, Socially, and Temporally-aware Model for Artistic Recommendation. N recommendation. In Proceedings of the 12th ACM Conference on Recommender In Proceedings of the 10th ACM Conference on Recommender Systems (Boston, Systems. ACM, 260–268. Massachusetts, USA) (RecSys ’16). 309–316. [43] Egon L van den Broek, Thijs Kok, Theo E Schouten, and Eduard Hoenkamp. 2006. [20] Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized Multimedia for art retrieval (m4art). In Multimedia Content Analysis, Management, Ranking from implicit feedback. In Proceedings of the Thirtieth AAAI Conference and Retrieval 2006, Vol. 6073. International Society for Optics and Photonics, on Artificial Intelligence. 144–150. 60730Z. [21] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation [44] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), Philbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity 422–446. with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and [22] Wang-Cheng Kang, Chen Fang, Zhaowen Wang, and Julian McAuley. 2017. Pattern Recognition. 1386–1393. Visually-aware fashion recommendation and design with generative image mod- [45] Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, els. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE, 207–216. Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. [23] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- Recommending What Video to Watch Next: A Multitask Ranking System. In mization. In 3rd International Conference on Learning Representations, ICLR 2015, Proceedings of the 13th ACM Conference on Recommender Systems (Copenhagen, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio Denmark) (RecSys ’19). ACM, New York, NY, USA, 43–51. https://doi.org/10. and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980 1145/3298689.3346997 [24] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017. Self-normalizing neural networks. In Advances in Neural Information Processing Systems. 971–980. [25] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. [26] Simon Kornblith, Jonathon Shlens, and Quoc V Le. 2018. Do better imagenet models transfer better? arXiv preprint arXiv:1805.08974 (2018). [27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifi- cation with deep convolutional neural networks. In Proceedings of Advances in neural information processing systems 25 (NIPS). 1097–1105. [28] Marco La Cascia, Saratendu Sethi, and Stan Sclaroff. 1998. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries. 24–28. [29] Chenyi Lei, Dong Liu, Weiping Li, Zheng-Jun Zha, and Houqiang Li. 2016. Com- parative Deep Learning of Hybrid Representations for Image Recommendations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2545–2553. [30] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 43–52. [31] L. McInnes, J. Healy, and J. Melville. 2018. UMAP: Uniform Manifold Approx- imation and Projection for Dimension Reduction. ArXiv e-prints (Feb. 2018). arXiv:1802.03426 [stat.ML] [32] Pablo Messina, Vicente Dominguez, Denis Parra, Christoph Trattner, and Alvaro Soto. 2018. Content-based artwork recommendation: integrating painting meta- data with neural and manually-engineered visual features. User Modeling and User-Adapted Interaction (2018), 1–40. [33] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. 452–461. [34] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987), 53–65. [35] Yong Rui, Thomas S Huang, Michael Ortega, and Sharad Mehrotra. 1998. Rele- vance feedback: a power tool for interactive content-based image retrieval. IEEE