CuratorNet: Visually-aware Recommendation of Art Images
                 Pablo Messina∗                                         Manuel Cartagena                                         Patricio Cerda
        Pontificia Universidad Católica                         Pontificia Universidad Católica                        Pontificia Universidad Católica
                Santiago, Chile                                         Santiago, Chile                                        Santiago, Chile
               pamessina@uc.cl                                        micartagena@uc.cl                                        pcerdam@uc.cl

                                             Felipe del Rio†                                          Denis Parra‡
                                    Pontificia Universidad Católica                        Pontificia Universidad Católica
                                            Santiago, Chile                                        Santiago, Chile
                                             fidelrio@uc.cl                                      dparra@ing.puc.cl

ABSTRACT                                                                                 crafted visual features [43]. Regarding application domains of recent
Although there are several visually-aware recommendation models                          image recommendation methods using neural visual embeddings,
in domains like fashion or even movies, the art domain lacks the                         to the best of our knowledge most of them focus on fashion recom-
same level of research attention, despite the recent growth of the                       mendation [20, 22, 30], a few on art recommendation [19, 32] and
online artwork market. To reduce this gap, in this article we intro-                     photo recommendation [29]. He et al. [19] proposed Vista, a model
duce CuratorNet, a neural network architecture for visually-aware                        combining neural visual embeddings, collaborative filtering as well
recommendation of art images. CuratorNet is designed at the core                         as temporal and social signals for digital art recommendation.
with the goal of maximizing generalization: the network has a fixed                          However, digital art projects can differ significantly from physi-
set of parameters that only need to be trained once, and thereafter                      cal art (paintings and photographs). Messina et al. [32] study recom-
the model is able to generalize to new users or items never seen                         mendation of paintings in an online art store using a simple k-NN
before, without further training. This is achieved by leveraging                         model based on neural visual features and metadata. Although
visual content: items are mapped to item vectors through visual                          memory-based models perform fairly well, model-based methods
embeddings, and users are mapped to user vectors by aggregating                          using neural visual features report better performance [19, 20] in
the visual content of items they have consumed. Besides the model                        the fashion domain, indicating room for improvement in this area,
architecture, we also introduce novel triplet sampling strategies to                     considering the growing sales in the global online artwork market1 .
build a training set for rank learning in the art domain, resulting in                       The most popular model-based method for image recommenda-
more effective learning than naive random sampling. With an eval-                        tion using neural visual embeddings is VBPR [20], a state-of-the-art
uation over a real-world dataset of physical paintings, we show that                     model that integrates implicit feedback collaborative filtering with
CuratorNet achieves the best performance among several baselines,                        neural visual embeddings into a Bayesian Personalized Ranking
including the state-of-the-art model VBPR. CuratorNet is motivated                       (BPR) learning framework [33]. VBPR performs well, but it has
and evaluated in the art domain, but its architecture and training                       some drawbacks. VBPR learns a latent embedding for each user
scheme could be adapted to recommend images in other areas.                              and for each item, so new users cannot receive suggestions and
                                                                                         new items cannot be recommended until re-training is carried out.
CCS CONCEPTS                                                                             An alternative is training a model such as Youtube’s Deep Neural
                                                                                         Recommender [8] which allows to recommend to new users with
• Information systems → Recommender systems; • Comput-
                                                                                         little preference feedback and without additional model training.
ing methodologies → Machine learning approaches; • Ap-
                                                                                         However, Youtube’s model was trained on millions of user transac-
plied computing → Media arts.
                                                                                         tions and with large amounts of profile and contextual data, so it
KEYWORDS                                                                                 does not easily fit to datasets that are small, with little user feedback
                                                                                         or with little contextual and profile data.
recommender systems, neural networks, visual art                                             In this work, we introduce a neural network for visually-aware
                                                                                         recommendation of images focused on visual art named Curator-
1    INTRODUCTION                                                                        Net, whose general structure can be seen in Figure 1. CuratorNet
The big revolution of deep convolutional neural networks (CNN)                           leverages neural image embeddings as those obtained from CNNs
in the area of computer vision for tasks such as image classification                    [18, 27, 41] pre-trained on the Imagenet dataset (ILSVRC [36]). We
[18, 27, 41], object recognition [1], image segmentation [3] or scene                    train CuratorNet for ranking with triplets (𝑃𝑢 , 𝑖 + , 𝑗− ), where 𝑃𝑢 is
identification [40] has reached the area of image recommender sys-                       the history of image preferences of a user 𝑢, whereas 𝑖 + and 𝑗− are
tems in recent years [19, 20, 22, 29, 30, 32]. These works use neural                    a pair of images with higher and lower preference respectively. Cu-
visual embeddings to improve the recommendation performance                              ratorNet draws inspiration from VBPR [20] and Youtube’s Recom-
compared to previous approaches for image recommendation based                           mender System [8]. VBPR [20] inspired us to leverage pre-trained
on ratings and text [2], social tags [39], context [5] and manually                      image embeddings as well as optimizing the model for ranking as in
∗ Also with Millennium Institute Foundational Research on Data, IMFD.
† Also with Millennium Institute Foundational Research on Data, IMFD.                    1 https://www.artsy.net/article/artsy-editorial-global-art-market-reached-674-
‡ Also with Millennium Institute Foundational Research on Data, IMFD.                    billion-2018-6
Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                                                                    Cartagena, et al.


BPR [33]. From the work of Convington et al. [8] we took the idea            through a mobile application, with the aim of making museum tour
of designing a deep neural network that can generalize to new users          recommendations more useful. Their content-based approach used
without introducing new parameters or further training (unlike               ratings given by the users during the tour and metadata from the
VBPR which needs to learn a latent user vector for each new user).           artworks rated, e.g. title or artist names.
As a result, CuratorNet can recommend to new users with very                    Finally, the most recent works use neural image embeddings
little feedback, and without additional training CuratorNet’s deep           [19, 32]. He et al. [19] propose the system Vista, which addresses
neural network is trained for personalized ranking using triplets            digital artwork recommendations based on pre-trained deep neural
and the architecture contains a set of layers with shared weights,           visual features, as well as temporal and social data. On the other
inspired by models using triplet loss for non-personalized image             hand, Messina et al. [32] address the recommendation of one-of-a-
ranking [38, 44]. In these works, a single image represents the input        kind physical paintings, comparing the performance of metadata,
query, but in our case, the input query is a set images representing         manually-curated visual features, and neural visual embeddings.
a user preference history, 𝑃𝑢 . In summary, compared to previous             Messina et al. [32] recommend to users by computing a simple
works, our main contributions are:                                           K-NN based similarity score among users’ purchased paintings and
      • a novel neural-based visually-aware architecture for image           the paintings in the dataset, a method that Kang et al. [22] call
        recommendation,                                                      VisRank.
      • a set of sampling guidelines for the creation of the training
        dataset (triplets), which improve the performance of Cu-
        ratorNet as well as VBPR with respect to random negative             2.2    Visually-aware Image Recommender
        sampling, and
                                                                                    Systems
      • presenting a thorough evaluation of the method against
        competitive state-of-the-art methods (VisRank [22, 32] and           In this section we survey works using visual features to recom-
        VBPR[20]) on a dataset of purchases of physical art (paintings       mend images. We also cite a few works using visual information to
        and photographs).                                                    recommend non-image items, but these are not too relevant for the
                                                                             present research.
   We also share the dataset2 of user transactions (with hashed
                                                                                 Manually-engineered visual features extracted from images (tex-
user and item IDs due to privacy requirements) as well as visual
                                                                             ture, sharpness, brightness, etc.) have been used in several tasks
embeddings of the paintings image files. One aspect to highlight
                                                                             for information filtering, such as retrieval [28, 35, 43] and ranking
about this research, is that although the triplets’ sampling guidelines
                                                                             [37]. More recently, interesting results have been shown for the
to build the BPR training set apply specifically to visual art, the
                                                                             use of low-level handcrafted stylistic visual features automatically
architecture of CuratorNet can be used in other visual domains for
                                                                             extracted from video frames for content-based video recommen-
image recommendation.
                                                                             dation [11]. Even better results are obtained when both stylistic
                                                                             visual features and annotated metadata are combined in a hybrid
2     RELATED WORK                                                           recommender, as shown in the work of Elahi et al. [13]. In a visually-
In this section we provide an overview of relevant related work,             aware setting not related to recommending images, Elsweiller et al.
considering: Artwork Recommender Systems (2.1), Visually-aware               [14] used manually-crafted attractiveness visual features [37], in
Recommender Systems (2.2), as well as highlights of what differenti-         order to recommend healthy food recipes to users.
ates our work to the existing literature.                                        Another branch of visually-aware image recommender systems
                                                                             focuses on using neural embeddings to represent images [19, 20,
2.1     Artwork Recommender Systems                                          22, 29, 32]. The computer vision community has a large track of
With respect to artwork recommender systems, one of the first                successful systems based on neural networks for several tasks
contributions was the CHIP Project [2]. The aim of the project               [1, 3, 18, 27, 40, 41]. This trend started from the outstanding per-
was to build a recommendation system for the Rijksmuseum. The                formance of the AlexNet [27] in the Imagenet Large Scale Visual
project used traditional techniques such as content-based filtering          Recognition challenge (ILSVRC [36]), but the most notable implica-
based on metadata provided by experts, as well as collaborative              tion is that the neural image embeddings have shown impressive
filtering based on users’ ratings. Another similar system but non-           performance for transfer learning, i.e., for tasks different from the
personalized was 𝑚4𝑎𝑟𝑡 by Van den Broek et al. [43], who used                original one [10, 26]. Usually these neural image embeddings are
color histograms to retrieve similar art images given a painting as          obtained from CNN models such as AlexNet [27], VGG [41] and
input query.                                                                 ResNet [18], among others. Motivated by these results, McAuley et
    Another important contribution is the work by Semeraro et al.            al. [30] introduced an image-based recommendation system based
[39], who introduced an artwork recommender system called FIRSt              on styles and substitutes for clothing using visual embeddings pre-
(Folksonomy-based Item Recommender syStem) which utilizes so-                trained on a large-scale dataset obtained from Amazon.com. Later,
cial tags given by experts and non-experts of over 65 paintings of           He et al. [20] went further in this line of research and introduced
the Vatican picture gallery. They did not employ visual features             a visually-aware matrix factorization approach that incorporates
among their methods. Benouaret et al. [5] improved the state-of-             visual signals (from a pre-trained CNN) into predictors of people’s
the-art in artwork recommender systems using context obtained                opinions, called VBPR. Their training model is based on Bayesian
                                                                             Personalized Ranking (BPR), a model previously introduced by
2 https://drive.google.com/drive/folders/1Dk7_BRNtN_IL8r64xAo6GdOYEycivtLy   Rendle et al. [33].
CuratorNet: Visually-aware Recommendation of Art Images


   The next work by He et al. [19] deals with visually-aware digital                       Table 1: Notation for CuratorNet.
art recommendation, building a model called Vista which combines
ratings, temporal and social signals and visual features.
                                                                            Symbol      Description
   Another relevant work was the research by Lei et al. [29] who
introduced comparative deep learning for hybrid image recommen-             𝑈,𝐼         user set, item set
dation. In this work, they use a siamese neural network architecture        𝑢           a specific user
                                                                            𝑖, 𝑗        a specific item (resp.)
for making recommendations of images using user information
                                                                            𝑖 +, 𝑗−     a positive item and negative item (resp.)
(such as demographics and social tags) as well as images in pairs           𝐼𝑢+ or 𝑃𝑢   set of all items which the user 𝑢 has expressed a positive
(one liked, one disliked) in order to build a ranking model. The ap-                    preference (full history)
proach is interesting, but they work with Flickr photos, not artwork         +
                                                                            𝐼𝑢,𝑘        set of all items which the user 𝑢 has expressed a positive
images, and use social tags, not present in our problem setting. The                    preference up to his 𝑘-th purchase basket (inclusive)
work by Kang et al. [22] expands VBPR but they focus on generat-            𝑃𝑢,𝑘        set of all items which the user 𝑢 has expressed a positive
ing images using Generative adversarial networks [17] rather than                       preference in his 𝑘-th purchase basket
recommending, with an application in the fashion domain. Finally,
Messina et al. [32] was already mentioned, but we can add that their
neural image embeddings outperformed other visual (manually-               each user 𝑢 ∈ 𝑈 a personalized ranked list of the items for which
extracted) and metadata features for ranking, with the exception of        the user still have not expressed preference, i.e., for 𝐼 \ 𝐼𝑢+ .
the metadata given by user’s favorite artist, which predicted even
better than neural embeddings for top@k recommendation.                    3.2     Preference Predictor
                                                                           The preference predictor in CuratorNet is inspired by VBPR [20], a
2.3    Differences to Previous Research
                                                                           state-of-the-art visual recommender model.
Almost all the surveyed articles on artwork recommendation have               However, CuratorNet has some important differences. First,
in common that they used standard techniques such as collabo-              we do not use non-visual latent factors, so we remove the traditional
rative filtering and content-based filtering, as well as manually-         user and item non-visual latent embeddings. Second, we do not
curated visual image features, but only the most recent works have         learn a specific embedding per user such as VBPR, but we learn a
exploited visual features extracted from CNNs [19, 32]. In compari-        joint model that, given a user’s purchase/like history, it outputs a
son to these works, we introduce a model-based approach (unlike            single embedding which can be used to rank unobserved artworks
the memory-based VisRank method by Messina et al. [32]) and                in the dataset, similar to YouTube’s Deep Learning network [8].
which recommends to cold-start items and users without additional          Another important difference of VBPR with CuratorNet is that the
model training (unlike [19]). With regards to more general work            former has a single matrix E to project a visual item embedding 𝑓𝑖
on visually-aware image recommender systems, almost all of the             into the user latent space. In CuratorNet, we rather learn a neural
surveyed articles have focused on tasks different from art recom-          network Φ(·) to perform that projection, which receives as input
mendation, such as fashion recommendation [20, 22, 30], photo              either a single image embedding f𝑖 or a set of image embeddings
[29] and video recommendation [13]. Only Vista, the work by He et          representing users’ purchase/like history 𝑃𝑢 = {f1, ..., f𝑁 } . Given
al. [19], resembles ours in terms of the topic (art recommendation)        all these aspects, the preference predictor of CuratorNet is given
and the use of visual features. Unlike them, we evaluate our pro-          by:
posed method, CuratorNet, in a dataset of physical paintings and
photographs, not only digital art. Moreover, Vista uses social and
                                                                                               𝑥𝑢,𝑖 = 𝛼 + 𝛽𝑢 + Φ(𝑃𝑢 )𝑇 Φ(f𝑖 )                   (1)
temporal metadata which we do not have and many other datasets
might not have either. Compared to all these previous research, and           where 𝛼 is an offset, 𝛽𝑢 represents a user bias, Φ(·) represents
to the best of our knowledge, CuratorNet is the first architecture         CuratorNet neural network and 𝑃𝑢 represents the set of visual
for image recommendation that takes advantage of shared weights            embeddings of the images in user 𝑢 history. After some experiments
in a triplet loss setting, an idea inspired by the results of Wang et      we found no differences between using or not a variable for item bias
al. [44] and Schroff et al. [38], but here adapted to the personalized     𝛽𝑖 so we dropped it in order to decrease the number of parameters
image recommendation domain.                                               (Occam’s razor).
                                                                              Finally, since we calculate the model parameters using BPR [33],
3 CURATORNET                                                               the parameters 𝛼, 𝛽𝑢 cancel out (details in the coming subsection)
3.1 Problem Formulation                                                    and our final preference predictor is simply

We approach the problem of recommending art images from user
positive-only feedback (e.g., purchase history, likes, etc.) upon visual                              𝑥𝑢,𝑖 = Φ(𝑃𝑢 )𝑇 Φ(f𝑖 )                     (2)
items (paintings, photographs, etc.). Let 𝑈 and 𝐼 be the set of users
and items in a dataset, respectively. We assume only one image             3.3     Model Learning via BPR
per each single item 𝑖 ∈ 𝐼 . Considering either user purchases or          We use the Bayesian Personalized Ranking (BPR) framework [33]
likes, the set of items for which a user 𝑢 has expressed positive          to learn the model parameters. Our goal is to optimize ranking by
preference is defined as 𝐼𝑢+ . In this work, we considered purchases       training a model which orders triples of the form (𝑢, 𝑖, 𝑗) ∈ D S ,
to be positive feedback from the user. Our goal is to generate for         where 𝑢 denotes a user, 𝑖 an item with positive feedback from 𝑢,
                                                                                                                                                 Cartagena, et al.


                 Figure 1: Architecture of CuratorNet showing in detail the layers with shared weights for training.
and 𝑗 an item with non-observed feedback from 𝑢. The training set                Adam optimizer [23], using the implementation in Tensorflow3 .
of triples D S is defined as:                                                    During each iteration of stochastic gradient descent, we sample a
                                                                                 user 𝑢, a positive item 𝑖 ∈ 𝐼 +𝑢 (i.e., removed from 𝑃𝑢 ), a negative item
              D S = {(𝑢, 𝑖, 𝑗)|𝑢 ∈ 𝑈 ∧ 𝑖 ∈ 𝐼𝑢+ ∧ 𝑗 ∈ 𝐼 \ 𝐼𝑢+ }            (3)    𝑗 ∈ 𝐼 \ 𝐼 +𝑢 , and user 𝑢 purchase/like history with item 𝑖 removed,
                                                                                 i.e., 𝑃𝑢 \ 𝑖.
   Table 1 shows that 𝐼 +𝑢 denotes the set of all items with positive
feedback from 𝑢 while 𝐼 \𝐼 +𝑢 shows those items without such positive
feedback. Considering our previously defined preference predictor                3.4     Model Architecture
𝑥𝑢,𝑖 , we would expect a larger preference score of 𝑢 over 𝑖 than over           The architecture of the CuratorNet neural network is summarized in
𝑗, then BPR defines the difference between scores                                Figure 1, but is presented with more details in Figure 1. For training,
                                                                                 each imput instance is expected to be a triple (𝑃𝑢 ,𝑖,𝑗), where 𝑃𝑢 is the
                            𝑥𝑢,𝑖,𝑗 = 𝑥𝑢,𝑖 − 𝑥𝑢,𝑗                          (4)    set of images in user 𝑢 history (purchases, likes) with a single item
                                                                                 𝑖 removed from the set, 𝑖 is an item with positive preference, and
   an then BPR aims at finding the parameters Θ which optimize
                                                                                 𝑗 is an item with assumed negative user preference. The negative
the objective function
                                                                                 user preference is assumed since the item 𝑗 is sampled from the
                             ∑︁                                                  list of images which 𝑢 has not interacted with yet. Each image (𝑖,
                   argmax         ln 𝜎 (𝑥𝑢,𝑖,𝑗 ) − 𝜆Θ ||Θ|| 2             (5)    𝑗 and all images ∈ 𝑃𝑢 ) goes through a ResNet [18] (pre-trained
                       Θ     DS
                                                                                 with ImageNet data), which outputs a visual image embedding
   where 𝜎 (·) is the sigmoid function, Θ includes all model param-              in R2,048 . ResNet weights are fixed during CuratorNet’s training.
eters, and 𝜆Θ is a regularization hyperparameter.                                Then, the network has two layers with scale exponential linear
   In CuratorNet, unlike BPR-MF [33] and VBPR [20], we use a                     units (hereinafter, SELU [24]), with 200 neurons each, which reduce
sigmoid cross entropy loss, considering that we can interpret the                the dimensionality of each image. Notice that these two layers work
decision over triplets as a binary classification problem, where if              similar to a siamese [7] or triplet loss architecture [38, 44], i.e., they
𝑥𝑢,𝑖,𝑗 > 0 represents class 𝑐 = 1 (triple well ranked, since 𝑥𝑢,𝑖 > 𝑥𝑢,𝑗         have shared weights. Each image is represented at the output of
) and 𝑥𝑢,𝑖,𝑗 ≤ 0 signifies class 𝑐 = 0 (triplet wrongly ranked, since            this section of the network by a vector in R200 . Then, for the case
𝑥𝑢,𝑖 ≤ 𝑥𝑢,𝑗 ). Then, CuratorNet loss can be expressed as:                        of the images in 𝑃𝑢 , their embeddings are both averaged (average
                                                                                 pooling [6]) as well as max-pooled per dimension (max pooling [6])
                                                                                 , and next concatenated to a resultant vector in R400 . Finally, three
        ∑︁
 L=−          𝑐 ln(𝜎 (𝑥𝑢,𝑖,𝑗 )) + (1 − 𝑐) ln(1 − 𝜎 (𝑥𝑢,𝑖,𝑗 )) + 𝜆Θ ||Θ|| 2 (6)
         DS                                                                      SELU consecutive layers of 300, 200, and 200 neurons respectively
                                                                                 end up with an output representation for 𝑃𝑢 in R200 . The final part
   where 𝑐 ∈ {0, 1} is the class, Θ includes all model parameters, 𝜆Θ            of the network is a ranking layer which evaluates a loss such that
is a regularization hyperparameter, and 𝜎 (𝑥𝑢,𝑖,𝑗 ) is the probability           Φ(𝑃𝑢 ) · Φ(𝑖) > Φ(𝑃𝑢 ) · Φ( 𝑗), where replacing in Equation (2), we
that a user 𝑢 really prefers 𝑖 over 𝑗, 𝑃 (𝑖 >𝑢 𝑗 |Θ) [33], calculated            have 𝑥𝑢,𝑖 > 𝑥𝑢 𝑗 . There are several options of loss functions, but
with the sigmoid function, i.e.,                                                 due to good results of the cross-entropy loss in similar architectures
                                                                                 with shared weights [25] rather than, e.g. the hinge loss where we
                                                       1
              𝑃 (𝑖 >𝑢 𝑗 |Θ) = 𝜎 (𝑥𝑢,𝑖,𝑗 ) =                               (7)
                                              1 + 𝑒 −(𝑥𝑢,𝑖 −𝑥𝑢,𝑗 )
   We perform the optimization to learn the parameters which                     3 A reference CuratorNet implementation may be found at https://github.com/ialab-
reduce the loss function L by stochastic gradient descent with the               puc/CuratorNet.
CuratorNet: Visually-aware Recommendation of Art Images


need to optimize an additional margin parameter 𝑚, we chose the               Guidelines for sampling triples. We generate the training
sigmoid cross-entropy for CuratorNet.                                      set D S as the union of multiple disjoint4 training sets, each one
    Notice that in this article we used a pre-trained ResNet [18] to       generated with a different strategy in mind. These strategies and
obtain the image visual features, but the model could use other            their corresponding training sets are:
CNNs such as AlexNet [27], VGG [41], etc. We chose ResNet since                (1) Removing item from purchase basket, and predicting this
it has performed the best in transfer learning tasks [10, 26].                     missing item.
                                                                               (2) Sort items purchased sequentially, and then predict next
                                                                                   purchase in basket.
3.5    Data Sampling for Training
                                                                               (3) Recommending visually similar artworks from the favorite
The original BPR article [33] suggests the creation of training triples            artists of a user.
(𝑢, 𝑖 +, 𝑗− ) simply by, given a user 𝑢, randomly sampling a positive          (4) Recommending profile items from the same user profile.
element 𝑖 + among those consumed, as well as sampling a negative               (5) Create an artificial user profile of a single item purchased,
feedback element 𝑗− among those not consumed. However, eventual                    and recommending profile items given this artificially cre-
research has shown that there are more effective ways to create                    ated user profile.
these training triples [12]. In our case, we define some guidelines to         (6) Create artificial profile with a single item, then recommend
sample triples for the training set based on analyses from previous                visually similar items from the same artist.
studies indicating features which provide signals of user preference.
                                                                              Finally, the training set D S is formally defined as:
For instance, Messina et al. [32] showed that people are very likely
to buy several artworks with similar visual themes, as well as from                                                   6
                                                                                                                      Ø
the same artist, then we used visual clusters and user’s favorite artist                                     DS =           DS𝑖                                (8)
to set some of these sampling guidelines.                                                                             𝑖=1
   Creating Visual Clusters. Some of the sampling guidelines                  In practice, we uniformly sample about 10 million training triples,
are based on visual similarity of the items, and although we have          distributed uniformly among the six training sets D S 𝑖 . Likewise,
some metadata for the images in the dataset, there is a significant        we sample about 300,000 validation triples. To avoid sampling iden-
number of missing values: only 45% of the images have information          tical triples, we hash them and compare the hashes to check for
about subject (e.g., architecture, nature, travel) and 53% about style     potential collisions. Before sampling the training and validation
(e.g., abstract, surrealism, pop art). For this reason, we conduct a       sets, we hide the last purchase basket of each user, using them later
clustering of images based on their visual representation, in such a       on for testing.
way that items with visual embeddings that are too similar will not
be used to sample positive/negative pairs (𝑖 +, 𝑗− ). To obtain these      4 EXPERIMENTS
visual clusters, we followed the following procedure: (i) Conduct
a Principal Component Analysis to reduce the dimensionality of             4.1 Datasets
images embedding vectors from R2,048 to R200 , (ii) perform k-means        For our experiments we used a dataset where the user preference is
clustering with 100 clusters. We conducted k-means clustering 20           in the form of purchases over physical art (painting and pictures).
times and for each time we calculated the Silhouette coefficient [34]      This private dataset was collected and shared by an online art store.
(an intrinsic metric of clustering quality), so we kept the clustering     The dataset consists of 2, 378 users, 6, 040 items (paintings and
resulting with the highest Silhouette value. Finally, (iii) we assign      photographs) and 5, 336 purchases. On average, each user bought
each image the label of its respective visual cluster. Samples of our      2-3 items. One important aspect of this dataset is that paintings are
clusters in a 2-dimensional projection map of images, built with           one-of-a-kind, i.e., there is a single instance of each item and once it
the UMAP method [31], can be seen in Figure 2.                             is purchased, is removed from the inventory. Since most of the items
                                                                           in the dataset are one-of-a-kind paintings (78%) and most purchase
                                                                           transactions have been made over these items (81.7%) a method
                                                                           relying on collaborative filtering model might suffer in performance,
                                                                           since user co-purchases are only possible on photographs. Another
                                                                           notable aspect in the dataset is that each item has a single creator
                                                                           (artist). In this dataset there are 573 artists, who have uploaded
                                                                           10.54 items in average to the online art store.
                                                                               The dataset5 with transaction tuples (user, item), as well as the
                                                                           tuples used for testing (the last purchase of each user with at least
                                                                           two purchases) are available for replicating our results as well as
                                                                           for training other models. Due to copyright restrictions we cannot
                                                                           share the original image files, but we share the embeddings of the
                                                                           images obtained with ResNet50 [18].
                                                                           4 Theoretically, these training sets are not perfectly disjoint, but in practice we hash

Figure 2: Examples of visual clusters automatically gener-                 all training triples and make sure no two training triples have the same hash. This
                                                                           prevents duplicates from being added to the final training set.
ated to sample triples for the training set.                               5 https://drive.google.com/drive/folders/1Dk7_BRNtN_IL8r64xAo6GdOYEycivtLy
                                                                                                                                    Cartagena, et al.


4.2    Evaluation Methodology
In order to build and test the models, we split the data into train,
validation and test sets. To make sure that we could make rec-
ommendations for all cases in the test set, and thus make a fair
comparison among recommendation methods, we check that ev-
ery user considered in the test set was also present in the training
set. All baseline methods were trained on the training set with
hyperparameters tuned with the validation set.
   Next, the trained models are used to report performance over
different metrics on the test set. For the dataset, the test set consists
of the last transaction from every user that purchased at least twice,
the rest of previous purchases are used for train and validation.
   Metrics. To measure the results we used several metrics: AUC
(also used in [19, 20, 22]), normalized discounted cumulative gain
(nDCG@k)[21], as well as Precision@k and Recall@k [9]. Although
                                                                            Figure 3: The sampling guidelines had a positive effect on
it might seem counter-intuitive, we calculate these metrics for a
                                                                            AUC compared to random negative sampling for building
low (k=20) as well as high values of k (𝑘 = 100). Most research on
                                                                            the BPR training set.
top-k recommendation systems focuses on the very top of the rec-
ommendation list, (k=5,10,20). However, Valcarce et al. [42] showed
that top-k ranking metrics measured at higher values of k (k=100,
                                                                              Precision@100 and nDCG@100), while it stands second in Re-
200) are specially robust to biases such as sparsity and popularity
                                                                              call@20 and nDCG@20 against the non-regularized version of
biases. The sparsity bias refers to the lack of known relevance for
                                                                              CuratorNet. This implies that CuratorNet overall ranks very well
all the user-items pairs, while the popularity bias is the tendency
                                                                              at top positions, and is specially robust against sparsity and pop-
of popular items to receive more user feedback, then missing user-
                                                                              ularity bias [42]. In addition, CuratorNet seems robust to changes
items are not missing at random. We are specially interested in
                                                                              in the regularization hyperparameter.
preventing popularity bias since we want to recommend not only
                                                                            • Compared to VBPR, CuratorNet is better in all seven metrics
from the artists that each user is commonly purchasing from. We
                                                                              (AUC, Precision@20, Recall@100, Precision@100 and nDCG@100).
aim at promoting novelty as well as discovery of relevant art from
                                                                              Notably, it is also more robust to the regularization hyperparam-
newcomer artists.
                                                                              eter 𝜆 than VBPR. We think that this is explained in part due
                                                                              to the characteristics of the dataset: VBPR exploits non-visual
4.3    Baselines
                                                                              co-occurrance patterns, but in our dataset this signal provides a
The methods used in the evaluation are the following:                         rather small preference information, since almost 80% are one-
(1) CuratorNet: The method described in this paper. We also test              of-a-kind items and transactions.
    it with four regularization values for 𝜆 = {0, .01, .001, .0001}.       • VisRank presents very competitive results, specially in terms of
(2) VBPR [20]: The state-of-the-art. We used the same embedding               AUC, nDCG@20 and nDCG@100, performing better than VBPR
    size as in CuratorNet (200), we optimized it until converge in            in this high one-of-a-kind dataset. However, CuratorNet performs
    the training set and we also tested the four regularization values        better than VisRank in all metrics. This provides evidence that
    for 𝜆 = {0, .01, .001, .0001}.                                            the model-based approach of CuratorNet that aggregates user
(3) VisRank [22, 32]: This is a simple memory-based content fil-              preferences into a single embedding is a better approach than
    tering method that ranks a candidate painting 𝑖 for a user 𝑢              the heuristic-based scoring of VisRank.
    based on the maximum cosine similarity with some existing
    item in the user profile 𝑗 ∈ 𝑃𝑢 i.e.                                    5.1    Effect of Sampling Guidelines
                  𝑠𝑐𝑜𝑟𝑒 (𝑢, 𝑖) = 𝑚𝑎𝑥 𝑗 ∈𝑃𝑢 𝑐𝑜𝑠𝑖𝑛𝑒 (𝑖, 𝑗)             (9)    We studied the effect of using our sampling guidelines for build-
                                                                            ing the training set D S compared to the traditional BPR setting
                                                                            where negative samples 𝑗 are sampled uniformly at random from
5     RESULTS AND DISCUSSION                                                the set of unobserved items by the user, i.e., 𝐼 \ 𝐼𝑢+ . In the case of
In Table 2, we can see the results comparing all methods. As refer-         CuratorNet we use all six sampling guidelines (D S 1 − D S 6 ), while
ence, at the top rows we present an oracle (perfect ranking), and           in VBPR we only used two sampling guidelines (D S 3 and D S 4 ),
in the bottom row a random recommender. Notice that AUC for a               since VBPR has no notion of session or purchase baskets in its
random recommender should be theoretically =0.5 (sorting pairs of           original formulation, and it has more parameters than CuratorNet
items given a user), so the AUC= .4973 serves as a check. In terms          to model collaborative non-visual latent preferences. We tested
of AUC, Recall@100, and Precision@100 CuratorNet with a small               AUC in both CuratorNet and VBPR, under their best performance
regularization (𝜆 = .0001) is the top model among other methods.            with regularization parameter 𝜆, with and without our sampling
We highlight the following points from these results:                       guidelines. Notice that results in Table 2 all consider the use of our
• CuratorNet, with a small regularization 𝜆 = .0001, outperforms            sampling guidelines. After conducting pairwise t-tests, we found
   the other methods in five metrics (AUC, Precision@20, Recall@100,        a significant improvement in CuratorNet and VBPR, as shown in
CuratorNet: Visually-aware Recommendation of Art Images


Table 2: Results for all methods, sorted by AUC performance. The top five results are highlighted for each metric. For reference,
the bottom row presents a random recommender, while the top row presents results of a perfect Oracle.

                      Method          𝜆 (L2 Reg.)       AUC         R@20     P@20       nDCG@20           R@100        P@100        nDCG@100
                      Oracle          –               1.0000       1.0000    .0655         1.0000         1.0000        .0131          1.0000
                      CuratorNet      .0001               .7204     .1683    .0106             .0966        .3200       .0040               .1246
                      CuratorNet      .001                .7177     .1566    .0094             .0895        .2937       .0037               .1160
                      VisRank         –                   .7151     .1521    .0093             .0956        .2765       .0034               .1195
                      CuratorNet      0                   .7131     .1689    .0100             .0977        .3048       .0038               .1239
                      CuratorNet      .01                  .7125     .1235    .0075             .0635        .2548       .0032               .0904
                      VBPR            .0001                .6641     .1368    .0081             .0728        .2399       .0030               .0923
                      VBPR            0                    .6543     .1287    .0078             .0670        .2077       .0026               .0829
                      VBPR            .001                 .6410     .0830    .0047             .0387        .1948       .0024               .0620
                      VBPR            .01                  .5489     .0101    .0005             .0039        .0506       .0006               .0118
                      Random          –                   .4973     .0103    .0006              .0041       .0322        .0005               .0098


Figure 3. CuratorNet with sampling guidelines (AUC=.7204) had a                       for future work is integrating multitask learning in our framework,
significant improvement over CuratorNet with random negative                          such as the recently published paper on the newest Youtube recom-
sampling (AUC=.6602), 𝑝 = 7.7 · 10−5 . Likewise, VBPR with guide-                     mender [45]. Finally, from a methodological point-of-view, we will
lines (AUC=.6641) had a significant improvement compared with                         test other datasets with likes rather than purchases, since we aim at
VBPR with random sampling (AUC=.5899), 𝑝 = 1.6 · 10−6 . With this                     understanding how the model will behave under a different type of
result, we conclude that the proposed sampling guidelines help in                     user relevance feedback.
selecting better triplets for more effective learning in our art image
recommendation setting.                                                               ACKNOWLEDGMENTS
                                                                                      This work has been supported by the Millennium Institute for
6    CONCLUSION                                                                       Foundational Research on Data (IMFD) and by the Chilean research
In this article we have introduced CuratorNet, an art image rec-                      agency ANID, FONDECYT grant 1191791.
ommender system based on neural networks. The learning model
of CuratorNet is inspired by VBPR [20], but it incorporates some                      REFERENCES
additional aspects such as layers with shared weights and it works                     [1] S. Akçay, M. E. Kundegorski, M. Devereux, and T. P. Breckon. 2016. Transfer
specially well in situations of one-of-a-kind items, i.e., items which                     learning using convolutional neural networks for object classification within X-
disappear from the inventory once consumed, making difficult to                            ray baggage security imagery. In Proceedings of the IEEE International Conference
                                                                                           on Image Processing (ICIP). 1057–1061.
user traditional collaborative filtering. Notice that an important                     [2] LM Aroyo, Y Wang, R Brussee, Peter Gorgels, LW Rutledge, and N Stash. 2007.
contribution of this article are the data shared, since we could not                       Personalized museum experience: The Rijksmuseum use case. In Proceedings of
                                                                                           Museums and the Web.
find on the internet any other dataset of user transactions over                       [3] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep
physical paintings. We have anonymized the user and item IDs                               convolutional encoder-decoder architecture for image segmentation. IEEE trans-
and we have provided ResNet visual embeddings to help other                                actions on pattern analysis and machine intelligence 39, 12 (2017), 2481–2495.
                                                                                       [4] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017.
researchers building and validating models with these data.                                Network dissection: Quantifying interpretability of deep visual representations.
   Our model outperforms state-of-the-art VBPR as well as other                            In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
simple but strong baselines such as VisRank [22, 32]. We also intro-                       6541–6549.
                                                                                       [5] Idir Benouaret and Dominique Lenne. 2015. Personalizing the museum experience
duce a series of guidelines for sampling triples for the BPR training                      through context-aware recommendations. In 2015 IEEE International Conference
set, and we show significant improvements in performance of both                           on Systems, Man, and Cybernetics. IEEE, 743–748.
                                                                                       [6] Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A theoretical analysis of
CuratorNet and VBPR versus traditional random sampling for neg-                            feature pooling in visual recognition. In Proceedings of the 27th international
ative instances.                                                                           conference on machine learning (ICML-10). 111–118.
   Future Work. Among our ideas for future work, we will test                          [7] Sumit Chopra, Raia Hadsell, Yann LeCun, et al. 2005. Learning a similarity metric
                                                                                           discriminatively, with application to face verification. In CVPR (1). 539–546.
our neural architecture using end-to-end-learning, in a similar fash-                  [8] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks
ion than [22] who used a light model called CNN-F to replace the                           for youtube recommendations. In Proceedings of the 10th ACM Conference on
pre-trained AlexNet visual embeddings. Another idea we will test is                        Recommender Systems. 191–198.
                                                                                       [9] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of
to create explanations for our recommendations based on low-level                          recommender algorithms on top-n recommendation tasks. In Proceedings of the
(textures) and high level (objects) visual features which some re-                         fourth ACM conference on Recommender systems. ACM, 39–46.
                                                                                      [10] Felipe del Rio, Pablo Messina, Vicente Dominguez, and Denis Parra. 2018. Do
cent research are able to identify from CNNs, such as the Network                          Better ImageNet Models Transfer Better... for Image Recommendation?. In 2nd
Dissection approach by Bau et al. [4]. Also, we will explore ideas                         workshop on Intelligent Recommender Systems by Knowledge Transfer and Learning.
from the research on image style transfer [15, 16], which might                            https://arxiv.org/abs/1807.09870
                                                                                      [11] Yashar Deldjoo, Mehdi Elahi, Paolo Cremonesi, Franca Garzotto, Pietro Piazzolla,
help us to identify styles and then use this information as context                        and Massimo Quadrana. 2016. Content-based video recommendation system
to produce style-aware recommendations. Another interesting idea                           based on stylistic visual features. Journal on Data Semantics 5, 2 (2016), 99–113.
                                                                                                                                                                Cartagena, et al.


[12] Jingtao Ding, Fuli Feng, Xiangnan He, Guanghui Yu, Yong Li, and Depeng Jin.               Transactions on circuits and systems for video technology 8, 5 (1998), 644–655.
     2018. An improved sampler for bayesian personalized ranking by leveraging            [36] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
     view data. In Companion of the The Web Conference 2018 on The Web Conference              Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.
     2018. International World Wide Web Conferences Steering Committee, 13–14.                 Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.
[13] Mehdi Elahi, Yashar Deldjoo, Farshad Bakhshandegan Moghaddam, Leonardo                    International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https:
     Cella, Stefano Cereda, and Paolo Cremonesi. 2017. Exploring the Semantic Gap              //doi.org/10.1007/s11263-015-0816-y
     for Movie Recommendations. In Proceedings of the Eleventh ACM Conference on          [37] Jose San Pedro and Stefan Siersdorfer. 2009. Ranking and Classifying Attractive-
     Recommender Systems (Como, Italy) (RecSys ’17). 326–330.                                  ness of Photos in Folksonomies. In Proceedings of the 18th International Conference
[14] David Elsweiler, Christoph Trattner, and Morgan Harvey. 2017. Exploiting food             on World Wide Web (Madrid, Spain) (WWW ’09). 771–780.
     choice biases for healthier recipe recommendation. In Proceedings of the 40th        [38] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A
     international acm sigir conference on research and development in information             unified embedding for face recognition and clustering. In Proceedings of the IEEE
     retrieval. ACM, 575–584.                                                                  conference on computer vision and pattern recognition. 815–823.
[15] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer     [39] Giovanni Semeraro, Pasquale Lops, Marco De Gemmis, Cataldo Musto, and
     using convolutional neural networks. In Proceedings of the IEEE conference on             Fedelucio Narducci. 2012. A folksonomy-based recommender system for person-
     computer vision and pattern recognition. 2414–2423.                                       alized access to digital artworks. Journal on Computing and Cultural Heritage
[16] Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, and Jonathon              (JOCCH) 5, 3 (2012), 11.
     Shlens. 2017. Exploring the structure of a real-time, arbitrary neural artistic      [40] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson.
     stylization network. arXiv preprint arXiv:1705.06830 (2017).                              2014. CNN features off-the-shelf: an astounding baseline for recognition. In
[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,             Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
     Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial           Workshops. 806–813.
     nets. In Advances in neural information processing systems. 2672–2680.               [41] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual                works for Large-Scale Image Recognition. In International Conference on Learning
     learning for image recognition. In Proceedings of the IEEE conference on computer         Representations.
     vision and pattern recognition. 770–778.                                             [42] Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On
[19] Ruining He, Chen Fang, Zhaowen Wang, and Julian McAuley. 2016. Vista: A                   the robustness and discriminative power of information retrieval metrics for top-
     Visually, Socially, and Temporally-aware Model for Artistic Recommendation.               N recommendation. In Proceedings of the 12th ACM Conference on Recommender
     In Proceedings of the 10th ACM Conference on Recommender Systems (Boston,                 Systems. ACM, 260–268.
     Massachusetts, USA) (RecSys ’16). 309–316.                                           [43] Egon L van den Broek, Thijs Kok, Theo E Schouten, and Eduard Hoenkamp. 2006.
[20] Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized                   Multimedia for art retrieval (m4art). In Multimedia Content Analysis, Management,
     Ranking from implicit feedback. In Proceedings of the Thirtieth AAAI Conference           and Retrieval 2006, Vol. 6073. International Society for Optics and Photonics,
     on Artificial Intelligence. 144–150.                                                      60730Z.
[21] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation         [44] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James
     of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002),            Philbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity
     422–446.                                                                                  with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and
[22] Wang-Cheng Kang, Chen Fang, Zhaowen Wang, and Julian McAuley. 2017.                       Pattern Recognition. 1386–1393.
     Visually-aware fashion recommendation and design with generative image mod-          [45] Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews,
     els. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE, 207–216.          Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019.
[23] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-                Recommending What Video to Watch Next: A Multitask Ranking System. In
     mization. In 3rd International Conference on Learning Representations, ICLR 2015,         Proceedings of the 13th ACM Conference on Recommender Systems (Copenhagen,
     San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio            Denmark) (RecSys ’19). ACM, New York, NY, USA, 43–51. https://doi.org/10.
     and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980                                     1145/3298689.3346997
[24] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter.
     2017. Self-normalizing neural networks. In Advances in Neural Information
     Processing Systems. 971–980.
[25] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural
     networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2.
[26] Simon Kornblith, Jonathon Shlens, and Quoc V Le. 2018. Do better imagenet
     models transfer better? arXiv preprint arXiv:1805.08974 (2018).
[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifi-
     cation with deep convolutional neural networks. In Proceedings of Advances in
     neural information processing systems 25 (NIPS). 1097–1105.
[28] Marco La Cascia, Saratendu Sethi, and Stan Sclaroff. 1998. Combining textual
     and visual cues for content-based image retrieval on the world wide web. In
     Proceedings of the IEEE Workshop on Content-Based Access of Image and Video
     Libraries. 24–28.
[29] Chenyi Lei, Dong Liu, Weiping Li, Zheng-Jun Zha, and Houqiang Li. 2016. Com-
     parative Deep Learning of Hybrid Representations for Image Recommendations.
     In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
     (CVPR). 2545–2553.
[30] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
     2015. Image-based recommendations on styles and substitutes. In Proceedings
     of the 38th International ACM SIGIR Conference on Research and Development in
     Information Retrieval. ACM, 43–52.
[31] L. McInnes, J. Healy, and J. Melville. 2018. UMAP: Uniform Manifold Approx-
     imation and Projection for Dimension Reduction. ArXiv e-prints (Feb. 2018).
     arXiv:1802.03426 [stat.ML]
[32] Pablo Messina, Vicente Dominguez, Denis Parra, Christoph Trattner, and Alvaro
     Soto. 2018. Content-based artwork recommendation: integrating painting meta-
     data with neural and manually-engineered visual features. User Modeling and
     User-Adapted Interaction (2018), 1–40.
[33] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.
     2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings
     of the twenty-fifth conference on uncertainty in artificial intelligence. 452–461.
[34] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and
     validation of cluster analysis. Journal of computational and applied mathematics
     20 (1987), 53–65.
[35] Yong Rui, Thomas S Huang, Michael Ortega, and Sharad Mehrotra. 1998. Rele-
     vance feedback: a power tool for interactive content-based image retrieval. IEEE