<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CuratorNet: Visually-aware Recommendation of Art Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pablo Messina∗</string-name>
          <email>P@100</email>
          <email>P@20</email>
          <email>pamessina@uc.cl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Cartagena</string-name>
          <email>micartagena@uc.cl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patricio Cerda</string-name>
          <email>pcerdam@uc.cl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felipe del Rio†</string-name>
          <email>ifdelrio@uc.cl</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denis Parra‡</string-name>
          <email>dparra@ing.puc.cl</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>∗Also with Millennium Institute Foundational Research on Data, IMFD.</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pontificia Universidad Católica</institution>
          ,
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pontificia Universidad Católica</institution>
          ,
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Pontificia Universidad Católica</institution>
          ,
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Pontificia Universidad Católica</institution>
          ,
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Pontificia Universidad Católica</institution>
          ,
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Also with Millennium Institute Foundational Research on Data, IMFD., ‡Also with Millennium Institute Foundational Research on Data, IMFD.</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Although there are several visually-aware recommendation models in domains like fashion or even movies, the art domain lacks the same level of research attention, despite the recent growth of the online artwork market. To reduce this gap, in this article we introduce CuratorNet, a neural network architecture for visually-aware recommendation of art images. CuratorNet is designed at the core with the goal of maximizing generalization: the network has a fixed set of parameters that only need to be trained once, and thereafter the model is able to generalize to new users or items never seen before, without further training. This is achieved by leveraging visual content: items are mapped to item vectors through visual embeddings, and users are mapped to user vectors by aggregating the visual content of items they have consumed. Besides the model architecture, we also introduce novel triplet sampling strategies to build a training set for rank learning in the art domain, resulting in more efective learning than naive random sampling. With an evaluation over a real-world dataset of physical paintings, we show that CuratorNet achieves the best performance among several baselines, including the state-of-the-art model VBPR. CuratorNet is motivated and evaluated in the art domain, but its architecture and training scheme could be adapted to recommend images in other areas.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Recommender systems; •
Computing methodologies → Machine learning approaches; •
Applied computing → Media arts.
recommender systems, neural networks, visual art</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        The big revolution of deep convolutional neural networks (CNN)
in the area of computer vision for tasks such as image classification
[
        <xref ref-type="bibr" rid="ref18 ref27 ref41">18, 27, 41</xref>
        ], object recognition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], image segmentation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or scene
identification [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] has reached the area of image recommender
systems in recent years [
        <xref ref-type="bibr" rid="ref19 ref20 ref22 ref29 ref30 ref32">19, 20, 22, 29, 30, 32</xref>
        ]. These works use neural
visual embeddings to improve the recommendation performance
compared to previous approaches for image recommendation based
on ratings and text [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], social tags [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ], context [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and manually
crafted visual features [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ]. Regarding application domains of recent
image recommendation methods using neural visual embeddings,
to the best of our knowledge most of them focus on fashion
recommendation [
        <xref ref-type="bibr" rid="ref20 ref22 ref30">20, 22, 30</xref>
        ], a few on art recommendation [
        <xref ref-type="bibr" rid="ref19 ref32">19, 32</xref>
        ] and
photo recommendation [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. He et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] proposed Vista, a model
combining neural visual embeddings, collaborative filtering as well
as temporal and social signals for digital art recommendation.
      </p>
      <p>
        However, digital art projects can difer significantly from
physical art (paintings and photographs). Messina et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] study
recommendation of paintings in an online art store using a simple k-NN
model based on neural visual features and metadata. Although
memory-based models perform fairly well, model-based methods
using neural visual features report better performance [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ] in
the fashion domain, indicating room for improvement in this area,
considering the growing sales in the global online artwork market1.
      </p>
      <p>
        The most popular model-based method for image
recommendation using neural visual embeddings is VBPR [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], a state-of-the-art
model that integrates implicit feedback collaborative filtering with
neural visual embeddings into a Bayesian Personalized Ranking
(BPR) learning framework [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. VBPR performs well, but it has
some drawbacks. VBPR learns a latent embedding for each user
and for each item, so new users cannot receive suggestions and
new items cannot be recommended until re-training is carried out.
An alternative is training a model such as Youtube’s Deep Neural
Recommender [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which allows to recommend to new users with
little preference feedback and without additional model training.
However, Youtube’s model was trained on millions of user
transactions and with large amounts of profile and contextual data, so it
does not easily fit to datasets that are small, with little user feedback
or with little contextual and profile data.
      </p>
      <p>
        In this work, we introduce a neural network for visually-aware
recommendation of images focused on visual art named
CuratorNet, whose general structure can be seen in Figure 1. CuratorNet
leverages neural image embeddings as those obtained from CNNs
[
        <xref ref-type="bibr" rid="ref18 ref27 ref41">18, 27, 41</xref>
        ] pre-trained on the Imagenet dataset (ILSVRC [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]). We
train CuratorNet for ranking with triplets ( , +, −), where  is
the history of image preferences of a user , whereas + and − are
a pair of images with higher and lower preference respectively.
CuratorNet draws inspiration from VBPR [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and Youtube’s
Recommender System [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. VBPR [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] inspired us to leverage pre-trained
image embeddings as well as optimizing the model for ranking as in
1https://www.artsy.net/article/artsy-editorial-global-art-market-reached-674billion-2018-6
BPR [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. From the work of Convington et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] we took the idea
of designing a deep neural network that can generalize to new users
without introducing new parameters or further training (unlike
VBPR which needs to learn a latent user vector for each new user).
As a result, CuratorNet can recommend to new users with very
little feedback, and without additional training CuratorNet’s deep
neural network is trained for personalized ranking using triplets
and the architecture contains a set of layers with shared weights,
inspired by models using triplet loss for non-personalized image
ranking [
        <xref ref-type="bibr" rid="ref38 ref44">38, 44</xref>
        ]. In these works, a single image represents the input
query, but in our case, the input query is a set images representing
a user preference history,  . In summary, compared to previous
works, our main contributions are:
• a novel neural-based visually-aware architecture for image
recommendation,
• a set of sampling guidelines for the creation of the training
dataset (triplets), which improve the performance of
CuratorNet as well as VBPR with respect to random negative
sampling, and
• presenting a thorough evaluation of the method against
competitive state-of-the-art methods (VisRank [
        <xref ref-type="bibr" rid="ref22 ref32">22, 32</xref>
        ] and
VBPR[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]) on a dataset of purchases of physical art (paintings
and photographs).
      </p>
      <p>We also share the dataset2 of user transactions (with hashed
user and item IDs due to privacy requirements) as well as visual
embeddings of the paintings image files. One aspect to highlight
about this research, is that although the triplets’ sampling guidelines
to build the BPR training set apply specifically to visual art, the
architecture of CuratorNet can be used in other visual domains for
image recommendation.
2</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>In this section we provide an overview of relevant related work,
considering: Artwork Recommender Systems (2.1), Visually-aware
Recommender Systems (2.2), as well as highlights of what
diferentiates our work to the existing literature.
2.1</p>
    </sec>
    <sec id="sec-4">
      <title>Artwork Recommender Systems</title>
      <p>
        With respect to artwork recommender systems, one of the first
contributions was the CHIP Project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The aim of the project
was to build a recommendation system for the Rijksmuseum. The
project used traditional techniques such as content-based filtering
based on metadata provided by experts, as well as collaborative
ifltering based on users’ ratings. Another similar system but
nonpersonalized was 4 by Van den Broek et al. [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ], who used
color histograms to retrieve similar art images given a painting as
input query.
      </p>
      <p>
        Another important contribution is the work by Semeraro et al.
[
        <xref ref-type="bibr" rid="ref39">39</xref>
        ], who introduced an artwork recommender system called FIRSt
(Folksonomy-based Item Recommender syStem) which utilizes
social tags given by experts and non-experts of over 65 paintings of
the Vatican picture gallery. They did not employ visual features
among their methods. Benouaret et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] improved the
state-ofthe-art in artwork recommender systems using context obtained
2https://drive.google.com/drive/folders/1Dk7_BRNtN_IL8r64xAo6GdOYEycivtLy
through a mobile application, with the aim of making museum tour
recommendations more useful. Their content-based approach used
ratings given by the users during the tour and metadata from the
artworks rated, e.g. title or artist names.
      </p>
      <p>
        Finally, the most recent works use neural image embeddings
[
        <xref ref-type="bibr" rid="ref19 ref32">19, 32</xref>
        ]. He et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] propose the system Vista, which addresses
digital artwork recommendations based on pre-trained deep neural
visual features, as well as temporal and social data. On the other
hand, Messina et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] address the recommendation of
one-of-akind physical paintings, comparing the performance of metadata,
manually-curated visual features, and neural visual embeddings.
Messina et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] recommend to users by computing a simple
K-NN based similarity score among users’ purchased paintings and
the paintings in the dataset, a method that Kang et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] call
VisRank.
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Visually-aware Image Recommender</title>
    </sec>
    <sec id="sec-6">
      <title>Systems</title>
      <p>In this section we survey works using visual features to
recommend images. We also cite a few works using visual information to
recommend non-image items, but these are not too relevant for the
present research.</p>
      <p>
        Manually-engineered visual features extracted from images
(texture, sharpness, brightness, etc.) have been used in several tasks
for information filtering, such as retrieval [
        <xref ref-type="bibr" rid="ref28 ref35 ref43">28, 35, 43</xref>
        ] and ranking
[
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. More recently, interesting results have been shown for the
use of low-level handcrafted stylistic visual features automatically
extracted from video frames for content-based video
recommendation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Even better results are obtained when both stylistic
visual features and annotated metadata are combined in a hybrid
recommender, as shown in the work of Elahi et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In a
visuallyaware setting not related to recommending images, Elsweiller et al.
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] used manually-crafted attractiveness visual features [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ], in
order to recommend healthy food recipes to users.
      </p>
      <p>
        Another branch of visually-aware image recommender systems
focuses on using neural embeddings to represent images [
        <xref ref-type="bibr" rid="ref19 ref20 ref22 ref29 ref32">19, 20,
22, 29, 32</xref>
        ]. The computer vision community has a large track of
successful systems based on neural networks for several tasks
[
        <xref ref-type="bibr" rid="ref1 ref18 ref27 ref3 ref40 ref41">1, 3, 18, 27, 40, 41</xref>
        ]. This trend started from the outstanding
performance of the AlexNet [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] in the Imagenet Large Scale Visual
Recognition challenge (ILSVRC [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]), but the most notable
implication is that the neural image embeddings have shown impressive
performance for transfer learning, i.e., for tasks diferent from the
original one [
        <xref ref-type="bibr" rid="ref10 ref26">10, 26</xref>
        ]. Usually these neural image embeddings are
obtained from CNN models such as AlexNet [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], VGG [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] and
ResNet [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], among others. Motivated by these results, McAuley et
al. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] introduced an image-based recommendation system based
on styles and substitutes for clothing using visual embeddings
pretrained on a large-scale dataset obtained from Amazon.com. Later,
He et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] went further in this line of research and introduced
a visually-aware matrix factorization approach that incorporates
visual signals (from a pre-trained CNN) into predictors of people’s
opinions, called VBPR. Their training model is based on Bayesian
Personalized Ranking (BPR), a model previously introduced by
Rendle et al. [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ].
      </p>
      <p>
        The next work by He et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] deals with visually-aware digital
art recommendation, building a model called Vista which combines
ratings, temporal and social signals and visual features.
      </p>
      <p>
        Another relevant work was the research by Lei et al. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] who
introduced comparative deep learning for hybrid image
recommendation. In this work, they use a siamese neural network architecture
for making recommendations of images using user information
(such as demographics and social tags) as well as images in pairs
(one liked, one disliked) in order to build a ranking model. The
approach is interesting, but they work with Flickr photos, not artwork
images, and use social tags, not present in our problem setting. The
work by Kang et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] expands VBPR but they focus on
generating images using Generative adversarial networks [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] rather than
recommending, with an application in the fashion domain. Finally,
Messina et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] was already mentioned, but we can add that their
neural image embeddings outperformed other visual
(manuallyextracted) and metadata features for ranking, with the exception of
the metadata given by user’s favorite artist, which predicted even
better than neural embeddings for top@k recommendation.
2.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Diferences to Previous Research</title>
      <p>
        Almost all the surveyed articles on artwork recommendation have
in common that they used standard techniques such as
collaborative filtering and content-based filtering, as well as
manuallycurated visual image features, but only the most recent works have
exploited visual features extracted from CNNs [
        <xref ref-type="bibr" rid="ref19 ref32">19, 32</xref>
        ]. In
comparison to these works, we introduce a model-based approach (unlike
the memory-based VisRank method by Messina et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]) and
which recommends to cold-start items and users without additional
model training (unlike [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]). With regards to more general work
on visually-aware image recommender systems, almost all of the
surveyed articles have focused on tasks diferent from art
recommendation, such as fashion recommendation [
        <xref ref-type="bibr" rid="ref20 ref22 ref30">20, 22, 30</xref>
        ], photo
[
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] and video recommendation [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Only Vista, the work by He et
al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], resembles ours in terms of the topic (art recommendation)
and the use of visual features. Unlike them, we evaluate our
proposed method, CuratorNet, in a dataset of physical paintings and
photographs, not only digital art. Moreover, Vista uses social and
temporal metadata which we do not have and many other datasets
might not have either. Compared to all these previous research, and
to the best of our knowledge, CuratorNet is the first architecture
for image recommendation that takes advantage of shared weights
in a triplet loss setting, an idea inspired by the results of Wang et
al. [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ] and Schrof et al. [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], but here adapted to the personalized
image recommendation domain.
3
3.1
      </p>
    </sec>
    <sec id="sec-8">
      <title>CURATORNET</title>
    </sec>
    <sec id="sec-9">
      <title>Problem Formulation</title>
      <p>
        We approach the problem of recommending art images from user
positive-only feedback (e.g., purchase history, likes, etc.) upon visual
items (paintings, photographs, etc.). Let  and  be the set of users
and items in a dataset, respectively. We assume only one image
per each single item  ∈  . Considering either user purchases or
likes, the set of items for which a user  has expressed positive
preference is defined as +. In this work, we considered purchases
to be positive feedback from the user. Our goal is to generate for
each user  ∈  a personalized ranked list of the items for which
the user still have not expressed preference, i.e., for  \ +.
The preference predictor in CuratorNet is inspired by VBPR [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], a
state-of-the-art visual recommender model.
      </p>
      <p>
        However, CuratorNet has some important diferences . First,
we do not use non-visual latent factors, so we remove the traditional
user and item non-visual latent embeddings. Second, we do not
learn a specific embedding per user such as VBPR, but we learn a
joint model that, given a user’s purchase/like history, it outputs a
single embedding which can be used to rank unobserved artworks
in the dataset, similar to YouTube’s Deep Learning network [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Another important diference of VBPR with CuratorNet is that the
former has a single matrix E to project a visual item embedding 
into the user latent space. In CuratorNet, we rather learn a neural
network Φ(·) to perform that projection, which receives as input
either a single image embedding f or a set of image embeddings
representing users’ purchase/like history  = {f1, ..., f } . Given
all these aspects, the preference predictor of CuratorNet is given
by:
      </p>
      <p>, =  +  + Φ( ) Φ(f )
where  is an ofset,  represents a user bias, Φ(·) represents
CuratorNet neural network and  represents the set of visual
embeddings of the images in user  history. After some experiments
we found no diferences between using or not a variable for item bias
 so we dropped it in order to decrease the number of parameters
(Occam’s razor).</p>
      <p>
        Finally, since we calculate the model parameters using BPR [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ],
the parameters  ,  cancel out (details in the coming subsection)
and our final preference predictor is simply
      </p>
      <p>, = Φ( ) Φ(f )
3.3</p>
    </sec>
    <sec id="sec-10">
      <title>Model Learning via BPR</title>
      <p>
        We use the Bayesian Personalized Ranking (BPR) framework [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]
to learn the model parameters. Our goal is to optimize ranking by
training a model which orders triples of the form (, ,  ) ∈ DS ,
where  denotes a user,  an item with positive feedback from ,
and  an item with non-observed feedback from . The training set
of triples DS
      </p>
      <p>is defined as:</p>
      <p>DS = {(, ,  ) | ∈  ∧  ∈ + ∧  ∈  \ +}
feedback from  while  \++ shows those items without such positive
feedback. Considering our previously defined preference predictor
, , we would expect a larger preference score of  over  than over
 , then BPR defines the diference between scores</p>
      <p>,, = , − ,
an then BPR aims at finding the parameters Θ which optimize
the objective function
argmax
Θ
︁∑
DS</p>
      <p>ln  (,, ) − Θ ||Θ||2
where  (·) is the sigmoid function, Θ includes all model
parameters, and Θ is a regularization hyperparameter.</p>
      <p>
        In CuratorNet, unlike BPR-MF [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] and VBPR [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], we use a
sigmoid cross entropy loss, considering that we can interpret the
decision over triplets as a binary classification problem, where if
, ≤ , ). Then, CuratorNet loss can be expressed as:
,, &gt; 0 represents class  = 1 (triple well ranked, since , &gt; ,
) and ,, ≤ 0 signifies class  = 0 (triplet wrongly ranked, since
L = −
 ln( (,, )) + (1 − ) ln(1 −  (,, )) + Θ ||Θ||2 (6)
(3)
(4)
(5)
(7)
i.e.,  \ .
      </p>
      <p>
        Adam optimizer [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], using the implementation in Tensorflow 3
.
      </p>
      <p>During each iteration of stochastic gradient descent, we sample a
user , a positive item  ∈  (i.e., removed from  ), a negative item
+
 ∈  \ + , and user  purchase/like history with item  removed,
3.4</p>
    </sec>
    <sec id="sec-11">
      <title>Model Architecture</title>
      <p>
        The architecture of the CuratorNet neural network is summarized in
is a regularization hyperparameter, and  (,, ) is the probability
that a user  really prefers  over  ,  ( &gt;  |Θ) [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], calculated
 ( &gt;  |Θ) =  (,, ) =
1
1 + −(, −, )
︁∑
reduce the loss function L by stochastic gradient descent with the
puc/CuratorNet.
need to optimize an additional margin parameter , we chose the
sigmoid cross-entropy for CuratorNet.
      </p>
      <p>
        Notice that in this article we used a pre-trained ResNet [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] to
obtain the image visual features, but the model could use other
CNNs such as AlexNet [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], VGG [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ], etc. We chose ResNet since
it has performed the best in transfer learning tasks [
        <xref ref-type="bibr" rid="ref10 ref26">10, 26</xref>
        ].
3.5
      </p>
    </sec>
    <sec id="sec-12">
      <title>Data Sampling for Training</title>
      <p>
        The original BPR article [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] suggests the creation of training triples
(, +, −) simply by, given a user , randomly sampling a positive
element + among those consumed, as well as sampling a negative
feedback element − among those not consumed. However, eventual
research has shown that there are more efective ways to create
these training triples [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In our case, we define some guidelines to
sample triples for the training set based on analyses from previous
studies indicating features which provide signals of user preference.
For instance, Messina et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] showed that people are very likely
to buy several artworks with similar visual themes, as well as from
the same artist, then we used visual clusters and user’s favorite artist
to set some of these sampling guidelines.
      </p>
      <p>
        Creating Visual Clusters. Some of the sampling guidelines
are based on visual similarity of the items, and although we have
some metadata for the images in the dataset, there is a significant
number of missing values: only 45% of the images have information
about subject (e.g., architecture, nature, travel) and 53% about style
(e.g., abstract, surrealism, pop art). For this reason, we conduct a
clustering of images based on their visual representation, in such a
way that items with visual embeddings that are too similar will not
be used to sample positive/negative pairs (+, −). To obtain these
visual clusters, we followed the following procedure: (i) Conduct
a Principal Component Analysis to reduce the dimensionality of
images embedding vectors from R2,048 to R200, (ii) perform k-means
clustering with 100 clusters. We conducted k-means clustering 20
times and for each time we calculated the Silhouette coeficient [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]
(an intrinsic metric of clustering quality), so we kept the clustering
resulting with the highest Silhouette value. Finally, (iii) we assign
each image the label of its respective visual cluster. Samples of our
clusters in a 2-dimensional projection map of images, built with
the UMAP method [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], can be seen in Figure 2.
      </p>
      <p>Guidelines for sampling triples. We generate the training
set DS as the union of multiple disjoint4 training sets, each one
generated with a diferent strategy in mind. These strategies and
their corresponding training sets are:
(1) Removing item from purchase basket, and predicting this
missing item.
(2) Sort items purchased sequentially, and then predict next
purchase in basket.
(3) Recommending visually similar artworks from the favorite
artists of a user.
(4) Recommending profile items from the same user profile.
(5) Create an artificial user profile of a single item purchased,
and recommending profile items given this artificially
created user profile.
(6) Create artificial profile with a single item, then recommend
visually similar items from the same artist.</p>
      <p>Finally, the training set DS is formally defined as:</p>
      <p>DS =
6
Ø</p>
      <p>DS</p>
      <p>=1</p>
      <p>In practice, we uniformly sample about 10 million training triples,
distributed uniformly among the six training sets DS . Likewise,
we sample about 300,000 validation triples. To avoid sampling
identical triples, we hash them and compare the hashes to check for
potential collisions. Before sampling the training and validation
sets, we hide the last purchase basket of each user, using them later
on for testing.
(8)
For our experiments we used a dataset where the user preference is
in the form of purchases over physical art (painting and pictures).
This private dataset was collected and shared by an online art store.
The dataset consists of 2, 378 users, 6, 040 items (paintings and
photographs) and 5, 336 purchases. On average, each user bought
2-3 items. One important aspect of this dataset is that paintings are
one-of-a-kind, i.e., there is a single instance of each item and once it
is purchased, is removed from the inventory. Since most of the items
in the dataset are one-of-a-kind paintings (78%) and most purchase
transactions have been made over these items (81.7%) a method
relying on collaborative filtering model might sufer in performance,
since user co-purchases are only possible on photographs. Another
notable aspect in the dataset is that each item has a single creator
(artist). In this dataset there are 573 artists, who have uploaded
10.54 items in average to the online art store.</p>
      <p>
        The dataset5 with transaction tuples (user, item), as well as the
tuples used for testing (the last purchase of each user with at least
two purchases) are available for replicating our results as well as
for training other models. Due to copyright restrictions we cannot
share the original image files, but we share the embeddings of the
images obtained with ResNet50 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
4Theoretically, these training sets are not perfectly disjoint, but in practice we hash
all training triples and make sure no two training triples have the same hash. This
prevents duplicates from being added to the final training set.
5https://drive.google.com/drive/folders/1Dk7_BRNtN_IL8r64xAo6GdOYEycivtLy
In order to build and test the models, we split the data into train,
validation and test sets. To make sure that we could make
recommendations for all cases in the test set, and thus make a fair
comparison among recommendation methods, we check that
every user considered in the test set was also present in the training
set. All baseline methods were trained on the training set with
hyperparameters tuned with the validation set.
      </p>
      <p>Next, the trained models are used to report performance over
diferent metrics on the test set. For the dataset, the test set consists
of the last transaction from every user that purchased at least twice,
the rest of previous purchases are used for train and validation.</p>
      <p>
        Metrics. To measure the results we used several metrics: AUC
(also used in [
        <xref ref-type="bibr" rid="ref19 ref20 ref22">19, 20, 22</xref>
        ]), normalized discounted cumulative gain
(nDCG@k)[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], as well as Precision@k and Recall@k [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Although
it might seem counter-intuitive, we calculate these metrics for a
low (k=20) as well as high values of k ( = 100). Most research on
top-k recommendation systems focuses on the very top of the
recommendation list, (k=5,10,20). However, Valcarce et al. [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ] showed
that top-k ranking metrics measured at higher values of k (k=100,
200) are specially robust to biases such as sparsity and popularity
biases. The sparsity bias refers to the lack of known relevance for
all the user-items pairs, while the popularity bias is the tendency
of popular items to receive more user feedback, then missing
useritems are not missing at random. We are specially interested in
preventing popularity bias since we want to recommend not only
from the artists that each user is commonly purchasing from. We
aim at promoting novelty as well as discovery of relevant art from
newcomer artists.
4.3
      </p>
    </sec>
    <sec id="sec-13">
      <title>Baselines</title>
      <p>
        The methods used in the evaluation are the following:
(1) CuratorNet: The method described in this paper. We also test
it with four regularization values for  = {0, .01, .001, .0001}.
(2) VBPR [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]: The state-of-the-art. We used the same embedding
size as in CuratorNet (200), we optimized it until converge in
the training set and we also tested the four regularization values
for  = {0, .01, .001, .0001}.
(3) VisRank [
        <xref ref-type="bibr" rid="ref22 ref32">22, 32</xref>
        ]: This is a simple memory-based content
filtering method that ranks a candidate painting  for a user 
based on the maximum cosine similarity with some existing
item in the user profile  ∈  i.e.
      </p>
      <p>(, ) =   ∈  (,  )
(9)
5</p>
    </sec>
    <sec id="sec-14">
      <title>RESULTS AND DISCUSSION</title>
      <p>In Table 2, we can see the results comparing all methods. As
reference, at the top rows we present an oracle (perfect ranking), and
in the bottom row a random recommender. Notice that AUC for a
random recommender should be theoretically =0.5 (sorting pairs of
items given a user), so the AUC= .4973 serves as a check. In terms
of AUC, Recall@100, and Precision@100 CuratorNet with a small
regularization ( = .0001) is the top model among other methods.
We highlight the following points from these results:
• CuratorNet, with a small regularization  = .0001, outperforms
the other methods in five metrics (AUC, Precision@20, Recall@100,</p>
      <p>
        Precision@100 and nDCG@100), while it stands second in
Recall@20 and nDCG@20 against the non-regularized version of
CuratorNet. This implies that CuratorNet overall ranks very well
at top positions, and is specially robust against sparsity and
popularity bias [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ]. In addition, CuratorNet seems robust to changes
in the regularization hyperparameter.
• Compared to VBPR, CuratorNet is better in all seven metrics
(AUC, Precision@20, Recall@100, Precision@100 and nDCG@100).
Notably, it is also more robust to the regularization
hyperparameter  than VBPR. We think that this is explained in part due
to the characteristics of the dataset: VBPR exploits non-visual
co-occurrance patterns, but in our dataset this signal provides a
rather small preference information, since almost 80% are
oneof-a-kind items and transactions.
• VisRank presents very competitive results, specially in terms of
AUC, nDCG@20 and nDCG@100, performing better than VBPR
in this high one-of-a-kind dataset. However, CuratorNet performs
better than VisRank in all metrics. This provides evidence that
the model-based approach of CuratorNet that aggregates user
preferences into a single embedding is a better approach than
the heuristic-based scoring of VisRank.
We studied the efect of using our sampling guidelines for
building the training set DS compared to the traditional BPR setting
where negative samples  are sampled uniformly at random from
the set of unobserved items by the user, i.e.,  \ +. In the case of
CuratorNet we use all six sampling guidelines (DS 1 − D3aSn6d),DwShile
in VBPR we only used two sampling guidelines (DS 4),
since VBPR has no notion of session or purchase baskets in its
original formulation, and it has more parameters than CuratorNet
to model collaborative non-visual latent preferences. We tested
AUC in both CuratorNet and VBPR, under their best performance
with regularization parameter , with and without our sampling
guidelines. Notice that results in Table 2 all consider the use of our
sampling guidelines. After conducting pairwise t-tests, we found
a significant improvement in CuratorNet and VBPR, as shown in
      </p>
    </sec>
    <sec id="sec-15">
      <title>6 CONCLUSION</title>
      <p>
        In this article we have introduced CuratorNet, an art image
recommender system based on neural networks. The learning model
of CuratorNet is inspired by VBPR [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], but it incorporates some
additional aspects such as layers with shared weights and it works
specially well in situations of one-of-a-kind items, i.e., items which
disappear from the inventory once consumed, making dificult to
user traditional collaborative filtering. Notice that an important
contribution of this article are the data shared, since we could not
ifnd on the internet any other dataset of user transactions over
physical paintings. We have anonymized the user and item IDs
and we have provided ResNet visual embeddings to help other
researchers building and validating models with these data.
      </p>
      <p>
        Our model outperforms state-of-the-art VBPR as well as other
simple but strong baselines such as VisRank [
        <xref ref-type="bibr" rid="ref22 ref32">22, 32</xref>
        ]. We also
introduce a series of guidelines for sampling triples for the BPR training
set, and we show significant improvements in performance of both
CuratorNet and VBPR versus traditional random sampling for
negative instances.
      </p>
      <p>
        Future Work. Among our ideas for future work, we will test
our neural architecture using end-to-end-learning, in a similar
fashion than [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] who used a light model called CNN-F to replace the
pre-trained AlexNet visual embeddings. Another idea we will test is
to create explanations for our recommendations based on low-level
(textures) and high level (objects) visual features which some
recent research are able to identify from CNNs, such as the Network
Dissection approach by Bau et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Also, we will explore ideas
from the research on image style transfer [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ], which might
help us to identify styles and then use this information as context
to produce style-aware recommendations. Another interesting idea
for future work is integrating multitask learning in our framework,
such as the recently published paper on the newest Youtube
recommender [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ]. Finally, from a methodological point-of-view, we will
test other datasets with likes rather than purchases, since we aim at
understanding how the model will behave under a diferent type of
user relevance feedback.
      </p>
    </sec>
    <sec id="sec-16">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work has been supported by the Millennium Institute for
Foundational Research on Data (IMFD) and by the Chilean research
agency ANID, FONDECYT grant 1191791.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Akçay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Kundegorski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Devereux</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Breckon</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Transfer learning using convolutional neural networks for object classification within Xray baggage security imagery</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Image Processing (ICIP)</source>
          .
          <volume>1057</volume>
          -
          <fpage>1061</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>LM</given-names>
            <surname>Aroyo</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R</given-names>
            <surname>Brussee</surname>
          </string-name>
          , Peter Gorgels, LW Rutledge, and
          <string-name>
            <given-names>N</given-names>
            <surname>Stash</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Personalized museum experience: The Rijksmuseum use case</article-title>
          .
          <source>In Proceedings of Museums and the Web.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Badrinarayanan</surname>
          </string-name>
          , Alex Kendall, and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Cipolla</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Segnet: A deep convolutional encoder-decoder architecture for image segmentation</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>39</volume>
          , 12 (
          <year>2017</year>
          ),
          <fpage>2481</fpage>
          -
          <lpage>2495</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>David</given-names>
            <surname>Bau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Bolei</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Aditya Khosla, Aude Oliva, and Antonio Torralba.
          <year>2017</year>
          .
          <article-title>Network dissection: Quantifying interpretability of deep visual representations</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>6541</fpage>
          -
          <lpage>6549</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Idir</given-names>
            <surname>Benouaret</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dominique</given-names>
            <surname>Lenne</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Personalizing the museum experience through context-aware recommendations</article-title>
          .
          <source>In 2015 IEEE International Conference on Systems, Man, and Cybernetics</source>
          . IEEE,
          <fpage>743</fpage>
          -
          <lpage>748</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y</given-names>
            <surname>-Lan</surname>
          </string-name>
          <string-name>
            <surname>Boureau</surname>
          </string-name>
          , Jean Ponce, and Yann LeCun.
          <year>2010</year>
          .
          <article-title>A theoretical analysis of feature pooling in visual recognition</article-title>
          .
          <source>In Proceedings of the 27th international conference on machine learning (ICML-10)</source>
          .
          <fpage>111</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sumit</given-names>
            <surname>Chopra</surname>
          </string-name>
          , Raia Hadsell,
          <string-name>
            <surname>Yann LeCun</surname>
          </string-name>
          , et al.
          <year>2005</year>
          .
          <article-title>Learning a similarity metric discriminatively, with application to face verification</article-title>
          .
          <source>In CVPR (1)</source>
          .
          <fpage>539</fpage>
          -
          <lpage>546</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Covington</surname>
          </string-name>
          , Jay Adams, and
          <string-name>
            <given-names>Emre</given-names>
            <surname>Sargin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep neural networks for youtube recommendations</article-title>
          .
          <source>In Proceedings of the 10th ACM Conference on Recommender Systems</source>
          .
          <volume>191</volume>
          -
          <fpage>198</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          , Yehuda Koren, and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Turrin</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Performance of recommender algorithms on top-n recommendation tasks</article-title>
          .
          <source>In Proceedings of the fourth ACM conference on Recommender systems. ACM</source>
          ,
          <volume>39</volume>
          -
          <fpage>46</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Felipe del Rio</surname>
            , Pablo Messina, Vicente Dominguez, and
            <given-names>Denis</given-names>
          </string-name>
          <string-name>
            <surname>Parra</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Do Better ImageNet Models Transfer Better... for Image Recommendation?</article-title>
          .
          <source>In 2nd workshop on Intelligent Recommender Systems by Knowledge Transfer and Learning</source>
          . https://arxiv.org/abs/
          <year>1807</year>
          .09870
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Yashar</surname>
            <given-names>Deldjoo</given-names>
          </string-name>
          , Mehdi Elahi, Paolo Cremonesi, Franca Garzotto, Pietro Piazzolla, and
          <string-name>
            <given-names>Massimo</given-names>
            <surname>Quadrana</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Content-based video recommendation system based on stylistic visual features</article-title>
          .
          <source>Journal on Data Semantics</source>
          <volume>5</volume>
          ,
          <issue>2</issue>
          (
          <year>2016</year>
          ),
          <fpage>99</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jingtao</surname>
            <given-names>Ding</given-names>
          </string-name>
          , Fuli Feng, Xiangnan He,
          <string-name>
            <surname>Guanghui Yu</surname>
            ,
            <given-names>Yong</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>and Depeng</given-names>
          </string-name>
          <string-name>
            <surname>Jin</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>An improved sampler for bayesian personalized ranking by leveraging view data</article-title>
          .
          <source>In Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee</source>
          ,
          <fpage>13</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Mehdi</surname>
            <given-names>Elahi</given-names>
          </string-name>
          , Yashar Deldjoo, Farshad Bakhshandegan Moghaddam, Leonardo Cella, Stefano Cereda, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Exploring the Semantic Gap for Movie Recommendations</article-title>
          .
          <source>In Proceedings of the Eleventh ACM Conference on Recommender Systems (Como</source>
          , Italy) (
          <source>RecSys '17)</source>
          .
          <fpage>326</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>David</given-names>
            <surname>Elsweiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Trattner</surname>
          </string-name>
          , and Morgan Harvey.
          <year>2017</year>
          .
          <article-title>Exploiting food choice biases for healthier recipe recommendation</article-title>
          .
          <source>In Proceedings of the 40th international acm sigir conference on research and development in information retrieval. ACM</source>
          ,
          <volume>575</volume>
          -
          <fpage>584</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Leon</surname>
            <given-names>A Gatys</given-names>
          </string-name>
          ,
          <article-title>Alexander S Ecker,</article-title>
          and
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Bethge</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Image style transfer using convolutional neural networks</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>2414</volume>
          -
          <fpage>2423</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Golnaz</surname>
            <given-names>Ghiasi</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Honglak</given-names>
            <surname>Lee</surname>
          </string-name>
          , Manjunath Kudlur, Vincent Dumoulin, and
          <string-name>
            <given-names>Jonathon</given-names>
            <surname>Shlens</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Exploring the structure of a real-time, arbitrary neural artistic stylization network</article-title>
          .
          <source>arXiv preprint arXiv:1705.06830</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Ian</surname>
            <given-names>Goodfellow</given-names>
          </string-name>
          , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Generative adversarial nets</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>2672</volume>
          -
          <fpage>2680</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Ruining</surname>
            <given-names>He</given-names>
          </string-name>
          , Chen Fang,
          <string-name>
            <given-names>Zhaowen</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Julian McAuley</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Vista: A Visually, Socially, and Temporally-aware Model for Artistic Recommendation</article-title>
          .
          <source>In Proceedings of the 10th ACM Conference on Recommender Systems</source>
          (Boston, Massachusetts, USA) (
          <source>RecSys '16)</source>
          .
          <fpage>309</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Ruining</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <surname>Julian McAuley</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>VBPR: Visual Bayesian Personalized Ranking from implicit feedback</article-title>
          .
          <source>In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence</source>
          .
          <fpage>144</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Kalervo</given-names>
            <surname>Järvelin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jaana</given-names>
            <surname>Kekäläinen</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Cumulated gain-based evaluation of IR techniques</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS) 20</source>
          ,
          <issue>4</issue>
          (
          <year>2002</year>
          ),
          <fpage>422</fpage>
          -
          <lpage>446</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Wang-Cheng</surname>
            <given-names>Kang</given-names>
          </string-name>
          , Chen Fang,
          <string-name>
            <given-names>Zhaowen</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Julian McAuley</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Visually-aware fashion recommendation and design with generative image models</article-title>
          .
          <source>In 2017 IEEE International Conference on Data Mining (ICDM)</source>
          . IEEE,
          <fpage>207</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>In 3rd International Conference on Learning Representations, ICLR</source>
          <year>2015</year>
          , San Diego, CA, USA, May 7-
          <issue>9</issue>
          ,
          <year>2015</year>
          , Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Günter</surname>
            <given-names>Klambauer</given-names>
          </string-name>
          , Thomas Unterthiner, Andreas Mayr, and
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Self-normalizing neural networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>971</volume>
          -
          <fpage>980</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Gregory</surname>
            <given-names>Koch</given-names>
          </string-name>
          , Richard Zemel, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Siamese neural networks for one-shot image recognition</article-title>
          .
          <source>In ICML deep learning workshop</source>
          , Vol.
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Simon</surname>
            <given-names>Kornblith</given-names>
          </string-name>
          , Jonathon Shlens, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Do better imagenet models transfer better? arXiv preprint</article-title>
          arXiv:
          <year>1805</year>
          .
          <volume>08974</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Alex</surname>
            <given-names>Krizhevsky</given-names>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geofrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In Proceedings of Advances in neural information processing systems 25 (NIPS)</source>
          .
          <volume>1097</volume>
          -
          <fpage>1105</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Marco</given-names>
            <surname>La</surname>
          </string-name>
          <string-name>
            <surname>Cascia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Saratendu</given-names>
            <surname>Sethi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Stan</given-names>
            <surname>Sclarof</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Combining textual and visual cues for content-based image retrieval on the world wide web</article-title>
          .
          <source>In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries</source>
          .
          <fpage>24</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Chenyi</surname>
            <given-names>Lei</given-names>
          </string-name>
          , Dong Liu,
          <string-name>
            <given-names>Weiping</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Zheng-Jun Zha</surname>
            , and
            <given-names>Houqiang</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Comparative Deep Learning of Hybrid Representations for Image Recommendations</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <volume>2545</volume>
          -
          <fpage>2553</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Julian</surname>
            <given-names>McAuley</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Targett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qinfeng</given-names>
            <surname>Shi</surname>
          </string-name>
          , and Anton Van Den Hengel.
          <year>2015</year>
          .
          <article-title>Image-based recommendations on styles and substitutes</article-title>
          .
          <source>In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM</source>
          ,
          <volume>43</volume>
          -
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>McInnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Healy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Melville</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction</article-title>
          . ArXiv e-prints
          <source>(Feb</source>
          .
          <year>2018</year>
          ). arXiv:
          <year>1802</year>
          .
          <article-title>03426 [stat</article-title>
          .ML]
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Pablo</surname>
            <given-names>Messina</given-names>
          </string-name>
          , Vicente Dominguez, Denis Parra, Christoph Trattner, and
          <string-name>
            <given-names>Alvaro</given-names>
            <surname>Soto</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Content-based artwork recommendation: integrating painting metadata with neural and manually-engineered visual features. User Modeling</article-title>
          and
          <string-name>
            <surname>User-Adapted Interaction</surname>
          </string-name>
          (
          <year>2018</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Stefen</surname>
            <given-names>Rendle</given-names>
          </string-name>
          , Christoph Freudenthaler, Zeno Gantner, and
          <string-name>
            <surname>Lars</surname>
          </string-name>
          Schmidt-Thieme.
          <year>2009</year>
          .
          <article-title>BPR: Bayesian personalized ranking from implicit feedback</article-title>
          .
          <source>In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence</source>
          .
          <fpage>452</fpage>
          -
          <lpage>461</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Peter J Rousseeuw</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>Silhouettes: a graphical aid to the interpretation and validation of cluster analysis</article-title>
          .
          <source>Journal of computational and applied mathematics 20</source>
          (
          <year>1987</year>
          ),
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Yong</surname>
            <given-names>Rui</given-names>
          </string-name>
          , Thomas S Huang,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Ortega</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Sharad</given-names>
            <surname>Mehrotra</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Relevance feedback: a power tool for interactive content-based image retrieval</article-title>
          .
          <source>IEEE Transactions on circuits and systems for video technology 8</source>
          ,
          <issue>5</issue>
          (
          <year>1998</year>
          ),
          <fpage>644</fpage>
          -
          <lpage>655</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Olga</surname>
            <given-names>Russakovsky</given-names>
          </string-name>
          , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
          <string-name>
            <surname>Alexander C. Berg</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2015</year>
          .
          <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>
          .
          <source>International Journal of Computer Vision (IJCV) 115</source>
          ,
          <issue>3</issue>
          (
          <year>2015</year>
          ),
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          . https: //doi.org/10.1007/s11263-015-0816-y
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37] Jose San Pedro and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Siersdorfer</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Ranking and Classifying Attractiveness of Photos in Folksonomies</article-title>
          .
          <source>In Proceedings of the 18th International Conference on World Wide Web (Madrid</source>
          , Spain) (
          <source>WWW '09)</source>
          .
          <fpage>771</fpage>
          -
          <lpage>780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Florian</surname>
            <given-names>Schrof</given-names>
          </string-name>
          , Dmitry Kalenichenko, and
          <string-name>
            <given-names>James</given-names>
            <surname>Philbin</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Facenet: A unified embedding for face recognition and clustering</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>815</volume>
          -
          <fpage>823</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Giovanni</surname>
            <given-names>Semeraro</given-names>
          </string-name>
          , Pasquale Lops, Marco De Gemmis, Cataldo Musto, and
          <string-name>
            <given-names>Fedelucio</given-names>
            <surname>Narducci</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A folksonomy-based recommender system for personalized access to digital artworks</article-title>
          .
          <source>Journal on Computing and Cultural Heritage (JOCCH) 5</source>
          ,
          <issue>3</issue>
          (
          <year>2012</year>
          ),
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Ali</given-names>
            <surname>Sharif</surname>
          </string-name>
          <string-name>
            <surname>Razavian</surname>
          </string-name>
          , Hossein Azizpour, Josephine Sullivan, and
          <string-name>
            <surname>Stefan Carlsson.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>CNN features of-the-shelf: an astounding baseline for recognition</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops</source>
          .
          <fpage>806</fpage>
          -
          <lpage>813</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          .
          <source>In International Conference on Learning Representations.</source>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Valcarce</surname>
          </string-name>
          , Alejandro Bellogín, Javier Parapar, and
          <string-name>
            <given-names>Pablo</given-names>
            <surname>Castells</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>On the robustness and discriminative power of information retrieval metrics for topN recommendation</article-title>
          .
          <source>In Proceedings of the 12th ACM Conference on Recommender Systems. ACM</source>
          ,
          <volume>260</volume>
          -
          <fpage>268</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <surname>Egon L van den</surname>
            <given-names>Broek</given-names>
          </string-name>
          , Thijs Kok, Theo E Schouten, and
          <string-name>
            <given-names>Eduard</given-names>
            <surname>Hoenkamp</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Multimedia for art retrieval (m4art)</article-title>
          .
          <source>In Multimedia Content Analysis, Management, and Retrieval</source>
          <year>2006</year>
          , Vol.
          <volume>6073</volume>
          . International Society for Optics and Photonics,
          <year>60730Z</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <surname>Jiang</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Song</given-names>
          </string-name>
          , Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and
          <string-name>
            <given-names>Ying</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning fine-grained image similarity with deep ranking</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          .
          <fpage>1386</fpage>
          -
          <lpage>1393</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <surname>Zhe</surname>
            <given-names>Zhao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Lichan</given-names>
            <surname>Hong</surname>
          </string-name>
          , Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi.
          <year>2019</year>
          .
          <article-title>Recommending What Video to Watch Next: A Multitask Ranking System</article-title>
          .
          <source>In Proceedings of the 13th ACM Conference on Recommender Systems</source>
          (Copenhagen, Denmark) (
          <source>RecSys '19)</source>
          . ACM, New York, NY, USA,
          <fpage>43</fpage>
          -
          <lpage>51</lpage>
          . https://doi.org/10. 1145/3298689.3346997
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>