=Paper=
{{Paper
|id=Vol-1887/paper4
|storemode=property
|title=Rethinking Conventional Collaborative Filtering for Recommending Daily Fashion Outfits
|pdfUrl=https://ceur-ws.org/Vol-1887/paper4.pdf
|volume=Vol-1887
|authors=Anders Kolstad,Özlem Özgöbek,Jon Atle Gulla,Simon Litlehamar
|dblpUrl=https://dblp.org/rec/conf/recsys/KolstadOGL17
}}
==Rethinking Conventional Collaborative Filtering for Recommending Daily Fashion Outfits==
<pdf width="1500px">https://ceur-ws.org/Vol-1887/paper4.pdf</pdf>
<pre>
               Rethinking Conventional Collaborative Filtering for
                     Recommending Daily Fashion Outfits

    Anders Kolstad, Özlem Özgöbek, Jon Atle Gulla                                                Simon Litlehamar
         Norwegian University of Science and Technology                                             Accenture AS
                     Trondheim, Norway                                                            Fornebu, Norway
                    andekol@stud.ntnu.no                                                  simon.litlehamar@accenture.com
            {ozlem.ozgobek,jon.atle.gulla}@ntnu.no

ABSTRACT                                                                         Klepp and Laitala found that 20% of clothes bought by Norwe-
A conventional collaborative filtering approach using a standard              gians were never or rarely used [15]. A reason for this might be that
utility matrix fails to capture the aspect of matching clothing items         they did not actually like the item they bought or that the item did
when recommending daily fashion outfits. Moreover, it is chal-                not match any existing clothing items in their closet. This informa-
lenged by the new user cold-start problem. In this paper, we de-              tion is very valuable to the clothing retailer. With such information,
scribe a novel approach for guiding users in selecting daily fashion          the retailer can map the customer’s taste profile and generate tar-
outfits, by providing outfit recommendations from a system con-               geted ads for the customer, reducing the number of unnecessary
sisting of an Internet of Things wardrobe enabled with RFID tech-             purchases, and increasing the number of satisfied customers.
nology and a corresponding mobile application. We show where a                   Generating such outfit suggestions and targeted ads can be made
conventional collaborative filtering approach comes short when rec-           a reality by recommender systems. A recommender system tries to
ommending fashion outfits, and how our novel approach—powered                 predict the rating value of a user-item combination, where the user
by machine learning algorithms—shows promising results in the do-             has indicated their ratings for other items in the past [1]. The system
main of fashion recommendation. Evaluation of our novel approach              tracks these ratings by receiving user feedback. User feedback
using a real-world dataset demonstrates the system’s effectiveness            is classified into explicit and implicit. Explicit feedback is when
and its ability to provide daily outfit recommendations that are rele-        the user explicitly rates an item on, e.g., a 5-star scale. Implicit
vant to the users. A non-comparable evaluation of the conventional            feedback records other user interactions, e.g., how long a user
approach is also given.                                                       spends on a web page on a certain topic. With the retrieved ratings
                                                                              by user feedback, the recommender system can predict the user’s
CCS CONCEPTS                                                                  ratings of new items, and suggest the items with a high predicted
                                                                              rating. One of the most successful recommendation technique is
•Information systems →Evaluation of retrieval results; Web
                                                                              called collaborative filtering (CF) [22]. CF recommends items on the
applications; •Computing methodologies →Classification and
                                                                              assumption that users who have interacted in similar ways before,
regression trees;
                                                                              will have common interests in the future as well. Conventional CF
                                                                              bases its recommendations from a matrix called the utility matrix,
KEYWORDS                                                                      which captures every rating value for the user-item combinations
Recommender Systems, Machine Learning, Fashion Recommenda-                    known to the system [17]. Table 1 shows an example of such a
tion, Collaborative Filtering, Internet of Things                             matrix, consisting of user-item combinations of users and movies.
                                                                              A known challenge in CF is called new user cold-start problem. This
                                                                              challenge is about how to recommend items to new users that have
1    INTRODUCTION                                                             not rated any items yet. Suppose we were to introduce a fourth
Selecting an outfit every morning is a task that many people strug-           user in Table 1. The user-item combinations for this fourth user
gle with, often due to time constraints or the feeling of having              would all have ’?’ as a value. How to then recommend items to this
nothing to wear. In [20], Pruit argues that our selection of an outfit        user is not an easy task.
influences other people’s impressions of us, and that it is of high
importance to our cultural lives. Moreover, the average Norwegian                          Table 1: Example of a utility matrix.
has 359 unique garments in their closets [15]. This suggests that
people need guidance and suggestions for selecting an outfit from                         Titanic    The Godfather     Pulp Fiction    The Notebook
their clothing haystack each morning.
                                                                               Alice         5             2                 ?                ?
RecSysKTL Workshop @ ACM RecSys ’17, August 27, 2017, Como, Italy              Bob           2             ?                 4                1
© 2017 Copyright is held by the author(s).                                     Charlie       4             1                 5                4


                                                                                 Recommending individual items, such as in Table 1, is what
                                                                              nearly all recommender systems are focusing on. In recent years,
                                                                              recommendations of collections, such as music playlists [12, 13, 23],


                                                                         22
has gained a lot of attention. Hansen and Golbeck identified some             that mood is a motivator for selecting outfits, but that users would
key aspects that affects the recommendation of collections [10].              be more invested in the system if it also considered weather.
One aspect that especially applies to outfit recommendation is the               In [19], Limaksornkul et al. also propose a mobile application
co-occurrence interaction effect. Matching clothing items (items              used as a virtual wardrobe. They try to solve the problem of effi-
that go well together) will have a positive interaction effect when           ciently managing closet inventory and guiding users in selecting
they co-occur together, and will therefore generate a more relevant           clothes based on the user’s fashion style, trends, their friends’ styles,
outfit recommendation to the user.                                            weather, and occasion. In the mobile application, the users can man-
   In [16], we proposed Connected Closet, a system consisting of              age their clothes, and receive statistical-based, weather-based, and
an Internet of Things wardrobe enabled with an RFID reader, so                event-based clothing suggestions. The statistical-based recommen-
that clothing items with RFID tags can be checked in and out of               dation engine is preliminary and is the only approach that takes
the closet, generating implicit feedback on clothing items the user           user’s preferences into account. Moreover, no evaluation of the
likes. Using a mobile application, the user can give explicit feedback        system is given.
on outfit he likes, and receive daily outfit recommendations based               A smart wardrobe system is proposed by Goh et al. in [9]. Here,
on outside temperature and wardrobe inventory. In this paper,                 garments attached to RFID tags can be scanned in the user’s closet.
we describe an implementation of the proposed system. We show                 Using a system application, the user can get clothing recommenda-
where a conventional CF approach comes short in terms of the                  tions based on the user’s mood, preferred color or and occasion.
new user cold-start problem and where it fails to capture the co-                Yu-Chu et al. propose a recommendation system using a modi-
occurrence effect between items. Moreover, we propose a novel                 fied Bayesian network for generating outfit recommendations from
CF approach that mitigates the shortcomings of the conventional               the user’s clothing items enabled with RFID tags stored in a smart
approach and implement the novel approach into the proposed                   wardrobe [24]. By taking weather, season, and occasion into consid-
system. Evaluations using a real-world dataset are performed on               eration, the system first select a top, and then finds bottoms which
both approaches.                                                              match the selected top. The process of selecting a bottom depend on
   The main contributions of this paper are:                                  user feedback rating the combination. An experiment on 10 users
     (1) A novel CF approach for recommending daily fashion out-              concluded that the proposed system gave more satisfied users than
         fits.                                                                a baseline using a basic Bayesian network without user feedback.
     (2) An accuracy evaluation of the approach using different                  An important aspect that needs to be mentioned is that virtual
         classification algorithms.                                           wardrobes are heavily dependent on explicit user feedback, while
                                                                              the Internet of Things wardrobes can make use of implicit user
   This work is a joint effort between the Smartmedia program1
                                                                              feedback as well.
at NTNU2 and Accenture Norway3 . The Smartmedia program is
                                                                                 As seen in the works above, most of the recommender systems
researching mobile context-aware recommender systems. While, in
                                                                              are preliminary, and does not contain clear steps for the recom-
this work, Accenture’s main goal is to research modern technology
                                                                              mendation algorithm. The ones that do have an implemented rec-
for building web-based information systems and to keep track of
                                                                              ommender system only have user studies and are lacking accuracy
technology key trends, such as Internet of Things.
                                                                              evaluation of their recommendations. In this paper, we describe a
   The rest of the paper is structured as follows. In Section 2, we
                                                                              fully implemented prototype, using similar architecture to [9] and
give an overview of related work, followed by a description of the
                                                                              [24], enabled with a novel recommendation approach evaluated on
proposed system in Section 3. Section 4 introduces the concept
                                                                              a real-world dataset. To the best of our knowledge, our novel ap-
of outfit recommendation. The recommendation approaches are
                                                                              proach is a completely unique way of generating recommendations
described in Section 5 and Section 6. Evaluation of the approaches
                                                                              using CF. This is mostly because the majority of CF recommender
is given in Section 7. We conclude with a summary and discuss
                                                                              systems today, are heavily based on the utility matrix [22], which
future work in Section 8.
                                                                              is not present in our approach.
2    RELATED WORK                                                             3     SYSTEM OVERVIEW
There are not many systems addressing daily outfit recommen-
                                                                              In this section, we describe the architecture of the smart wardrobe
dations from either an Internet of Things wardrobe or a virtual
                                                                              proposed in [16]. Moreover, we explain how the users receive
wardrobe. In this section, we give an overview of the state of the
                                                                              recommendations through the mobile application which is a part
art, identify gaps in these works, and show where our system differs
                                                                              of the architecture. We built and implemented a prototype of the
from past work and how it complements previous work.
                                                                              whole system and created a short demonstration video available at
   Dumeljic et al. propose a virtual wardrobe implemented as a
                                                                              https://goo.gl/rZBZqo.
mobile application [6]. By explicitly stating the user’s current
mood, the user can add clothing items that best fit the mood, to the
                                                                              3.1     Architecture
virtual inventory. In [6], the outfit recommendation approach is not
described and has not been implemented in the system. Moreover,               Figure 1 shows a high-level view of the architecture. The Closet
a user study of ten people was conducted, where they concluded                is embedded with a Raspberry PI4 connected to an RFID reader.
                                                                              Clothing items enabled with an RFID tag and that has their id
1 http://research.idi.ntnu.no/SmartMedia/                                     number stored in the Cloud, are clothing items that are compatible
2 http://www.ntnu.edu/
3 https://www.accenture.com/no-en                                             4 A tiny computer. See https://www.raspberrypi.org/


                                                                         23
                                                                               4      OUTFIT RECOMMENDATION
                                                                               We define an outfit, denoted o, as a tuple of two items, c 1 and c 2 ,
                                                                               where c 1 is a top and c 2 is a bottom. Although clothing outfits can
                                                    Weather API
                                                                               also contain more, or less, than two items, the current version of
                                                                               our system only addresses outfits of two items. This is with the
                                                                               assumption that most outfits comprise of one top and one bottom.
                              Cloud      MQTT                                  Recommendation of outfits consisting of a one-piece, e.g., a dress,
                                                                               or with additional accessories, is planned for later research.
         RFID tag
                                                                               4.1        Inclusion Criteria
                                                                               To ensure that the user receives outfit recommendations that are
                                                                               relevant for a given day, we define an inclusion criteria for the cloth-
        RFID                                                                   ing items that can be part of a recommended outfit. The inclusion
                                Closet          Mobile Application             criteria are defined as follows:
                                                                                    (1) Clothing item must be inside the closet. The status of
                    Figure 1: High level architecture.                                   the item is determined by the latest RFID tag scan.
                                                                                    (2) Clothing item must be suitable for current weather.
                                                                                         Items are stored in a database with a suitable temperature
with the system. Such clothing items can be manually scanned                             range property. This is the range of temperatures a clothing
through the RFID reader. When a scanning occurs, a message gets                          item is comfortable to wear. The outside temperature at
broadcasted to multiple services deployed in the Cloud. These                            time of recommendation, must be inside the item’s suitable
services include—among others—a recommender service and an                               temperature range.
inventory service. By communicating with each other and a third-                  All clothing items that are owned by a user ui and fits the inclu-
party Weather API, they provide outfit recommendations to the                  sion criteria is represented as a set I (ui ). All outfit combinations
Mobile Application.                                                            that can be generated from I (ui ) are added to the set O(ui ).

3.2    Mobile Application                                                      4.2        User Ratings
When the user opens the mobile application, he gets displayed a                The favored outfits indicated (explicitly or implicitly) by the user,
recommendation for an outfit that suits today’s temperature and                are stored in the system using unary positive-only values. Outfits
is inside the user’s closet. By swiping through a list, the user is            that have not been rated are outfits that the users either do not like
displayed multiple recommended outfits. Moreover, the user can                 or have not been seen or used together from the user’s closet C(ui ).
modify the recommended outfit by using the arrows that corre-                  Not rated outfits will be referred to as ’neutral’ outfits in the rest of
sponds to each clothing item. By clicking a Save button, the user              this paper.
gives an explicit positive feedback on the displayed outfit, indicating
that the user has this outfit as one of his favorites.                         4.3        Recommended Outfits
                                                                               The list of recommended outfits that the user receives in the mobile
                                                                               application is generated by the system’s recommender service that
                                                                               returns the set R(ui ) of recommended outfits for the user.

                                                                               4.4        Notation
                                                                               All the notations defined in this section are summarized in Table 2.
                                                                               These notations will be used throughout the paper.

                                                                                                 Table 2: Notations used in this paper.

                                                                                   Notation                        Description
                                                                                   ui                              The ith user (owner) of a closet.
                                                                                   cj                              The jth clothing item.
                                                                                   ok = (c 1 , c 2 )               An outfit of c 1 and c 2 .
                                                                                   C(ui ) = {c 1 , . . . , cl }    Every clothing items the user owns.
                                                                                   I (ui ) = {c 1 , . . . , cm }   Clothes fitting the inclusion criteria.
                                                                                   O(ui ) = {o 1 , . . . , on }    Outfit combinations of items in I (ui ).
                                                                                   R(ui ) = {o 1 , . . . , op }    Outfits recommended to the user.
       Figure 2: Screenshot of the mobile application.


                                                                          24
5     RETHINKING CONVENTIONAL CF                                                   of users who have favored an outfit. Using Z and W , we train
In this section, we introduce an approach for outfit recommendation                a classifier using a classification model. Outfits that have been
using a conventional utility matrix for collaborative filtering. We                favored by users and have an associated weight above 0 will be
discuss where this approach comes short, and introduce a novel                     classified as ’positive’, while outfits with an associated weight of
approach for outfit recommendation using an outfit-item matrix.                    0 will be classified as ’neutral’. When the model has been trained,
                                                                                   we generate all the possible outfit combinations O(ui ), of the items
5.1    Conventional CF Approach                                                    that fit the inclusion criteria for the given user ui . By using the
                                                                                   classifier, we can now recommend the outfits that are classified as
An obvious solution to recommending fashion outfits is to map the
                                                                                   ’positive’ to the user R(ui ).
users’ favorite outfits onto a utility matrix U , consisting of users
                                                                                      The advantages of this approach are that it captures the co-
and outfits. Then, using a neighborhood model, one could predict
                                                                                   occurrence interaction effect between two clothing items. This is
new outfits for users by comparing the user’s interaction pattern
                                                                                   because it considers the clothing items that an outfit is composed of,
with users with same interaction pattern. To recommend the daily
                                                                                   instead of just looking at the outfits as a whole. Moreover, it is not
outfits R(ui ), we need to match the predicted outfits with the items
                                                                                   challenged by the new user cold-start problem because we assume
that fit the inclusion criteria I (ui ), and filter out outfits that do not
                                                                                   that people that own similar clothing items will have same taste in
contain only such items. The approach is illustrated in Figure 3.
                                                                                   outfits as well. Lastly, this approach has a huge advantage in terms
    The first problem with this approach is that it can only recom-
                                                                                   of user privacy, because it does not need to store the user-item
mend outfits that have been favored by other users. In other words,
                                                                                   combinations in one centralized matrix.
it cannot generate completely new outfits, and therefore fails to
                                                                                      In Figure 5, we give an example of a possible recommendation
capture the co-occurrence effect between individual items. An-
                                                                                   pipeline that can occur in our system using the novel approach. To
other problem with this approach is that it is challenged by the
                                                                                   the left is the set of all the clothing items owned by the user. By
new user cold-start problem. Users who have not favored any out-
                                                                                   inputting this and the current outside temperature at the user’s
fits or checked out any items, cannot receive recommendations.
                                                                                   location, the function f1 filters out and generates possible outfits
Lastly, privacy is becoming a huge concern in recommender sys-
                                                                                   for recommendation wrt. the inclusion criteria. These outfits are
tems [2, 3], and in this approach, we store all the users’ ratings in
                                                                                   then inputted to f2 , which follows the same steps as described in
one centralized matrix, causing a huge risk for the users’ privacy.
                                                                                   Figure 4. In the end of the pipeline, we get the generated set of
                                                                                   recommended outfits that is displayed in the mobile application.
5.2    Novel Outfit-Item Matrix Apprach
                                                                                      Although not implemented in our system, this approach could
By basing our recommendations on the idea that users that have                     be easily used by a clothing retailer to generate targeted ads by
similar items in their closets will also have similar taste in outfits,            inputting clothing items from the retailer together with the user’s
we propose a novel approach where we rethink the conventional                      clothing items in C(ui ). Then, the clothing retailer could recom-
approach by completely transforming the utility matrix. In Figure 4,               mend new outfits that the users might want to buy, or individual
we create a matrix Z , where the columns represent outfits, and the                items that would make a great outfit with clothing items already
rows represent the clothing items that compose the outfit. Each                    owned by the user.
outfit is associated with a weight w. This weight is the number


                            o1      ...      ok                                                         o1      ···    ok
                        0                         1                                                 0                       1             0   1
                                                                                             c1 · · ·                                    w1
                 u1 · · ·                                                                      B                            C           B C
                   B                              C                                    Z = ... B                            C       W = @ ... A
           U = ... B
                   @                ···
                                                  C
                                                  A                                            @                ···         A
                                                                                               cn                     ···                    wk
                   un                       ···


                        Neighborhood model                                                                   Classification model


                                Filter
      I(ui)                    function                    R(ui)                       O(ui)                      Classifier               R(ui)

Figure 3: Conventional CF approach using a utility matrix.                            Figure 4: Novel approach using an outfit-item matrix.


                                                                              25
                        C(ui)                                    25 ℃              O(ui)                                           R(ui)
                 c1                    c2                                 o1                  o2                                     o1
                                                                                                                                   o1


                                                                 f1                                                 f2
                 c3                     c4


                        Figure 5: Example of a possible recommendation pipeline using the novel approach.


6     RECOMMENDATION MODEL                                                            Gradient Boosting. Another popular ensemble method that relies
In this section, we present the recommendation model for our                       on a set of weak learners is called Gradient Boosting. It follows
novel approach using different classification models. The chosen                   the same fundamental idea as AdaBoost, but instead of focusing on
classification models are widely known and perform well in many                    the sample weights when picking its weak learners, it focuses on
domains [4, 5]. The classification models also include a baseline                  gradients [8].
classifier. Moreover, we introduce some neighborhood models that                      Uniform. As a baseline, we use a classifier that generates class
are applied with the conventional approach.                                        predictions uniformly at random.

6.1    Classification Models                                                       6.2    Neighborhood Models
   Naı̈ve Bayes. Assuming the attributes of the samples are con-                   To predict the ratings of the user-outfit combinations in the matrix
ditionally independent and given the sample’s class labels, Naı̈ve                 U , given in Figure 3, we apply the user-based neighborhood model
Bayes assigns a test sample the class label Y by maximizing the                    [1]. This model predicts user ratings by finding users that have
numerator in this equation [18]:                                                   rated similar outfits. To find similar users, we can apply different
                                       Îd                                          similarity measures. In our model, we apply Jaccard (JAC) and
                               P(Y )      i=1 P(X i | Y )                          cosine similarity (COS) as defined by Equation 3:
                 P(Y | X ) =                                 ,          (1)
                                          P(X )
                                                                                                          |A ∩ B|                             A·B
                                                                                      Sim J AC (A, B) =                  SimCO S (A, B) =                 (3)
where X is a set of d attributes.                                                                         |A ∪ B|                           ||A|| ||B||
                                                                                   After user similarities have been calculated we can predict the
   Adaptive Boosting (AdaBoost). Over the recent years, classifica-
                                                                                   ratings rˆui of unrated outfits using this formula:
tion techniques known as ensemble methods have gained a lot of
attention. One of the most popular ones is AdaBoost. It aggregates                                                   Sim(u, v)rvi
                                                                                                                 Í
                                                                                                          rˆui = Ív                             (4)
over a set of weak learners ht (x) that tends to perform slightly                                                   v |Sim(u, v)|
better than a random classifier. The final classifier H (x) is then
obtained by ensembling the weak learners by a weighted majority                    6.3    Ranking Model
voting scheme using this equation [7]:                                             To rank the outfits that are predicted to the user in R(ui ), using
                                                                                   the novel approach, we assign each prediction of an outfit o j to a
                                      T
                                                                                   ranking score equal to the classifier’s probability of the class label
                                     Õ                 
                      H (x) = siдn            α t ht (x) ,              (2)
                                                                                   being ’positive’ P(w j > 0 | o j ). It should be noted that this is
                                       t =1
                                                                                   not a personalized ranking model, but as seen from our results, it
where α t is the assigned weight for each weak learner.                            performed well for each individual user.
   To pick the weak learners, each training sample is associated                      The conventional approach does not use classification models, so
with a weight indicting its importance. AdaBoost will then pick                    the probability of the predicted class label is not available. Instead,
its weak learners in a forward stage-wise manner by focusing on                    the outfits are ranked according to the predicted rating calculated
predicting the high-weight samples correctly.                                      using the similarity measures.


                                                                              26
7     EXPERIMENTS                                                             Table 4: The average properties for the users in the test sets.
In this section, we describe the setting for how our experiment
was performed. We give a detailed description of the dataset that                        Closet size      avд(|O(ui )|)        avд(|O(ui )T P |)
was used and present the results of the different models that were                               Full              682.5                       31.4
evaluated. The main goals of the experiments are to demonstrate                           Half empty               164.0                       17.6
the effectiveness of the system and to compare and select the best
classification model for our system.

7.1     Dataset                                                                  To reduce the dimensionality of the samples and to detect items
                                                                              that are interrelated, the multivariate analysis technique called
The dataset is scraped from Polyvore.com5 . Polyvore is a social
                                                                              principal component analysis was applied to the samples before
media site where users can create clothing outfits by matching
                                                                              training the models [14]. The reduction is done by transforming
individual clothing items. Other users can then ’like’ these outfits
                                                                              to a new set of uncorrelated features ordered so that the first ones
by a clicking a ’like button’.
                                                                              retain most of the original variation.
    From the available outfits at Polyvore, we first gathered the most
                                                                                 For evaluating the conventional approach using the different
liked outfits from the last 3 months. For these outfits, we filtered
                                                                              neighborhood models, we first randomly removed 30% of the user
the outfits so that they only contained a top and a bottom. Then,
                                                                              likes from the utility-matrix. Then, we predicted all outfit likes
we collected other outfits that these items also were a part of, and
                                                                              for each user, and filtered them out wrt. I (ui ) using the same
filtered them. Lastly, we gathered all the user likes for each of the
                                                                              assumption above. The recommended outfits were then compared
outfits we had gathered. Table 3 describes the size of the dataset.
                                                                              to the true outfit likes.

          Table 3: Data statistics on Polyvore dataset.                       7.3    Evaluation Metrics
                                                                              If we look at the task of recommending the outfits as retrieving
               # Outfits       # Clothes   # Users    # Likes                 all relevant items (outfits) from a collection of outfits separated
                 6,186              158       7,093     19,287                into the two classes; relevant and not relevant, we can apply the
        Positive: 260           Tops: 81                                      popular accuracy metrics from information retrieval systems. In
        Neutral: 5,917       Bottoms: 87                                      our case, we say that the relevant outfits are the ones classified as
                                                                              ’positive’, and the not relevant are the outfits classified as ’neutral’.
                                                                              Then, we can use a popular metric known as Recall. It measures
   From the gathered dataset, we have 260 outfits that are classified         the ratio of relevant items retrieved to the number of all relevant
as ’positive’ and 5,917 that are classified as ’neutral’. This means          items available [11]:
that the dataset has an imbalance approximately of 23 to 1.
   In total, there are 158 individual clothing items in the dataset.                                      |relevant items retrieved|
                                                                                               Recall =                                               (5)
This means that the feature vectors used in the classification models                                         |all relevant items|
will be relatively sparse binary vectors of 158 dimensions.                      In this paper, we also report Recall@N, which is the Recall in
                                                                              a ranked list just considering the N first elements. We compute
7.2     Evaluation Methods                                                    Recall and Recall@N by averaging over the result for each user ui .
To evaluate our novel approach, we iterated through the follow-                  A way to graphically display the tradeoff between the true posi-
ing procedure for all users with at least 20 outfit likes: For all of         tive rate and the false positive rate, is known as a receiver operating
the user’s favorite outfits, we hide each of the user’s ground-truth          characteristic (ROC) curve. The true positive rate is the same as
favorite outfits from the system by decreasing the outfits’ corre-            Recall, and the false positive rate is the ratio of non-relevant items
sponding weights in W by 1. Then, we train the classification model           retrieved to the number of all non-relevant items available. The
using Z and W . Moreover, with the assumption that a user only                ROC curve is great to compare the performance difference between
own items that are part of the items the user likes, we generate              classifiers, where the best classifiers tend to be located in the upper
outfit combinations, assuming all of the items in C(ui ) fitting the          left corner of the diagram. The classifiers that performs best on
inclusion criteria. We then compared the predicted class labels of            average will have a large area under the ROC curve (AUC) [11].
the generated outfits combinations to the true favorite outfits of               To evaluate the ranking via utility, we sum the utility of an outfit
the user. We also ran the procedure a second time, but now by ran-            j to a user u over a ranked recommended list of size L. By summing
domly removing 50% of the users’ tops and bottoms in C(ui ). This             over this value for each user, we obtain the R-score as follows [1]:
was done to simulate outfit recommendations from a half empty                                             m
                                                                                                                              max {ru j , 0}
closet. In Table 4, we summarize some statistics for the test sets
                                                                                                          Õ       Õ
                                                                                             R-score =                                         ,      (6)
that was generated by running these methods. As seen in this table,                                       u=1 j ∈Iu ,v j ≤L    2(v j −1)/α
there are—on average—quite many outfits that are being classified
for each user O(ui ), compared to the true number of the user’s               where v j is the rank of outfit j and ru j is the ground-truth rating
favorite outfits O(ui )T P .                                                  of outfit j. α is the half-life, set to 5 in our experiments. The higher
                                                                              the R-score is, the true favorite outfits for each user tend to appear
5 http://www.polyvore.com/                                                    in the top of the ranked list.


                                                                         27
                                                                Table 5: Results from evaluation of novel approach.

                                                                Overall                                                            Top-L
                                 Model                 AUC     Accuracy      Recall      R-score    Recall@5                   Recall@10      Recall@15        Recall@20
                                 Naı̈ve Bayes           .704        .870       .756         566.2         .091                       .183           .287             .382
                                 Gradient Boosting      .864        .878       .997         851.9         .111                       .228           .337             .448
                                 AdaBoost               .885        .723       .978         872.1         .113                       .223           .334             .442
                                 Uniform                .500        .500       .493          88.0         .024                       .052           .077             .095


7.4                        Results and Discussion                                                   dominating models in all categories. On average and overall, Gradi-
In this section, we present our results and discuss some insight we                                 ent Boosting performs best, while in a top-L ranked list, AdaBoost
obtained while running the experiments. By the end of this section                                  performs slightly better. For N > 5, Gradient Boosting was—at
we will have answered the following questions:                                                      maximum—only .006 points better than AdaBoost in Recall@N. In
                                                                                                    terms of the R-score, AdaBoost is superior to Gradient Boosting.
    Q1. How do the different classification models compare using
                                                                                                    Because of this, we conclude that AdaBoost is the model yielding
         our novel approach?
                                                                                                    highest utility to the users.
    Q2. How does closet size affect the recommendation results?
                                                                                                       In Figure 6, we plot a ROC curve for the different models used
    Q3. To what extent can the conventional approach be used to
                                                                                                    to generate a single ranked list of user-outfit pairs. This type of
         recommend new outfits to the users?
                                                                                                    ROC curve is sometimes referred to as a global ROC curve [21]. As
   The evaluation method for the novel approach was performed                                       indicated by the gray dotted line, AdaBoost is the best model at a
using the classification models in Section 6.1. For Naı̈ve Bayes the                                false positive rate at 20%, predicting 86% of the users’ favorite outfits.
best configuration was setting a prior probability for the ’neutral’                                As the false positive rate increase, Gradient Boosting becomes
class label to 0.99 and a 0.01 prior probability for the ’positive’                                 slightly superior to AdaBoost. On average, Gradient Boosting and
class. This was mostly due to the 23 to 1 imbalance in the dataset.                                 AdaBoost dominates the two other models with an AUC of .864 and
AdaBoost gave the best result using decision trees as weak learners                                 .885, respectively. Naı̈ve Bayes yields a satisfactory AUC of .704,
and with a learning rate of 1.0. Gradient Boosting performed best                                   while from the Uniform model we got an expected AUC of .500.
with similar configurations.                                                                           The high values of AUC and R-score are a strong indication that
   In Table 5, we report AUC, Accuracy and Recall for the predicted                                 the non-personalized ranking model performs quite well and even
class labels for all of the outfits that were tested when simulating a                              better than expected.
full closet. In the right-hand side of the table, we also report the
R-score and Recall@N in a ranked list of L outfits. Because each
user has different numbers of clothes in their closet, every user is
                                                                                                      # Recommendatinos


                                                                                                                          60
recommended a ranked list of various lengths of L. The best per-
forming model in each category is highlighted by underlining its                                                          40

result. As seen in the table, Gradient Boosting and AdaBoost are the                                                      20

                                                                                                                           0
                                                                                                                                            Favorite outfits          Novel outfits
                     1.0

                                                                                                    Figure 7: Distributions of outfit recommendations using Ad-
                     0.8                                                                            aBoost.

                                                                                                       Figure 7 shows the distributions of outfit recommendations in a
True Positive Rate


                     0.6
                                                                                                    top-20 list recommended to the users with at least 20 outfit likes.
                                                                                                    In total, 196 unique outfits were recommended to the users, where
                     0.4
                                                                                                    33 of them were novel outfits—never favored by any users in the
                                                                                                    past. This shows that a wide range of outfits end up in the users’
                                                                                                    recommended top lists.
                     0.2                                                   Naive Bayes                 Experiment on a half empty closet resulted in no change in terms
                                                                           Gradient Boosting        of overall Recall, and at most, a .005 decrease in AUC, and for this
                                                                           AdaBoost                 reason, we do not report any results beyond this. Besides the fact
                                                                           Uniform
                     0.0
                                                                                                    that few clothing items will result in fewer outfit recommendations,
                           0.0          0.2          0.4       0.6         0.8           1.0        we conclude that closet size has little effect on the recommenda-
                                                 False Positive Rate
                                                                                                    tions.
                                                                                                       In Table 6, results from evaluation of the conventional approach
Figure 6: Global ROC curves for recommendations from a                                              is given. The table shows Recall@N in a ranked list of M outfits.
full closet.                                                                                        Because M is much lower than L, we only report up to N = 5 (as


                                                                                               28
     Table 6: Evaluation of the conventional approach.                          The authors would also like to thank everyone at Accenture who
                                                                              has provided valuable feedback on this research.
                Model      Recall@1      Recall@5
                                                                              REFERENCES
                Cosine            .077          .366                           [1] Charu C. Aggarwal. 2016. Recommender Systems: The Textbook (1st ed.). Springer
                Jaccard           .050          .250                               Publishing Company, Incorporated.
                                                                               [2] Arnaud Berlioz, Arik Friedman, Mohamed Ali Kaafar, Roksana Boreli, and
                                                                                   Shlomo Berkovsky. 2015. Applying Differential Privacy to Matrix Factoriza-
                                                                                   tion. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys
opposed to N = 20 in evaluation of the novel approach). Note that                  ’15). ACM, New York, NY, USA, 107–114.
                                                                               [3] Smriti Bhagat, Udi Weinsberg, Stratis Ioannidis, and Nina Taft. 2014. Recom-
the results are not comparable to the results in Table 5, as they are              mending with an Agenda: Active Learning of Private Attributes Using Matrix
derived using an approach that is fundamentally different. The best                Factorization. In Proceedings of the 8th ACM Conference on Recommender Systems
performing model is highlighted with underlined results. As the                    (RecSys ’14). ACM, New York, NY, USA, 65–72.
                                                                               [4] Rich Caruana and Alexandru Niculescu-Mizil. 2006. An Empirical Comparison
numbers indicates, the approach generates new outfit recommen-                     of Supervised Learning Algorithms. In Proceedings of the 23rd International
dations to the users at with a satisfactory accuracy. However, these               Conference on Machine Learning (ICML ’06). ACM, New York, NY, USA, 161–168.
outfit recommendations are—as argued in Section 5—only outfits                 [5] Thomas G. Dietterich. 2000. Ensemble Methods in Machine Learning. In Proceed-
                                                                                   ings of the First International Workshop on Multiple Classifier Systems (MCS ’00).
that have been composed and favored by other users in the past.                    Springer-Verlag, London, UK, UK, 1–15.
Therefore, we conclude that this approach is insufficient when it              [6] Bojana Dumeljic, Martha Larson, and Alessandro Bozzon. 2014. Moody Closet:
                                                                                   Exploring Intriguing New Views on Wardrobe Recommendation. In Proceedings
comes to recommending novel and personalized daily outfits.                        of the First International Workshop on Gamification for Information Retrieval
                                                                                   (GamifIR ’14). ACM, New York, NY, USA, 61–62.
8   CONCLUSION AND FUTURE WORK                                                 [7] Yoav Freund and Robert E Schapire. 1997. A Decision-Theoretic Generalization
                                                                                   of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 55, 1
We have introduced a novel approach for recommending daily                         (Aug. 1997), 119–139.
fashion outfits from a smart closet. Our novel approach mitigate               [8] Jerome H. Friedman. 2000. Greedy Function Approximation: A Gradient Boosting
                                                                                   Machine. Annals of Statistics 29 (2000), 1189–1232.
a wide range of challenges faced by a conventional approach that               [9] K. N. Goh, Y. Y. Chen, and E. S. Lin. 2011. Developing a smart wardrobe system.
tries to recommend daily fashion outfits. Evaluation of our novel                  In 2011 IEEE Consumer Communications and Networking Conference (CCNC).
approach demonstrates the method’s effectiveness, and its ability                  303–307.
                                                                              [10] Derek L. Hansen and Jennifer Golbeck. 2009. Mixing It Up: Recommending
to provide users with accurate and novel outfit recommendations.                   Collections of Items. In Proceedings of the SIGCHI Conference on Human Factors
   The results from the evaluation helped us select which model to                 in Computing Systems (CHI ’09). ACM, New York, NY, USA, 1217–1226.
                                                                              [11] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl.
deploy in the system. R-score, AUC, and Recall@N are the most                      2004. Evaluating Collaborative Filtering Recommender Systems. ACM Trans. Inf.
useful measures regarding each individual user. Since, AdaBoost                    Syst. 22, 1 (Jan. 2004), 5–53.
achieved the highest R-score and AUC, it was chosen as the main               [12] Kurt Jacobson, Vidhya Murali, Edward Newett, Brian Whitman, and Romain Yon.
                                                                                   2016. Music Personalization at Spotify. In Proceedings of the 10th ACM Conference
classifier and implemented with the novel approach in the rec-                     on Recommender Systems (RecSys ’16). ACM, New York, NY, USA, 373–373.
ommender service deployed in the cloud. It should be noted that               [13] Dietmar Jannach, Lukas Lerche, and Iman Kamehkhosh. 2015. Beyond ”Hitting
Gradient Boosting achieved slightly better results in Recall@N,                    the Hits”: Generating Coherent Music Playlist Continuations with the Right
                                                                                   Tracks. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys
but we regard this difference as insignificant and conclude that                   ’15). ACM, New York, NY, USA, 187–194.
AdaBoost is indeed the best fit for our system.                               [14] Ian Jolliffe. 2002. Principal component analysis. Wiley Online Library.
                                                                              [15] Ingunn Grimstad Klepp and Kirsi Laitala. 2016. Clothing consumption in Norway.
   A non-comparable evaluation of the conventional approach was                    Technical report 2. Oslo and Akershus University College, Oslo. In Norwegian.
performed to see to what extent it could recommend daily outfits.             [16] Anders Kolstad, Özlem Özgöbek, Jon Atle Gulla, and Simon Litlehamar. 2017.
The accuracy results are acceptable, but due to the approach’s                     Connected Closet - A Semantically Enriched Mobile Recommender System
                                                                                   for Smart Closets. In Proceedings of the 13th International Conference on Web
many challenges, it cannot be considered as an efficient method for                Information Systems and Technologies (WEBIST 2017). 298–305.
recommending daily outfits.                                                   [17] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of
   Although we have demonstrated the system’s performance using                    Massive Datasets (2nd ed.). Cambridge University Press, New York, NY, USA.
                                                                              [18] David D. Lewis. 1998. Naive (Bayes) at Forty: The Independence Assumption in
a real-world dataset, a full scale evaluation using data gathered from             Information Retrieval. In Proceedings of the 10th European Conference on Machine
physical clothes enabled with RFID tags is planned for future work.                Learning (ECML ’98). Springer-Verlag, London, UK, UK, 4–15.
                                                                              [19] Chantima Limaksornkul, Duangkamol Na Nakorn, Onidta Rakmanee, and Wanta-
The current state of the system should be considered as an early                   nee Viriyasitavat. 2014. Smart Closet: Statistical-based apparel recommendation
prototype and is premature for such a full scale evaluation. Because               system. In Student Project Conference (ICT-ISPC), 2014 Third ICT International.
of this, these plans are preliminary and we consider other research                IEEE, 155–158.
                                                                              [20] John C Pruit. 2015. Getting Dressed. Popular Culture as Everyday Life (2015).
topics to be more important at the current stage. These topics in-            [21] Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock.
clude content-based outfit recommendation and recommendation                       2002. Methods and Metrics for Cold-start Recommendations. In Proceedings of
of garments to be recycled or donated. With these research top-                    the 25th Annual International ACM SIGIR Conference on Research and Development
                                                                                   in Information Retrieval (SIGIR ’02). ACM, New York, NY, USA, 253–260.
ics, we intend to incorporate additional contextual factors such as           [22] Xiaoyuan Su and Taghi M. Khoshgoftaar. 2009. A Survey of Collaborative
season, user’s occasion, and user’s body type.                                     Filtering Techniques. Adv. in Artif. Intell. 2009, Article 4 (Jan. 2009).
                                                                              [23] Andreu Vall. 2015. Listener-Inspired Automated Music Playlist Generation. In
                                                                                   Proceedings of the 9th ACM Conference on Recommender Systems (RecSys ’15).
ACKNOWLEDGMENTS                                                                    ACM, New York, NY, USA, 387–390.
                                                                              [24] Lin Yu-Chu, Yuusuke Kawakita, Etsuko Suzuki, and Haruhisa Ichikawa. 2012.
This work is an extension to a prototype of the proposed system                    Personalized Clothing-Recommendation System Based on a Modified Bayesian
initially developed during an internship at Accenture. The authors                 Network. In Proceedings of the 2012 IEEE/IPSJ 12th International Symposium on
would like to thank everyone involved in the internship for their                  Applications and the Internet (SAINT ’12). IEEE Computer Society, Washington,
                                                                                   DC, USA, 414–417.
contributions prior this work.


                                                                         29

</pre>