=Paper=
{{Paper
|id=Vol-1887/paper4
|storemode=property
|title=Rethinking Conventional Collaborative Filtering for Recommending Daily Fashion Outfits
|pdfUrl=https://ceur-ws.org/Vol-1887/paper4.pdf
|volume=Vol-1887
|authors=Anders Kolstad,Özlem Özgöbek,Jon Atle Gulla,Simon Litlehamar
|dblpUrl=https://dblp.org/rec/conf/recsys/KolstadOGL17
}}
==Rethinking Conventional Collaborative Filtering for Recommending Daily Fashion Outfits==
Rethinking Conventional Collaborative Filtering for Recommending Daily Fashion Outfits Anders Kolstad, Özlem Özgöbek, Jon Atle Gulla Simon Litlehamar Norwegian University of Science and Technology Accenture AS Trondheim, Norway Fornebu, Norway andekol@stud.ntnu.no simon.litlehamar@accenture.com {ozlem.ozgobek,jon.atle.gulla}@ntnu.no ABSTRACT Klepp and Laitala found that 20% of clothes bought by Norwe- A conventional collaborative filtering approach using a standard gians were never or rarely used [15]. A reason for this might be that utility matrix fails to capture the aspect of matching clothing items they did not actually like the item they bought or that the item did when recommending daily fashion outfits. Moreover, it is chal- not match any existing clothing items in their closet. This informa- lenged by the new user cold-start problem. In this paper, we de- tion is very valuable to the clothing retailer. With such information, scribe a novel approach for guiding users in selecting daily fashion the retailer can map the customer’s taste profile and generate tar- outfits, by providing outfit recommendations from a system con- geted ads for the customer, reducing the number of unnecessary sisting of an Internet of Things wardrobe enabled with RFID tech- purchases, and increasing the number of satisfied customers. nology and a corresponding mobile application. We show where a Generating such outfit suggestions and targeted ads can be made conventional collaborative filtering approach comes short when rec- a reality by recommender systems. A recommender system tries to ommending fashion outfits, and how our novel approach—powered predict the rating value of a user-item combination, where the user by machine learning algorithms—shows promising results in the do- has indicated their ratings for other items in the past [1]. The system main of fashion recommendation. Evaluation of our novel approach tracks these ratings by receiving user feedback. User feedback using a real-world dataset demonstrates the system’s effectiveness is classified into explicit and implicit. Explicit feedback is when and its ability to provide daily outfit recommendations that are rele- the user explicitly rates an item on, e.g., a 5-star scale. Implicit vant to the users. A non-comparable evaluation of the conventional feedback records other user interactions, e.g., how long a user approach is also given. spends on a web page on a certain topic. With the retrieved ratings by user feedback, the recommender system can predict the user’s CCS CONCEPTS ratings of new items, and suggest the items with a high predicted rating. One of the most successful recommendation technique is •Information systems →Evaluation of retrieval results; Web called collaborative filtering (CF) [22]. CF recommends items on the applications; •Computing methodologies →Classification and assumption that users who have interacted in similar ways before, regression trees; will have common interests in the future as well. Conventional CF bases its recommendations from a matrix called the utility matrix, KEYWORDS which captures every rating value for the user-item combinations Recommender Systems, Machine Learning, Fashion Recommenda- known to the system [17]. Table 1 shows an example of such a tion, Collaborative Filtering, Internet of Things matrix, consisting of user-item combinations of users and movies. A known challenge in CF is called new user cold-start problem. This challenge is about how to recommend items to new users that have 1 INTRODUCTION not rated any items yet. Suppose we were to introduce a fourth Selecting an outfit every morning is a task that many people strug- user in Table 1. The user-item combinations for this fourth user gle with, often due to time constraints or the feeling of having would all have ’?’ as a value. How to then recommend items to this nothing to wear. In [20], Pruit argues that our selection of an outfit user is not an easy task. influences other people’s impressions of us, and that it is of high importance to our cultural lives. Moreover, the average Norwegian Table 1: Example of a utility matrix. has 359 unique garments in their closets [15]. This suggests that people need guidance and suggestions for selecting an outfit from Titanic The Godfather Pulp Fiction The Notebook their clothing haystack each morning. Alice 5 2 ? ? RecSysKTL Workshop @ ACM RecSys ’17, August 27, 2017, Como, Italy Bob 2 ? 4 1 © 2017 Copyright is held by the author(s). Charlie 4 1 5 4 Recommending individual items, such as in Table 1, is what nearly all recommender systems are focusing on. In recent years, recommendations of collections, such as music playlists [12, 13, 23], 22 has gained a lot of attention. Hansen and Golbeck identified some that mood is a motivator for selecting outfits, but that users would key aspects that affects the recommendation of collections [10]. be more invested in the system if it also considered weather. One aspect that especially applies to outfit recommendation is the In [19], Limaksornkul et al. also propose a mobile application co-occurrence interaction effect. Matching clothing items (items used as a virtual wardrobe. They try to solve the problem of effi- that go well together) will have a positive interaction effect when ciently managing closet inventory and guiding users in selecting they co-occur together, and will therefore generate a more relevant clothes based on the user’s fashion style, trends, their friends’ styles, outfit recommendation to the user. weather, and occasion. In the mobile application, the users can man- In [16], we proposed Connected Closet, a system consisting of age their clothes, and receive statistical-based, weather-based, and an Internet of Things wardrobe enabled with an RFID reader, so event-based clothing suggestions. The statistical-based recommen- that clothing items with RFID tags can be checked in and out of dation engine is preliminary and is the only approach that takes the closet, generating implicit feedback on clothing items the user user’s preferences into account. Moreover, no evaluation of the likes. Using a mobile application, the user can give explicit feedback system is given. on outfit he likes, and receive daily outfit recommendations based A smart wardrobe system is proposed by Goh et al. in [9]. Here, on outside temperature and wardrobe inventory. In this paper, garments attached to RFID tags can be scanned in the user’s closet. we describe an implementation of the proposed system. We show Using a system application, the user can get clothing recommenda- where a conventional CF approach comes short in terms of the tions based on the user’s mood, preferred color or and occasion. new user cold-start problem and where it fails to capture the co- Yu-Chu et al. propose a recommendation system using a modi- occurrence effect between items. Moreover, we propose a novel fied Bayesian network for generating outfit recommendations from CF approach that mitigates the shortcomings of the conventional the user’s clothing items enabled with RFID tags stored in a smart approach and implement the novel approach into the proposed wardrobe [24]. By taking weather, season, and occasion into consid- system. Evaluations using a real-world dataset are performed on eration, the system first select a top, and then finds bottoms which both approaches. match the selected top. The process of selecting a bottom depend on The main contributions of this paper are: user feedback rating the combination. An experiment on 10 users (1) A novel CF approach for recommending daily fashion out- concluded that the proposed system gave more satisfied users than fits. a baseline using a basic Bayesian network without user feedback. (2) An accuracy evaluation of the approach using different An important aspect that needs to be mentioned is that virtual classification algorithms. wardrobes are heavily dependent on explicit user feedback, while the Internet of Things wardrobes can make use of implicit user This work is a joint effort between the Smartmedia program1 feedback as well. at NTNU2 and Accenture Norway3 . The Smartmedia program is As seen in the works above, most of the recommender systems researching mobile context-aware recommender systems. While, in are preliminary, and does not contain clear steps for the recom- this work, Accenture’s main goal is to research modern technology mendation algorithm. The ones that do have an implemented rec- for building web-based information systems and to keep track of ommender system only have user studies and are lacking accuracy technology key trends, such as Internet of Things. evaluation of their recommendations. In this paper, we describe a The rest of the paper is structured as follows. In Section 2, we fully implemented prototype, using similar architecture to [9] and give an overview of related work, followed by a description of the [24], enabled with a novel recommendation approach evaluated on proposed system in Section 3. Section 4 introduces the concept a real-world dataset. To the best of our knowledge, our novel ap- of outfit recommendation. The recommendation approaches are proach is a completely unique way of generating recommendations described in Section 5 and Section 6. Evaluation of the approaches using CF. This is mostly because the majority of CF recommender is given in Section 7. We conclude with a summary and discuss systems today, are heavily based on the utility matrix [22], which future work in Section 8. is not present in our approach. 2 RELATED WORK 3 SYSTEM OVERVIEW There are not many systems addressing daily outfit recommen- In this section, we describe the architecture of the smart wardrobe dations from either an Internet of Things wardrobe or a virtual proposed in [16]. Moreover, we explain how the users receive wardrobe. In this section, we give an overview of the state of the recommendations through the mobile application which is a part art, identify gaps in these works, and show where our system differs of the architecture. We built and implemented a prototype of the from past work and how it complements previous work. whole system and created a short demonstration video available at Dumeljic et al. propose a virtual wardrobe implemented as a https://goo.gl/rZBZqo. mobile application [6]. By explicitly stating the user’s current mood, the user can add clothing items that best fit the mood, to the 3.1 Architecture virtual inventory. In [6], the outfit recommendation approach is not described and has not been implemented in the system. Moreover, Figure 1 shows a high-level view of the architecture. The Closet a user study of ten people was conducted, where they concluded is embedded with a Raspberry PI4 connected to an RFID reader. Clothing items enabled with an RFID tag and that has their id 1 http://research.idi.ntnu.no/SmartMedia/ number stored in the Cloud, are clothing items that are compatible 2 http://www.ntnu.edu/ 3 https://www.accenture.com/no-en 4 A tiny computer. See https://www.raspberrypi.org/ 23 4 OUTFIT RECOMMENDATION We define an outfit, denoted o, as a tuple of two items, c 1 and c 2 , where c 1 is a top and c 2 is a bottom. Although clothing outfits can Weather API also contain more, or less, than two items, the current version of our system only addresses outfits of two items. This is with the assumption that most outfits comprise of one top and one bottom. Cloud MQTT Recommendation of outfits consisting of a one-piece, e.g., a dress, or with additional accessories, is planned for later research. RFID tag 4.1 Inclusion Criteria To ensure that the user receives outfit recommendations that are relevant for a given day, we define an inclusion criteria for the cloth- RFID ing items that can be part of a recommended outfit. The inclusion Closet Mobile Application criteria are defined as follows: (1) Clothing item must be inside the closet. The status of Figure 1: High level architecture. the item is determined by the latest RFID tag scan. (2) Clothing item must be suitable for current weather. Items are stored in a database with a suitable temperature with the system. Such clothing items can be manually scanned range property. This is the range of temperatures a clothing through the RFID reader. When a scanning occurs, a message gets item is comfortable to wear. The outside temperature at broadcasted to multiple services deployed in the Cloud. These time of recommendation, must be inside the item’s suitable services include—among others—a recommender service and an temperature range. inventory service. By communicating with each other and a third- All clothing items that are owned by a user ui and fits the inclu- party Weather API, they provide outfit recommendations to the sion criteria is represented as a set I (ui ). All outfit combinations Mobile Application. that can be generated from I (ui ) are added to the set O(ui ). 3.2 Mobile Application 4.2 User Ratings When the user opens the mobile application, he gets displayed a The favored outfits indicated (explicitly or implicitly) by the user, recommendation for an outfit that suits today’s temperature and are stored in the system using unary positive-only values. Outfits is inside the user’s closet. By swiping through a list, the user is that have not been rated are outfits that the users either do not like displayed multiple recommended outfits. Moreover, the user can or have not been seen or used together from the user’s closet C(ui ). modify the recommended outfit by using the arrows that corre- Not rated outfits will be referred to as ’neutral’ outfits in the rest of sponds to each clothing item. By clicking a Save button, the user this paper. gives an explicit positive feedback on the displayed outfit, indicating that the user has this outfit as one of his favorites. 4.3 Recommended Outfits The list of recommended outfits that the user receives in the mobile application is generated by the system’s recommender service that returns the set R(ui ) of recommended outfits for the user. 4.4 Notation All the notations defined in this section are summarized in Table 2. These notations will be used throughout the paper. Table 2: Notations used in this paper. Notation Description ui The ith user (owner) of a closet. cj The jth clothing item. ok = (c 1 , c 2 ) An outfit of c 1 and c 2 . C(ui ) = {c 1 , . . . , cl } Every clothing items the user owns. I (ui ) = {c 1 , . . . , cm } Clothes fitting the inclusion criteria. O(ui ) = {o 1 , . . . , on } Outfit combinations of items in I (ui ). R(ui ) = {o 1 , . . . , op } Outfits recommended to the user. Figure 2: Screenshot of the mobile application. 24 5 RETHINKING CONVENTIONAL CF of users who have favored an outfit. Using Z and W , we train In this section, we introduce an approach for outfit recommendation a classifier using a classification model. Outfits that have been using a conventional utility matrix for collaborative filtering. We favored by users and have an associated weight above 0 will be discuss where this approach comes short, and introduce a novel classified as ’positive’, while outfits with an associated weight of approach for outfit recommendation using an outfit-item matrix. 0 will be classified as ’neutral’. When the model has been trained, we generate all the possible outfit combinations O(ui ), of the items 5.1 Conventional CF Approach that fit the inclusion criteria for the given user ui . By using the classifier, we can now recommend the outfits that are classified as An obvious solution to recommending fashion outfits is to map the ’positive’ to the user R(ui ). users’ favorite outfits onto a utility matrix U , consisting of users The advantages of this approach are that it captures the co- and outfits. Then, using a neighborhood model, one could predict occurrence interaction effect between two clothing items. This is new outfits for users by comparing the user’s interaction pattern because it considers the clothing items that an outfit is composed of, with users with same interaction pattern. To recommend the daily instead of just looking at the outfits as a whole. Moreover, it is not outfits R(ui ), we need to match the predicted outfits with the items challenged by the new user cold-start problem because we assume that fit the inclusion criteria I (ui ), and filter out outfits that do not that people that own similar clothing items will have same taste in contain only such items. The approach is illustrated in Figure 3. outfits as well. Lastly, this approach has a huge advantage in terms The first problem with this approach is that it can only recom- of user privacy, because it does not need to store the user-item mend outfits that have been favored by other users. In other words, combinations in one centralized matrix. it cannot generate completely new outfits, and therefore fails to In Figure 5, we give an example of a possible recommendation capture the co-occurrence effect between individual items. An- pipeline that can occur in our system using the novel approach. To other problem with this approach is that it is challenged by the the left is the set of all the clothing items owned by the user. By new user cold-start problem. Users who have not favored any out- inputting this and the current outside temperature at the user’s fits or checked out any items, cannot receive recommendations. location, the function f1 filters out and generates possible outfits Lastly, privacy is becoming a huge concern in recommender sys- for recommendation wrt. the inclusion criteria. These outfits are tems [2, 3], and in this approach, we store all the users’ ratings in then inputted to f2 , which follows the same steps as described in one centralized matrix, causing a huge risk for the users’ privacy. Figure 4. In the end of the pipeline, we get the generated set of recommended outfits that is displayed in the mobile application. 5.2 Novel Outfit-Item Matrix Apprach Although not implemented in our system, this approach could By basing our recommendations on the idea that users that have be easily used by a clothing retailer to generate targeted ads by similar items in their closets will also have similar taste in outfits, inputting clothing items from the retailer together with the user’s we propose a novel approach where we rethink the conventional clothing items in C(ui ). Then, the clothing retailer could recom- approach by completely transforming the utility matrix. In Figure 4, mend new outfits that the users might want to buy, or individual we create a matrix Z , where the columns represent outfits, and the items that would make a great outfit with clothing items already rows represent the clothing items that compose the outfit. Each owned by the user. outfit is associated with a weight w. This weight is the number o1 ... ok o1 ··· ok 0 1 0 1 0 1 c1 · · · w1 u1 · · · B C B C B C Z = ... B C W = @ ... A U = ... B @ ··· C A @ ··· A cn ··· wk un ··· Neighborhood model Classification model Filter I(ui) function R(ui) O(ui) Classifier R(ui) Figure 3: Conventional CF approach using a utility matrix. Figure 4: Novel approach using an outfit-item matrix. 25 C(ui) 25 ℃ O(ui) R(ui) c1 c2 o1 o2 o1 o1 f1 f2 c3 c4 Figure 5: Example of a possible recommendation pipeline using the novel approach. 6 RECOMMENDATION MODEL Gradient Boosting. Another popular ensemble method that relies In this section, we present the recommendation model for our on a set of weak learners is called Gradient Boosting. It follows novel approach using different classification models. The chosen the same fundamental idea as AdaBoost, but instead of focusing on classification models are widely known and perform well in many the sample weights when picking its weak learners, it focuses on domains [4, 5]. The classification models also include a baseline gradients [8]. classifier. Moreover, we introduce some neighborhood models that Uniform. As a baseline, we use a classifier that generates class are applied with the conventional approach. predictions uniformly at random. 6.1 Classification Models 6.2 Neighborhood Models Naı̈ve Bayes. Assuming the attributes of the samples are con- To predict the ratings of the user-outfit combinations in the matrix ditionally independent and given the sample’s class labels, Naı̈ve U , given in Figure 3, we apply the user-based neighborhood model Bayes assigns a test sample the class label Y by maximizing the [1]. This model predicts user ratings by finding users that have numerator in this equation [18]: rated similar outfits. To find similar users, we can apply different Îd similarity measures. In our model, we apply Jaccard (JAC) and P(Y ) i=1 P(X i | Y ) cosine similarity (COS) as defined by Equation 3: P(Y | X ) = , (1) P(X ) |A ∩ B| A·B Sim J AC (A, B) = SimCO S (A, B) = (3) where X is a set of d attributes. |A ∪ B| ||A|| ||B|| After user similarities have been calculated we can predict the Adaptive Boosting (AdaBoost). Over the recent years, classifica- ratings rˆui of unrated outfits using this formula: tion techniques known as ensemble methods have gained a lot of attention. One of the most popular ones is AdaBoost. It aggregates Sim(u, v)rvi Í rˆui = Ív (4) over a set of weak learners ht (x) that tends to perform slightly v |Sim(u, v)| better than a random classifier. The final classifier H (x) is then obtained by ensembling the weak learners by a weighted majority 6.3 Ranking Model voting scheme using this equation [7]: To rank the outfits that are predicted to the user in R(ui ), using the novel approach, we assign each prediction of an outfit o j to a T ranking score equal to the classifier’s probability of the class label Õ H (x) = siдn α t ht (x) , (2) being ’positive’ P(w j > 0 | o j ). It should be noted that this is t =1 not a personalized ranking model, but as seen from our results, it where α t is the assigned weight for each weak learner. performed well for each individual user. To pick the weak learners, each training sample is associated The conventional approach does not use classification models, so with a weight indicting its importance. AdaBoost will then pick the probability of the predicted class label is not available. Instead, its weak learners in a forward stage-wise manner by focusing on the outfits are ranked according to the predicted rating calculated predicting the high-weight samples correctly. using the similarity measures. 26 7 EXPERIMENTS Table 4: The average properties for the users in the test sets. In this section, we describe the setting for how our experiment was performed. We give a detailed description of the dataset that Closet size avд(|O(ui )|) avд(|O(ui )T P |) was used and present the results of the different models that were Full 682.5 31.4 evaluated. The main goals of the experiments are to demonstrate Half empty 164.0 17.6 the effectiveness of the system and to compare and select the best classification model for our system. 7.1 Dataset To reduce the dimensionality of the samples and to detect items that are interrelated, the multivariate analysis technique called The dataset is scraped from Polyvore.com5 . Polyvore is a social principal component analysis was applied to the samples before media site where users can create clothing outfits by matching training the models [14]. The reduction is done by transforming individual clothing items. Other users can then ’like’ these outfits to a new set of uncorrelated features ordered so that the first ones by a clicking a ’like button’. retain most of the original variation. From the available outfits at Polyvore, we first gathered the most For evaluating the conventional approach using the different liked outfits from the last 3 months. For these outfits, we filtered neighborhood models, we first randomly removed 30% of the user the outfits so that they only contained a top and a bottom. Then, likes from the utility-matrix. Then, we predicted all outfit likes we collected other outfits that these items also were a part of, and for each user, and filtered them out wrt. I (ui ) using the same filtered them. Lastly, we gathered all the user likes for each of the assumption above. The recommended outfits were then compared outfits we had gathered. Table 3 describes the size of the dataset. to the true outfit likes. Table 3: Data statistics on Polyvore dataset. 7.3 Evaluation Metrics If we look at the task of recommending the outfits as retrieving # Outfits # Clothes # Users # Likes all relevant items (outfits) from a collection of outfits separated 6,186 158 7,093 19,287 into the two classes; relevant and not relevant, we can apply the Positive: 260 Tops: 81 popular accuracy metrics from information retrieval systems. In Neutral: 5,917 Bottoms: 87 our case, we say that the relevant outfits are the ones classified as ’positive’, and the not relevant are the outfits classified as ’neutral’. Then, we can use a popular metric known as Recall. It measures From the gathered dataset, we have 260 outfits that are classified the ratio of relevant items retrieved to the number of all relevant as ’positive’ and 5,917 that are classified as ’neutral’. This means items available [11]: that the dataset has an imbalance approximately of 23 to 1. In total, there are 158 individual clothing items in the dataset. |relevant items retrieved| Recall = (5) This means that the feature vectors used in the classification models |all relevant items| will be relatively sparse binary vectors of 158 dimensions. In this paper, we also report Recall@N, which is the Recall in a ranked list just considering the N first elements. We compute 7.2 Evaluation Methods Recall and Recall@N by averaging over the result for each user ui . To evaluate our novel approach, we iterated through the follow- A way to graphically display the tradeoff between the true posi- ing procedure for all users with at least 20 outfit likes: For all of tive rate and the false positive rate, is known as a receiver operating the user’s favorite outfits, we hide each of the user’s ground-truth characteristic (ROC) curve. The true positive rate is the same as favorite outfits from the system by decreasing the outfits’ corre- Recall, and the false positive rate is the ratio of non-relevant items sponding weights in W by 1. Then, we train the classification model retrieved to the number of all non-relevant items available. The using Z and W . Moreover, with the assumption that a user only ROC curve is great to compare the performance difference between own items that are part of the items the user likes, we generate classifiers, where the best classifiers tend to be located in the upper outfit combinations, assuming all of the items in C(ui ) fitting the left corner of the diagram. The classifiers that performs best on inclusion criteria. We then compared the predicted class labels of average will have a large area under the ROC curve (AUC) [11]. the generated outfits combinations to the true favorite outfits of To evaluate the ranking via utility, we sum the utility of an outfit the user. We also ran the procedure a second time, but now by ran- j to a user u over a ranked recommended list of size L. By summing domly removing 50% of the users’ tops and bottoms in C(ui ). This over this value for each user, we obtain the R-score as follows [1]: was done to simulate outfit recommendations from a half empty m max {ru j , 0} closet. In Table 4, we summarize some statistics for the test sets Õ Õ R-score = , (6) that was generated by running these methods. As seen in this table, u=1 j ∈Iu ,v j ≤L 2(v j −1)/α there are—on average—quite many outfits that are being classified for each user O(ui ), compared to the true number of the user’s where v j is the rank of outfit j and ru j is the ground-truth rating favorite outfits O(ui )T P . of outfit j. α is the half-life, set to 5 in our experiments. The higher the R-score is, the true favorite outfits for each user tend to appear 5 http://www.polyvore.com/ in the top of the ranked list. 27 Table 5: Results from evaluation of novel approach. Overall Top-L Model AUC Accuracy Recall R-score Recall@5 Recall@10 Recall@15 Recall@20 Naı̈ve Bayes .704 .870 .756 566.2 .091 .183 .287 .382 Gradient Boosting .864 .878 .997 851.9 .111 .228 .337 .448 AdaBoost .885 .723 .978 872.1 .113 .223 .334 .442 Uniform .500 .500 .493 88.0 .024 .052 .077 .095 7.4 Results and Discussion dominating models in all categories. On average and overall, Gradi- In this section, we present our results and discuss some insight we ent Boosting performs best, while in a top-L ranked list, AdaBoost obtained while running the experiments. By the end of this section performs slightly better. For N > 5, Gradient Boosting was—at we will have answered the following questions: maximum—only .006 points better than AdaBoost in Recall@N. In terms of the R-score, AdaBoost is superior to Gradient Boosting. Q1. How do the different classification models compare using Because of this, we conclude that AdaBoost is the model yielding our novel approach? highest utility to the users. Q2. How does closet size affect the recommendation results? In Figure 6, we plot a ROC curve for the different models used Q3. To what extent can the conventional approach be used to to generate a single ranked list of user-outfit pairs. This type of recommend new outfits to the users? ROC curve is sometimes referred to as a global ROC curve [21]. As The evaluation method for the novel approach was performed indicated by the gray dotted line, AdaBoost is the best model at a using the classification models in Section 6.1. For Naı̈ve Bayes the false positive rate at 20%, predicting 86% of the users’ favorite outfits. best configuration was setting a prior probability for the ’neutral’ As the false positive rate increase, Gradient Boosting becomes class label to 0.99 and a 0.01 prior probability for the ’positive’ slightly superior to AdaBoost. On average, Gradient Boosting and class. This was mostly due to the 23 to 1 imbalance in the dataset. AdaBoost dominates the two other models with an AUC of .864 and AdaBoost gave the best result using decision trees as weak learners .885, respectively. Naı̈ve Bayes yields a satisfactory AUC of .704, and with a learning rate of 1.0. Gradient Boosting performed best while from the Uniform model we got an expected AUC of .500. with similar configurations. The high values of AUC and R-score are a strong indication that In Table 5, we report AUC, Accuracy and Recall for the predicted the non-personalized ranking model performs quite well and even class labels for all of the outfits that were tested when simulating a better than expected. full closet. In the right-hand side of the table, we also report the R-score and Recall@N in a ranked list of L outfits. Because each user has different numbers of clothes in their closet, every user is # Recommendatinos 60 recommended a ranked list of various lengths of L. The best per- forming model in each category is highlighted by underlining its 40 result. As seen in the table, Gradient Boosting and AdaBoost are the 20 0 Favorite outfits Novel outfits 1.0 Figure 7: Distributions of outfit recommendations using Ad- 0.8 aBoost. Figure 7 shows the distributions of outfit recommendations in a True Positive Rate 0.6 top-20 list recommended to the users with at least 20 outfit likes. In total, 196 unique outfits were recommended to the users, where 0.4 33 of them were novel outfits—never favored by any users in the past. This shows that a wide range of outfits end up in the users’ recommended top lists. 0.2 Naive Bayes Experiment on a half empty closet resulted in no change in terms Gradient Boosting of overall Recall, and at most, a .005 decrease in AUC, and for this AdaBoost reason, we do not report any results beyond this. Besides the fact Uniform 0.0 that few clothing items will result in fewer outfit recommendations, 0.0 0.2 0.4 0.6 0.8 1.0 we conclude that closet size has little effect on the recommenda- False Positive Rate tions. In Table 6, results from evaluation of the conventional approach Figure 6: Global ROC curves for recommendations from a is given. The table shows Recall@N in a ranked list of M outfits. full closet. Because M is much lower than L, we only report up to N = 5 (as 28 Table 6: Evaluation of the conventional approach. The authors would also like to thank everyone at Accenture who has provided valuable feedback on this research. Model Recall@1 Recall@5 REFERENCES Cosine .077 .366 [1] Charu C. Aggarwal. 2016. Recommender Systems: The Textbook (1st ed.). Springer Jaccard .050 .250 Publishing Company, Incorporated. [2] Arnaud Berlioz, Arik Friedman, Mohamed Ali Kaafar, Roksana Boreli, and Shlomo Berkovsky. 2015. Applying Differential Privacy to Matrix Factoriza- tion. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys opposed to N = 20 in evaluation of the novel approach). Note that ’15). ACM, New York, NY, USA, 107–114. [3] Smriti Bhagat, Udi Weinsberg, Stratis Ioannidis, and Nina Taft. 2014. Recom- the results are not comparable to the results in Table 5, as they are mending with an Agenda: Active Learning of Private Attributes Using Matrix derived using an approach that is fundamentally different. The best Factorization. In Proceedings of the 8th ACM Conference on Recommender Systems performing model is highlighted with underlined results. As the (RecSys ’14). ACM, New York, NY, USA, 65–72. [4] Rich Caruana and Alexandru Niculescu-Mizil. 2006. An Empirical Comparison numbers indicates, the approach generates new outfit recommen- of Supervised Learning Algorithms. In Proceedings of the 23rd International dations to the users at with a satisfactory accuracy. However, these Conference on Machine Learning (ICML ’06). ACM, New York, NY, USA, 161–168. outfit recommendations are—as argued in Section 5—only outfits [5] Thomas G. Dietterich. 2000. Ensemble Methods in Machine Learning. In Proceed- ings of the First International Workshop on Multiple Classifier Systems (MCS ’00). that have been composed and favored by other users in the past. Springer-Verlag, London, UK, UK, 1–15. Therefore, we conclude that this approach is insufficient when it [6] Bojana Dumeljic, Martha Larson, and Alessandro Bozzon. 2014. Moody Closet: Exploring Intriguing New Views on Wardrobe Recommendation. In Proceedings comes to recommending novel and personalized daily outfits. of the First International Workshop on Gamification for Information Retrieval (GamifIR ’14). ACM, New York, NY, USA, 61–62. 8 CONCLUSION AND FUTURE WORK [7] Yoav Freund and Robert E Schapire. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 55, 1 We have introduced a novel approach for recommending daily (Aug. 1997), 119–139. fashion outfits from a smart closet. Our novel approach mitigate [8] Jerome H. Friedman. 2000. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29 (2000), 1189–1232. a wide range of challenges faced by a conventional approach that [9] K. N. Goh, Y. Y. Chen, and E. S. Lin. 2011. Developing a smart wardrobe system. tries to recommend daily fashion outfits. Evaluation of our novel In 2011 IEEE Consumer Communications and Networking Conference (CCNC). approach demonstrates the method’s effectiveness, and its ability 303–307. [10] Derek L. Hansen and Jennifer Golbeck. 2009. Mixing It Up: Recommending to provide users with accurate and novel outfit recommendations. Collections of Items. In Proceedings of the SIGCHI Conference on Human Factors The results from the evaluation helped us select which model to in Computing Systems (CHI ’09). ACM, New York, NY, USA, 1217–1226. [11] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. deploy in the system. R-score, AUC, and Recall@N are the most 2004. Evaluating Collaborative Filtering Recommender Systems. ACM Trans. Inf. useful measures regarding each individual user. Since, AdaBoost Syst. 22, 1 (Jan. 2004), 5–53. achieved the highest R-score and AUC, it was chosen as the main [12] Kurt Jacobson, Vidhya Murali, Edward Newett, Brian Whitman, and Romain Yon. 2016. Music Personalization at Spotify. In Proceedings of the 10th ACM Conference classifier and implemented with the novel approach in the rec- on Recommender Systems (RecSys ’16). ACM, New York, NY, USA, 373–373. ommender service deployed in the cloud. It should be noted that [13] Dietmar Jannach, Lukas Lerche, and Iman Kamehkhosh. 2015. Beyond ”Hitting Gradient Boosting achieved slightly better results in Recall@N, the Hits”: Generating Coherent Music Playlist Continuations with the Right Tracks. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys but we regard this difference as insignificant and conclude that ’15). ACM, New York, NY, USA, 187–194. AdaBoost is indeed the best fit for our system. [14] Ian Jolliffe. 2002. Principal component analysis. Wiley Online Library. [15] Ingunn Grimstad Klepp and Kirsi Laitala. 2016. Clothing consumption in Norway. A non-comparable evaluation of the conventional approach was Technical report 2. Oslo and Akershus University College, Oslo. In Norwegian. performed to see to what extent it could recommend daily outfits. [16] Anders Kolstad, Özlem Özgöbek, Jon Atle Gulla, and Simon Litlehamar. 2017. The accuracy results are acceptable, but due to the approach’s Connected Closet - A Semantically Enriched Mobile Recommender System for Smart Closets. In Proceedings of the 13th International Conference on Web many challenges, it cannot be considered as an efficient method for Information Systems and Technologies (WEBIST 2017). 298–305. recommending daily outfits. [17] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Although we have demonstrated the system’s performance using Massive Datasets (2nd ed.). Cambridge University Press, New York, NY, USA. [18] David D. Lewis. 1998. Naive (Bayes) at Forty: The Independence Assumption in a real-world dataset, a full scale evaluation using data gathered from Information Retrieval. In Proceedings of the 10th European Conference on Machine physical clothes enabled with RFID tags is planned for future work. Learning (ECML ’98). Springer-Verlag, London, UK, UK, 4–15. [19] Chantima Limaksornkul, Duangkamol Na Nakorn, Onidta Rakmanee, and Wanta- The current state of the system should be considered as an early nee Viriyasitavat. 2014. Smart Closet: Statistical-based apparel recommendation prototype and is premature for such a full scale evaluation. Because system. In Student Project Conference (ICT-ISPC), 2014 Third ICT International. of this, these plans are preliminary and we consider other research IEEE, 155–158. [20] John C Pruit. 2015. Getting Dressed. Popular Culture as Everyday Life (2015). topics to be more important at the current stage. These topics in- [21] Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. clude content-based outfit recommendation and recommendation 2002. Methods and Metrics for Cold-start Recommendations. In Proceedings of of garments to be recycled or donated. With these research top- the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’02). ACM, New York, NY, USA, 253–260. ics, we intend to incorporate additional contextual factors such as [22] Xiaoyuan Su and Taghi M. Khoshgoftaar. 2009. A Survey of Collaborative season, user’s occasion, and user’s body type. Filtering Techniques. Adv. in Artif. Intell. 2009, Article 4 (Jan. 2009). [23] Andreu Vall. 2015. Listener-Inspired Automated Music Playlist Generation. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys ’15). ACKNOWLEDGMENTS ACM, New York, NY, USA, 387–390. [24] Lin Yu-Chu, Yuusuke Kawakita, Etsuko Suzuki, and Haruhisa Ichikawa. 2012. This work is an extension to a prototype of the proposed system Personalized Clothing-Recommendation System Based on a Modified Bayesian initially developed during an internship at Accenture. The authors Network. In Proceedings of the 2012 IEEE/IPSJ 12th International Symposium on would like to thank everyone involved in the internship for their Applications and the Internet (SAINT ’12). IEEE Computer Society, Washington, DC, USA, 414–417. contributions prior this work. 29