Fashion and Apparel Classification using Convolutional Neural Networks Alexander Schindler Thomas Lidy Stephan Karner, Matthias Hecker Austrian Institute of Technology Vienna University of Technology MonStyle Digital Safety and Security Institute of Software Technology Vienna, Austria Vienna, Austria Vienna, Austria matthias.hecker@monstyle.io alexander.schindler@ait.ac.at lidy@ifs.tuwien.ac.at Abstract—We present an empirical study of applying deep to describe the semantic content of the images with respect to Convolutional Neural Networks (CNN) to the task of fashion and the high variability and deformability of clothing items. Recent apparel image classification to improve meta-data enrichment of approaches harness the potential of Deep Neural Networks e-commerce applications. Five different CNN architectures were analyzed using clean and pre-trained models. The models were (DNN) to learn the image representation. In [3] a siamese evaluated in three different tasks person detection, product and network of pre-trained Convolutional Neural Networks (CNN) gender classification, on two small and large scale datasets. is used to train a distance function which can be used to asses similarities between fashion images. I. I NTRODUCTION In this study we present an empirical evaluation of various The recent progress in the image retrieval domain provides DNN architectures concerning their classification accuracy in new possibilities for a vertical integration of research results different classification tasks. These tasks are evaluated on two into industrial or commercial applications. Based on the re- different datasets on further two different scales. First, a wide markable success of Deep Neural Networks (DNN) applied to evaluation is performed on a smaller scale dataset and the best image processing tasks, this study focuses on the task of fash- performing models are then applied to large scale datasets. The ion image classification. Photographs of clothes and apparels remainder of this paper is organized as follows. In Section II have to be classified into a set of pre-annotated categories such we review related work. In Section III the datasets used for the as skirt, jeans or sport-shoes. Online e-commerce companies evaluation are presented. Section IV provides an overview of such as Asos-EU 1 , Farfetch 2 or Zalando 3 provide access the evaluated neural network architectures. Section V describes to the data of their products in stock including item-meta- the evaluation setup and summarizes as well as discusses the data and images. Especially the provided meta-data varies results. Finally, conclusions and outlooks to future work are in quality, granularity and taxonomy. Although, most of the given in Section VI. companies provide categorical descriptions of their products, the applied terminology varies as well as the depth of the II. R ELATED W ORK categorical hierarchy. Fashion image classification is thus used to consolidate the meta-data by enriching it with new Recently, CBIR has experienced remarkable progress in the generalized categorical labels. fields of image recognition by adopting methods from the area This is a traditional image processing task with domain of deep learning using convolutional neural networks (CNNs). specific challenges of large variating styles, textures, shapes A full review of deep learning and convolutional neural and colors. A major advantage is the image quality which networks is provided by [4]. Neural networks and CNNs are are professionally produced high quality and high resolution not new technologies, but with early successes such as LeNet images. There are generally two categories of photographs. [5], it is only recently that they have shown competitive results The first arranges products in front of a white background. The for tasks such as in the ILSVRC2012 image classification second portraits a person or parts of a person who is wearing Challenge [6]. With this remarkable reduction in a previously the products. While the first category reduces semantic noise stalling error-rate there has been an explosion of interest in of the images, the second one introduces it, because a person CNNs. Many new architectures and approaches were presented wearing multiple items such as jeans, t-shirt, shoes and belt is such as GoogLeNet [7], Deep Residual Networks (ResNets) only assigned to a single label. Clothing and apparel retrieval [8] or the Inception Architecture [7]. Neural networks have has been addressed to find clothes similar to a photograph also been applied to metrics learning [9] with applications [1] or a given style [2]. The main challenge these studies in image similarity estimation and visual search. Recently faced was the definition and extraction of relevant features two datasets have been published. The MVC Dataset [10] 1 http://www.asos.de/ for view-invariant clothing retrieval (161.638) images and the 2 https://www.farfetch.com DeepFashion Dataset [11] with 800.000 annotated real life 3 https://www.zalando.de/ images. 24 Fashion and Apparel Classification using Convolutional Neural Networks images for SPORTS SHOES (922 products) and 103 images for STRAIGHT LEG TROUSERS (19 products). IV. D EEP N EURAL N ETWORK M ODELS In this study we compared five different DNN architectures which varied in depth and number of trainable parameters, including three winning contributions to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [12] and two compact custom CNNs with fewer trainable parameters. The following architectures were evaluated: Vgg16 and Vgg19: very deep convolutional neural net- works (VGGnet) [13] with 16/19 layers and 47/60 million trainable parameters, reaching an ILSVRC top-5 error rate of 6.8%. InceptionV3: high performance network at a relatively modest computational cost [7] with 25 million trainable parameters reaching an ILSVRC top-5 error rate of 5.6%. Custom CNN and Vgg-like: compact convolutional neu- Fig. 1. Fashion categories hierarchy. ral network with only 10 million trainable parameters. The models were implemented in Python 2.7 using the keras4 III. DATA Deep Learning library on top of the Theano5 backend. The data provided was retrieved from online e-commerce V. E VALUATION companies such as Asos-EU, Farfetch or Zalando. The Convolutional Neural Networks were evaluated towards their classification accuracies in the tasks of differentiating Person: The persons dataset consists of 7833 images and the persons from products as well as classifying product images corresponding ground truth assignments. 5.669 images are according their product category and gender. We performed labeled as Person and 2.164 images are labeled as Products. three-fold cross-evaluation and calculated the accuracies on a per-image and a per-product scale. To calculate the per-product Products: The product dataset consists of 234.884 images and their corresponding ground-truth assignments. These images 4 https://github.com/fchollet/keras belong to 39.474 products where each product is described by 5 https://github.com/Theano/Theano 5,95 images on average. Ground-truth labels are provided for categories category, gender and age. All labels, including age, P are provided on a categorical scale. The provided ground-truth Person > 0.9 assignments consists of 43 classes for the category attribute. These categories are based on a hirarchical taxonnomy. The hirarchy for the provided dataset is depicted in Figure 1. > 0.8 Its largest class SPORTS SHOES contains 66.439 images (10.807 products) and its smallest class JUMPSUITS 6 images (1 product). To facilitate more rapid experimentation, the > 0.6 provided dataset was sub-sampled to approximately 10% of its initial size. Further, due to the class imbalance of the provided category labels, an artificial threshold has been applied to > 0.4 the class sizes of the assignments. All classes with less than 100 images have been skipped. The remaining classes have been sub-sampled to a 10% subset. The sub-sampling > 0.2 Product adhered to further restriction. First, stratification was used to ensure that the frequency distribution of class labels in the subsample corresponds to that of the original one. Second, sub- < 0.1 sampling was performed on product-level. This ensured the consistency of product-images and that there are no products Fig. 2. Examples predictions of the person detector. Prediction was realized with only one image. Finally, sub-sampling of a class was as binary classification. Values above a values of 0.5 are classified as persons stopped when a minimum of 100 images was reached. This and values below as products. Images in the first line thus represent images resulted in a subset of 23.305 instances, ranging from 5.659 predicted as persons with high confidence. 25 Fashion and Apparel Classification using Convolutional Neural Networks 1) Train from scratch or Fine-tune: This part of the evalu- ation deals with the question of whether to train a model from scratch or to fine-tune a pre-trained model. The availability of a large collection of high quality images and a relative small number of classes suggests that models can be effectively fitted according the specific domain. The results presented in Table I show that pre-trained models outperform clean models that have been specifically trained from scratch using only the images of the fashion image collection. Additionally, we evaluated the two different types of applying pre-trained models: a) resetting and training only the top fully connected layers while keeping all other parameters fixed, and b) continued fitting of all parameters on the new data - which is also referred to as fine-tuning. In either way the 1000 unit output layer of the pre-trained models had to be replaced with a 30 units layer representing the 30 product categories.The results of the evaluation show that fine- Fig. 3. Prediction accuracies on a per-image level for the best performing tuning outperforms the fitting of clean fully connected layers model - a fine-tuned InceptionV3 on the 234K dataset. by 5.9% (VGG16) to 7.9% (InceptionV3). The smaller custom models did provide an advantage concerning processing time accuracy the cumulative maximum of all predicted product of fitting and applying the model, but their accuracy results images was taken into account. differ by 16.1% to the top performing fine-tuned model. Figure 3 shows the prediction accuracy per class for the best A. Detecting Persons model (fine-tuned IncepionV3) on the 234.408 images dataset. The most reliably predicted classes are SPORT SHOES, Person detection was introduced based on the observation STRAIGHT LEG TROUSERS and BELTS, the least reliable that products are presented in two general types. First, there classes are STRAIGHT JEANS, JEANS and SKINNY JEANS. are images of products placed in front of a white background These results indicate the problem of different granularity or table. The other type of images are worn products. Because within the provided ground-truth assignments. Root- and leaf- persons on these images are wearing more than one product such as trousers, shirts, shoes and belts, it is hard for a classifier to learn the right boundaries. Thus, the intention was to train a person detector and to either filter person images, or to use this additional information as input for further models. We applied a custom VGG-like CNN with three layers of batch-normalized stacked convolution layers with 32, 64 and 64 3x3 filters and a 256 units fully connected layer with 0.5 dropout. We realized this task as a binary classification prob- lem by using a sigmoid activation function for the output layer. Predictions greater or equal 0.5 were classified as persons. This approach already provided an accuracy of 91.07% on the person dataset. Figure 2 shows example images of the person detection model. Images on the bottom row were predicted with values below 0.1 and are thus categorized as products, whereas images in the top-row are considered to be persons. B. Product Classification The product classification experiments were conducted us- ing the different CNN architectures presented in Section IV on two different scales. First, a broad evaluation was performed on the small-scale subset of 23.305 images. Then, the best performing models were evaluated on the large scale 234.408 images dataset. All models, except those where explicitly men- tioned, were trained using image data augmentation, including Fig. 4. Confusion matrix on a per-image level for the best performing model horizontal flipping of the image, shifting it by 25% in height - a fine-tuned InceptionV3 on the 234K dataset. The vertical axis represents and width as well as a 25% zoom range. the annotated category, the horizontal the prediction. 26 Fashion and Apparel Classification using Convolutional Neural Networks Description best fold best fold cum max Mean cum max InceptionV3, pretrained, fine-tuned 0.706 0.794 0.791 InceptionV3, pretrained, fine-tuned 0.658 0.729 0.716 VGG16, pretrained, fine-tuned 0.646 0.711 0.691 InceptionV3, pretrained, fine-tuned, person filter model as layer 0.569 0.685 0.658 VGG19, pretrained, fine-tuned 0.579 0.673 0.634 InceptionV3, pretrained, fine-tuned, no augmentation 0.564 0.673 0.647 VGG19, pretrained, train only top-layers 0.578 0.669 0.343 VGG16, pretrained, train only top-layers 0.603 0.652 0.368 InceptionV3, pretrained, train only top-layers 0.585 0.650 0.643 InceptionV3, pretrained, fine-tuned - person filtered metadata 0.640 0.636 0.614 InceptionV3, clean 0.492 0.594 0.580 Custom CNN, augmentation 0.506 0.568 0.538 Custom CNN 0.463 0.556 0.523 Custom VGG-like 0.438 0.549 0.519 VGG16, clean 0.439 0.455 0.443 VGG19, clean 0.437 0.447 0.430 VGG19, pretrained, train only top-layers 0.819 0.887 0.880 InceptionV3, pretrained, fine-tuned 0.798 0.863 0.836 VGG19, pretrained, fine-tuned 0.762 0.846 0.830 TABLE I C LASSIFICATION RESULTS FOR THE product category CLASSIFICATION TASK . R ESULTS SUMMARIZE PER IMAGE ACCURACY OF THE BEST FOLD , PER PRODUCT ACCURACY OF THE BEST FOLD , MEAN PER PRODUCT ACCURACY OF ALL FOLDS . nodes are used interchangeably which results from the ag- [3] A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Be- gregation of different e-commerce catalogs using different longie, “Learning visual clothing style with heterogeneous dyadic co- occurrences,” in Proceedings of the IEEE International Conference on taxonomies. Although confusion a child- with a parent-class Computer Vision, 2015, pp. 4642–4650. is semantically not wrong, but the trained models do not take [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of this hierarchy into account and predict each label individually. the devil in the details: Delving deep into convolutional nets,” arXiv preprint arXiv:1405.3531, 2014. Thiseffect can be seen in the confusion matrix in Figure 4 [5] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel, where spezialized classes such as JEANS and SKINNY JEANS “Optimal brain damage.” in NIPs, vol. 2, 1989, pp. 598–605. or SKINNY and SKINNY JEANS or are confused frequently. [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural infor- mation processing systems, 2012, pp. 1097–1105. C. Gender Prediction [7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” The aim of the gender prediction task was to predict in Proceedings of the IEEE Conference on Computer Vision and Pattern the intended gender of the product into the classes MALE, Recognition, 2015, pp. 1–9. FEMALE and UNISEX. The results are comparable to the [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision product classification task in the sense that pre-trained and and Pattern Recognition, 2016, pp. 770–778. fine-tuned models provide the highest accuracies with a best [9] P. Jain, B. Kulis, J. V. Davis, and I. S. Dhillon, “Metric and kernel performing value of 88%. learning using a linear transformation,” Journal of Machine Learning Research, vol. 13, no. Mar, pp. 519–547, 2012. [10] K.-H. Liu, T.-Y. Chen, and C.-S. Chen, “Mvc: A dataset for view- VI. C ONCLUSIONS AND F UTURE W ORK invariant clothing retrieval and attribute prediction,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, In this study we presented an empirical evaluation of ser. ICMR ’16. New York, NY, USA: ACM, 2016, pp. 313–316. different Convolutional Neural Network (CNN) architectures [Online]. Available: http://doi.acm.org/10.1145/2911996.2912058 [11] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Pow- concerning their performance in different tasks in the domain ering robust clothes recognition and retrieval with rich annotations,” of fashion image classification. The experiments indicated that in Proceedings of IEEE Conference on Computer Vision and Pattern dispite the large amount and high quality of provided fashion Recognition (CVPR), 2016. [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, images, pre-trained and fine-tuned models outperform those Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and which were trained on the given collections alone. Future work L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” will concentrate on analyzing models on a scale of two million International Journal of Computer Vision (IJCV), 2015. [13] K. Simonyan and A. Zisserman, “Very deep convolutional networks for images. large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. R EFERENCES [1] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan, “Street-to- shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3330–3337. [2] W. Di, C. Wah, A. Bhardwaj, R. Piramuthu, and N. Sundaresan, “Style finder: Fine-grained clothing style detection and retrieval,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 8–13. 27