=Paper= {{Paper |id=Vol-2009/fmt-proceedings-2017-paper2 |storemode=property |title=Fashion and Apparel Classification using Convolutional Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2009/fmt-proceedings-2017-paper2.pdf |volume=Vol-2009 |authors=Alexander Schindler,Thomas Lidy,Stephan Karner,Matthias Hecker |dblpUrl=https://dblp.org/rec/conf/fmt/SchindlerLKH17 }} ==Fashion and Apparel Classification using Convolutional Neural Networks== https://ceur-ws.org/Vol-2009/fmt-proceedings-2017-paper2.pdf

Fashion and Apparel Classification using
Convolutional Neural Networks
Alexander Schindler Thomas Lidy Stephan Karner, Matthias Hecker
Austrian Institute of Technology Vienna University of Technology MonStyle
Digital Safety and Security Institute of Software Technology Vienna, Austria
Vienna, Austria Vienna, Austria matthias.hecker@monstyle.io
alexander.schindler@ait.ac.at lidy@ifs.tuwien.ac.at

Abstract—We present an empirical study of applying deep to describe the semantic content of the images with respect to
Convolutional Neural Networks (CNN) to the task of fashion and the high variability and deformability of clothing items. Recent
apparel image classification to improve meta-data enrichment of approaches harness the potential of Deep Neural Networks
e-commerce applications. Five different CNN architectures were
analyzed using clean and pre-trained models. The models were (DNN) to learn the image representation. In [3] a siamese
evaluated in three different tasks person detection, product and network of pre-trained Convolutional Neural Networks (CNN)
gender classification, on two small and large scale datasets. is used to train a distance function which can be used to asses
similarities between fashion images.
I. I NTRODUCTION
In this study we present an empirical evaluation of various
The recent progress in the image retrieval domain provides DNN architectures concerning their classification accuracy in
new possibilities for a vertical integration of research results different classification tasks. These tasks are evaluated on two
into industrial or commercial applications. Based on the re- different datasets on further two different scales. First, a wide
markable success of Deep Neural Networks (DNN) applied to evaluation is performed on a smaller scale dataset and the best
image processing tasks, this study focuses on the task of fash- performing models are then applied to large scale datasets. The
ion image classification. Photographs of clothes and apparels remainder of this paper is organized as follows. In Section II
have to be classified into a set of pre-annotated categories such we review related work. In Section III the datasets used for the
as skirt, jeans or sport-shoes. Online e-commerce companies evaluation are presented. Section IV provides an overview of
such as Asos-EU 1 , Farfetch 2 or Zalando 3 provide access the evaluated neural network architectures. Section V describes
to the data of their products in stock including item-meta- the evaluation setup and summarizes as well as discusses the
data and images. Especially the provided meta-data varies results. Finally, conclusions and outlooks to future work are
in quality, granularity and taxonomy. Although, most of the given in Section VI.
companies provide categorical descriptions of their products,
the applied terminology varies as well as the depth of the
II. R ELATED W ORK
categorical hierarchy. Fashion image classification is thus
used to consolidate the meta-data by enriching it with new Recently, CBIR has experienced remarkable progress in the
generalized categorical labels. fields of image recognition by adopting methods from the area
This is a traditional image processing task with domain of deep learning using convolutional neural networks (CNNs).
specific challenges of large variating styles, textures, shapes A full review of deep learning and convolutional neural
and colors. A major advantage is the image quality which networks is provided by [4]. Neural networks and CNNs are
are professionally produced high quality and high resolution not new technologies, but with early successes such as LeNet
images. There are generally two categories of photographs. [5], it is only recently that they have shown competitive results
The first arranges products in front of a white background. The for tasks such as in the ILSVRC2012 image classification
second portraits a person or parts of a person who is wearing Challenge [6]. With this remarkable reduction in a previously
the products. While the first category reduces semantic noise stalling error-rate there has been an explosion of interest in
of the images, the second one introduces it, because a person CNNs. Many new architectures and approaches were presented
wearing multiple items such as jeans, t-shirt, shoes and belt is such as GoogLeNet [7], Deep Residual Networks (ResNets)
only assigned to a single label. Clothing and apparel retrieval [8] or the Inception Architecture [7]. Neural networks have
has been addressed to find clothes similar to a photograph also been applied to metrics learning [9] with applications
[1] or a given style [2]. The main challenge these studies in image similarity estimation and visual search. Recently
faced was the definition and extraction of relevant features two datasets have been published. The MVC Dataset [10]
1 http://www.asos.de/ for view-invariant clothing retrieval (161.638) images and the
2 https://www.farfetch.com DeepFashion Dataset [11] with 800.000 annotated real life
3 https://www.zalando.de/ images.

24
Fashion and Apparel Classification using Convolutional Neural Networks

images for SPORTS SHOES (922 products) and 103 images
for STRAIGHT LEG TROUSERS (19 products).
IV. D EEP N EURAL N ETWORK M ODELS
In this study we compared five different DNN architectures
which varied in depth and number of trainable parameters,
including three winning contributions to the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC) [12] and two
compact custom CNNs with fewer trainable parameters. The
following architectures were evaluated:
Vgg16 and Vgg19: very deep convolutional neural net-
works (VGGnet) [13] with 16/19 layers and 47/60 million
trainable parameters, reaching an ILSVRC top-5 error rate
of 6.8%.
InceptionV3: high performance network at a relatively
modest computational cost [7] with 25 million trainable
parameters reaching an ILSVRC top-5 error rate of 5.6%.
Custom CNN and Vgg-like: compact convolutional neu-
Fig. 1. Fashion categories hierarchy. ral network with only 10 million trainable parameters.
The models were implemented in Python 2.7 using the keras4
III. DATA Deep Learning library on top of the Theano5 backend.
The data provided was retrieved from online e-commerce V. E VALUATION
companies such as Asos-EU, Farfetch or Zalando.
The Convolutional Neural Networks were evaluated towards
their classification accuracies in the tasks of differentiating
Person: The persons dataset consists of 7833 images and the
persons from products as well as classifying product images
corresponding ground truth assignments. 5.669 images are
according their product category and gender. We performed
labeled as Person and 2.164 images are labeled as Products.
three-fold cross-evaluation and calculated the accuracies on a
per-image and a per-product scale. To calculate the per-product
Products: The product dataset consists of 234.884 images and
their corresponding ground-truth assignments. These images 4 https://github.com/fchollet/keras

belong to 39.474 products where each product is described by 5 https://github.com/Theano/Theano

5,95 images on average. Ground-truth labels are provided for
categories category, gender and age. All labels, including age,
P
are provided on a categorical scale. The provided ground-truth

Person
> 0.9
assignments consists of 43 classes for the category attribute.
These categories are based on a hirarchical taxonnomy. The
hirarchy for the provided dataset is depicted in Figure 1.
> 0.8
Its largest class SPORTS SHOES contains 66.439 images
(10.807 products) and its smallest class JUMPSUITS 6 images
(1 product). To facilitate more rapid experimentation, the
> 0.6
provided dataset was sub-sampled to approximately 10% of its
initial size. Further, due to the class imbalance of the provided
category labels, an artificial threshold has been applied to > 0.4
the class sizes of the assignments. All classes with less
than 100 images have been skipped. The remaining classes
have been sub-sampled to a 10% subset. The sub-sampling > 0.2
Product

adhered to further restriction. First, stratification was used to
ensure that the frequency distribution of class labels in the
subsample corresponds to that of the original one. Second, sub- < 0.1
sampling was performed on product-level. This ensured the
consistency of product-images and that there are no products
Fig. 2. Examples predictions of the person detector. Prediction was realized
with only one image. Finally, sub-sampling of a class was as binary classification. Values above a values of 0.5 are classified as persons
stopped when a minimum of 100 images was reached. This and values below as products. Images in the first line thus represent images
resulted in a subset of 23.305 instances, ranging from 5.659 predicted as persons with high confidence.

25
Fashion and Apparel Classification using Convolutional Neural Networks

1) Train from scratch or Fine-tune: This part of the evalu-
ation deals with the question of whether to train a model from
scratch or to fine-tune a pre-trained model. The availability of
a large collection of high quality images and a relative small
number of classes suggests that models can be effectively fitted
according the specific domain.
The results presented in Table I show that pre-trained
models outperform clean models that have been specifically
trained from scratch using only the images of the fashion
image collection. Additionally, we evaluated the two different
types of applying pre-trained models: a) resetting and training
only the top fully connected layers while keeping all other
parameters fixed, and b) continued fitting of all parameters
on the new data - which is also referred to as fine-tuning. In
either way the 1000 unit output layer of the pre-trained models
had to be replaced with a 30 units layer representing the 30
product categories.The results of the evaluation show that fine-
Fig. 3. Prediction accuracies on a per-image level for the best performing tuning outperforms the fitting of clean fully connected layers
model - a fine-tuned InceptionV3 on the 234K dataset.
by 5.9% (VGG16) to 7.9% (InceptionV3). The smaller custom
models did provide an advantage concerning processing time
accuracy the cumulative maximum of all predicted product of fitting and applying the model, but their accuracy results
images was taken into account. differ by 16.1% to the top performing fine-tuned model.
Figure 3 shows the prediction accuracy per class for the best
A. Detecting Persons model (fine-tuned IncepionV3) on the 234.408 images dataset.
The most reliably predicted classes are SPORT SHOES,
Person detection was introduced based on the observation
STRAIGHT LEG TROUSERS and BELTS, the least reliable
that products are presented in two general types. First, there
classes are STRAIGHT JEANS, JEANS and SKINNY JEANS.
are images of products placed in front of a white background
These results indicate the problem of different granularity
or table. The other type of images are worn products. Because
within the provided ground-truth assignments. Root- and leaf-
persons on these images are wearing more than one product
such as trousers, shirts, shoes and belts, it is hard for a
classifier to learn the right boundaries. Thus, the intention was
to train a person detector and to either filter person images, or
to use this additional information as input for further models.
We applied a custom VGG-like CNN with three layers of
batch-normalized stacked convolution layers with 32, 64 and
64 3x3 filters and a 256 units fully connected layer with 0.5
dropout. We realized this task as a binary classification prob-
lem by using a sigmoid activation function for the output layer.
Predictions greater or equal 0.5 were classified as persons.
This approach already provided an accuracy of 91.07% on the
person dataset. Figure 2 shows example images of the person
detection model. Images on the bottom row were predicted
with values below 0.1 and are thus categorized as products,
whereas images in the top-row are considered to be persons.

B. Product Classification
The product classification experiments were conducted us-
ing the different CNN architectures presented in Section IV on
two different scales. First, a broad evaluation was performed
on the small-scale subset of 23.305 images. Then, the best
performing models were evaluated on the large scale 234.408
images dataset. All models, except those where explicitly men-
tioned, were trained using image data augmentation, including Fig. 4. Confusion matrix on a per-image level for the best performing model
horizontal flipping of the image, shifting it by 25% in height - a fine-tuned InceptionV3 on the 234K dataset. The vertical axis represents
and width as well as a 25% zoom range. the annotated category, the horizontal the prediction.

26
Fashion and Apparel Classification using Convolutional Neural Networks

Description best fold best fold cum max Mean cum max
InceptionV3, pretrained, fine-tuned 0.706 0.794 0.791
InceptionV3, pretrained, fine-tuned 0.658 0.729 0.716
VGG16, pretrained, fine-tuned 0.646 0.711 0.691
InceptionV3, pretrained, fine-tuned, person filter model as layer 0.569 0.685 0.658
VGG19, pretrained, fine-tuned 0.579 0.673 0.634
InceptionV3, pretrained, fine-tuned, no augmentation 0.564 0.673 0.647
VGG19, pretrained, train only top-layers 0.578 0.669 0.343
VGG16, pretrained, train only top-layers 0.603 0.652 0.368
InceptionV3, pretrained, train only top-layers 0.585 0.650 0.643
InceptionV3, pretrained, fine-tuned - person filtered metadata 0.640 0.636 0.614
InceptionV3, clean 0.492 0.594 0.580
Custom CNN, augmentation 0.506 0.568 0.538
Custom CNN 0.463 0.556 0.523
Custom VGG-like 0.438 0.549 0.519
VGG16, clean 0.439 0.455 0.443
VGG19, clean 0.437 0.447 0.430
VGG19, pretrained, train only top-layers 0.819 0.887 0.880
InceptionV3, pretrained, fine-tuned 0.798 0.863 0.836
VGG19, pretrained, fine-tuned 0.762 0.846 0.830
TABLE I
C LASSIFICATION RESULTS FOR THE product category CLASSIFICATION TASK . R ESULTS SUMMARIZE PER IMAGE ACCURACY OF THE BEST FOLD , PER
PRODUCT ACCURACY OF THE BEST FOLD , MEAN PER PRODUCT ACCURACY OF ALL FOLDS .

nodes are used interchangeably which results from the ag- [3] A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Be-
gregation of different e-commerce catalogs using different longie, “Learning visual clothing style with heterogeneous dyadic co-
occurrences,” in Proceedings of the IEEE International Conference on
taxonomies. Although confusion a child- with a parent-class Computer Vision, 2015, pp. 4642–4650.
is semantically not wrong, but the trained models do not take [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of
this hierarchy into account and predict each label individually. the devil in the details: Delving deep into convolutional nets,” arXiv
preprint arXiv:1405.3531, 2014.
Thiseffect can be seen in the confusion matrix in Figure 4 [5] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel,
where spezialized classes such as JEANS and SKINNY JEANS “Optimal brain damage.” in NIPs, vol. 2, 1989, pp. 598–605.
or SKINNY and SKINNY JEANS or are confused frequently. [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
C. Gender Prediction [7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
The aim of the gender prediction task was to predict in Proceedings of the IEEE Conference on Computer Vision and Pattern
the intended gender of the product into the classes MALE, Recognition, 2015, pp. 1–9.
FEMALE and UNISEX. The results are comparable to the [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE Conference on Computer Vision
product classification task in the sense that pre-trained and and Pattern Recognition, 2016, pp. 770–778.
fine-tuned models provide the highest accuracies with a best [9] P. Jain, B. Kulis, J. V. Davis, and I. S. Dhillon, “Metric and kernel
performing value of 88%. learning using a linear transformation,” Journal of Machine Learning
Research, vol. 13, no. Mar, pp. 519–547, 2012.
[10] K.-H. Liu, T.-Y. Chen, and C.-S. Chen, “Mvc: A dataset for view-
VI. C ONCLUSIONS AND F UTURE W ORK invariant clothing retrieval and attribute prediction,” in Proceedings of
the 2016 ACM on International Conference on Multimedia Retrieval,
In this study we presented an empirical evaluation of ser. ICMR ’16. New York, NY, USA: ACM, 2016, pp. 313–316.
different Convolutional Neural Network (CNN) architectures [Online]. Available: http://doi.acm.org/10.1145/2911996.2912058
[11] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Pow-
concerning their performance in different tasks in the domain ering robust clothes recognition and retrieval with rich annotations,”
of fashion image classification. The experiments indicated that in Proceedings of IEEE Conference on Computer Vision and Pattern
dispite the large amount and high quality of provided fashion Recognition (CVPR), 2016.
[12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
images, pre-trained and fine-tuned models outperform those Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
which were trained on the given collections alone. Future work L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
will concentrate on analyzing models on a scale of two million International Journal of Computer Vision (IJCV), 2015.
[13] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
images. large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

R EFERENCES
[1] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan, “Street-to-
shop: Cross-scenario clothing retrieval via parts alignment and auxiliary
set,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on. IEEE, 2012, pp. 3330–3337.
[2] W. Di, C. Wah, A. Bhardwaj, R. Piramuthu, and N. Sundaresan,
“Style finder: Fine-grained clothing style detection and retrieval,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, 2013, pp. 8–13.