=Paper=
{{Paper
|id=Vol-2844/fashion2
|storemode=property
|title=Towards Fashion Image Annotation: A Clothing Category Recognition Procedure
|pdfUrl=https://ceur-ws.org/Vol-2844/fashion2.pdf
|volume=Vol-2844
|authors=Tryfon-Rigas Tzikas,Alexandros-Charalampos Kyprianidis,Maria Kotouza,Sotirios-Filippos Tsarouchis,Antonios Chrysopoulos,Pericles Mitkas
|dblpUrl=https://dblp.org/rec/conf/setn/TzikasKKTCM20
}}
==Towards Fashion Image Annotation: A Clothing Category Recognition Procedure==
Towards Fashion Image Annotation: A Clothing Category Recognition Procedure Tryfon-Rigas Tzikas Alexandros-Charalampos Maria Kotouza tzikasta@ece.auth.gr Kyprianidis maria.kotouza@issel.ee.auth.gr Electrical and Computer Engineering, alexandros.kyprianidis@issel.ee.auth.gr Electrical and Computer Engineering, Aristotle University of Thessaloniki Electrical and Computer Engineering, Aristotle University of Thessaloniki Thessaloniki, Greece Aristotle University of Thessaloniki Thessaloniki, Greece Thessaloniki, Greece Sotirios-Filippos Tsarouchis Antonios Chrysopoulos Pericles Mitkas sotiris.tsarouchis@issel.ee.auth.gr achryso@issel.ee.auth.gr mitkas@auth.gr Electrical and Computer Engineering, Electrical and Computer Engineering, Electrical and Computer Engineering, Aristotle University of Thessaloniki Aristotle University of Thessaloniki Aristotle University of Thessaloniki Thessaloniki, Greece Thessaloniki, Greece Thessaloniki, Greece Abstract 1 Introduction In contemporary clothing industry, design, development and Fashion clothing is one of the oldest industries, occupying procurement teams are constantly asked to present more one of the highest market shares. In this age of fast fashion, products with fewer resources in a shorter time. Thus, cloth- trends change in a highly frequent manner, making it an ing companies that aim to remain competitive in today’s appropriate field for applying optimization techniques to ef- market have to deploy new Artificial Intelligence techniques ficiently extract valuable information from the huge amount aiming at the automation of their traditional procedures. In of generated data. To this end, contemporary clothing brands this direction, the presented approach utilizes a deep learning tend to introduce Artificial Intelligence (AI) techniques, aim- model to accurately classify fashion images. The predictions ing to improve the processes of supply chain, while keeping are intended to be used on a personalized recommendation up to date with the newest fashion trends. Fashion houses system, that acts as an assistant for the fashion designers. such as Hugo Boss1 and Tommy Hilfiger2 have already devel- Two well established architectures are studied, VGG and oped AI-driven tools to improve the design process, whereas ResNet, as well as a variation of ResNet. The realized experi- Prada3 uses AI to deliver high-quality content faster. ments include: (a) architecture comparison, (b) hyperparam- The development of such tools was not feasible before eter tuning and classification, and (c) transfer learning. Two the evolution of Deep Learning and Computer Vision: image fashion datasets are used for the model training and classifi- recognition, detection, segmentation and generation, as well cation: DeepFashion (for training the model from scratch) as 3D reconstruction, are some of the techniques that are and iMaterialist (used to evaluate the transferability of the being used in the development of fashion related solutions. produced model). The results show that the first set of ex- The emergence of an abundance of related projects is justified periments achieved 80.5% accuracy, whereas the pre-trained by the rapid growth in the specific scientific fields. model used on the second dataset led to a decrease of 60% In this paper, Deep Learning algorithms for clothing cate- on training time, while attaining satisfying results. gory classification are evaluated. Two datasets are used as inputs, DeepFashion and iMaterialist, while data augmenta- CCS Concepts: • Computing methodologies → Object tion techniques are applied on them. The first one is used to recognition; Supervised learning by classification; Neural train the model from scratch, while the second one to eval- networks; • Applied computing → Consumer products. uate the transferability of the produced model. The models that were used during the experiments are VGG16, ResNet50 Keywords: object classification, fashion clothing images, and a variation of ResNet50. fine-tuning, convolutional neural networks 1 https://www.hugoboss.com/fashionstories/digitalisation-is-and-remains- a-big-trend-which-has-already-been-embraced-by-hugo-boss/fs-story- AI4FASHION2020, September 02–04, 2020, Athens, Greece 1e6xd6hk2kr8e.html Copyright © 2020 for this paper by its authors. Use permitted under Creative 2 https://www.ibm.com/blogs/think/2018/01/tommyhilfiger-ai/ Commons License Attribution 4.0 International (CC BY 4.0). 3 https://www.pradagroup.com/en/news-media/news-section/prada- group-expands-collaboration-with-adobe.html The proposed solution is part of the Data Annotation variation of ResNet50 (ResNet50v2), by using the DeepFash- module introduced in our previous work [11], where an AI- ion dataset, after applying image pre-processing techniques. enabled system utilized towards the improvement of cloth- The next step contains the selection of the architecture with ing design process was proposed. Specifically, the aforemen- the highest accuracy, by performing a grid search for the tioned system is responsible for retrieving, organizing and image augmentation parameters and the model’s training hy- combining data from many different sources, while taking perparameters. In the last step, the fine-tuned model is used into account the designers’ preferences, in order to suggest on the iMaterialist dataset, to evaluate the transferability of clothing products of interest and help fashion designers with the produced model. the decision-making process. The rest of paper is organized as follows. Section 2 lists 3.1 Image Pre-processing related works. Section 3 introduces the methodology. Sec- The efficiency of the model is heavily dependent on the input tion 4 presents the experimental setup, datasets and results. dataset that is used during the training process. Taking this Section 5 contains the conclusion and future work. into consideration, the images need to be cropped, using the provided bounding boxes from the dataset, to exclude non-relatable objects as well as background noise, in order 2 Related Work to restrain the model from capturing irrelevant information. Several research works have been realized in the field of Moreover, in a multi-class classification problem, each image AI-enabled Fashion applications. There are many works that corresponds to one label, thus it needed to avoid having mul- tried to discern the AI applications in the fashion industry in tiple clothes in a single image, as it can mislead the training four categories [7]: (a) apparel design, (b) manufacturing, (c) process and affect its performance in a negative manner. retailing, (d) supply chain management. In the work of [13] a In order to achieve higher performance and reduce over- comprehensive review of AI systems in apparel supply chains fitting, Data Augmentation techniques are applied, on the is presented, while in [5] an empirical review on existing available training set, in the following order: 1) rotation, 2) apparel recommendation systems is conducted. shearing, 3) horizontal flip and 4) zoom in or out; experi- Fashion image analysis has emerged as a challenging task. menting on each one of them to fine-tune them. Starting The majority of the approaches that have been used over with the first technique, a range of low values was tested time can be described as follows: (a) traditional features and the optimal values were kept in the end. learning methods based on manually created features which are then processed by machine learning algorithms [15], (b) 3.2 Clothes Recognition with ResNet Deep Learning algorithms based on deep neural networks There are many state-of-the art solutions in the literature and especially convolutional neural networks. In most cases, related to image recognition using Deep Learning techniques. the models that have been developed achieve high results Architectures like VGG [14] and ResNet [10] are proved to concerning image classification and recognition. [12] [3] [9] be ideal for recognizing clothing categories from fashion In the area of fashion image classification, Hidayati et images [1] [2]. More specifically, VGG16 and ResNet50 are al. [9] proposed a classification technique that recognizes commonly used in this field. clothing genres based on visually differentiable style ele- In this work, experimentation with VGG16 and ResNet50 ments. Additionally, Cychnerski et al.[2] presented a set of was realized. Additionally, a variation of ResNet50 was in- experiments in order to evaluate ResNet and SqueezeNet. vestigated, which is characterized by an architecture with Many datasets have been introduced as test-beds to apply the following modifications in the skip connection: the batch various AI techniques in the field of fashion. DeepFashion normalization and the ReLU function takes place before the [12] is composed of 800,000 images which are richly an- convolutional layer [2]. This variation of ResNet50 was cho- notated with attributes, clothing landmarks and correspon- sen as the one with the best performance amongst other dence of images taken under different scenarios. DeepFash- variation attempts on the input dataset. ion2 [3] is an improved version of DeepFashion, with en- riched annotations; style, scale, viewpoint, occlusion, bound- 3.3 Hyperparameter Tuning ing box and dense landmarks were added. Hyperparameter tuning is a crucial task towards achieving the optimal performance in Deep Learning modelling. In this process, a set of optimizers were investigated in order 3 Methodology to find the appropriate one for the problem at hand. More The clothing category classification, as well as the fine-tuning specifically, the optimizers examined are Adam, Adadelta, of an existing model to another dataset are challenging tasks. Adamax, Adagrad, SGD. In Figure 1, the proposed approach is described, being divided Weight initialization of a Deep Learning network strongly in three steps. As a first step, three different deep learning affects the performance of the model, since problems like architectures are tested: (a) VGG16, (b) ResNet50 and (c) a vanishing and exploding gradients are tackled by using the Figure 1. Overview of the proposed methodology correct initializer. The following initializers were used in evaluation of the produced models’ performance. The sec- the experiments: (a) Random Normal, (b) He Normal [8], tion is composed of three sets of experiments, as follows: (c) Glorot Normal [4], (d) Zeros, (e) He Uniform [8], and (f) (1) architecture comparison, (2) hyperparameter tuning and Glorot Uniform [4]. classification, and (3) transfer learning. In addition, regularization restricts the exponential growth of model’s weights and prevents the model from overfitting. 4.1 Datasets The techniques employed in the proposed approach are a Two datasets were used for the training and evaluation of the combinations of regularizers and weight decay. Both these models, DeepFashion and iMaterialist. DeepFashion dataset parameters are investigated in regard with the learning rate, [12] consists of 800,000 images characterized by many fea- as they are correlated with it. The regularizers examined are tures and labels. iMaterialist dataset [6] consists of 1,000,000 as follows: (a) L1 (b) L2 (c) L1 & L2, while the weight decay images and contains 8 groups of 228 fine-grained attributes. values are: (a) 0.98, (b) 0.95, (c) 0.75. The imbalanced distribution of the classes in each dataset was balanced by randomly choosing 5000 images for every 3.4 Transfer Learning clothing category, using 50.000 images in total. They were After the completion of the first set of experiments, focused split into training, validation and test set with ratios of 0.7, on the multi-class classification problem of clothing cate- 0.15, 0.15, respectively. gories, we proceed with the examination of the second set, which deals with the evaluation of the performance of an 4.2 Experimental Setup already trained model in another dataset, making use of Input images were scaled down to 224x224 RGB images and transfer learning techniques. The evaluation of the model classified into 10 classes including coat and jacket, dress, top, in a second dataset can be broken down in two cases: (a) shorts, trousers, skirt, leggings and jeggings, outfit, special evaluating the pre-trained model without further training, occasion and suits. The models were trained on a Nvidia and (b) using the pre-trained model as a starting point to Tesla K40c GPU with 32GB memory RAM and utilizing an re-train either the whole model, or only specific layers. The Intel Xeon E5-2630 processor. The batch size that was used whole idea is based on the similarity between the two fashion during training is 32 and the initial learning rate was set datasets and on the fact that they share common low-level according to Keras defaults values for each optimizer (0.01 features, which are also captured from the weights of the for SGD and 0.001 for the rest of them). bottom layers of the model. The main hypothesis should im- prove the model’s performance as it can achieve comparative 4.3 Results results in significant less time. 4.3.1 Architecture Comparison. The architectures tested for the classification of the provided clothing categories are 4 Experiments the following: VGG16, ResNet50 and a variation of ResNet50 This section contains the experimental process on the prob- (ResNet50v2) [2]. They were all tested using the same values lem of multi-class clothing categories classification and the on each hyperparameter, based on the configuration in Table 1. Moreover, Table 1 makes clear that ResNet50v2 outper- Table 2. Image augmentation experiments forms the rest of the models, achieving accuracy 74%; thus it is selected to be used for the rest of the experiments. Rotation Accuracy Shear Accuracy The performance of the models was measured with the us- age of the following evaluation metrics: accuracy, precision, 0 71.0% 0 73.0% recall and f1 score. 10 73.0% 0.05 76.0% 30 67.0% 0.1 71.0% Table 1. Model initialization parameters and architecture 90 52.0% 0.2 77.0% comparison Zoom Accuracy Horizontal Flip Accuracy Parameters Values Model Accuracy 0 77.0% True 77.5% Optimizer Adam 0.05 77.2% VGG16 67% 0.1 76.0% Initializer Glorot Uniform False 77.2% Learning Rate 0.01 ResNet50 70% 0.2 74.0% Weight Decay 0.9 Regularizer L1 ResNet50v2 74% Image Augmentation None Table 3. Initializer and optimizer experiments Initializer Accuracy Optimizer Accuracy 4.3.2 Classification Results. Towards the improvement of the produced model’s performance, many experiments Random Normal 78.3% Adam 78.0% were conducted in order to find the best configuration of the He Normal 77.8% Adagrad 77.8% available hyperparameters. During this process, a grid search Glorot Normal 78.8% Adadelta 80.0% for the image augmentation parameters was performed, as Zeros 10.0% SGD 71.0% well as the model’s training hyperparameters, in order to He Uniform 77.6% Adamax 79.0% boost the accuracy of the model. The order in which the Glorot Uniform 77.5% experiments were performed is as follows: (a) Image aug- mentation (b) Initializer, (c) Optimizer, (d) Learning rate and Table 4. Learning rate, weight decay and regularizer experi- Regularizer, (e) Learning rate and weight decay. In the fol- ments lowing experiments the default parameters are used for the initial configuration, as mentioned in Table 1. The order in which each parameter’s experiments are conducted is impor- Regularizer Weight Decay tant, as with the completion of each one, the optimal value Learning rate L1 L2 L1 & L2 0.98 0.95 0.75 of the corresponding parameter is extracted and is used in the configuration of the following experiments. 0.01 67% 63% 68% 67% 63% 60% The results of the image augmentation experiments, are 0.1 65% 78% 78% 76% 78% 76% presented in Table 2. The optimal values for each technique 1 73% 80% 75% 80% 80% 81% are the following: (a) Rotation: 10, (b) Shear: 0.2, (c) Zoom: 0.05 and (d) Horizontal flip: True. The optimal values led the Table 5. Model optimization parameters produced model to not only achieve better performance, but to avoid overfitting, as well. It is clear that the model per- Image Augmentation Values Parameters Values forms better when the image augmentation process causes Rotation 10 Optimizer Adadelta mediocre changes in the datasets. Shear 0.2 Initializer Glorot Normal In Table 3, the results of the various initializers and opti- Zoom 0.05 Learning Rate 1 mizers are presented. In the first case Glorot Normal achieved Horizontal Flip True Weight Decay 0.75 the best results, while Zeros provided the worst, as expected. Regularizer L2 As far as the optimizers are concerned, they all achieved similar results, except from SGD. The reason behind this is that SGD demands additional fine-tuning to determine the Table 4. Their optimal values are strongly dependent on the appropriate hyperparameters, in contrast with the rest of learning rate parameter. For this reason each parameter is the optimizers, who are adaptive gradient methods. Among tested in respect to different values of learning rate. The best the optimizers, Adadelta achieved the highest accuracy. values of three parameters coming in pairs are as follows: The results of the experiments conducted in order to de- (a) learning rate: 1, regularizer: L2, (b) learning rate: 1, weight termine the weight decay and regularizer are presented in decay: 0.75. The final trained model using the optimal parameters Table 6. Transfer Learning Experiments achieved 80.5% accuracy, as presented in Table 5. Figure 2 is the confusion matrix of the model for each class. The Experiments Precision Recall F1 Score Accuracy diagonal of the matrix presents the true positive value per Benchmark (No training) 40.7% 37.8% 37.3% 38.0% class. The classes Skirt, Trousers, Dress and Shorts are classi- Last layer 46.3% 42.4% 42.1% 42.0% Whole model 65.2% 64.6% 64.7% 62.5% fied better than the rest, while many samples of Outfit and Without pre-trained weights 65.1% 64.9% 64.8% 65.0% Suits are misclassified as Coat and Dress respectively, since there is a vivid resemblance between the images of these classes. On the second step of the experimental process, all the layers of the model were frozen, except from the last one, in order to keep the learned features intact and modify only the classifier’s weights, which constitutes the last layer of the model. The results show a slight improvement over the benchmark on each evaluation metric. To further improve the model’s performance on the new dataset, the whole model was unfrozen, which actually led to significantly better results. The model achieved 62.5% accu- racy, almost 20% better than the previous best performance, revealing that even though the datasets share common fea- tures, as they both contain fashion clothing images, they also appear to have variant inputs. To highlight this last point, the confusion matrix of the last experiment is presented on Figure 3. The classes Shorts, Trousers, Coat are classified with greater confidence, while Leggings are misclassified as Trousers and Dress as Skirts and vice versa. This behavior may derive from either annotation Figure 2. ResNet50v2 evaluated in DeepFashion fault or the fact that these two classes share many visual characteristics, as a long skirt can be easily misjudged as a dress. 4.3.3 Transfer Learning Results. In this section, the per- Lastly, the model was trained from scratch, without using formance of the Deep Learning model produced from the first any weights originating from the pre-trained model. The set of experiments is evaluated on the iMaterialist dataset, model achieved 65% accuracy, surpassing the previous results. which was not used previously. The datasets have many The result is completely justified, as the newly estimated visual features in common, as they both are used for classi- hyperparameters are more suitable for whole model training, fying fashion clothing images to categories. Therefore, it is while in fine-tuning it is needed to use lower learning rate assumed that the pre-trained model can be used as a baseline, to slightly adjust the weights. Comparing the performance upon which we can apply a set of slight weight adjustments of the model trained from scratch and the model trained through fine-tuning to improve its performance, while using using the pre-trained weights, it seems that the second one a low value for the training learning rate. In order to have achieved 2.5% less accuracy. However, this is compensated by comparative results, the same hyperparameters and the eval- the time the model needed for completing its training, since uation results of the pre-trained model in iMaterialist were it was 60% faster than the first one (8 hours and 20 hours maintained as benchmark in the fine-tuning experiments. respectively), saving significant amount of computation time. Table 6 contains the comparison results of the fine-tuning experiments against the ones achieved by the pre-trained 5 Conclusion and Future Work model, which is the benchmark and has not undergone any In this work, a classification model capable of recognizing further training. The differentiation between the experi- 10 different categories of clothing images was presented. ments lies on the model’s layers that each time are trained. The process followed for analyzing the Deep Learning ar- Thus, for the first step of the Transfer Learning process the chitectures of VGG, ResNet and a variation of ResNet were pre-trained model was applied on the input dataset as is, described in detail, as well as the techniques performed to without changing any of the pre-defined hyperparameters. find the optimal model and boost its performance. The results were very poor, since the model achieved a mere DeepFashion was used for model training, while iMaterial- 38% accuracy, indicating that the two datasets contain differ- ist was used for evaluating the transferability of the produced ent content and they cannot be processed by the produced model. The work was mainly focused on hyperparameter model without additional training. tuning, which is a necessary but time-consuming process [3] Yuying Ge, Ruimao Zhang, Lingyun Wu, Xiaogang Wang, Xiaoou Tang, and Ping Luo. 2019. DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images. CoRR abs/1901.07973 (2019). arXiv:1901.07973 http://arxiv.org/abs/1901.07973 [4] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Sta- tistics (Proceedings of Machine Learning Research, Vol. 9), Yee Whye Teh and Mike Titterington (Eds.). PMLR, Chia Laguna Resort, Sardinia, Italy, 249–256. http://proceedings.mlr.press/v9/glorot10a.html [5] Congying Guan, Sheng-feng Qin, Wessie Ling, and Guofu Ding. 2016. Apparel recommendation system evolution: an empirical review. In- ternational Journal of Clothing Science and Technology 28 (11 2016), 854–879. https://doi.org/10.1108/IJCST-09-2015-0100 [6] Sheng Guo, Weilin Huang, Xiao Zhang, Prasanna Srikhanta, Yin Cui, Yuan Li, Matthew R. Scott, Hartwig Adam, and Serge J. Belongie. 2019. The iMaterialist Fashion Attribute Dataset. CoRR abs/1906.05750 (2019). arXiv:1906.05750 http://arxiv.org/abs/1906.05750 [7] Z.X. Guo, W. Wong, SYS Leung, and Min Li. 2011. Applications of artifi- Figure 3. Retrained ResNet50 on iMaterialist based on the cial intelligence in the apparel industry: A review. Textile Research Jour- pretrained weights nal 81 (11 2011), 1871–1892. https://doi.org/10.1177/0040517511411968 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delv- ing deep into rectifiers: Surpassing human-level performance on ima- genet classification. In Proceedings of the IEEE international conference for achieving the highest accuracy. The produced model on computer vision. 1026–1034. achieved 80.5% accuracy on DeepFashion, while the fine- [9] Shintami Chusnul Hidayati, Chuang-Wen You, Wen-Huang Cheng, tuning of the pre-trained model on iMaterialist led to an and Kai-Lung Hua. 2018. Learning and Recognition of Clothing Genres 62.5% accuracy with a 60% reduction in training time, com- From Full-Body Images. IEEE Transactions on Cybernetics 48 (2018), 1647–1659. pared to the corresponding model trained from scratch. [10] Riaz Ullah Khan, Xiaosong Zhang, Rajesh Kumar, and Emelia Opoku Future work involves the improvement of the input datasets Aboagye. 2018. Evaluating the Performance of ResNet Model Based on by manually refining its misplaced labels, which can be pre- Image Recognition. In Proceedings of the 2018 International Conference cisely identified using already trained models and even its on Computing and Artificial Intelligence (Chengdu, China) (ICCAI 2018). enhancement with more samples, in order for the produced Association for Computing Machinery, New York, NY, USA, 86–90. https://doi.org/10.1145/3194452.3194461 model to provide more robust results. Moreover, a wider set [11] Maria Th Kotouza, Sotirios-Filippos Tsarouchis, Alexandros- of experiments can be conducted in order to improve the Charalampos Kyprianidis, Antonios C Chrysopoulos, and Pericles A performance of the model, such as further investigation on Mitkas. 2020. Towards Fashion Recommendation: An AI System for selecting a proper model architecture, detailed tuning of the Clothing Data Retrieval and Analysis. In IFIP International Conference hyperparameters in the pre-trained model’s fine-tuning pro- on Artificial Intelligence Applications and Innovations. Springer, 433–444. cess and testing other training techniques in the fine-tuning [12] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. process. DeepFashion: Powering Robust Clothes Recognition and Retrieval With Rich Annotations. In The IEEE Conference on Computer Vision Acknowledgments and Pattern Recognition (CVPR). [13] E.W.T. Ngai, S. Peng, Paul Alexander, and Karen Moon. 2014. De- This research has been co-financed by the European Re- cision support and intelligent systems in the textile and apparel gional Development Fund of the European Union and Greek supply chain: An academic review of research articles. Expert Sys- national funds through the Operational Program Competi- tems with Applications: An International Journal 41 (01 2014), 81–91. tiveness, Entrepreneurship and Innovation, under the call https://doi.org/10.1016/j.eswa.2013.07.013 [14] Karen Simonyan and Andrew Zisserman. 2014. Very Deep RESEARCH – CREATE – INNOVATE (project code: T1EDK- Convolutional Networks for Large-Scale Image Recognition. 03464) arXiv:1409.1556 [cs.CV] [15] S. Vittayakorn, K. Yamaguchi, A. C. Berg, and T. L. Berg. 2015. Runway References to Realway: Visual Analysis of Fashion. In 2015 IEEE Winter Conference on Applications of Computer Vision. 951–958. [1] Kuan-Ting Chen and Jiebo Luo. 2016. When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing Features. CoRR abs/1611.03915 (2016). arXiv:1611.03915 http://arxiv.org/abs/1611. 03915 [2] J. Cychnerski, A. Brzeski, A. Boguszewski, M. Marmolowski, and M. Trojanowicz. 2017. Clothes detection and classification using convolu- tional neural networks. In 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA). 1–8.