Creating catalogues of clothes images using neural
networks
Anna V. Korobko1 , Aleksei A. Korobko2 and Aleksei V. Markovin3
1
  Reshetnev Siberian State University of Science and Technology, Krasnoyarsk, Russian Federation
2
  Institute of Computational Modeling, SB RAS, Krasnoyarsk, Russian Federation
3
  "Osnova" Ltd., Krasnoyarsk, Russian Federation


                                         Abstract
                                         A lot of businessmen, companies and brands have created their accounts on Instagram using this social
                                         network as a platform for the promotion and sales of their goods, work and services. Despite all the
                                         possibilities which Instagram holds, the promotion of goods under the conditions of market saturation
                                         is a complicated task. It is becoming urgent to search for new technological solutions which would
                                         provide collecting images of goods from Instagram business accounts and aggregating this information
                                         in one integrated on-line marketplace, taking into account the requirements formed by the sales of
                                         clothes in e-commerce. The present study is devoted to the practical test of the approach to automatically
                                         cataloguing of goods based on images with the help of neural networks in the frames of collecting
                                         and aggregating information about goods from Instagram business accounts in an integrated on-line
                                         marketplace. The experience of applying neural network models is studied for the fashion industry in
                                         general and, in particular, for the cataloguing of clothes images. The applied methods and approaches of
                                         building convolutional neural networks are described and substantiated. The architecture of two neural
                                         network models for determining the colour and category of clothes from their images is described in
                                         detail. The accuracy of the model as well as the losses during learning and testing is estimated. The
                                         accuracy of the models is compared with the accuracy of a random classification. Testing the basic
                                         configurations allows one to determine the directions for future research, to formulate forthcoming
                                         scientific and technical problems and to form reference values of the classification accuracy for estimating
                                         the efficiency of more complex models.

                                         Keywords
                                         processing, cataloguing, neural networks, fashion, Instagram


1. Introduction
With the development of information technologies, Internet is becoming an important aspect in
the life of many people. This is unlimited access to knowledge, possibility of earning money
remotely, communication with friends all around the world as well as a marketplace. Instagram
is a social network based on a relatively new way of communication, i.e. uploading images
and short videos. Instagram, which started from one million users in 2010 has at present
more than 500 million active users who daily watch the information uploaded in the network.

SibDATA 2021: The 2nd Siberian Scientific Workshop on Data Analysis Technologies with Applications 2021, June 25,
2021, Krasnoyarsk, Russia
$ gglhroom@gmail.com (A. V. Korobko)
 0000-0001-5337-3247 (A. V. Korobko)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Many businessmen, companies and brands have created their accounts on Instagram using this
network as a platform for the promotion and sales of their goods, work and services. This is
an attractive marketplace due to numerous active users, unique possibilities for advertising,
integration with Facebook and convenient tools of business analytics. However, in spite of all
the possibilities of the network Instagram, the promotion of goods under the conditions of the
saturated market is a difficult task. Most accounts which represent small and medium business
remain unnoticed and have difficulties in attracting new users. A search for new technological
solutions is becoming urgent in order to provide the collection of images of goods from business
accounts of Instagram and aggregation of this information in an integrated marketplace taking
into account the requirements formed by the sales of clothes in e-commerce.
   The market of e-commerce has been developing at a high rate since 2013. At the end of 2019
in Russia there were about 4 700 clothes and footwear shops on the Internet with the level of
sales being no lower than one delivery order per day; more than 100 shops are among the top
1000 leaders of the Russian e-commerce market. Customers already have a great experience of
buying in e-commerce, and the requirements to presenting goods on the site of an internet shop
have been formed. One can hardly imagine a marketplace or an internet clothes shop without a
catalogue of goods. The most necessary attributes of goods which form the structure of the
whole product line are “Colour” and “Category”.
   The present study is devoted to the practical test of the approach to automatically cataloguing
of goods based on images with the help of neural networks in the frames of collecting and
aggregating information about goods from Instagram business accounts in an integrated on-line
marketplace. The experience of applying neural net models is studied for the fashion industry in
general and, in particular, for cataloguing clothes images. The applied methods and approaches
of building convolutional neural networks are described and substantiated. The architecture
of two neural network models for determining the colour and category of clothes from their
images is described in detail. The accuracy of the model as well as the losses during learning
and testing is estimated. The accuracy of the models is compared with the accuracy of a random
classification. Testing the basic configurations allows one to indicate the directions for future
research, to formulate forthcoming scientific and technical problems and to form reference
values of the classification accuracy for estimating the efficiency of more complex models.


2. Review of the existing solutions
At present, the most effective technique of classifying images is artificial neural networks. A
neural net is a simple mathematical model which is not programmed but learns itself [1, 2, 3].
The model analyses a great number of examples related to the problem being solved, and finds
in them statistical regularities to be used to form rules for automatically solving the stated
problem. This approach is radically different from the earlier algorithms of artificial intellect,
which required the preliminary formulation of the human knowledge. The use of a neural net
both automatizes the process of problem-solving and the process of acquiring the knowledge
which is necessary for it.
   The accumulated experience of using neural networks for solving various problems of
decision-making support allows forming a great number of parameters for tuning the model and
their possible values. These parameters include: architecture of the neural network, number and
type of the layers, number of free neurons in each layer, set of input and target values, activation
functions, algorithms and optimization functions, learning style and number of iterations. With
this range of possible “tunings” of the neural network, the problem of constructing a model for
solving a specific problem becomes a time-consuming procedure.
   In English publications concerning the solution of the problems of fashion industry, use is
made of the technologies of clothes image analysis, including neural networks. In the study [4]
the problem of predicting a colour of clothes which will be fashionable in the next season is
solved. In [Wang et al., 2018] a hybrid intellectual model of the medium term forecasts of the
volume of sales in retail fashion trade is built. An approach is proposed to solve the problem of
determining the clothes style which will become a new fashionable trend based on the image
analysis in social networks [5].
   The problem of cataloguing clothes which is considered in this study is most often formulated
in literature as the classification of images based on the clothes type. In the study [6] one of the
tasks of computer vision is solved, i.e. to determine clothes items in images. Under consideration
are such items as: hat, glasses, bag, trousers, shoes, etc. The suggested approach is based on
applying a sequence of modern methods: highlighting possible areas containing the object and
using a convolutional neural network of deep learning. Since the location of items is in strong
correlation with the location of human joints, the authors take into account the information
about the stance of the man in the images in order to increase the efficiency of the algorithm. A
qualitative and quantitative analysis of several models of convolutional networks based on 70
000 clothes images from the dataset Fashion-MNIST was made in [7]. An approach to using the
semantic segmentation of clothes images as the preliminary stage of recognizing the clothes
category is described in [8]. A model architecture is suggested including a basic network which
serves for acquiring the features and a net in the form of the feature pyramid which sets the
grouping of the feature values.
   The review of the scientific literature and existing solutions shows that neural nets are
intensively used for predicting fashion trends and automatic cataloguing of clothes images.
The proposed solutions differ, depending on the research goal and list of the clothes items and
accessories being considered. In the frames of the present study, it is interesting to test the basic
neural net models with convolutional layers using our own dataset obtained from business
accounts of the social network Instagram.


3. Methods and approaches
The general approach to solving the stated problem is based on the methodology of the system
analysis, object-oriented approach and theory of database. For testing the neural network
models it is decided to use the language Python, a modern high-level programming language of
general purpose oriented towards increasing the productivity of the developer and improving
the readability of the code. Due to the human-readable program code and to a great number of
in-built functional libraries, Python is highly suitable for computational experiments, scientific
research and fast prototyping.
   As a program environment for the implementation of the algorithms of identifying the
colour of items from a fly page, as well as their distribution and cataloguing, use was made
of Jupyter, a web platform for the interactive development of executive pages (notebook),
which can simultaneously contain information modules structured using the markup language
Markdown, connected fragments of the program code (not only in Python) and results of the
execution of this code. The flexibility of this development environment in combination with
the high capacities in Python allows one to efficiently solve research problems. In the present
research, use is made of a library for the support of processing big multidimensional arrays and
matrices with a set of high-level mathematical functions (numpy), library of quick functions for
analysing and manipulating the data based on the relational structure of representing multi-
index datasets (pandas), library of high-level operations with files and collections of files (shutil),
complete library for the creation of static, animated and interactive visualization (matplotlib)
and framework for the deep machine learning of neural networks (keras).
   Using the algorithm of collecting information and an assembly module for collecting images
from sources (profiles of users) allowed accumulating a database containing 1253 images of
goods from more than 150 accounts. The uploaded images were divided by the moderator
into 23 categories and into 19 groups according to the colour of the proposed item. The
problem of cataloguing of goods is reduced to a mathematical problem of single label multi-class
classification. This means that each of the considered objects (images) can be referred to one
and only class from several ones. In the case under consideration, the classes are the colour
groups and categories of goods. The main stages of building the model for identifying the colour
of goods from the fly page: “Uploading the marked images”, “Preparing the data for building
the model”, “Building the model and “Training the model”. This stage of uploading assumes
unpacking files with the images, uploading the dataset with the image descriptions including
the colour, and testing the data integrity. The stage of preparing the data for building the model
includes: the formation of the class list for training the model; formation of a file structure
for sorting the images; distribution of the images into files. Along with the distribution of
the images of the same colour into catalogues, training, validating and test sets are formed
according to the requirements of the library for building and training of classification models
(Keras). The construction of the model is the crucial stage of its creation. It is at this stage
when the model tuning parameters are determined which influence its efficiency. The stage
of training the model includes the compilation of the model, initialization of a “generator” of
training and testing images and training itself. The compilation of the model implies the call of
the function with the same name for the constructed model and estimation of the parameters:
the functions of the loss calculation, optimizer and metrics.
   The loss calculation function is also called the target function, since it determines how the
trained neural network will estimate how close the obtained result is to the expected one. The
loss function receives the prediction given by the network and the true value (which the network
had to return) and calculates an estimate of the distance between them, reflecting how well
the network handled this particular example. The efficiency of neural networks lies in the
application of a target function to tune the values of the weights of neurons in order to reduce
losses for each image involved in training the model. The tuning itself is performed using
an optimizer which implements the so-called error back propagation algorithm, which is the
central deep learning algorithm. When training the model, an image generator is used, which is
an object responsible for the order in which images are selected from the training and test set
for transferring the model. To train the neural network, it is common practice to recompress
images in order to reduce the computational load on the network, the target_size parameter
determines to what size the generator should reduce the image. Neural network training is
performed in batches - subsets of the training set, with the batch_size parameter defining the
batch size. The classification type is specified by the class_mode parameter.
   The accumulated database of images and their markup allows one to proceed to the develop-
ment of algorithms for identifying the colour of goods and their cataloguing for automatically
(without the participation of the moderator) dividing newly uploaded items into groups and
categories. The first step in solving the problem of image recognition can be considered the
construction of convolutional networks and the assessment of their accuracy. Even a very
modest result can be interpreted as positive and confirm the possibility of using the proposed
approach, subject to further research for estimating the effectiveness of various combinations
of the properties of the neural network. Testing of basic configurations will allow determining
directions for further research and formulating forthcoming scientific and technical tasks, and
obtaining reference values of the classification accuracy to assess the effectiveness of more
complex models.


4. Building, training and testing the model for identifying the
   colour from the image
The formation of a list of colours consists in analysing the number of images of each colour and
discarding categories of colours with low content. To build and train the model for identifying
the colour of a product, it was decided to set a threshold value for the number of images of
the same colour, namely 80 images. The threshold value allowed us to make a list of 7 popular
colours: [’Beige’, ’White’, ’Blue’, ’Brown’, ’Pink’, ’Gray’, ’Black’]. The formation of a file structure
for sorting images implies the creation of a set of folders with the names corresponding to the
target class of the model, i.e. to the colour of the item in the image
   To identify the colour of a product from the image, Model 1 was built. It has the sequential
structure and consists of 11 layers. The layers are described in Table 1.
   The present model is based on convolutional and compression layers, which ensure the
identification of individual image features and fixation of these features in the resulting (output)
tensor. The number of the detected features coincides with the number of free neurons in the
layer. The convolutional layers preserve the size of the input image (taking into account the
alignment) and significantly expand the feature space. Compression layers reduce the size of
the image in multiples of the subset size, preserving the set of features identified in the previous
layer. On layer 8, we get a 7x7 image, where each pixel has 128 features. The ninth (9) layer
"unfolds" the three-dimensional tensor into the one-dimensional one, forming a vector of 6272
values at the output. The last two layers are designed to reduce the feature spaces in 2 stages to
the required 7 classes corresponding to the colours of the items selected for training.
   For all but the last layer, the ’relu’ activation function was selected. Neurons with this
activation function are called ReLU (rectified linear unit). The function has the following
formula f (x) = max (0, x) and implements a simple threshold transition at zero. On the last layer,
the ’softmax’ activation function is selected, corresponding to the task of single label multi-class
Table 1
The properties of the layers of Model 1
  Number                                                    Number
                                                                         Activation   Tensor form
  of the                   Type of the layer                of neu-
                                                                         function     at the input
  layer                                                     rons
  1              conv2D, convolutional, core (3, 3)         32           ’relu’       (150, 150, 3)
  2           MaxPooling2D, convolutional, subset (2, 2)    32           -            (148, 148, 32)
  3              conv2D, convolutional, core (3, 3)         64           ’relu’       (74, 74, 32)
  4           MaxPooling2D, compression, subset (2, 2)      64           -            (72, 72, 64)
  5              conv2D, convolutional, core (3, 3)         128          ’relu’       (36, 36, 64)
  6           MaxPooling2D, compression, subset (2, 2)      128          -            (34, 34, 128)
  7              conv2D, convolutional, core (3, 3)         128          ’relu’       (17, 17, 128)
  8           MaxPooling2D, compression, subset (2, 2)      128          -            (15, 15, 128)
  9                            Flat                         6272         -            (7, 7, 128)
  10                          Dense                         512          ’relu’       (6272)
  11                          Dense                         7            ’softmax’    (512)


classification. In total, the model under consideration includes 3 456 199 trained parameters.
To compile Model 1, we set the loss calculation function - categorical cross entropy, optimizer
- RMSProp (root mean square propagation) - a gradient descent algorithm with an impulse,
and Accuracy metric which is the portion of correct answers of the algorithm. For Model 1,
the image generator of the training set and verification set should produce 150x150 images in
batches of 20 images with their category classification. The previously created generator is
used as a generator, with the number of steps per epoch being 17, the number of epochs is 30,
the number of validation (verification) steps is 6. For 30 epochs of the Model calculation, its
accuracy at the training stage was 1, and the value of the loss function was close to 0.


Figure 1: The change in the accuracy during the training (.) and the test (-) of Model 1
Table 2
The configuration of Model 2
  Number                                                  Number
                                                                      Activation   Tensor form
  of the                  Type of the layer               of neu-
                                                                      function     at the input
  layer                                                   rons
  1             conv2D, convolutional, core (3, 3)        32          ’relu’       (150, 150, 3)
  2          MaxPooling2D, convolutional, subset (2, 2)   32          -            (148, 148, 32)
  3             conv2D, convolutional, core (3, 3)        64          ’relu’       (74, 74, 32)
  4          MaxPooling2D, compression, subset (2, 2)     64          -            (72, 72, 64)
  5             conv2D, convolutional, core (3, 3)        128         ’relu’       (36, 36, 64)
  6          MaxPooling2D, compression, subset (2, 2)     128         -            (34, 34, 128)
  7                           Flat                        369928      -            (17, 17, 128)
  8                       Dropout, 0,5                    -           -            -
  9                          Dense                        512         ’relu’       (36992)
  10                         Dense                        6           ’softmax’    (512)


   The graph of the change in accuracy at the training stage and at the test stage (Figure 1)
indicates that after the 8th epoch of calculation, an overtraining effect appeared and at the same
stage an acceptable level of accuracy had already been achieved, which was 0.4286, with the
value of the loss parameter at the training stage being 0.8321. For the algorithm identifying the
colour of the item from the image, the accuracy of the random classification was 0.1339.


5. Building, training and testing of the model for determining
   the clothes category of the item from the image
The problem of developing an algorithm for the distribution and cataloguing of goods, similarly
to the problem of identifying the colour, can be formulated as a problem of single label multiclass
classification.
   After unpacking, uploading and checking the data integrity, the analysis of the degree of filling
of certain categories with image files was performed. 6 categories of clothing were identified,
containing more than 100 photographs: [’Outerwear’, ’Suits, outfit, ’Dresses’, ’Sweatshirts,
Sweaters, Jumpers’, ’Bags’, ’Decorations’]. To train the neural network, it is important that the
classes be balanced, i.e. so that each category should contain approximately the same number of
images, and there is enough data to fully "configure" the network. In our case, quite few images
were accumulated, allowing us only to investigate the fundamental possibility of applying the
proposed approach, but not to obtain a ready-made solution with a sufficient level of accuracy.
   For the distribution and cataloguing of goods, Model 2 was built. Model 2 has a sequential
structure and consists of 10 layers. The layers are described in Table 2.
   A Dropout layer is added to Model 2, which zeroes out some revealed features (50% in
our case), which allows avoiding the effect of overtraining the neural network. In total, the
considered model includes 19,036,742 training parameters.
   For Model 2, an infinite training set image generator is created. The generator can generate
an almost infinite number of images based on the original set by changing the parameters:
Figure 2: The change in the accuracy during the training (.) and the test (-) of Model 2


scale (rescale), image rotation (rotation_range), width shift (width_shift_range), height shift
(height_shift_range), counterclockwise pixel shift (shear_range), scaling (zoom_range), hori-
zontal mirroring (horizontal_flip). The parameter values set the range in which the generator
chooses random values for a particular distortion and applies them to real images. Thus, we
expand the training set to infinity.
   For Model 2, the following parameter values are defined: rotation_range = 40, width_shift_range
= .2, height_shift_range = .2, shear_range = .2, zoom_range = .2, horizontal_flip = True. To train
Model 2, the previously created generator is used as a generator, the number of steps per epoch
is 100, the number of epochs is 30, and the number of validation (verification) steps is 6. For 30
epochs of the Model 2 calculation, its accuracy at the training stage was equal to 0.7115, and
the value of the loss function was 0.8184. The graphs of changes in the accuracy at the training
stage and at the verification stage (Figure 4) diverge at the 5th epoch of the neural network
training. At the 4th step of training, the level of accuracy at the verification stage was 0.2583,
with the value of the loss parameter at the training stage being 1.6360. The maximum local
accuracy at the verification stage was equal to 0.3750 at epoch 17, but the losses at the training
stage at this stage reached 3.0128.
   Even small values of the model accuracy parameter at the verification stage can be interpreted
as success if they exceed the accuracy parameters of the so-called "random model". Using the
random number generator, one of the target classes is randomly selected for each image, and then
the accuracy of this random prediction is estimated by comparison with the actual classes. For
the model of cataloguing images by clothing categories, the accuracy of the random classification
was 0.1642.
6. Conclusion
The accumulated database of images and their markup will allow moving on to the development
of intelligent algorithms for identifying the colour of goods from fly pages and distribution,
as well as cataloguing for automatic (without the participation of the moderator) division of
newly uploaded goods into groups and categories. To solve the set problems of single label
multiclass classification, deep learning neural networks with convolutional layers, a training set
generator using distorting the original images and additional training of a previously trained
image classification neural network were created and tested. The proposed models show the
classification accuracy higher than that of the "random classifier", which can be interpreted
as evidence for the consistency of the selected classification tool, provided that the research
continues. The results obtained represent a good theoretical and technological groundwork
to continue research in the chosen direction. The improvement of the developed models of
intelligent classification can be associated with solving such problems as: expanding the actual
database of images, searching and testing an ensemble of models that provide preliminary
marking or segmentation of images, building hybrid models for analysing both images and
metadata and product description text. The development of an information and analytical
system for collecting and presenting information about goods from various sources posted
on social networks implies expanding the set of product filters and development of additional
services to support the search and purchase of goods.


References
[1] F. Chollet, Xception: Deep learning with depthwise separable convolutions, in: Proceedings
    of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
[2] J. J. Hopfield, Neural networks and physical systems with emergent collective computational
    abilities, Proceedings of the national academy of sciences 79 (1982) 2554–2558.
[3] F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization
    in the brain., Psychological review 65 (1958) 386.
[4] J. Lin, P. Sun, J.-R. Chen, L. Wang, H. Kuo, W. Kuo, Applying gray model to predicting trend
    of textile fashion colors, The Journal of The Textile Institute 101 (2010) 360–368.
[5] A. Alamsyah, M. A. A. Saputra, R. A. Masrury, Object detection using convolutional neural
    network to identify popular fashion product, in: Journal of Physics: Conference Series,
    volume 1192, IOP Publishing, 2019, p. 012040.
[6] K. Hara, V. Jagadeesh, R. Piramuthu, Fashion apparel detection: the role of deep convolu-
    tional neural network and pose-dependent priors, in: 2016 IEEE Winter Conference on
    Applications of Computer Vision (WACV), IEEE, 2016, pp. 1–9.
[7] K. Meshkini, J. Platos, H. Ghassemain, An analysis of convolutional neural network for
    fashion images classification (fashion-mnist), in: International Conference on Intelligent
    Information Technologies for Industry, Springer, 2019, pp. 85–95.
[8] J. Martinsson, O. Mogren, Semantic segmentation of fashion images using feature pyramid
    networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision
    Workshops, 2019, pp. 0–0.