Organizational and Legal Aspects of Managing the
      Process of Recognition of Objects in the Image

          Nataliya Boyko[0000-0002-6962-9363], Lesia Mochurad[0000-0002-4957-1512],

         Iryna Andrusiak[0000-0001-6887-0510], Yurii Drevnytskyi[0000-0001-6481-380X]

              Lviv Polytechnic National University, Lviv79013, Ukraine
        nataliya.i.boyko@lpnu.ua, lesia.i.mochurad@lpnu.ua,
             airyna2016@gmail.com, yuriytom81@gmail.com


       Abstract. The issue of object recognition using ANN models is considered. The
       object of training is the YOLO approach for recognizing objects in an image.
       The subject of training is Keras, TensorFlow, and the ability to create and ex-
       plore ANN models using them. The purpose of the work is to write a program
       that recognizes certain objects in the images and learn how it works according
       to the YOLO approach. The analysis of the YOLO approach of object recogni-
       tion on images is carried out. An example of its use for recognizing objects on a
       student card: a barcode, a seal and a signature is given. The high-level Keras
       API and the TensorFlow library were used to build the ANN architecture to
       build and work with the computation graph. An analysis of LeNet-5, AlexNet,
       GoogleNet architectures was performed, while building ANN's own architec-
       ture and analysis of the YOLO approach for object recognition, a program for
       object recognition in an image was written using Python, TensorFlow, Keras,
       and TensorBoard for visualization of training and architecture artificial neural
       network. YOLO's approach to image recognition is explored. I have better stud-
       ied TensorFlow, Keras for constructing and exploring ANN and TensorBoard
       models to visualize the training process of ANN, the graph of calculations.
       Gained practical skills in writing ANN, and their practical application. Has
       deepened his knowledge in the field of machine learning. The hardest part of
       the network was learning to recognize object sizes.

       Keywords: object recognition, ANN, YOLO, TensorFlow, Keras.


1 Introduction

There are many problems that have different arithmetic saturation. Some of them are
easier to tell to computers: performing arithmetic operations, while others can be
solved mentally by language recognition, image analysis, object classification and
more. The advantage of solving tasks using computers is that they can perform known
arithmetic and trigonometric operations sequentially and without fail. In addition to
the usual processing of algebraic tasks, object recognition tasks are added. Known
object recognition algorithms for an image usually have two parts: localization - de-
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li-
cense Attribution 4.0 International (CC BY 4.0). CybHyg-2019: International Workshop on
Cyber Hygiene, Kyiv, Ukraine, November 30, 2019.
termining the location of the object in the image, and classification - determining
what it is for the object, ie to which class it belongs. This paper describes how to use a
CNN convolutional neural network and the YOLO algorithm to identify individual
objects in a static image. To analyze the task at hand, it is first scanned for computer
vision.


2 Setting the Task

The task is to explore the YOLO approach for recognizing objects in images and writ-
ing implementation using API TensorFlow and Keras in Python. The main task is to
train the model from scratch, using the YOLO approach to identify individual objects,
for example: a barcode, a seal and a signature on a student card. Achieve at least 80%
average accuracy on test data. To solve it, you need to solve the following tasks:
     explore the YOLO approach;
     learn the basics of Keras, TensorFlow;
     write your own implementation of the YOLO approach from scratch and the
         ANN model architecture;
     create a dataset of different student card photos;
     mark the dataset: barcode, seal and signature on the images;
     write an algorithm for automatic generation of augmented data in YOLO for-
         mat;
     train the model to an accuracy of at least 80% on the test data.


3 Methods of Solving

Conditionally, the recognition algorithm can be divided into 2 components:
1) Localization – determine the coordinates and dimensions of the object:
    Input is an image, first there are important features of the image, then there is a
function of dependence between the image features and the coordinates of the center,
the height and width of the object.
2) Classification – the definite on to which class an object be longs:
Input is a localized object, first the features of the object are found, then there is a
function of dependency between the features of the object and the class to which it
belongs.
    We chose the YOLO (you only look once) architecture because it combines 2
steps of recognition – localization and classification. Due to the fact that all recogni-
tion is performed by one net work, it can be optimized specifically for recognition
efficiency: “A single neural network predicts bounding boxes and class probabilities
directly from full image simonies valuation. Since the whole detection pipe line is a
single network, it can be optimized end-to-end directly on detection performance” [8-
13].
    This approach optimizes the speed of the algorithm, because the attributes of ob-
ject classes are determined by one network with the attributes of its location and size:
    “Our unified architecture is extremely fast. Our base YOLO model processes im-
ages in real-time at 45 frames per second. A smaller version of the network, Fast
YOLO, processes an astounding 155 frames per second while still achieving double
the mAP of other real-time detectors” [1-4].
    “Finally, YOLO learns very general representations of objects. It outer - forms all
other detection methods, including DPM and R-CNN, by a wide margin when gener-
alizing from natural images to artwork on both the Picasso Dataset and the People-Art
Dataset” [6].
    We changed some parts of the algorithm to accomplish this task: since there is on-
ly one copy of each student card per class, there is no need to use non max suppres-
sion to determine one final prediction among others. It is enough to take 1 bbox for
each class with the maximum level of confidence for that class.
    Our task is easier than recognizing any objects - the objects are in static positions
relative to each other, only the position of the student card changes, 1 image of the
student card has only 1 copy of each class. So we decided to use a model with far
fewer parameters than the YOLO model. In addition, it is my task to train the network
from scratch, and training such a large architecture as the YOLO model is a very
complex process that requires a lot of data, time and computing resources [10-14].


Fig. 1. ANN architecture

First two digits - kernel size, third digit –amount of filters, s - strides (horizontal, ver-
tical).
    Typically, three modes are used for ANN research:
     Training - to increase the accuracy of object recognition.
     Predictions - for object recognition.
     Calculation - to calculate certain functions that evaluate the operation of the
         algorithm.
This is the input to the algorithm for further processing. Input data vary depending on
the mode of operation. Training: The training input has two components (Features,
Labels).
    Features are the images that you want to recognize. Used to calculate network
output - location, size, and feature class predictions.
    Filed in [N x W x H x channels], where: N is the number of images; W, H is the
length and height of the image in pixels; channels - number of image channels (3 for
RGB images) [18-21].
    Labels are the true coordinates of the object center, their size, and the class index-
es to which they belong. They are used by the error, accuracy, and other functions that
are intended to evaluate the operation of the algorithm by determining the difference
between the correct data and those predicted by the algorithm.
    According to the YOLO approach, the input image is divided by a grid into a grid
of [S x S] cells, each cell having the same size, and is responsible for recognizing a
specific area of the input image:


Fig. 2. Divide the input image into cells

Each cell consists of [B] bounding boxes, to locate the object and [C] the probabilities
of the object belonging to a particular class Pr (Classi | Object) for its classification (C
is the number of classes). Only one object in a class can recognize a cell - the proba-
bility of which among the C probabilities of all classes is the highest. Bouldering box
is a vector of format numbers (x, y, w, h, confidence) where: (x, y) - the coordinates
of the bbox center are calculated as the offset from the coordinates of the lower left
corner of a particular cell of the image, so they take values between 0 and 1; (w, h)
are the length and width of the bbox, normalized to the size of the input image, so
they take values between 0 and 1.


Fig. 3. Bbox one of the cells of the image (bbox border is red, cell grid is yellow)
Confidence - the level of confidence that the bbox really recognized the true object.
    The confidence value reflects how confidently you can say that a given bbox has
an object, and how true the output of that bbox is. The goal of dream faience is to
show how true the results of the prediction are, with no ready answers like in training
mode. If there are no objects in this bbox, confidence = 0, if any, confidence =
IoU(intersection over union) between this bbox (predicted) and true bbox. confidence
= probability (obj) * iou (pred, label) - formal definition of confidence [4-6].
    Labels are fed to the algorithm's input in format [N x Sw x Sh x (B*5+C)] , where:
N - number of images; Sw - the number of cells in the width of the grid; Sh - the
number of cells in the height of the grid; B - the number of bboxes per cell; C - num-
ber of object classes.
    To identify the barcode, stamp and signature on the student card I used: Sw = Sh
=10, B = 1, C = 3. Foresight: features. Calculations: features, labels.
    Example of normalization of coordinates and size of bbox (Fig. 4):


Fig. 4. Provided by the bbox object in the image (bbox borders are marked in red)

Coordinates of two predicted bboxes for 1 object (barcode) in an image size
128x128px (img_w, img_h = 128px, S=4):
    [x, y, w, h]p1 = [47, 38, 66, 30] (px)
    [x, y, w, h]p2 = [51, 39, 60, 29] (px)
Each cell size [cell_w x cell_h]: you should adjust the size of the image so that
(img_w, img_h) is divided exactly by the number of cells (S) so that the cells cover
the entire image.
    cell_w = img_w//S – weight of cell
    cell_h = img_h/S – high of cell
    cell_w = 128/4 = 32(px)
    cell_h = 128/4 = 32(px)
Coordinate normalization:
    Xnorm = (x – ((x// cell_w) * cell_w)) / cell_w
    Xp1_norm = (47 – ((47 // 32) * 32)) / 32 = (47-32) / 32 = 15 / 32 = 0.46875
    Xp2_norm = (51 – ((51 // 32) * 32)) / 32 = (51-32) / 32 = 19 / 32 = 0.59375
    Ynorm = (y – (y// S) * cell_h) / cell_h
    Yp1_norm = (38 – ((38 // 32) * 32)) / 32 = (38-32) / 32 = 6 / 32 = 0.1875
    Yp2_norm = (39 – ((39 // 32) * 32)) / 32 = (39-32) / 32 = 7 / 32 = 0.21875
Size normalization:
    wnorm = w / img_w
    wp1_norm = 66 / 128 = 0.515625
    wp2_norm = 60 / 128 = 0.46875
    hnorm = h / img_h
    hp1_norm = 30/128 = 0.234375
    hp2_norm = 29/128 = 0.2265625
Normalized bboxes:
    [x, y, w, h]p1_norm = [0.46875, 0.1875, 0.515625, 0.234375]
    [x, y, w, h]p2_norm = [0.59375, 0.21875, 0.46875, 0.2265625]
This is the data that results from processing the input algorithm. The output also dif-
fers depending on the mode of operation. Training: no initial data
    Prediction: Output is the end product of the algorithm provided for each cell of an
input image of an object's location and its class. The output dimension has the same
format as the Labels input:

                             
                             NS
                                   w
                                       S
                                            h
                                                 (B * 5  C)                          (1)

Due to the fact that only one copy of each class is guaranteed to recognize objects on
one student card, there is no need to use non max suppression. For each asset class, a
bbox with a maximum confidence level greater than a certain confidence level of all
bboxes in that class is selected. A class is defined as the index of the maximum value
among the class predictions for a given grid cell, all the bboxes in the cell belong to
that class (Fig. 5).
    Calculation: Output is the value of certain metrics designed to evaluate the accura-
cy, object recognition error algorithm (IoU, max IoU, mean IoU, probability). Proba-
bility is a metric that combines the prediction of an object's coordinates with its class's
predictions. This can be done by the following formula:
                                                           ruth                 ruth
Propability= Pr (Classi | Object)* Pr (Object)* IoU tpred = Pr (Classi)* IoU tpred     (2)

Pr (Classi | Object) - the probability of a particular class for an object in bbox, provid-
ed that it has that object.
Pr (Object) - the probability that a particular bbox has an object.
IoU ruth
    tpred - the ratio of intersection to combining true bbox and predicted.
                  ruth
Pr (Object)* IoU tpred - calculated as the confidence of a particular bbox.
    This gives class-dependent confidence points for each bbox. Shows the full proba-
bility that an object of a particular class is in a particular bbox.
     IoU is a feature that reflects the intersection of true and predicted bbox to merge.
It is used to estimate the accuracy of the object algorithm localization. I use this fea-
ture because I need to evaluate the accuracy of object localization by an algorithm to
understand how well it works.


Fig. 5. Choosing the best predictions From left to right: all predictions, bbox with confidence>
     0.5, max bbox with confidence> 0.5


Fig. 6. Visualization of IoU function

IoU = I/U, where: I - is theater section of true bbox and predicted network; U - is the
combining true bbox and predicted network. “At training time we only want one
bounding box predictor to be responsible for each object. We assign one predictor to
be “responsible” for predicting an object based on which prediction has the highest
current IOU with the ground truth. This leads to specialization between the bounding
box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or
classes of object, improving overall recall”.
    Example of calculating IoU:
Fig. 7. Provided and true bbox facility Green rectangle - true, red - provided.

IOU calculations for one of the predicted bbox and true bbox:
[x, y, w, h]p1 = [47, 38, 66, 30](px) - The coordinates of a true rectangle
[x, y, w, h]t = [48, 38, 67, 34](px) - The coordinates of the predicted rectangle
x1 = x-w/2; y1 = y-h/2 - bottom left bbox point
x2 = x+w/2; y2 = y+h/2 – upper right bbox point
x1p1 = 47-66//2 = 14
y1p1 = 38-30//2 = 23
x2p1 = 47+66//2 = 80
y2p1 = 38+30//2 = 53
x1t = 48-67//2 = 15
y1t = 38-34//2 = 21
x2t = 48+67//2 = 81
y2t = 38+34//2 = 55
Intersection coordinates:
(x1i , y1i); (x2i , y2i) - lower left and upper right point of intersection rectangle
x1i = max(x1p1 , x1t)
y1i = max(y1p1 , y1t)
x2i = min(x2p1 , x2t)
y1i = min(y2p1 , y2t)
x1i , y1i = 15, 23
x2i , y2i = 80, 53
intersection area:
intersection = max(0, (x2i - x1i ))*max(0, (y2i - y1i))
union area:
union = Wp1 * hp1 + Wt* ht– intersection
IoU = intersection/union
intersection = max(0, (80- 15))*max(0, (53- 23)) = 65*30 = 1950
union = 66*30 + 67*34 - 1950 = 2308
IoU = 1950/2308 ~ 0.845
    Mean IoU is a value for understanding how well the network localizes objects for
the image package. It is calculated as the ratio of the sum to the number of function
values IoU among all provided bbox for all images in 1 mini-package. Only the IoU
values from the predicted bbox that are responsible for recognizing the object in the
cell and the cell responsible for recognizing the object for that image are taken into
account. This value is important because during training the algorithm is unstable and
the value of the IoU function may vary greatly for an individual object. Therefore, to
get a better idea of the accuracy of object recognition, IoU values are averaged over
the entire mini-package.
                                N    S   B
           Mean _ IoU  ( k ijobj * IoU truth
                                           pred ) / obj _ num                           (3)
                                 e   i   j

   N – the number of images in the mini-package;
   S – the length of the grid in the cells;
   B – number of bboxes;
   truth – true bbox;
   pred – prediction bbox;
    k ijobj - from item 1 of section "YOLO error function";
    obj_num - the number of true objects for all images in the mini-package.
MaxIoU – this value gives the numerical characteristic of localization accuracy for
the best recognition cases. It can be compared to mean IoU to understand how the
best IoU values for the best cases of localization differ from the average IoU values. It
is calculate das the maximum value of the IoU function among all bbox predictions
for mini-package images.

                         Max _ IoU  MaxN ( IoU truth
                                                pred )                                  (4)

    N – the number of images in the mini-package;
    truth – true bbox;
    pred– prediction bbox.
The YOLO error is a measure of how far network predictions differ from true values.
The error function enables the network to learn. That is, it directly influences the
learning process, thanks to which the optimization algorithm trains the network,
changing its parameters in the direction of reducing the error. Without it, the network
cannot learn to recognize objects, because without knowing what and how wrong the
network will not be able to fix it.
    Modified standard deviation between network prediction and true values is used to
train the network. It is easy to optimize, but it is not well suited to the main goal of the
network - maximize average accuracy because it equally evaluates localization and
classification errors, which can be very different. Also, in each image, many cells are
not responsible for object recognition. This causes the confidence equation for the
bbox of this cell to 0, often over saturating the cell gradients responsible for recogni-
tion. It can lead to instability of the network model, and cause early training discrep-
ancies, the model will find a poor local minimum.
    To prevent this, the error for bbox coordinates is increased, and for cells with no
objects the confidence error is reduced, 2 parameters are added for this: Lcoord = 5 and
Lnoobj = 0.5.
    The standard deviation also equates to errors in large and small bboxes. The error
should reflect that small deviations in large bboxes are less significant than in small
bboxes. For this, the square root is taken from the length and height of the bbox.
    “We optimize for sum-squared error in the output of our model. We use sum-
squared error because it is easy to optimize, however it does not perfectly align with
our goal of maximizing average precision. It weights localization error equally with
classification error which may not be ideal.
    Also, in every image many grid cells do not contain any object. This pushes the
“confidence” scores of those cells towards zero, often overpowering the gradient from
cells that do contain objects. This can lead to model instability, causing training to
diverge early on.
    To remedy this, we increase the loss from bounding box coordinate predictions
and decrease the loss from confidence predictions for boxes that don’t contain objects.
We use two parameters, λcoord and λnoobj to accomplish this. We setv λcoord =5 and λnoobj
= 0.5.
    Sum-squared error also equally weights errors in large boxes and small boxes. Our
error metric should reflect that small deviations in large boxes matter less than in
small boxes. To partially address this we predict the square root of the bounding box
width and height instead of the width and height directly”.
    In YOLO, the error function consists of 4 parts, since the network output consists
of several parts:
 Coordinates of the bbox center.
 Height and width bbox.
 Confidence for bbox.
 The likelihood that an object in a given cell belongs to a particular class if its cen-
   ter hits the cell.
Consider the individual parts of the error and an example of how to calculate it for my
task: Parameters for example (Fig. 8).
Fig. 8. Image for example of calculation of YOLO error green is indicated by true bbox, others
    include different classes

Image height and width = 160, 120 px respectively, S = 10, 3 classes of objects, con-
sider b = 2 to account for a case with multiple bboxes in one cell. Also, for simplicity,
let's consider only those cells that should have an object. (values are rounded for visu-
al clarity).
     True bboxes:
[ 0.375, 0.53125, 0.09167, 0.26875, 0.1, 1, 0, 0] - bbox1 bar code
[ 0.375, 0.53125, 0.09167, 0.26875, 0.935, 1, 0, 0] - bbox2 bar code
[ 0.0833, 0.25, 0.15, 0.1125, 0.95, 0, 1, 0] - bbox1 seal
[ 0.0833, 0.25, 0.15, 0.1125, 0.234, 0, 1, 0] - bbox2 seal
[ 0.5, 0.9375, 0.167, 0.075, 0, 0, 0, 1] - bbox1 signature
[ 0.5, 0.9375, 0.167, 0.075, 0.873, 0, 0, 1] - bbox2 signature
     Prediction bboxes:
[ 0.363, 0.39, 0.409, 0.0414, 0.064, 1.01, 0.00491, 0.023] - bbox1 bar code
[ 0.36, 0.552, 0.097, 0.267, 0.757, 1.01, 0.00491, 0.023] - bbox2 bar code
[ 0.11, 0.258, 0.15, 0.11, 0.75, 0.0094, 0.994, 0] - bbox1 seal
[ 0.18, 0.196, 0.089, 0.4, 0.248, 0.0094, 0.994, 0] - bbox2 seal
[ 0.156, 0.136, 0.0657, -0.046, 0.0102, -0.0183, -0.0095, 1.033] - bbox1signature
[ 0.513, 0.96, 0.176, 0.0814, 0.69, -0.0183, -0.0095, 1.033] - bbox2 signature
     The value of - 0.046 issued by the network as the height of bbox1 for the signature
is converted to 0 because the height cannot be negative.
     IoU: [0.1, 0.935, 0.95, 0.234, 0.0, 0.873] is the IoU value of true and predicted
bbox for each object (used as true confidence).
1) Error for coordinates
     This is the value that characterizes the difference between the coordinates of the
bbox center ( xˆ i , yˆ i ) predicted by the network, and true ( xi , y i ) . Its purpose is to
locate the object on the image by a network, since it requires the coordinates of its
center. It is calculated only for the bbox that are responsible for recognition (if their
iou is maximal among other bboxes in this grid cell):“It also only penalizes bounding
box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has
the highest IOU of any predictor in that grid cell)”.
                              S B obj
                                                      2
                    l coord *   k ij * ( x i  xˆ i )  ( y i  yˆ i )
                              i j
                                                                         2
                                                                                       (5)


    xi , y i - true coordinates of center bbox;
    xˆ i , yˆ i - predicted coordinates of center bbox;
    kijobj - coefficient equal to 1 or 0: l - when j - bbox is responsible for its recogni-
tion, ie its value for IoU is maximal among other j - bboxes in this i - cell. In addition,
the cell should be responsible for recognition (if it falls into the center of the object).
    0 - otherwise.
    lcoord = 5 - the magnitude of the increase in error for the coordinates and size
   loss_coords =
   5*(
   0 *...
   + 0*((0.375-0.363)2 + (0.53125-0.39)2)
   + 1*((0.375 - 0.36)2+(0.53125 - 0.552)2)
   +0*…
   + 1*((0.0833-0.11)2 + (0.25-0.258)2)
   + 0*((0.0833 - 0.18)2+(0.25 - 0.196)2)
   + 0*…
   + 0*((0.5-0.156)2 + (0.9375-0.136)2)
   + 1*((0.5 - 0.513)2+(0.9375 - 0.96)2)
   + 0*...
   ) ~ 0.01054
   2) Error for width and height
   This is the value that characterizes the difference between the size of a bbox, its
                                               ˆ i , hˆi ) , and the true ones ( wi , hi ) .
width and the length provided by the network ( w
Its purpose is to locate the object on the image by the network, since it requires its
length and width. Size and coordinate errors are only calculated for the bbox that are
responsible for the recognition.
                          S B obj
                                        
                l coord *   k ij * ( w i 
                          i j
                                                    wˆ i )
                                                             2
                                                                 ( h 
                                                                     i
                                                                          hˆ )
                                                                            i
                                                                               2
                                                                                       (6)


    ( wi , hi ) - true width and height of bbox;
    ( wˆ , hˆ ) - predicted width and height of bbox;
       i   i
    l coord - from first error;
    k ijobj - from first error.
     loss_size =
     5*(
     + 0*...
     + 0*((0.091671/2 - 0.4091/2)2 + (0.268751/2 - 0.04141/2)2)
     + 1*((0.091671/2-0.0971/2)2 + (0.268751/2- 0.2671/2)2)
     + 0*...
     + 1*((0.151/2 - 0.151/2)2 + (0.11251/2 - 0.111/2)2)
     + 0*((0.151/2-0.0891/2)2 + (0.11251/2- 0.41/2)2)
     + 0*...
     + 0*((0.1671/2 - 0.06571/2)2 + (0.0751/2 - 01/2)2)
     + 1*((0.1671/2- 0.1761/2)2 + (0.0751/2- 0.08141/2)2)
     + 0*...
     ) ~ 0.00171
The error should reflect smaller deviations in large bboxes less than in small ones, so
it takes root from height and width: “Our error metric should reflect that small devia-
tions in large boxes matter less than in small boxes”. This is well illustrated in the
example below:
     Height of the first and second, width of the third object reduced by 0.1 for true
bbox and those responsible for recognition.
     loss_size_smaller =
     5*(
     0*...
     + 0*((0.091671/2 - 0.4091/2)2 + (0.168751/2 - 0.04141/2)2)
     + 1*((0.091671/2-0.0971/2)2 + (0.168751/2- 0.1671/2)2)
     + 0*...
     + 1*((0.151/2 - 0.151/2)2 + (0.01251/2 - 0.011/2)2)
     + 0*((0.151/2-0.0891/2)2 + (0.01251/2- 0.41/2)2)
     + 0*...
     + 0*((0.0671/2 - 0.06571/2)2 + (0.0751/2 - 01/2)2)
     + 1*((0.0671/2- 0.0761/2)2 + (0.0751/2- 0.08141/2)2)
     + 0*...
     ) ~ 0.00317
3) Confidence level error:
     “Each grid cell predicts B bounding boxes and confidence scores for those boxes.
These confidence scores reflect how confident the model is that the box contains an
object and also how accurate it thinks the box is that it predicts”.
     “If no object exists in that cell, the confidence scores should be zero. Otherwise
we want the confidence score to equal the intersection over union (IoU) between the
predicted box and the ground truth”.
     Therefore, the error reflects the difference between the real confidence value (iou)
and the predicted network (c). The purpose of her finding is to understand how true
are the results of the prediction of not having ready answers as in training.
                S 2 B obj            2          S 2 B noobj              2
                  k ij * (iou  c )  l        k        * (iou  c )               (7)
                 i j           i   i      noobj* i j ij           i   i

    c - predicted confidence;
     i
    iou - IoU max(bbox) i – cells, because true confidence equals iou;
        i
      obj
    k ij - from first error;
      noobj                 obj
    k ij    - opposite to k ij ;

    l           = 0.5 - trust coefficient for bbox not responsible for object recognition.
        noobj
Their confidence for him should be 0. I think the coefficients for confidence errors are
smaller than the coefficients for sizes and coordinates, since this error depend son the
IOU value that changes dynamically in the learning process. So that the network does
not learn when iou values are small to predict only small values of confidence, and
when iou grows up sharply not to predict only high values of confidence the learning
factories reduced.
    loss_confidence = (
    0*...
    + 0*(0.1 - 0.064)2
    + 1*(0.935-0.757)2
    +0*...
    + 1*(0.95 - 0.75)2
    + 0*(0.234-0.248)2
    +0*...
    + 0*(0.0 - (-0.046))2
    + 1*(0.935-0.873)2
    + 0*...
    )+
    0.5*(
    1*...
    + 1*(0.1 - 0.064)2
    + 0*(0.935-0.757)2
    + 1*...
    + 0*(0.95 - 0.75)2
    + 1*(0.234-0.248)2
    + 1*...
    + 1*(0.0 - (-0.046))2
    + 0*(0.935-0.873)2
    + 1*...
    ) ~ 0.077 + 0.011 = 0.088
    4) Classification error:
    Object network classification error Required to improve the network's ability to
classify objects.

                            S 2 obj classes
                             k ij *  ( p ( c )  pˆ ( c )) 2                          (8)
                             i              i        i
                                       c

    p (c ) - is there an object class in this cell (1 or 0);
     i
    pˆ ( c ) - class prediction for a given cell;
      i
      obj
    kij - coefficient equal to 1 or 0:
    1 - when the center of the object enters the i - cell.
    0 - otherwise.
“Note that the loss function only penalizes classification error if an object is present in
that grid cell”
    loss_class= (
    0*...
    + (1 - 1.01)^2 + (0 - 0.00491)^2 + (0 - 0.023)^2
    + 0*...
    + (0 - 0.0094)^2 + (1 - 0.994)^2 + (0 - 0)^2
    + 0*...
    + (0 -(-0.0183))^2 + (0 - (-0.0095))^2 + (1 - 1.033)^2
    + 0*...
    ) ~ 0.00229
This is the value that characterizes the overall deviation of the present network per-
formance from the desired results. It has the effect of further optimizing the network.
The next step in the network operation is to calculate the dependence of the error
function on the network parameters, then the parameters change in the direction of
reducing the error.
    The final error is the sum of four errors: 1) coordinate errors, 2) width and height,
3) confidence level, 4) classification. For the task - localization of bar code on the
student card, classification is not required, so the classification error is not included in
the general error:
    Loss = 0.01054 + 0.00171 + 0.088 + 0.00229 ~ 0.10254
    Therefore, the final recognition error consists of the error of coordinate recogni-
tion, size recognition, prediction of how true this recognition is, and errors of object
class definition. The purpose of the training is to increase the overall accuracy of ob-
ject recognition by reducing the overall recognition error.


4 Experiments

To mark the data, we applied bbox boundary markers as rectangles of different colors
for each class, and saved the coordinates, sizes, classes of these rectangles as true
output for the Labels algorithm by pre-converting them to YOLO (not changes). I
used [16] to mark the data:


Fig. 9. Example of image markup

We trained a network with the following parameters: The total size of the training
data sets 80 images. During training, I used data augmentation – certain changes to
the input data in order to ensure that the algorithm works for noisy data, as well as to
better study the patterns in the data. For example, in order for an algorithm to be able
to recognize objects on a student card, if it sin the rotate deposition, pseudo random
rotations in the training data are used. For better recognition of images made at an
angle - slopes, images that are not in the center of the image - offsets, for images with
different levels of brightness - fluctuations in brightness and hsv - transformation.
After augmentation training set size - 720 images After each epoch, the initial 80
images are augmented up to 720 by pseudorandom translating, rotating, scaling,
shearing, hsv image transformations. For this I used api [16]. Batch size - 80 images,
the training era consists of 9 mini-packages (9 step / epoch).


5 Results

We managed to reach mean IoU of ~ 85% on test data over 400 training periods. At
the end of each training period, the values and metrics described above were calculat-
ed and stored to represent the state of the network and the training process as a whole.
For graphing I used [14]. Results for 400 training periods:
    The x-axis is the index of the era, the y-axis is the value of the function
Fig. 10. General error function

The graph shows that at the initial stage of training, the error dropped rapidly, gradu-
ally its rate of decline decreased. In my opinion, this is due to the fact that certain
features of the objects are easier to learn, others harder. It is also a result of the gradi-
ent damping phenomenon - in the first stages, the initial layers are rapidly training and
the distant layers are slower because they have smaller gradients. Then the distant
layers are slowly refinished, and the initial layers are adjusted, partly because the
error is constantly fluctuating, but as a result it is reduced.


Fig. 11. Coordinate recognition error function

The graph shows that the network quickly learned to recognize the coordinates of the
bbox center. I think this is because all objects look like simple geometric shapes, and
they are the same color. However, when resizing go rotating the image during aug-
mentation, the center of the object does not change.
Fig. 12. Size recognition error function

The graph shows that the size recognition error decreases a there slowly and is unsta-
ble compared too there parts of the YOLO error. I think this is due to the fact that as a
result of augmentation, the shape and size of the objects in the input data is constantly
changing, so the network needs more time to study these patterns.


Fig. 13. Trust recognition error function

The chart shows that the confidence level error from the outset was relative to other
parts of the YOLO error. I believe that this is due to several factors - it depends only
on one value (confidence), and the size error (w, h), coordinates (x, y) - two, and its
part is multiplied by a factor of 0.5. Also, since all recognized objects are similar to
simple geometric shapes in a network, it is easier to understand if an object is in a
particular cell.
Fig. 14. Class recognition error function

The graph shows that the confidence level error decreases rapidly at the beginning of
training. Further, the rate of decrease is slowed. I think this is due to the fact that the
position of different classes of objects does not change relative to each other (left
signature, then seal, then barcode) so it is easy to distinguish them by this feature.
And slowing down the error reduction rate is because when other parts of the YOLO
error are much larger, the optimization algorithm changes the network parameters
more strongly to reduce them.


Fig. 15. Max value of IoU

The graph shows that the maximum IoU value is growing faster and more than the
average IoU value. I think that because some objects area sire to localize than others,
an object whose harpies very similar to a simple geometric shape (rectangle) is easier
to localize than an object with a complex shape – because any complex shape can be
represented as a combination of simple forms. It takes a time for the network to learn
to do this.
Fig. 16. Average value of IoU

The graph shows that at the beginning of training, the mean value of IoU increased
rapidly, and gradually the rate of increase of iou decreased. In my opinion, this is due
to the fact that the localization error initially decreased significantly, and the average
IoU increased accordingly. Then the rate of decrease of the error gradually decreased,
respectively, and the rate of increase of the average IoU decreased. It can be conclud-
ed that the localization error and the mean IoU are inversely proportional.


6 Conclusion

We achieved the task (Achieve at least 80% average accuracy on the test data) using
the algorithm describe dab vein 4h 10min using the Intel (R) Core (TM) i5-8300H
CPU @ 2.30GHz 2.30 GHz to train the network. With the YOLO approach, you can
train you net work to recognize objects using a relatively small amount of computing
power within a reasonable time. The speed of training depends strongly on the objects
of recognition and image parameters: the complexity of the shape of the object, the
possibility of the presence of several objects of the same class in the image, the num-
ber of classes of objects, or easily distinguish them, camera angle, change of illumina-
tion, distance to the object , placing the object in the image. After all, training a net-
work for object recognition under different conditions requires a larger data set to
train and test, to train the network on a larger set takes longer. Also, a more complex
ANN model may be required to analyze more complex data. Therefore, such details
should be determined in the first phase of building the object recognition system. In
order to evaluate the network effectively, it is necessary to select the appropriate met-
rics (in this case IoU, yolo_loss). Selecting an error function by minimizing which
will maximize recognition accuracy. Combining localization and classification steps
enables the recognition of 1-step ANN calculations and simplifies the design of algo-
rithm input. At the same time, the recognition process is optimized, since the same
features of the object studied by the same network are used for localization and classi-
fication.
    Research materials: labeled data set, trained model, references to useful resources
- can be used for future research, and research in general - an example for future gen-
erations, which they can refine, refine. The trained model can be used to recognize
barcodes, prints, signatures on an image, and then transfer them to other data pro-
cessing algorithms. For example, a barcode can be transmitted to a barcode reader to
obtain student information, and a stamp and signature can be transmitted to document
validation algorithms.
    Thanks to the development of machine learning, we have received real-time object
recognition tools with fairly high precision. This facilitates process automation, as
work related to the analysis of visual images by humans can be partially translated to
a computer. Further research is promising, as it can increase the amount of computer-
translated work and thus free people from monotonous work. To do this, the accuracy
and speed of the algorithms must be at least as human as possible.


References
 1. Kang, H.-Y., Lim, B.-J., Li, K.-J.: P2P Spatial query processing by Delaunay triangulation,
     Lecture notes in computer science, vol. 3428, pp. 136–150, Springer/Heidelberg (2005)
 2. Boehm, C., Kailing, K., Kriegel, H., Kroeger, P.: Density connected clustering with local
    subspace preferences, IEEE Computer Society, Proc. of the 4th IEEE Intern. conf. on data
    mining, pp. 27–34, Los Alamitos (2004)
 3. Wang, Y., Wu, X.: Heterogeneous spatial data mining based on grid, Lecture notes in
    computer science, vol. 4683, pp. 503–510, B.: Springer/Heidelberg (2007)
 4. Kryvenchuk Y., Boyko N., Helzynskyy I., Helzhynska T., Danel R.: Synthesis control sys-
    tem physiological state of a soldier on the battlefield. CEUR. Vol. 2488. Lviv, Ukraine, p.
    297–306. (2019)
 5. Harel, D., Koren, Y.: Clustering spatial data using random walks, Proc. of the 7th ACM
    SIGKDD Intern. conf. on knowledge discovery and data mining, pp. 281–286, San Fran-
    cisco, California (2000)
 6. Gahegan, M.: On the application of inductive machine learning tools to geographical anal-
    ysis, Geographical Analysis, vol. 32, pp. 113–139 (2000)
 7. Boyko, N., Shakhovska, Kh., Mochurad, L., Campos, J.: Information System of Catering
    Selection by Using Clustering Analysis, Proceedings of the 1st International Workshop on
    Digital Content & Smart Multimedia (DCSMart 2019), рр. 94-106, Lviv, Ukraine (2019)
 8. Boyko, N., Komarnytska, H.,Kryvenchuk ,Yu., Malynovskyy, Yu.: Clustering Algorithms
    for Economic and Psychological Analysis of Human Behavior, Proceedings of the Interna-
    tional Workshop on Conflict Management in Global Information Networks (CMiGIN
    2019), рр. 614-626, Lviv, Ukraine (2019)
 9. Kryvenchuk Y., Vovk O., Chushak-Holoborodko A., Khavalko V., Danel R.: Research of
    servers and protocols as means of accumulation, processing and operational transmission
    of measured information. Advances in Intelligent Systems and Computing. Vol.1080.
    p.920-934. (2020)
10. Zhang, С., Murayama, Y.: Testing local spatial autocorrelation using, Intern. J. of Geogr.
    Inform. Science, vol. 14, pp. 681–692 (2000)
11. Estivill-Castro, V., Lee, I.: Amoeba: Hierarchical clustering based on spatial proximity us-
    ing Delaunay diagram, 9th Intern. Symp. on spatial data handling, pp. 26–41, Beijing,
    China (2000)
12. Turton, I., Openshaw, S., Brunsdon, C.: Testing spacetime and more complex hyperspace
    geographical analysis tools, Innovations in GIS 7, pp. 87–100, London: Taylor & Francis
    (2000)
13. Fedushko S., Syerov Yu., Skybinskyi O., Shakhovska N., Kunch Z. (2020) Efficiency of
    Using Utility for Username Verification in Online Community Management. Proceedings
    of the International Workshop on Conflict Management in Global Information Networks
    (CMiGIN 2019), Lviv, Ukraine, November 29, 2019. CEUR-WS.org, Vol-2588. pp. 265-
    275. http://ceur-ws.org/Vol-2588/paper22.pdf
14. Tung, A.K, Hou, J., Han, J.: Spatial clustering in the presence of obstacles, The 17th In-
    tern. conf. on data engineering (ICDE’01), pp. 359–367, Heidelberg (2001)
15. Veres, O., Shakhovska N.: Elements of the formal model big date, The 11th Intern. conf.
    Perspective Technologies and Methods in MEMS Design (MEMSTEH), pp. 81-83, Poly-
    ana (2015)
16. Agrawal, R., Gehrke, J., Gunopulos ,D., Raghavan, P.: Automatic sub-space clustering of
    high dimensional data, Data mining knowledge discovery, vol. 11(1), pp. 5–33 (2005)
17. Shakhovska, N., Boyko, N., Zasoba, Y., Benova, E.: Big data processing technologies in
    distributed information systems. Procedia Computer Science, 10th International
    conference on emerging ubiquitous systems and pervasive networks (EUSPN-2019), 9th
    International conference on current and future trends of information and communication
    technologies in healthcare (ICTH-2019), Vol. 160, pp. 561–566, Lviv, Ukraine (2019)
18. Guimei, L., Jinyan, L., Sim, K., Limsoon, W.: Distance based subspace clustering with
    flexible dimension partitioning, Proc. of the IEEE 23rd Intern. conf. on digital object iden-
    tifier, vol. 15. Iss. 20, pp. 1250–1254 (2007)
19. Aggarwal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces,
    ACM SIGMOD Intern. conf. on management of data, pp. 70–81 (2000)
20. Guo, D., Peuquet, D.J., Gahegan, M.: ICEAGE: Interactive clustering and exploration of
    large and high-dimensional geodata, Geoinfor-matica, vol. 3, N. 7, pp. 229–253 (2003)
21. Boyko, N., Basystiuk, O.: Comparison Of Machine Learning Libraries Performance Used
    For Machine Translation Based On Recurrent Neural Networks, 2018 IEEE Ukraine Stu-
    dent, Young Professional and Women in Engineering Congress (UKRSYW), pp.78-82,
    Kyiv, Ukraine (2018)
22. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A Monte Carlo algorithm for
    fast projective clustering, ACM SIGMOD Intern. conf. on management of data, pp. 418–
    427, Madison, Wisconsin, USA (2002)
23. Ankerst, M., Ester, M., Kriegel, H.-P.: Towards an effective cooperation of the user and
    the computer for classification, Proc. of the 6th ACM SIGKDD Intern. conf. on knowledge
    discovery and data mining, pp. 179–188, Boston, Massachusetts, USA (2000)
24. Boyko N., Pylypiv O., Peleshchak Y., Kryvenchuk Y., Campos J.: Automated document
    analysis for quick personal health record creation. 2nd International Workshop on Infor-
    matics and Data-Driven Medicine. IDDM 2019. Lviv. p. 208-221. (2019)
25. Kryvenchuk Y., Mykalov P., Novytskyi Y., Zakharchuk M., Malynovskyy Y., Řepka M.:
    Analysis of the architecture of distributed systems for the reduction of loading high-load
    networks. Advances in Intelligent Systems and Computing. Vol.1080. p.759-550. (2020)