Data Distillation for Traffic Sign Detection?

               Alexey Popov1 , Vlad Shakhuro1 , and Anton Konushin1,2,3
                    1
                 Lomonosov Moscow State University, Moscow, Russia
                        2
                  NRU Higher School of Economics, Moscow, Russia
                      3
                        Samsung AI Center, Moscow, Russia
    {alexey.popov,vlad.shakhuro,anton.konushin}@graphics.cs.msu.ru


        Abstract. This work is devoted to the traffic sign detection on images using deep
        learning methods. We focus on the problem of detector transfer to new datasets
        with different road signs. We present an algorithm for distilling a set of unlabelled
        data to select the most informative images to be labeled. This method allows to
        significantly reduce the amount of data labeling with a small decline of detector
        performance.

        Keywords: Data distillation, Traffic sign detection, Detector adaptation.


1     Introduction

Modern object detection methods are based on deep learning methods. Deep neural
network require large labelled datasets to be trained. Labelling large enough datasets is
usually difficult and expensive.
     Consider traffic sign detection task. Traffic sign recognition system which is used,
for instance, in self-driving cars, should be universal and work in several countries.
Traffic sign is a standardized object, but it looks a little different in different countries.
Moreover, there exist unique signs in some countries. Traffic sign detection system
should be somehow trained on different datasets on several datasets and be applicable in
different countries. Another problem, which we explore in this work, is training sample
size. Nowadays, there is no common opinion in research community on how much data
is sufficient to train a traffic sign detector.
     In the first part of our work we explore traffic sign detector finetuning. We use two
large traffic sign datasets RTSD (Russian signs) and TTK100 (China signs) to find out
how detector performance changes with and without pretraining, how training sample
size affects performance.
     In the second part of our work we explore several methods which allow to reduce
training sample size. We explore random sampling and propose two new methods for
data filtering. First method requires labelled training sample and uses detector loss to
filter out uninformative frames. It chooses the most informative frames and works better
than random sampling.
    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0).
?
    Publication is supported by RFBR grant 19-07-00844.
2 A. Popov et al.

     Second proposed method regresses a number which describes how informative frame
is. This method chooses most informative frames without any labelling.


2   Related work

Modern object detection methods use deep neural networks. They can be divided into
several types.
     First type, two stage detectors are represented by Faster R-CNN method [8] and
its’ successors. These methods consist of two parts: object location hypotheses gener-
ator and object-background classifier. Such algorithms work quite slowly, but usually
achieve best detection quality.
     Second type is single stage detectors. YOLO [11], SSD [1] and other methods aim to
simultaneously generate object location hypotheses and classify them. These methods
achieve higher inference speed at price of lower quality. Such detectors are widely used
in real-time applications.
     In addition to basic object detection methods there exist methods like RetinaNet [2]
that improve baseline detection methods using pyramid of features and loss function
designed specially for object detection.
     All detectors require a large amount of data for training. For example there are
several data sets available for the task of detecting road signs [3, 4]. There are rare and
frequent classes of road signs in data sets.
     Conventional machine learning approaches are not suitable for recognizing rare
road signs. There are special methods are offered for processing rare road signs, for
example, based on generating synthetic data [9, 10]. Early approaches to generating
synthetic data used heuristic methods based on computer graphics [12]. These methods
contained some parameter selection algorithms for creating synthetic samples. Methods
for evaluating the quality of target data samples were also an important part of them.
Currently, the main direction of methods for generating synthetic data collections. It is
to use the idea of the generative adversarial network (GAN) [13], which allows us to
create high-quality synthetic data sets for the task of road signs detecting [14].
     Data augmentation algorithms are also available to increase the number of rare data.
For example [15]. In this article propose augmentation strategy: patches are cut and
pasted among training images where the ground truth labels are also mixed proportion-
ally to the area of the patches. By making efficient use of training pixels and retaining
the regularization effect of regional dropout.
     But on the other hand, with a large amount of data, detection algorithms take a
very long time to learn. Therefore, there are algorithms that reduce the amount of data
for training. However, the quality of the algorithm remains the same. This class of
algorithms includes data distillation algorithms.
     Data distillation is a relatively new method in computer vision. In detection task
there is method described in [5]. This article propose method that ensembles predic-
tions from multiple transformations of unlabeled data, received using single model, to
automatically generate new training annotations and, in particular, is applied to the de-
tection task.
                                                Data Distillation for Traffic Sign Detection 3

    There are method based on the idea of reducing training data by choosing the most
informative examples. For example, in [6] proposed method, using the adapted loss
function of the target algorithm for obtaining complexity value of frame. Frames are
sorted by complexity and the most complex ones are selected for training target algo-
rithm. It is possible to significantly increase the quality of the detection algorithm in the
problem of detecting road signs.
    Also there are method using two different neural network for data distillation. First
neural network is teacher and second is student. For example in [7] this approach using
for data distillation approach to learning optical flow estimation from unlabeled data.
The approach distills reliable predictions from a teacher network, and uses these pre-
dictions as annotations to guide a student network to learn optical flow. This approach
is data-driven, and learns optical flow for occluded pixels. This enables to train model
with a much simpler loss function, and achieve a much higher accuracy.


3     Training data filtering

An important task is to select the most informative images for finetuning. Opposite,
finetuning of the detector is also possible on a full collection of data, but it is time-
consuming. Therefore, we offer our own methods for selecting the most informative
frames. In our work, we consider several such methods and show that in some cases it is
possible to increase the quality of the algorithm. The practical purpose of the methods is
to select data from an unmarked collection for further markup. The selected data should
be the most informative for the target algorithm and the algorithm must achieve high
quality performance on these test data.


3.1   Random sampling

Random sampling is the simplest method for data filtering. Let X be a set of data and
Xi is image from dataset. Let y be a vector witch consist of 0 or 1 with size like first
X dimension, |y| = n = |X|,  Pnq - a selection parameter that defines the part of data to
select. Next, we require that i=0 yi ∼ n∗q and than Y = {Xi |yi = 1}. Thus we have
chosen a random part of the images defined by the q parameter and we will continue to
call this technique RS (Random sample).


3.2   Data filtering using loss ranking

Based on the assumption that the loss function of the binary detector determines the
complexity of the frame. The loss function determines the complexity of the entire
frame, regardless of the number of characters on this frame. The loss function deter-
mines the difficulty of detecting signs on a frame without reference to classes of spe-
cific signs on this frame. We can construct an algorithm for selecting the most complex
frames. In this paper, we propose an algorithm that can be used to estimate the com-
plexity of frames that already have markup.
4 A. Popov et al.

 1. Loss function L(f (Yi ), gt(Yi )) of the detector algorithm f (Yi ) that was trained on
    data from the basic domain and Y set of data from the target domain images, where
    L(f (Yi ), gt(Yi )) ∈ R, Yi ∈ Y , gt(Yi ) is ground truth detections on the Yi frame.
 2. Compute y : yi = L(f (Yi ), gt(Yi )) and sort y and Y in descending order respec-
    tively.
 3. Create Ye = {Yi |i ≤ n ∗ q}.
 4. Ye - this is the final dataset that will be used for traning or finetuning.

    The use of the algorithm described above has a good effect on further training of
the detection algorithm. Because, complex examples are often the most common, which
allows you to train the algorithm faster without losing its final quality. It is important
to note that the presented algorithm has a significant drawback - requirement for data
markup.


3.3    Filtering unlabelled data

Based on the previous point, we can get the complexity of marked frames based on
the values of the loss function of the trained algorithm. It is important to note that the
main burden on data collection is just imposed by their mark up. Therefore, we need
to reduce the number of images that need to be marked up. To reduce the number of
markup data, we need to select the most informative frames without any additional
information about these frames. It is proposed to train a small neural network that can
predict the complexity of frames without using pre-markup.
    Let’s develop the idea of determining the complexity of images. To do this, we
will use a small neural network. In the following algorithms, we used the general ar-
chitecture Fig. 1 but different loss functions. This neural network model is designed
specifically for predicting the complexity of frames. This is a simple convolutional neu-
ral network that receives a 300x300 image as input. It has a pair of fully connected
layers and returns a single real number at the output.
    Based on the results of the previous part, we got markup for all our data that has the
original markup. We formalize this approach. We have X    e = {f (x)|x ∈ X} and than
we use it as training sample. We will use them as a training sample, for an algorithm
that we will train to predict the complexity of the image for our detector. In total, we
have the following algorithm:

 1. Using a trained detection algorithm, we run the marked-up data set of the base
    domain and get the complexity of each image for our algorithm.
 2. Based on the obtained data, we train a small neural network that predicts the com-
    plexity of the image for our initial algorithm.
 3. Predicting complexity for images from the target data set and forming them in
    descending order of complexity.
 4. Select the appropriate part of the most complex images and mark them up.

      Let us get the function gloss (x) ∈ R, x ∈ Y where gloss (x) is our algorithm.
                                                  Data Distillation for Traffic Sign Detection 5


     Fig. 1. The architecture of neural network for predicting the complexity of the image.


MSE loss For sample version of our algorithm we use MSE loss. This algorithm will
be trained to predict the exact value of the image complexity. Here is an example of
calculating MSE for a pair of images:

                                         (y 2 − ye2 )
                                    M SE =                                    (1)
                                              2
    Where y, ye are pairs of numbers from the markup and predicted by the network
respectively.

DupletLoss As part of this work, we developed the DupletLoss loss function, which
allows us to train the algorithm to predict complexity (the information content of the
frame for the neural network detector). This loss function based on comparing the com-
plexity of a pair of images from a data set.
    Here is an example of calculating the DupletLoss (DL) function for a pair of images:
                                    (
                                         e T ) (d ∗ d)
                                      ϕ(d,           e ≤0
                             DL =        e T ) (d ∗ d)                               (2)
                                      Ψ (d,          e >0
                                  (
                                    100 + d4                de ≤ T
                                            e
                       ϕ(d, T ) =
                          e
                                                  15
                                                                                     (3)
                                    100 + 4(d − 16 ∗ T ) de > T
                                               e
                                  (
                                              de
                          e T ) = −100 − 4
                       Ψ (d,
                                                            de ≤ T
                                                                                     (4)
                                                  15
                                    100 − 4(de + 16 ∗ T ) de > T
     Where T - parameter of the loss function, d = y[1] − y[0] - the difference between
the complexity of the image 1 and 0, which are obtained from markup, and de = ye[1] −
ye[0] - the difference between the complexity of the image is 1 and 0, which are predicted
by the algorithm.
6 A. Popov et al.

                                        DupletLoss

                                         100 DL


                                           50


                                                                      d
                    −4   −3    −2     −1             1    2      3         4


                                        −50


                                       −100


                    Fig. 2. Graph of the DupletLoss function with T = 1.


    This function allows you to order images by their complexity, using the gradation
of complexity in the markup. This function assumes a penalty for incorrectly ordered
images, but it does not impose any restrictions on the difference module, and does not
force the network to optimize values for selecting accurate accuracy values.


4   Experimental evaluation

All experiments to train or finetuning the RetinaNet detector were performed on two
GeForce GTX 1080 Ti. The choice was made in favor of RetinaNet based on the fol-
lowing advantages:

 1. Thanks to the Focal Loss function, the detector demonstrates high quality perfor-
    mance on rare or small objects.
 2. High speed of operation.
 3. Its accuracy and completeness is not inferior to other modern algorithms that are
    relevant at the time of writing this work.

    For the experiments we used RTSD dataset [4] which contains 59028 images, and
TTK100 dataset [3] for finetuning containing 9182 images. In TTK100 there are several
major classes of characters Fig. 3.
    The metric calculated in all experiments is the AUC. The quantitative interpretation
of ROC is given by the AUC indicator-the area bounded by the ROC curve and the axis
of the proportion of false positive classifications, where the AUC - area under ROC
curve.
                                                Data Distillation for Traffic Sign Detection 7


                       Fig. 3. The main classes of the TTK100 dataset.


4.1   Baseline
In the initial training, experiments were performed to train the detector on mixed data
sets in equal proportions, which allowed us to evaluate the detector’s ability to finetun-
ing.
    In further experiments, we will determine that not only is finetuning as good as
learning on a mixed dataset, but in some cases it is significantly better (Table 1).


                             Table 1. Baseline of our algorithm.

            BaseTrain/Finetuning Group name Base only TTK100 Base+TTK100
                                  Average    0.7110 0.8904         -
                                   Yellow    0.7059 0.8886         -
                RTSD train
                                    Red      0.7145 0.8910         -
                                    Blue     0.7043 0.8891         -
                                  Average       -      0.8908   0.8908
                                   Yellow       -      0.8902   0.8873
               Without train
                                    Red         -      0.8928   0.8918
                                    Blue        -      0.8914   0.8889


    Thus, based on the results obtained, it is possible to evaluate the further effectiveness
of the proposed algorithms.

4.2   Training detector by random data selection algorithm
In order to obtain an experimental estimate (Table 2), which will be compared with the
trained algorithms for selecting the most complex examples for the detector, experi-
ments were initially conducted with a random selection of data portions from the main
data set.
    These results allow us to evaluate the quality of random selection methods in com-
parison with the selection algorithms of the most complex algorithms.
8 A. Popov et al.

              Table 2. Result of detector on different part of DS (DS is TTK100).

      BaseTrain/Finetuning Group name 12 of DS 14 of DS 18 of DS 16
                                                                 1         1
                                                                    of DS 32 of DS
                            Average 0.8909 0.8990 0.8827 0.8787 0.8735
                             Yellow    0.8890 0.7976 0.7998 0.8703 0.7933
          RTSD train
                              Red      0.8917 0.8856 0.8851 0.8802 0.8773
                              Blue     0.8894 0.8828 0.8811 0.8782 0.8698
                            Average 0.8863 0.8806 0.8721 0.8466 0.7792
                             Yellow    0.8826 0.8986 0.8936 0.8699 0.8908
         Without train
                              Red      0.8872 0.8831 0.8751 0.8533 0.7816
                              Blue     0.8845 0.8787 0.8678 0.8426 0.7727
                             Images     3003     1500      754      374      184
                             Yellow      353      148      92       50        18
        Count of images
                              Red       5399     2763     1433      628      318
                              Blue      1111      534      247      133       71


Fig. 4. On the left, the most complex frame from TTK100, on the right, the simplest frame, based
on the loss function ranking algorithm.


4.3   Training detector by data selection with loss ranking algorithm

Results of the algorithm based on data filtered out using an algorithm that uses the
original markup (Table 3). This algorithm shows good results compared to random
selection, as it improves the balance of the data set.
    It increases the number of sample images from rarer classes at the expense of more
frequent ones. The results of the algorithm significantly improved on yellow signs, but
slightly deteriorated on red signs. This is a consequence of the fact that the dataset has
become more balanced. The number of yellow characters in the dataset has increased,
but the number of red characters has decreased, which is logical, since initially there
are few yellow characters in the data set, but a lot of red ones.
    It is experimentally proved that this method improves the balance of samples by
adding examples from the rarest classes to the data set. This allows you to improve the
quality of the final algorithm for rare character classes.
    It is important to note that the deterioration of results in small subsamples is due
to the fact that ranking data by the loss function determines the most complex images,
                                                  Data Distillation for Traffic Sign Detection 9

Table 3. Result of detector on different part of DS (DS is TTK100) which are obtained using the
algorithm from 3.2.

      BaseTrain/Finetuning Group name 12 of DS 14 of DS 18 of DS 16
                                                                 1         1
                                                                    of DS 32 of DS
                            Average 0.8873 0.8832 0.8813 0.8077 0.5305
                             Yellow    0.8854 0.8797 0.8784 0.7216 0.3957
         Without train
                              Red      0.8879 0.8839 0.8821 0.8218 0.5814
                              Blue     0.8869 0.8820 0.8810 0.7386 0.4538
                             Images     3000     1499      749      374      186
                             Yellow      480      261      112      48        25
        Count of images
                              Red       5964     2721     1237      563      288
                              Blue      1232      612      288      133       54


which are the most noisy. Due to this, the quality of the algorithm on small parts of the
dataset significantly decreases.


Fig. 5. On the left, the most complex frame from TTK100, on the right, the simplest frame, based
on the MSE ranking algorithm.


4.4   Training detector by data selection without marking algorithm
According to the results of experiments (Table 4), you can see that the quality of the
model in some cases improves, but at the same time, with a very small number of
images, this method sometimes produces outliers, which we discussed in the previous
part. We’ll talk about what you can do when you need to select very few images later. If
you pay attention to how the sample balance has changed, it is obvious that this method
has a positive effect on this parameter in the data set.
    For cases where there are some outliers in small portions of data, you can try to ran-
domly select portions of data from the sample in which such outliers are not observed,
for example, we conducted experiments with a random selection of part of the data from
a larger one, on which the algorithm showed some improvements (Table 5).
    It is important to note that a very important advantage of our method is the improved
balance of training data compared to random selection.
10 A. Popov et al.

Table 4. Result of detector on different part of DS (DS is TTK100) which are obtained using the
algorithm from 3.3.

      BaseTrain/Finetuning Group name 12 of DS 14 of DS 18 of DS 16
                                                                 1         1
                                                                    of DS 32 of DS
                            Average 0.8858 0.8814 0.8716 0.7899 0.6938
                             Yellow    0.8797 0.8721 0.8967 0.7887 0.7630
         Without train
                              Red      0.8867 0.8831 0.8748 0.7917 0.7567
                              Blue     0.8857 0.8781 0.8661 0.7861 0.6842
                             Images     3001     1501      751      376      188
                             Yellow      423      249      146      93        55
        Count of images
                              Red       5779     2688     1343      603      328
                              Blue      1238      574      265      105       38

Table 5. Result of detector on different part of DS (DS is TTK100) which are obtained using a
combination of random selection and algorithm 3.2 or 3.3 (based on 18 DS called BDS), which is
compared with DupletLoss.

                                    1                                        1
                                    2
                                           of 12     of 14     of 14     of 16        of
    BaseTrain/Finetuning Group name BDS       BDS       BDS       BDS       DS      with
                                    with 3.2 with 3.3 with 3.2 with 3.3 DupletLoss
                          Average    0.8394 0.7923 0.7309 0.7779               0.7833
                           Yellow    0.7657 0.7791 0.6926 0.6919               0.7852
       Without train
                            Red      0.8456 0.7945 0.7406 0.7826               0.7845
                            Blue     0.8269 0.7895 0.7077 0.7760               0.7800
                           Images      374       374       187       187         376
                           Yellow       65        64        39       36          55
      Count of images
                            Red        656       668       310       370         614
                            Blue       155       138        80       55          108


5    Conclusion

In this paper, we consider the problem of training a traffic sign detector and the related
problem of the detector’s demand for the amount of training data. Considering algo-
rithms for detecting road signs, we offer a number of algorithms for selecting the most
informative frames with different approaches. An approach with pre-markup of data
and an approach without pre-markup of data. Thanks to the algorithms proposed in
this paper, it is possible to reduce the amount of data required for training the detector.
However, the quality of the final algorithm does not deteriorate.


References
 1. Liu W. et al. Ssd: Single shot multibox detector //European conference on computer vision.
    – Springer, Cham, 2016. – C. 21-37.
 2. Lin T. Y. et al. Focal loss for dense object detection //Proceedings of the IEEE international
    conference on computer vision. – 2017. – C. 2980-2988.
 3. Zhu Z. et al. Traffic-sign detection and classification in the wild //Proceedings of the IEEE
    conference on computer vision and pattern recognition. – 2016. – C. 2110-2118.
                                                   Data Distillation for Traffic Sign Detection 11

 4. Shakhuro V. I., Konouchine A. S. Russian traffic sign images dataset //Computer optics. –
    2016. – T. 40. – V. 2. – C. 294-300.
 5. Radosavovic I. et al. Data distillation: Towards omni-supervised learning //Proceedings of
    the IEEE Conference on Computer Vision and Pattern Recognition. – 2018. – C. 4119-4128.
 6. Sofiyuk K.: Neural network model for object detection in images //Lomonosov-2018. - 2018.
 7. Liu P. et al. Ddflow: Learning optical flow with unlabeled data distillation //Proceedings of
    the AAAI Conference on Artificial Intelligence. – 2019. – T. 33. – C. 8770-8777.
 8. Ren S. et al. Faster r-cnn: Towards real-time object detection with region proposal networks
    //Advances in neural information processing systems. – 2015. – C. 91-99.
 9. Chigorin A., Konushin A. A system for large-scale automatic traffic sign recognition and
    mapping // ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sci-
    ences. — 2013. — no. II-3/W3. — P. 13–17.
10. Faizov BV, Shakhuro VI, Sanzharov VV, Konushin AS. Classification of rare traffic signs.
    Computer Optics 2020; 44(2): 237-244. DOI: 10.18287/2412-6179-CO-601.
11. Redmon J. et al. You only look once: Unified, real-time object detection //Proceedings of the
    IEEE conference on computer vision and pattern recognition. – 2016. – C. 779-788.
12. Moiseev B. et al. Evaluation of traffic sign recognition methods trained on synthetically gen-
    erated data //International Conference on Advanced Concepts for Intelligent Vision Systems.
    – Springer, Cham, 2013. – C. 576-583.
13. Goodfellow I. et al. Generative adversarial nets //Advances in neural information processing
    systems. – 2014. – C. 2672-2680.
14. Shakhuro V., Faizov B., Konushin A. Rare Traffic Sign Recognition using Synthetic Training
    Data //Proceedings of the 3rd International Conference on Video and Image Processing. –
    2019. – C. 23-26.
15. Yun S. et al. Cutmix: Regularization strategy to train strong classifiers with localizable fea-
    tures //Proceedings of the IEEE International Conference on Computer Vision. – 2019. – C.
    6023-6032.