=Paper= {{Paper |id=Vol-2962/paper37 |storemode=property |title=Video Scene Location Recognition with Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2962/paper37.pdf |volume=Vol-2962 |authors=Lukáš Korel,Petr Pulc,Jiří Tumpach,Martin Holeňa |dblpUrl=https://dblp.org/rec/conf/itat/KorelPTH21 }} ==Video Scene Location Recognition with Neural Networks == https://ceur-ws.org/Vol-2962/paper37.pdf
                    Video Scene Location Recognition with Neural Networks

                                    Lukáš Korel1 , Petr Pulc1 , Jiří Tumpach2 , and Martin Holeňa3
                       1  Faculty of Information Technology, Czech Technical University, Prague, Czech Republic
                           2 Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
                 3 Institute of Computer Science, Academy of Sciences of the Czech Republic, Prague, Czech Republic


Abstract: This paper provides an insight into the possi-                  2     ANN-Based Scene Classification
bility of scene recognition from a video sequence with a
small set of repeated shooting locations (such as in tele-                The problem of scene classification has been studied for
vision series) using artificial neural networks. The basic                many years. There are many approaches based on neural
idea of the presented approach is to select a set of frames               networks, where an ANN using huge amount of images
from each scene, transform them by a pre-trained single-                  learned to recognize the type of given scene (for example,
image pre-processing convolutional network, and classify                  a kitchen, a bedroom, etc.). For this case several datasets
the scene location with subsequent layers of the neural                   are available. One example is [11], but it does not specify
network. The considered networks have been tested and                     locations, so this and similar datasets are not usable for our
compared on a dataset obtained from The Big Bang The-                     task.
ory television series. We have investigated different neural                 However, our classification problem is different. We
network layers to combine individual frames, particularly                 want to train an ANN able to recognize a particular lo-
AveragePooling, MaxPooling, Product, Flatten, LSTM,                       cation (for example “Springfield-EverGreenTerrace-742-
and Bidirectional LSTM layers. We have observed that                      floor2-bathroom”), which can me recorded by camera
only some of the approaches are suitable for the task at                  from many angles (typically, some object can be occluded
hand.                                                                     by other objects from some angles).
                                                                             One approach using ANN to solve this task is described
                                                                          in [1], there convolutional networks were used. The differ-
1    Introduction                                                         ence to our approach is on the one hand in the extraction
                                                                          and usage of video images, on the other hand in types of
People watching videos are able to recognize where the
                                                                          ANN layers.
current scene is located. When watching some film or se-
                                                                             Another approach is described in [4]. The authors
rial, they are able to recognize that a new scene is on the
                                                                          propose a high-level image representation, called Object
same place they have already seen. Finally, people are able
                                                                          Bank, where an image is represented as a scale-invariant
to understand scenes hierarchy. All this supports human
                                                                          response map of a large number of pre-trained generic ob-
comprehensibility of videos.
                                                                          ject detectors. Leveraging on the Object Bank representa-
   The role of location identification in scene recognition
                                                                          tion, good performances on high level visual recognition
by humans motivated our research into scene location clas-
                                                                          tasks can be achieved with simple off-the-shelf classifiers
sification by artificial neural networks (ANNs). A more
                                                                          such as logistic regression and linear SVM.
ambitious goal would be a make system able to remem-
ber unknown video locations and using this data identify
video scene that is located in that location and mark it with             3     Methodology
the same label. This paper reports a work in progress in
that direction. It describes the employed methodology and                 3.1   Data Preparation
presents first experimental results obtained with six kinds
of neural networks.                                                       Video data consists of large video files. Therefore, the first
   The rest of the paper is organized as follows. The next                task of video data preparation consists in loading the data
section is about existing approaches to solve this problem.               that is currently needed.
Section 3 is divided to two parts. The first one is about data               We have evaluated the distribution of the data used for
preparation before their usage in ANNs. The second one                    ANN training. We have found there are some scenes with
is about design of the ANNs in our experiments. Finally,                  low occurence, whereas others occur up to 30 times more
Section 4 – the last section before the conclusion shows                  frequently compared to them. Hence, the second task of
our results of experiments with these ANNs.                               video data preparation is to increase the uniformity of their
                                                                          distribution, to prevent biasing the ANN to most frequent
                                                                          classes. This is achieved due to undersampling the fre-
                                                                          quent classes in the training data.
     Copyright ©2021 for this paper by its authors. Use permitted under      The input consists of video files and a text file. The
Creative Commons License Attribution 4.0 International (CC BY 4.0).       video files are divided into independent episodes. The text
                                                 # frames video
Frame s0 Frame s1                  Frame sX Frame s(X+1)

    Scene 0               ...            Scene X                              ...            Scene N                                  Input Layer
                                                                                                                     #frames *
                                                                                                                                    [224 × 224 × 3]
   # frames s0



                 Frame                                                                      Frame
    Frame sX
                                                                                                                                 #frames *
                                                               ...
                 sX+1                                                                      s(X+1)-1


                                    # scene frames
                                       divided by

         Frame A         Frame B
                                        # frames
                                                     Frame C
                                                                                    ...
                                                                                          Frame n                                       VGG19
                                                                                                                     #frames *      pretrained ANN
          Resize         Resize                      Resize          Resize               Resize
                                                                                                                                         [4096]
          Frame          Frame                       Frame                                Frame
                                                                                    ...
           RA             RB                          RC                                    Rn


                                                                                                       Figure 2: First, untrainable part of our neural network,
                                           Concatenated input data X                                   where the Input Layer represents frame with resolution
                                        # frames × width × height × #colors                            224 × 224 in BGR colors and output is vector with length
                                                                                                       4096, which is output from VGG19 network without last
        Figure 1: Input preparation for a neural network                                               two layers


file is contains manually created metainformation about                                                (resolution 224 × 224, BGR colors) from one scene. This
every scene. Every row contains metainformation about                                                  is processed by a pretrained VGG19 neural network with-
one scene. The scene is understand as sequence of frames,                                              out two top layers. The two top layers were removed
that are not interrupted by another frame with different                                               due to transfer learning. Its output is a vector with size
scene location label. Every row contains a relative path                                               4096. For the 20 input frames we have 20 vectors with
to the source video file, the frame number where the scene                                             size 4096. These vectors are merged to a 2D matrix with
begins and the count of the its frames. Figure 1 outlines                                              size 20 × 4096.
how frames are extracted and prepared for an ANNs. For                                                    For the second part, forming the upper layers of the final
ANNs training, we select from each target scene a constant                                             network, we have considered six possibilities: a product
count 20 frames (denoted # frames in Figure 1). To get                                                 layer, a flatten layer, an average pooling layer, a max pool-
most informative representation of the considered scene,                                               ing layer, an LSTM layer and a bidirectional LSTM layer.
frames for sampling are taken from the whole length of                                                 All of them, as well as the VCGnet, will be described be-
the scene. This, in particular, prevents to select frames                                              low. Each of listed layers is preceded by a Dense layer.
only within a short time interval. Each scene has its own                                              The Dense layer returns matrix 20 × 12, where number 12
frame distance computed from its frames count:                                                         is equal to the number of classes. With this output every
                                                                                                       model works differently.
                                                           SF
                                              SL =                                                     VGGnet The VGGNets [2] were originally developed for
                                                           F
                                                                                                       object recognition and detection. They have deep convo-
where SF is the count of scene frames, F is the considered                                             lutional architectures with smaller sizes of convolutional
constant count of selected frames and SL is the distance                                               kernel (3 × 3), stride (1 × 1), and pooling window (2 × 2).
between two selected frames in the scene. After frames                                                 There are different network structures, ranging from 11
extraction, every frame is reshaped to an input 3D matrix                                              layers to 19 layers. The model capability is increased
for the ANN. Finally the reshaped frames are merged to                                                 when the network is deeper, but imposing a heavier com-
one input matrix for the neural network.                                                               putational cost.
                                                                                                          We have used the VGG19 model (VGG network with
3.2      Used Neural Networks and Their Design                                                         19 layers) from the Keras library in our case. This model
                                                                                                       [3] won the 1st and 2nd place in the 2014 ImageNet
Our first idea was to create a complex neural network                                                  Large Scale Visual Recognition Challenge in the 2 cat-
based on different layers. However, there were too many                                                egories called object localization and image classifica-
parameters to train in view of the amount of data that we                                              tion, respectively. It achieves 92.7% in image classifi-
had. Therefore, we have decided to use transfer learning                                               cation on Caltech-101, top-5 test accuracy on ImageNet
from some pretrained network.                                                                          dataset which contains 14 million images belonging to
   Because our data are actually images, we considered                                                 1000 classes. The architecture of the VGG19 model is
only ANNs pretrained on image datasets in particular                                                   depicted in figure 3.
ResNet50 [9], ResNet101 [9] and VGGnet [2]. Finally,
we have decided to use VGGnet due to its small size.
                                                                                                       3.2.1   Product array
   Hence, ANNs which we trained on our data are com-
posed of two parts. The first part, depicted in Figure 2 is                                            In this approach, we apply a product array layer to all out-
based on the VGGnet. At the input, we have 20 frames                                                   put vectors from the dense layer. A Product array layer
  Figure 3: Architecture of the used VGG19 model [10], in our network is used without FC1, FC2 and Softmax layers


computes product of all values in chosen dimension of an
n-dimensional array and returns an n-1-dimensional array.                               Input Layer
                                                                                     [#frames × 4096]

                        Input Layer
                     [#frames × 4096]
                                                                                         Dense
                                                                                       [#frames ×
                                                                                    #locationClasses]
                          Dense
                        [#frames ×
                     #locationClasses]
                                                                                          Flatten
                                                                                        [#frames *
                                                                                    #locationClasses]

                         Product
                    [#locationClasses]

                                                                                          Dense
                                                                                    [#locationClasses]
Figure 4: Trainable part of the neural network based on a
product layer

   A model with a product layer is outlined in Figure 4.        Figure 5: Trainable part of the neural network based on a
The output from a Product layer is one number for each          flatten layer
class, i.e. scene location, so our result is vector with 12
numbers. It returns a probability distribution over the set
of scene locations.                                                A model with a flatten layer is outlined in Figure 5. Af-
                                                                ter the input and a dense layer, a flatten layer follows,
3.2.2   Flatten                                                 which returns long vector with 12 ∗ 20 numbers in this
                                                                case. It is followed by a second dense layer. Its output
In this approach, we apply a flatten layer to all output vec-   has again a dimension equal to the number of classes and
tors from the dense layer. A Flatten layer creates one long     it returns a probability distribution over the set of scene
vector from matrix so, that all rows are in sequence.           locations.
3.2.3   Average Pooling                                          in our case of scene locations. LSTM layers are intended
                                                                 for recurrent signal propagation, and differently to other
In this approach, we apply average pooling to all output         commonly encountered layers, they consists not of sim-
vectors from the dense layer part of the network (Figure 6).     ple neurons, but of units with their own inner structure.
An average-pooling layer computes the average of values          Several variants of such a structure have been proposed
assigned to subsets of its preceding layer that are such that:   (e.g., [5, 8]), but all of them include at least the following
   • they partition the preceding layer, i.e., that layer        four components:
     equals their union and they are mutually disjoint;             • Memory cells can store values, aka cell states, for an
                                                                      arbitrary time. They have no activation function, thus
   • they are identically sized.
                                                                      their output is actually a biased linear combination of
Taking into account these two conditions, the size p1 ×               unit inputs and of the values coming through recur-
. . . × pD of the preceding layer and the size r1 × . . . × rD        rent connections.
of the sets forming its partition determine the size of the
                                                                    • Input gate controls the extent to which values from
average-pooling layer.
                                                                      the previous unit within the layer or from the preced-
                                                                      ing layer influence the value stored in the memory
                                                                      cell. It has a sigmoidal activation function, which is
                        Input Layer
                     [#frames × 4096]                                 applied to a biased linear combination of the input
                                                                      and recurrent connections, though its bias and synap-
                                                                      tic weights are specific and in general different from
                                                                      the bias and synaptic weights of the memory cell.
                          Dense                                     • Forget gate controls the extent to which the memory
                        [#frames ×                                    cell state is suppressed. It again has a sigmoidal ac-
                     #locationClasses]                                tivation function, which is applied to a specific bi-
                                                                      ased linear combination of input and recurrent con-
                                                                      nections.

                    AveragePooling1D
                                                                    • Output gate controls the extent to which the memory
                    [#locationClasses]                                cell state influences the unit output. Also this gate has
                                                                      a sigmoidal activation function, which is applied to a
                                                                      specific biased linear combination of input and recur-
                                                                      rent connections, and subsequently composed either
Figure 6: Trainable part of the neural network based on an
                                                                      directly with the cell state or with its sigmoidal trans-
average-pooling layer
                                                                      formation, using a different sigmoid than is used by
                                                                      the gates.
   In this case, an Average Pooling layer’s forming sets
size is 20 × 1. Using this size in average-pooling layer,           Hence using LSTM layers a more sophisticated ap-
we get again one number for each class, which returns a          proach compared to simple average pooling. A LSTM,
probability distribution over the set of scene locations.        layer can keep hidden state through time with information
   Apart form average pooling, we have tried also max            about previous frames.
pooling. However, it led to substantially worse results. Its        Figure 7 shows that the input to an LSTM layer is a 2D
classification of the scene location was typically based on      matrix. Its rows are ordered by the time of frames from
people or items in the foreground, not on the scene as a         the input scene. Every input frame in the network is repre-
whole.                                                           sented by one vector. The output from the LSTM layer is
   Although using the average-pooling layer is easy, it          a vector of the same size as in previous approaches, which
gives acceptable results. The number of trainable parame-        returns a probability distribution over the set of scene lo-
ters of the network is then low, which makes it suitable for     cations.
our comparatively small dataset.
                                                                 3.2.5   Bidirectional Long Short Term Memory
3.2.4   Long Short Term Memory
                                                                 An LSTM, due to its hidden state, preserves information
An LSTM layer is used for classification of sequences of         from inputs that has already passed through it. Unidirec-
feature vectors, or equivalently, multidimensional time se-      tional LSTM only preserves information from the past be-
ries with discrete time. Alternatively, that layer can be also   cause the only inputs it has seen are from the past. A Bidi-
employed to obtain sequences of such classifications, i.e.,      rectional LSTM runs inputs in two ways, one from the past
in situations when the neural network input is a sequence        to the future and one from the future to the past. To this
of feature vectors and its output is a a sequence of classes,    end, it combines two hidden states, one for each direction.
                                                              frames is a generator. It has lower capacity requirements,
                       Input Layer                            because data are loaded just in time when they are needed
                    [#frames × 4096]                          and memory is released after the data have been used for
                                                              ANN. All non-image information about inputs (video lo-
                                                              cation, scenes information, etc.) are processed in text for-
                                                              mat by Pandas.
                         Dense
                       [#frames ×
                    #locationClasses]                         Table 1: Versions of the employed hardware and software

                                                                      CPU cores                    2
                                                                      GPU compute capability       3.5 and higher
                                                                      OS                           Linux 5.4.0
                          LSTM                                        CUDA                         11.3
                    [#locationClasses]
                                                                      Python                       3.8.6
                                                                      TensorFlow                   2.3.1
                                                                      Keras                        2.4.0
Figure 7: Trainable part of the neural network based on an
                                                                      OpenCV                       4.5.2
LSTM layer

                                                                 We have 17 independent datasets prepared by ourselves
                       Input Layer
                                                              from proprietary videos of the The Big Bang Theory se-
                    [#frames × 4096]
                                                              ries, thus the datasets can’t be public. Each dataset origi-
                                                              nates from one episode of the series. Each experiment was
                                                              trained with one dataset, so results are independent as well.
                                                              So we can compare behavior of the models with different
                         Dense                                datasets.
                       [#frames ×
                                                                 Our algorithm to select data in training routine is based
                    #locationClasses]
                                                              on oversampling. It randomly selects target class and from
                                                              the whole training dataset is randomly select source scene
                                                              with replacement. This algorithm is applied due to an un-
                                                              balanced proportion of different target classes. Thanks to
                   Bidirectional LSTM                         this method, all targets are distributed equally and the net-
                   [#locationClasses]                         work does not overfit a highly represented class.


Figure 8: Trainable part of the neural network based on a     4.2   Results
bidirectional LSTM layer
                                                              The differences between the models considered in the sec-
                                                              ond, trained part of the network were tested for signifi-
   Figure 8 shows that the input to a bidirectional LSTM      cance by the Friedman test. The basic null hypotheses
layer is the same as the input to a LSTM layer. Every         that the mean classification accuracy for all 6 models coin-
input frame in the network is represented by one vector.      cides was strongly rejected, with the achieved significance
The output from the Bidirectional LSTM layer is a vector      p = 2.8 × 10−13 . For the post-hoc analysis, we employed
of the same size as in previous approaches, which returns     the Wilcoxon signed rank test with two-sided alternative
a probability distribution over the set of scene locations.   for all 15 pairs of theconsidered models, because of the in-
                                                              consistence of the more commonly used mean ranks post-
                                                              hoc test, to which recently Benavoli et al. pointed out [6].
4     Experiments                                             For correction to multiple hypotheses testing, we used the
                                                              Holm method [7]. The results are included the comparison
4.1   Experimental Setup                                      between models in Table 2.
                                                                 Summary statistics of the predictive accuracy of classi-
The ANNs for scene location classification were imple-        fication all 17 episode datasets are in Table 3. Every exper-
mented in the libraries Python language using TensorFlow      iment was performed on every dataset at least 7 times. The
and Keras. Neural network training was accelerated using      table is complemented with results for individual episodes,
a NVIDIA GPU. The versions of the employed hardware           depicted in box plots.
and software are listed in Table 1. For image preparation,       The model with a max-pooling layer had the worst re-
OpenCV and Numpy were used. The routine for preparing         sults (Figure 12) of all experiments. Its overall mean ac-
Table 2: Comparison of accuracy results on all 17 episode datasets. The values in the table are counts of datasets, in
which the model in row has higher accuracy compared to the model in column. If the difference is not significant in the
Wilcoxon test than the count is in italic. If the difference is significant, then the higher count is in bold.

                             Product    Flatten    Average      Max   LSTM      BidirectionalLSTM       SummaryScore
     Product                      X          16          6       16       5                     1                 44
     Flatten                       1          X          0       10       0                     0                 11
     Average                     11          17          X       17       3                     1                 49
     Max                           1          6          0        X       0                     0                  7
     LSTM                        12          17         14       17       X                     3                 63
     BidirectionalLSTM           16          17         15       17      14                     X                 79



curacy was around 10 %. This is only slighty higher than         lar average-pooling, max-pooling, product, flatten, LSTM,
random choice which is 1/12. The model was not able to           and bidirectional LSTM layers. The considered networks
achieve better accuracy than 20 %. Its results were stable       have been tested and compared on a dataset obtained from
and standard deviation was very low.                             The Big Bang Theory television series. The model with
   Slightly better results (Figure 10) had the model with        max-pooling layer was not successful, its accuracy was
the a flatten layer, it was sometimes able to achieve a high     the lowest of all models. The models with a flatten or
accuracy, but its standard deviation was very high. On the       product layer were very unstable, their standard deviation
other hand, results for some other episodes were not better      was very large. The most stable among all models was
than those of the max-pooling model.                             the one with an average-pooling layer. The models with
   A better solution is the product model, whose predictive      unidirectional LSTM and bidirectional LSTM had simi-
accuracy (Figure 9) was for several episodes higher than         lar standard deviation of the accuracy. The model with
80 %. On the other hand, other episodes had only slightly        a bidirectional LSTM had the highest accuracy among all
better results than the flatten model. And it had the highest    considered models. In our opinion, this is because its inter-
standard deviation among all considered models.                  nal memory cells preserve information in both directions.
   The most stable results (Figure 11) with good accuracy        Those results shows, that models with internal memory are
had the model based on average-pooling layer. Its mean           able to classify with a higher accuracy than models with-
accuracy was 32 % and for no episode, the accuracy was           out internal memory.
substantially different.                                            Our method may have limitations due to the chosen pre-
   The model with unidirectional LSTM layer had the sec-         trained ANN and the low dimension of some neural layer
ond mean accuracy of considered our models (Figure 13).          parts. In future research, it is desirable to achieve higher
Its internal memory brings advantage in compare over the         accuracy in scene location recognition. This task may also
previous approaches, over 40 %, though also a compara-           need modifying model parameters or using other architec-
tively high standard deviation.                                  tures. It also may need other pretrained models or com-
   The highest mean accuracy had the model with a bidi-          bining several pretrained models. It is also desirable that,
rectional LSTM layer (Figure 14). It had a similar stan-         if the ANN detects an unknown scene, it will remember
dard deviation as the one with a unidirectional LSTM, but        it and next time it will recognize a scene from the same
an accuracy mean nearly 50 %.                                    location properly.


5   Conclusion and Future Research
                                                                 Acknowledgments
In this paper was provided an insight into the possibil-
ity of using artificial neural networks for scene recog-
nition location from a video sequence with a small set
                                                                 The research reported in this paper has been supported by
of repeated shooting locations (such as in television se-
                                                                 the Czech Science Foundation (GAČR) grant 18-18080S.
ries) was provided. Our idea was to select more than one
frame from each scene and classify the scene using that            Computational resources were supplied by the project
sequence of frames. We used a pretrained VGG19 net-              "e-Infrastruktura CZ" (e-INFRA LM2018140) provided
work without two last layers. This results were used as          within the program Projects of Large Research, Develop-
an input to the trainable part our neural network architec-      ment and Innovations Infrastructures.
ture. We have designed six neural network models with              Computational resources were provided by the ELIXIR-
different layer types. We have investigated different neu-       CZ project (LM2018131), part of the international
ral network layers to combine video frames, in particu-          ELIXIR infrastructure.
    Table 3: Aggregated predictive accuracy over all 17 datasets [%]

        model                 mean     std   25%     50%     75%
        Product                43.7   38.4    4.6    32.4    85.2
        Flatten                23.6   30.8    1.0     5.1    39.6
        Average                32.2    8.1   26.5    31.5    37.1
        Max                     9.3    2.9    8.1     9.3    10.9
        LSTM                   40.7   25.2   19.7    39.9    59.4
        BidirectionalLSTM      47.8   25.1   29.6    50.5    67.7




    Figure 9: Box plot with results obtained using the product model




    Figure 10: Box plot with results obtained using the flatten model




Figure 11: Box plot with results obtained using the average-pooling model
   Figure 12: Box plot with results obtained using the max-pooling model




      Figure 13: Box plot with results obtained using the LSTM model




Figure 14: Box plot with results obtained using the bidirectional LSTM model
References
[1] Zhong, W., Kjellström, H.: Movie scene recognition with
    Convolutional Neural Networks.
    https://www.diva-portal.org/smash/get/diva2
    :859486/FULLTEXT01.pdf KTH ROYAL INSTITUTE OF
    TECHNOLOGY (2015) 5–36
[2] Simonyan, K., Zisserman, A.: Very Deep Convolutional
    Networks for Large-Scale Image Recognition.
    https://arxiv.org/pdf/1409.1556v6.pdf Visual Ge-
    ometry Group, Department of Engineering Science, Univer-
    sity of Oxford (2015)
[3] Russakovsky O., Deng J., Hao S., Krause J., Satheesh S.,
    Ma Sean, Huang Z., Karpathy A., Khosla A., Bernstein M.,
    Berg A. C., Fei-Fei L.: ImageNet Large Scale Visual Recog-
    nition Challenge. International Journal of Computer Vision
    115 (2015), pp. 211–252
[4] Li-jia L., Hao S., Fei-fei L., Xing E.: A High-Level Image
    Representation for Scene Classification & Semantic Feature
    Sparsification.
    https://cs.stanford.edu/groups/vision/pdf/
    LiSuXingFeiFeiNIPS2010.pdf NIPS (2010)
[5] Felix A. Gers, Schmidhuber J., and Cummins F., Learning
    to forget: Continual prediction with LSTM, in Proceedings
    of ICANN. ENNS (1999), pp. 850-–855
[6] Benavoli A., Corani G., Mangili F.: Should We Really Use
    Post-Hoc Tests Based on Mean-Ranks? Journal of Machine
    Learning Research 17 (2016), pp. 1–10
[7] García S., Herrera F.: An Extension on "Statistical Com-
    parisons of Classifiers over Multiple Data Sets" for all Pair-
    wise Comparisons Journal of Machine Learning Research 9
    (2008), pp. 2677–2694
[8] Graves, A.: Supervised Sequence Labelling with Recurrent
    Neural Networks. Springer (2012)
[9] Kaiming H., Xiangyu Z., Shaoqing R., Jian S.: Deep Resid-
    ual Learning for Image Recognition. 2016 IEEE Conference
    on Computer Vision and Pattern Recognition (2016), pp.
    770–778
[10] Sudha V., Ganeshbabu T. R.: A Convolutional Neural
    Network Classifier VGG-19 Architecture for Lesion Detec-
    tion and Grading in Diabetic Retinopathy Based on Deep
    Learning.
    http://www.techscience.com/cmc/v66n1/40483
    Computers, Materials & Continua (2021), pp. 827–842
[11] Zhou B., Lapedriza A., Khosla A., Oliva A., Torralba A.:
    Places: A 10 Million Image Database for Scene Recogni-
    tion. IEEE Transactions on Pattern Analysis and Machine
    Intelligence (2018), pp. 1452–1464