<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Video Scene Location Recognition with Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lukáš Korel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Pulc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jirˇí Tumpach</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holenˇa</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information Technology, Czech Technical University</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Mathematics and Physics, Charles University</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Computer Science, Academy of Sciences of the Czech Republic</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper provides an insight into the possibility of scene recognition from a video sequence with a small set of repeated shooting locations (such as in television series) using artificial neural networks. The basic idea of the presented approach is to select a set of frames from each scene, transform them by a pre-trained singleimage pre-processing convolutional network, and classify the scene location with subsequent layers of the neural network. The considered networks have been tested and compared on a dataset obtained from The Big Bang Theory television series. We have investigated different neural network layers to combine individual frames, particularly AveragePooling, MaxPooling, Product, Flatten, LSTM, and Bidirectional LSTM layers. We have observed that only some of the approaches are suitable for the task at hand.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>People watching videos are able to recognize where the
current scene is located. When watching some film or
serial, they are able to recognize that a new scene is on the
same place they have already seen. Finally, people are able
to understand scenes hierarchy. All this supports human
comprehensibility of videos.</p>
      <p>The role of location identification in scene recognition
by humans motivated our research into scene location
classification by artificial neural networks (ANNs). A more
ambitious goal would be a make system able to
remember unknown video locations and using this data identify
video scene that is located in that location and mark it with
the same label. This paper reports a work in progress in
that direction. It describes the employed methodology and
presents first experimental results obtained with six kinds
of neural networks.</p>
      <p>The rest of the paper is organized as follows. The next
section is about existing approaches to solve this problem.
Section 3 is divided to two parts. The first one is about data
preparation before their usage in ANNs. The second one
is about design of the ANNs in our experiments. Finally,
Section 4 – the last section before the conclusion shows
our results of experiments with these ANNs.</p>
      <p>Copyright ©2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
2</p>
    </sec>
    <sec id="sec-2">
      <title>ANN-Based Scene Classification</title>
      <p>
        The problem of scene classification has been studied for
many years. There are many approaches based on neural
networks, where an ANN using huge amount of images
learned to recognize the type of given scene (for example,
a kitchen, a bedroom, etc.). For this case several datasets
are available. One example is [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], but it does not specify
locations, so this and similar datasets are not usable for our
task.
      </p>
      <p>However, our classification problem is different. We
want to train an ANN able to recognize a particular
location (for example
“Springfield-EverGreenTerrace-742floor2-bathroom”), which can me recorded by camera
from many angles (typically, some object can be occluded
by other objects from some angles).</p>
      <p>
        One approach using ANN to solve this task is described
in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], there convolutional networks were used. The
difference to our approach is on the one hand in the extraction
and usage of video images, on the other hand in types of
ANN layers.
      </p>
      <p>
        Another approach is described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The authors
propose a high-level image representation, called Object
Bank, where an image is represented as a scale-invariant
response map of a large number of pre-trained generic
object detectors. Leveraging on the Object Bank
representation, good performances on high level visual recognition
tasks can be achieved with simple off-the-shelf classifiers
such as logistic regression and linear SVM.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Data Preparation</title>
        <p>Video data consists of large video files. Therefore, the first
task of video data preparation consists in loading the data
that is currently needed.</p>
        <p>We have evaluated the distribution of the data used for
ANN training. We have found there are some scenes with
low occurence, whereas others occur up to 30 times more
frequently compared to them. Hence, the second task of
video data preparation is to increase the uniformity of their
distribution, to prevent biasing the ANN to most frequent
classes. This is achieved due to undersampling the
frequent classes in the training data.</p>
        <p>The input consists of video files and a text file. The
video files are divided into independent episodes. The text
file is contains manually created metainformation about
every scene. Every row contains metainformation about
one scene. The scene is understand as sequence of frames,
that are not interrupted by another frame with different
scene location label. Every row contains a relative path
to the source video file, the frame number where the scene
begins and the count of the its frames. Figure 1 outlines
how frames are extracted and prepared for an ANNs. For
ANNs training, we select from each target scene a constant
count 20 frames (denoted # frames in Figure 1). To get
most informative representation of the considered scene,
frames for sampling are taken from the whole length of
the scene. This, in particular, prevents to select frames
only within a short time interval. Each scene has its own
frame distance computed from its frames count:
SL =</p>
        <p>SF
F
where SF is the count of scene frames, F is the considered
constant count of selected frames and SL is the distance
between two selected frames in the scene. After frames
extraction, every frame is reshaped to an input 3D matrix
for the ANN. Finally the reshaped frames are merged to
one input matrix for the neural network.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Used Neural Networks and Their Design</title>
        <p>Our first idea was to create a complex neural network
based on different layers. However, there were too many
parameters to train in view of the amount of data that we
had. Therefore, we have decided to use transfer learning
from some pretrained network.</p>
        <p>
          Because our data are actually images, we considered
only ANNs pretrained on image datasets in particular
ResNet50 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], ResNet101 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and VGGnet [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Finally,
we have decided to use VGGnet due to its small size.
        </p>
        <p>Hence, ANNs which we trained on our data are
composed of two parts. The first part, depicted in Figure 2 is
based on the VGGnet. At the input, we have 20 frames
#frames *</p>
        <p>VGG19
pretrained ANN
[4096]
(resolution 224 224, BGR colors) from one scene. This
is processed by a pretrained VGG19 neural network
without two top layers. The two top layers were removed
due to transfer learning. Its output is a vector with size
4096. For the 20 input frames we have 20 vectors with
size 4096. These vectors are merged to a 2D matrix with
size 20 4096.</p>
        <p>For the second part, forming the upper layers of the final
network, we have considered six possibilities: a product
layer, a flatten layer, an average pooling layer, a max
pooling layer, an LSTM layer and a bidirectional LSTM layer.
All of them, as well as the VCGnet, will be described
below. Each of listed layers is preceded by a Dense layer.
The Dense layer returns matrix 20 12, where number 12
is equal to the number of classes. With this output every
model works differently.</p>
        <p>
          VGGnet The VGGNets [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] were originally developed for
object recognition and detection. They have deep
convolutional architectures with smaller sizes of convolutional
kernel (3 3), stride (1 1), and pooling window (2 2).
There are different network structures, ranging from 11
layers to 19 layers. The model capability is increased
when the network is deeper, but imposing a heavier
computational cost.
        </p>
        <p>
          We have used the VGG19 model (VGG network with
19 layers) from the Keras library in our case. This model
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] won the 1st and 2nd place in the 2014 ImageNet
Large Scale Visual Recognition Challenge in the 2
categories called object localization and image
classification, respectively. It achieves 92.7% in image
classification on Caltech-101, top-5 test accuracy on ImageNet
dataset which contains 14 million images belonging to
1000 classes. The architecture of the VGG19 model is
depicted in figure 3.
3.2.1
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Product array</title>
        <p>In this approach, we apply a product array layer to all
output vectors from the dense layer. A Product array layer
computes product of all values in chosen dimension of an
n-dimensional array and returns an n-1-dimensional array.</p>
        <p>Input Layer
[#frames × 4096]</p>
        <p>Dense
[#frames ×
#locationClasses]</p>
        <p>Product
[#locationClasses]</p>
        <p>A model with a product layer is outlined in Figure 4.
The output from a Product layer is one number for each
class, i.e. scene location, so our result is vector with 12
numbers. It returns a probability distribution over the set
of scene locations.
3.2.2</p>
      </sec>
      <sec id="sec-3-4">
        <title>Flatten</title>
        <p>In this approach, we apply a flatten layer to all output
vectors from the dense layer. A Flatten layer creates one long
vector from matrix so, that all rows are in sequence.</p>
        <p>A model with a flatten layer is outlined in Figure 5.
After the input and a dense layer, a flatten layer follows,
which returns long vector with 12 20 numbers in this
case. It is followed by a second dense layer. Its output
has again a dimension equal to the number of classes and
it returns a probability distribution over the set of scene
locations.
3.2.3
In this approach, we apply average pooling to all output
vectors from the dense layer part of the network (Figure 6).
An average-pooling layer computes the average of values
assigned to subsets of its preceding layer that are such that:
• they partition the preceding layer, i.e., that layer
equals their union and they are mutually disjoint;
• they are identically sized.</p>
        <p>Taking into account these two conditions, the size p1
: : : pD of the preceding layer and the size r1 : : : rD
of the sets forming its partition determine the size of the
average-pooling layer.</p>
        <p>In this case, an Average Pooling layer’s forming sets
size is 20 1. Using this size in average-pooling layer,
we get again one number for each class, which returns a
probability distribution over the set of scene locations.</p>
        <p>Apart form average pooling, we have tried also max
pooling. However, it led to substantially worse results. Its
classification of the scene location was typically based on
people or items in the foreground, not on the scene as a
whole.</p>
        <p>Although using the average-pooling layer is easy, it
gives acceptable results. The number of trainable
parameters of the network is then low, which makes it suitable for
our comparatively small dataset.
3.2.4</p>
      </sec>
      <sec id="sec-3-5">
        <title>Long Short Term Memory</title>
        <p>
          An LSTM layer is used for classification of sequences of
feature vectors, or equivalently, multidimensional time
series with discrete time. Alternatively, that layer can be also
employed to obtain sequences of such classifications, i.e.,
in situations when the neural network input is a sequence
of feature vectors and its output is a a sequence of classes,
in our case of scene locations. LSTM layers are intended
for recurrent signal propagation, and differently to other
commonly encountered layers, they consists not of
simple neurons, but of units with their own inner structure.
Several variants of such a structure have been proposed
(e.g., [
          <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
          ]), but all of them include at least the following
four components:
• Memory cells can store values, aka cell states, for an
arbitrary time. They have no activation function, thus
their output is actually a biased linear combination of
unit inputs and of the values coming through
recurrent connections.
• Input gate controls the extent to which values from
the previous unit within the layer or from the
preceding layer influence the value stored in the memory
cell. It has a sigmoidal activation function, which is
applied to a biased linear combination of the input
and recurrent connections, though its bias and
synaptic weights are specific and in general different from
the bias and synaptic weights of the memory cell.
• Forget gate controls the extent to which the memory
cell state is suppressed. It again has a sigmoidal
activation function, which is applied to a specific
biased linear combination of input and recurrent
connections.
• Output gate controls the extent to which the memory
cell state influences the unit output. Also this gate has
a sigmoidal activation function, which is applied to a
specific biased linear combination of input and
recurrent connections, and subsequently composed either
directly with the cell state or with its sigmoidal
transformation, using a different sigmoid than is used by
the gates.
        </p>
        <p>Hence using LSTM layers a more sophisticated
approach compared to simple average pooling. A LSTM,
layer can keep hidden state through time with information
about previous frames.</p>
        <p>Figure 7 shows that the input to an LSTM layer is a 2D
matrix. Its rows are ordered by the time of frames from
the input scene. Every input frame in the network is
represented by one vector. The output from the LSTM layer is
a vector of the same size as in previous approaches, which
returns a probability distribution over the set of scene
locations.
3.2.5</p>
      </sec>
      <sec id="sec-3-6">
        <title>Bidirectional Long Short Term Memory</title>
        <p>An LSTM, due to its hidden state, preserves information
from inputs that has already passed through it.
Unidirectional LSTM only preserves information from the past
because the only inputs it has seen are from the past. A
Bidirectional LSTM runs inputs in two ways, one from the past
to the future and one from the future to the past. To this
end, it combines two hidden states, one for each direction.</p>
        <p>Dense
[#frames ×
#locationClasses]</p>
        <p>LSTM
[#locationClasses]</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Experimental Setup</title>
        <p>The ANNs for scene location classification were
implemented in the libraries Python language using TensorFlow
and Keras. Neural network training was accelerated using
a NVIDIA GPU. The versions of the employed hardware
and software are listed in Table 1. For image preparation,
OpenCV and Numpy were used. The routine for preparing
frames is a generator. It has lower capacity requirements,
because data are loaded just in time when they are needed
and memory is released after the data have been used for
ANN. All non-image information about inputs (video
location, scenes information, etc.) are processed in text
format by Pandas.</p>
        <p>We have 17 independent datasets prepared by ourselves
from proprietary videos of the The Big Bang Theory
series, thus the datasets can’t be public. Each dataset
originates from one episode of the series. Each experiment was
trained with one dataset, so results are independent as well.
So we can compare behavior of the models with different
datasets.</p>
        <p>
          Our algorithm to select data in training routine is based
on oversampling. It randomly selects target class and from
the whole training dataset is randomly select source scene
with replacement. This algorithm is applied due to an
unbalanced proportion of different target classes. Thanks to
this method, all targets are distributed equally and the
network does not overfit a highly represented class.
The differences between the models considered in the
second, trained part of the network were tested for
significance by the Friedman test. The basic null hypotheses
that the mean classification accuracy for all 6 models
coincides was strongly rejected, with the achieved significance
p = 2:8 10 13. For the post-hoc analysis, we employed
the Wilcoxon signed rank test with two-sided alternative
for all 15 pairs of theconsidered models, because of the
inconsistence of the more commonly used mean ranks
posthoc test, to which recently Benavoli et al. pointed out [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
For correction to multiple hypotheses testing, we used the
Holm method [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The results are included the comparison
between models in Table 2.
        </p>
        <p>Summary statistics of the predictive accuracy of
classification all 17 episode datasets are in Table 3. Every
experiment was performed on every dataset at least 7 times. The
table is complemented with results for individual episodes,
depicted in box plots.</p>
        <p>The model with a max-pooling layer had the worst
results (Figure 12) of all experiments. Its overall mean
accuracy was around 10 %. This is only slighty higher than
random choice which is 1=12. The model was not able to
achieve better accuracy than 20 %. Its results were stable
and standard deviation was very low.</p>
        <p>Slightly better results (Figure 10) had the model with
the a flatten layer, it was sometimes able to achieve a high
accuracy, but its standard deviation was very high. On the
other hand, results for some other episodes were not better
than those of the max-pooling model.</p>
        <p>A better solution is the product model, whose predictive
accuracy (Figure 9) was for several episodes higher than
80 %. On the other hand, other episodes had only slightly
better results than the flatten model. And it had the highest
standard deviation among all considered models.</p>
        <p>The most stable results (Figure 11) with good accuracy
had the model based on average-pooling layer. Its mean
accuracy was 32 % and for no episode, the accuracy was
substantially different.</p>
        <p>The model with unidirectional LSTM layer had the
second mean accuracy of considered our models (Figure 13).
Its internal memory brings advantage in compare over the
previous approaches, over 40 %, though also a
comparatively high standard deviation.</p>
        <p>The highest mean accuracy had the model with a
bidirectional LSTM layer (Figure 14). It had a similar
standard deviation as the one with a unidirectional LSTM, but
an accuracy mean nearly 50 %.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Research</title>
      <p>In this paper was provided an insight into the
possibility of using artificial neural networks for scene
recognition location from a video sequence with a small set
of repeated shooting locations (such as in television
series) was provided. Our idea was to select more than one
frame from each scene and classify the scene using that
sequence of frames. We used a pretrained VGG19
network without two last layers. This results were used as
an input to the trainable part our neural network
architecture. We have designed six neural network models with
different layer types. We have investigated different
neural network layers to combine video frames, in
particular average-pooling, max-pooling, product, flatten, LSTM,
and bidirectional LSTM layers. The considered networks
have been tested and compared on a dataset obtained from
The Big Bang Theory television series. The model with
max-pooling layer was not successful, its accuracy was
the lowest of all models. The models with a flatten or
product layer were very unstable, their standard deviation
was very large. The most stable among all models was
the one with an average-pooling layer. The models with
unidirectional LSTM and bidirectional LSTM had
similar standard deviation of the accuracy. The model with
a bidirectional LSTM had the highest accuracy among all
considered models. In our opinion, this is because its
internal memory cells preserve information in both directions.
Those results shows, that models with internal memory are
able to classify with a higher accuracy than models
without internal memory.</p>
      <p>Our method may have limitations due to the chosen
pretrained ANN and the low dimension of some neural layer
parts. In future research, it is desirable to achieve higher
accuracy in scene location recognition. This task may also
need modifying model parameters or using other
architectures. It also may need other pretrained models or
combining several pretrained models. It is also desirable that,
if the ANN detects an unknown scene, it will remember
it and next time it will recognize a scene from the same
location properly.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The research reported in this paper has been supported by
the Czech Science Foundation (GACˇ R) grant 18-18080S.</p>
      <p>Computational resources were supplied by the project
"e-Infrastruktura CZ" (e-INFRA LM2018140) provided
within the program Projects of Large Research,
Development and Innovations Infrastructures.</p>
      <p>Computational resources were provided by the
ELIXIRCZ project (LM2018131), part of the international
ELIXIR infrastructure.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kjellström</surname>
          </string-name>
          , H.:
          <article-title>Movie scene recognition with Convolutional Neural Networks</article-title>
          . https://www.diva-portal.org/smash/get/diva2 :859486/FULLTEXT01.
          <article-title>pdf KTH ROYAL INSTITUTE OF TECHNOLOGY (</article-title>
          <year>2015</year>
          )
          <fpage>5</fpage>
          -
          <lpage>36</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          . https://arxiv.org/pdf/1409.1556v6.pdf Visual Geometry Group, Department of Engineering Science, University of Oxford (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Russakovsky</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hao</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            <given-names>S.</given-names>
          </string-name>
          , Ma Sean, Huang
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Karpathy</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Khosla</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Berg</surname>
          </string-name>
          <string-name>
            <given-names>A. C.</given-names>
            ,
            <surname>Fei-Fei L.: ImageNet Large Scale Visual Recognition Challenge</surname>
          </string-name>
          .
          <source>International Journal of Computer Vision</source>
          <volume>115</volume>
          (
          <year>2015</year>
          ), pp.
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Li-jia L</surname>
          </string-name>
          .,
          <string-name>
            <surname>Hao</surname>
            <given-names>S.</given-names>
          </string-name>
          , Fei-fei
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Xing</surname>
          </string-name>
          <string-name>
            <surname>E</surname>
          </string-name>
          .
          <article-title>: A High-Level Image Representation for Scene Classification &amp; Semantic Feature Sparsification</article-title>
          . https://cs.stanford.edu/groups/vision/pdf/ LiSuXingFeiFeiNIPS2010.pdf
          <string-name>
            <surname>NIPS</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Felix</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gers</surname>
            , Schmidhuber J.,
            <given-names>and Cummins F.</given-names>
          </string-name>
          ,
          <article-title>Learning to forget: Continual prediction with LSTM</article-title>
          ,
          <source>in Proceedings of ICANN. ENNS</source>
          (
          <year>1999</year>
          ), pp.
          <fpage>850</fpage>
          --
          <lpage>855</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Benavoli</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corani</surname>
            <given-names>G.</given-names>
          </string-name>
          , Mangili F.:
          <string-name>
            <surname>Should We Really Use</surname>
          </string-name>
          Post-Hoc
          <source>Tests Based on Mean-Ranks? Journal of Machine Learning Research</source>
          <volume>17</volume>
          (
          <year>2016</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>García</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herrera</surname>
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise</article-title>
          <source>Comparisons Journal of Machine Learning Research</source>
          <volume>9</volume>
          (
          <year>2008</year>
          ), pp.
          <fpage>2677</fpage>
          -
          <lpage>2694</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Supervised Sequence Labelling with Recurrent Neural Networks</article-title>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>H.</given-names>
          </string-name>
          , Xiangyu
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Shaoqing</surname>
          </string-name>
          <string-name>
            <surname>R.</surname>
          </string-name>
          , Jian S.:
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>2016 IEEE Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2016</year>
          ), pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sudha</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganeshbabu</surname>
            <given-names>T</given-names>
          </string-name>
          . R.:
          <string-name>
            <given-names>A Convolutional</given-names>
            <surname>Neural</surname>
          </string-name>
          <article-title>Network Classifier VGG-19 Architecture for Lesion Detection and Grading in Diabetic Retinopathy Based on Deep Learning</article-title>
          . http://www.techscience.com/cmc/v66n1/40483 Computers, Materials &amp;
          <string-name>
            <surname>Continua</surname>
          </string-name>
          (
          <year>2021</year>
          ), pp.
          <fpage>827</fpage>
          -
          <lpage>842</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Zhou</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lapedriza</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khosla</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliva</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Places: A 10 Million Image Database for Scene Recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2018</year>
          ), pp.
          <fpage>1452</fpage>
          -
          <lpage>1464</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>