<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimization of computational complexity of an arti cial neural network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikolay Vershkov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viktor Kuchukov</string-name>
          <email>or@list.ru</email>
          <email>zzvkuchukov@ncfu.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natalia Kuchukova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolay Kucherov</string-name>
          <email>ynkucherov@ncfu.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Egor Shiriaev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>North-Caucasus Center for Mathematical Research, North-Caucasus Federal University</institution>
          ,
          <addr-line>1, Pushkin Street, 355017, Stavropol</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>The article deals with the modelling of Arti cial Neural Networks as an information transmission system to optimize their computational complexity. The analysis of existing theoretical approaches to optimizing the structure and training of neural networks is carried out. In the process of constructing the model, the well-known problem of isolating a deterministic signal on the background of noise and adapting it to solving the problem of assigning an input implementation to a certain cluster is considered. A layer of neurons is considered as an information transformer with a kernel for solving a certain class of problems: orthogonal transformation, matched ltering, and nonlinear transformation for recognizing the input implementation with a given accuracy. Based on the analysis of the proposed model, it is concluded that it is possible to reduce the number of neurons in the layers of neural network and to reduce the number of features for training the classi er.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The necessity to represent functions of n variables in the form of a superposition of functions of
smaller number variables arose in connection with the development of the neural networks theory
and practice. The appearance of neural networks is associated with an article by McCulloch
et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which describes a mathematical model of a neuron and a neural network. It has
been proven that both Boolean functions and nite state machines can be represented by neural
networks. Later, a serious mathematical analysis of perceptron revealed limitations on the
area of their applicability. Later, the restrictions were weakened by replacing the threshold
activation functions of neurons with sigmoid ones. The theoretical basis of Arti cial Neural
Networks (ANNs) was the Kolmogorov-Arnold theorem proved as a result of scienti c discussion
[
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], which showed the possibility of representing a continuous real function of n variables
f (x1; x2; : : : ; xn) in the form of a superposition of functions of a smaller number variables. A
signi cant theoretical development of ANN was the proof of the Hecht-Nielsen theorem [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which
showed the possibility of approximating a function of several variables with a given accuracy
of ANN with one hidden layer in a non-constructive form. Interest in Deep Networks stems
from the limitations of the perceptron. The use of multilayer networks was initially limited
by the complexity of their training. Due to the ideas of Hinton's team, training multilayer
ANNs became possible [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Multilayer networks enabled solving problems of classi cation,
extrapolation, feature extraction, etc. in conditions of high uncertainty, i. e. with a su ciently
small volume of the training sample, obtain satisfactory results. Thus, the modern theory of
ANN is based on a vector (geometric) approach [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6 ref7">2, 3, 4, 5, 6, 7</xref>
        ].
      </p>
      <p>
        An interesting approach to the problem of constructing image recognition systems of optimal
architecture was proposed by Rao et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. They proposed to present the pattern recognition
system functionally in the form of two blocks: feature selection and a trained classi er. The
selection of features is carried out using orthogonal transformations of the input signal. In
order to increase the e ciency of the system, a method for decreasing the dimension of the
feature vector for training the classi er is proposed. The ANN input receives a sequence of
values fXig = (xi1; xi2; : : : ; xin), which can be represented as discrete samples of some continuous
function x(t). ANN output is a sequence fYig = (y1i; y2i; : : : ; ymi), which can also represent
discrete samples of the function y(t). Here i is the number of the sample, m and n are the
dimensions of the output and input samples of the sequences, respectively. This approach makes
it possible to study the output sequence of the ANN not only as a geometric interpretation of the
input samples, but also to consider the information interaction of the layers of complex (deep)
ANNs using the mathematical apparatus of the information transmission theory.
      </p>
      <p>
        We developed the ideas of Ahmed and Rao, widely applying the methods of decoding and
separating signals against the background of noise, used in the theory of information transmission
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, the use of the discrete Fourier transform (DFT) did not allow obtaining a
signi cant gain in reducing the computational complexity of the ANN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Despite the signi cant
improvement in the algorithms of the ANN operation and the reduction of the training time by
tens and hundreds of times, a number of problems remain in this area that require theoretical
comprehension. First of all, these include a signi cant computational load and, as a result, a
signi cant training time.
      </p>
      <p>The solution of the above problem using existing theoretical approaches is di cult, and the
proposed ANN model in the form of an information transmission system will allow studying the
interaction of layers and signi cantly reduce the complexity of training.</p>
      <p>
        Within the framework of this article, the following restrictions apply. We did not seek
to consider all the diversity of the ANN architecture, but limited themselves to feed-forward
networks. This article did not set the task of a complete study of the ANN and obtaining
practically signi cant results, but considered the problem of constructing a mathematical model
of the ANN based on the processes of parallel digital processing of input information, presented
as a random process containing a deterministic signal that must be attributed to a certain class.
We suggest that such an approach will signi cantly reduce the costs of training ANNs and,
thereby, increase the e ciency of their application.
2. Feature selection as an orthogonal transformation of the input vector
Studying the problem of feature selection, we relied on the provisions formulated by Rao et al.
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. ANN input vector fXig = (xi1; xi2; : : : ; xin) is considered as discretization of the continuous
signal x(t) considering the provisions of the Kotelnikov theorem (better known abroad as the
Nyquist-Shannon theorem). The output signal y(t) can also be represented by discrete values
fYig = (y1i; y2i; ; ymi). Investigating the ANN, we proceed from the classical scheme of the
information transmission system using wideband signals [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. The transmission of broadband
(complex) signals is characterized by the shape of a time-frequency matrix [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] distinguish
3 types of time-frequency matrices: parallel, serial and serial-parallel. A signal with a
serialparallel matrix is fed to the input of the ANN.
      </p>
      <p>
        Let the function x(t) be a complex signal consisting of i variants, each of which is encoded with
a sequence of n symbols. At the output, we get the function y(t), consisting of i variants, each
of the length m. Considering the principles of complex signals analysis, it is more convenient to
represent them in a generalized spectral form, i.e. each version of the signal can be represented
in the form [
        <xref ref-type="bibr" rid="ref10 ref11 ref5">5, 10, 11</xref>
        ]:
      </p>
      <p>There T = n tx = m ty, and the expansion coe cients
8&lt; xr(t) = Pkr2</p>
      <p>k=kr1 akr k(t)
: yl(t) = Pkl2
k=kl1 akl k(t)</p>
      <p>; t 2 [0; T ]
8&gt; akr = R0T 2k(t)dt R0T xr(t) k(t)dt</p>
      <p>1
&lt;
&gt; akl = R0T 2k(t)dt R0T yl(t) k(t)dt</p>
      <p>1
:</p>
      <p>
        The coordinate functions k(t) satisfy the orthogonality condition [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Representation (1)
makes it possible to understand the procedure for processing complex signals not only in the
time domain, but in the time-frequency domain, as it happens with the classical ANN design
methods. Using orthogonal transformations, the features (spectrum) of the input signal will
be selected, which, after processing, can be used to train the classi er. Thus, the ANN in the
form of an information transmission system (ITS) performs the transformation y(ti) = F (x(ti)),
where x(ti) = Pin=1 xi(i tx) and y(ti) = Pin=1 yi(i ty) and the functional of F is the subject of
this article.
      </p>
      <p>
        The classical approach (McCulloch-Pitts model [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) considers a mathematical model of a
neuron in the form yk;l = f (Pin=1 wik;lxik;l), where k; l are the number of the layer and the
number of the neuron in the layer, yk;l is the output of the neuron, xik;l are the inputs of the
neuron, wik;l are the weights (synapses) of input signals, f is the output function of the neuron,
which may or may not be linear. The transformation of a signal in a neuron can be considered
in the traditional sense as an algebraic sum of products of input signals and weights (adaptive
adder), and in the sense of expressions (1),(2). Indeed, if the set of weights of the k-th neuron
of the i-th layer fwkigk=0;1;2;::: is a discrete representation of the i-th orthogonal function i(t),
then the output will be
ai =
n
X x(k t) i(k t)
k=0
Applying to expression (3) a normalizing coe cient of the form k = R0T 1i2(t)dt , denoting
wi(k t) =
      </p>
      <p>1
R0T i2(t)dt
i(k t)
and summing over the period, we arrive at expression (2).</p>
      <p>
        Let us consider the application of the orthogonal transform based on the widely used Fourier
transform [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. However, in some cases, instead of trigonometric functions, some others are
more appropriate as a core, such as Lagger, Legendre, Hermite, Walsh, Chebyshev, Hadamard,
etc. Since the input and output signals are presented in discrete form, we use a kind of Fourier
transform for discrete signals, discrete Fourier transform (DFT).
      </p>
      <p>Thus, choosing the weights wi(k t) in accordance with (4), at the output of the i-th neuron,
we obtain the value of one of the spectral components ai. A layer of neurons with selected weights
wi(k t) for spectral components with serial numbers i = 0; 1; 2; 3; : : : gives the spectrum of the
input signal with a given accuracy as a result of the transformation.</p>
      <p>
        Let us consider in more detail the process of orthogonal transformation of the DFT based
on the McCulloch-Pitts model [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The DFT is based on the well-known expression for the
(1)
(2)
(3)
(4)
continuous Fourier transform X(!) = R 1 x(t)e j1tdt Passing from the continuous form to the
1
discrete one, we replace the integration with the summation and, introducing a restriction on
the signal spectrum width, we obtain the expression X(m) = Pn
i=01 x(i)e j2 im=n. A similar
expression can be implemented using complex numbers, but, more conveniently, it can be reduced
to the form
      </p>
      <p>X(m) = Pin=01 x(i) cos 2 nim j sin 2 nim
= Pin=01 x(i) cos 2 nim j Pin=01 x(i) sin 2 nim</p>
      <p>=
qopt =
s
1 Z 1 jX(!)j2 d!
2 N (!)
using Euler's identity e j = cos( ) j sin( ).</p>
      <p>
        Here X(m) is the m-th harmonic of the DFT, m is the index in the frequency domain, x(i)
is a sequence of input samples of size n. In this case, the number of neurons in the layer must
be at least 2n to convert the real (cosine) and imaginary (sine) parts of the complex number,
which is in agreement with the Hecht-Nielsen theorem [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It requires the number of neurons in
the rst hidden layer at least 2n + 1. If a di erent orthogonal transformation is used as a core
function, then the substitution of weights is performed based on the transformation used.
      </p>
      <p>
        In addition to DFT, in the practice of digital signal processing, the discrete cosine transform
(DCT) is widely used [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. DCT is used to compress images in MPEG and JPEG formats. DCT
is closely related to the Fourier transform and is a homomorphism of its vector space. Since
DCT operates with real numbers, it does not require 2n neurons in a layer, n is enough, which
reduces the number of neurons in the rst hidden layer by 2 times compared to DFT. To solve
the problem of reducing the computational load on the ANN, we use DCT.
3. Implementation of the classi er based on the McCulloch-Pitts model
For successful classi cation of the i-th implementation of the input value, it is necessary to
perform the operation of determining the degree of similarity of the implementation with the
values that determine the classi cation levels. The classical approach for training the classi er
is based on adaptive algorithms that minimize the value of the working function [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ] As a
working function for training ANN with a teacher, the mean square of the neural layer error is
widely used. Newton's and steepest descent methods are used as algorithms, as well as their
variations [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Since the proposed model widely uses digital signal processing methods, the implementation
of the classi er is based on the criteria of optimal ltering. A similar approach was considered
in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where the correlation function of the ANN output signal and the training sequence was
proposed as a working function. Let each realization of the input value be a deterministic signal,
which is the average value of all realizations from the training sample belonging to a certain class
mixed with additive noise, i.e. Xi = Zj + N . Here Xi is the input implementation belonging to
the j-th class. Zj = Pip=01 Xi=p is the number of input implementations belonging to the j-th
class. N is a random process, which has a Gaussian distribution. In the theory of information
transfer, the ratio of the maximum value of the input implementation to the standard deviation
of the noise is used as a characteristic that determines the ratio of the implementation to a
certain class [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]:
q =
max(Xi)
      </p>
      <p>out</p>
      <p>
        The optimal lter tends to maximize the value (6), i.e. q ! max. Since after passing
through the rst hidden layer, the input implementation is presented in the form of a spectrum,
Xiout(!) = Xi(!)H(!), where H(!) is the frequency response of the output layer. Omitting
intermediate calculations [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we get:
(5)
(6)
(7)
H(!) = k
      </p>
      <p>X (!) e j!t</p>
      <p>N (!)
H(!) = k jX(!)j</p>
      <p>n
Yj = X XiHij
i=1
(8)
(9)
(10)</p>
      <p>Based on expression (8), the amplitude-frequency characteristic of the optimal lter will be
the amplitude-frequency characteristic of the input implementation with an accuracy up to the
coe cient k:</p>
      <p>Expression (9) allows us to de ne the ANN output layer as an m n matrix, i. e. the number
of matrix rows is determined by the number of classes, columns { by the dimension of the input
implementation, and the values of the matrix rows contain the averaged values of the input
implementations related to each class. In accordance with the McCulloch-Pitts model, the value
at the output of the j-th neuron of the output layer is de ned as</p>
      <p>Based on the Bunyakovsky-Schwarz inequality and expression (7), we obtain the complex
frequency response of the optimal circuit:</p>
      <p>In expression (10), Hij is the j-th row of the output layer's matrix of weights, which is the
mathematical expectation of the implementations Xi of the training sample belonging to the
j-th class. The physical meaning of expression (10) means the correlation function of the input
implementation with the representation of the class, de ned in the form of the output layer
weights.</p>
      <p>
        Based on the above reasoning and the obtained expression (10), an ANN with one hidden
layer was built based on the PyTorch library [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The experiment was carried out using the
MNIST database [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The ANN layer weights were lled in as follows: the rst hidden layer
implements DCT, its dimension is equal to the dimension of the input implementation n, and
the output layer performs the function of an optimal receiver based on the correlator. Without
the use of nonlinear functions of the layers, the ANN gives the recognition accuracy on test
data of 72%. In order to assess the correctness of the choice of the mathematical expectation
criterion for the output layer, we will train the classi er, i.e. ANN output layer using the
gradient method. To do this, we arti cially prohibit changing the weights of all layers, except
for the last one. The result of training the classi er is shown in gure 1. Due to the training,
the recognition reliability was increased to 90%, which indicates that the weights in expression
(10), represented by the mathematical expectation of class realizations, are not an ideal criterion
for the optimal receiver. The use of a nonlinear layer, which is a ReLu function, increases the
recognition capabilities of test data from the MNIST database up to 95%.
4. Minimization of the feature vector in training the classi er
The next step, in accordance with [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], is to minimize the feature vector (in our case, the spectrum
of the input signal) in such a way that, with insigni cant losses in the information content of
the feature vector, reduce its dimension. To do this, Ahmed and Rao propose to conduct an
analysis of variance of the feature vector in order to identify features with minimum variance, i.
e. not informative, and exclude them from the number of analyzed ones. To assess the degree of
in uence of harmonics with low dispersion, we use an ANN with one hidden layer and the ReLu
function. The result of the successive removal of uninformative harmonics is shown in gure 2.
      </p>
      <p>
        It is clearly seen from the presented graph that the removal of more than 300 of 784 harmonics
has no e ect at all on the recognition accuracy of test data. Removing also 200 leads to a decrease
in the recognition quality by 1%. That is, leaving approximately 250 features of 784, we can
recognize the test data with 94% reliability. Thus, the rst hidden layer of the ANN can be
reduced by approximately 70% with a decrease in reliability of no more than 1%. The weight
matrix of the output layer also will be reduced by 70%. Therefore, the gain from the application
of the proposed model in comparison with the standard one [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ] is approximately 84%.
      </p>
    </sec>
    <sec id="sec-2">
      <title>5. Conclusion</title>
      <p>Despite the signi cant gain in ANN performance, the proposed model has a number of
disadvantages. These include the low reliability of test data recognition: 95% versus 98% or
more for ANNs implemented on the basis of the traditional model. This can be explained by
the fact that the simplest ANNs with one hidden layer have been simulated so far. In addition,
the use of the averaged value as a measure of the similarity with the implementation requires
further study, since further training is needed, and therefore computational costs. We hope that
further research in the proposed direction will reduce the recognition error and expand the range
of application of the proposed model.</p>
      <p>Acknowledgements This work was supported by a grant from the Russian Science
Foundation Grant No. 19-71-10033.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>McCulloch W S and Pitts</surname>
            <given-names>W 1943</given-names>
          </string-name>
          <source>The bulletin of mathematical biophysics 5</source>
          <volume>115</volume>
          {
          <fpage>133</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kolmogorov</surname>
            <given-names>A N</given-names>
          </string-name>
          <year>1957</year>
          <article-title>On the representation of continuous functions of several variables in the form of superpositions of continuous functions of one variable and addition Doklady Akademii nauk vol 114 (Rossijskaya akademiya nauk</article-title>
          ) pp
          <volume>953</volume>
          {
          <fpage>956</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Arnol'd V I 1958</surname>
          </string-name>
          <article-title>Mat</article-title>
          . Prosveshchenie 3
          <volume>41</volume>
          {
          <fpage>61</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Hecht-Nielsen</surname>
            <given-names>R</given-names>
          </string-name>
          1988
          <source>IEEE spectrum 25</source>
          <volume>36</volume>
          {
          <fpage>41</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Hinton</surname>
            <given-names>G E</given-names>
          </string-name>
          <year>2007</year>
          Trends in
          <source>cognitive sciences 11</source>
          <volume>428</volume>
          {
          <fpage>434</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Alexandrovich</surname>
            <given-names>S A</given-names>
          </string-name>
          <year>1983</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Klod</surname>
            <given-names>S 1963</given-names>
          </string-name>
          <article-title>Works on information theory and cybernetics</article-title>
          (Foreign Literature Publishing House)
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Rao</surname>
            <given-names>K</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ahmed N 1976</surname>
          </string-name>
          <article-title>Orthogonal transforms for digital signal processing ICASSP'76</article-title>
          . IEEE International Conference on Acoustics,
          <source>Speech, and Signal Processing</source>
          vol
          <volume>1</volume>
          (IEEE) pp
          <volume>136</volume>
          {
          <fpage>140</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Vershkov</surname>
            <given-names>N A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuchukov</surname>
            <given-names>V A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuchukova N N and Babenko</surname>
            <given-names>M 2020</given-names>
          </string-name>
          <article-title>The wave model of arti cial neural network 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus) (IEEE</article-title>
          ) pp
          <volume>542</volume>
          {
          <fpage>547</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Vasilievich</surname>
            <given-names>S A</given-names>
          </string-name>
          <year>1967</year>
          <article-title>Information theory and its application to automatic control problems (Publishing House "Science" Head edition of physical</article-title>
          and mathematical literature)
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Deng</surname>
            <given-names>L</given-names>
          </string-name>
          <source>2012 IEEE Signal Processing Magazine 29</source>
          <volume>141</volume>
          {
          <fpage>142</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Hebb D O 1949 A Wiley</surname>
          </string-name>
          <article-title>Book in Clinical Psychology 62 78</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Widrow</surname>
            <given-names>B</given-names>
          </string-name>
          <source>1959 Part 4</source>
          <volume>74</volume>
          {
          <fpage>85</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Paszke</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Massa</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lerer</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bradbury</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chanan</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Killeen</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gimelshein</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antiga</surname>
            <given-names>L</given-names>
          </string-name>
          et al.
          <year>2019</year>
          arXiv preprint arXiv:
          <year>1912</year>
          .01703
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>