<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IDDM-</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Lenet-Vit Deep Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eugene</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fedorov</string-name>
          <email>y.fedorov@chdtu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tetyana</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Utkina</string-name>
          <email>t.utkina@chdtu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leshchenko</string-name>
          <email>mari.leshchenko@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nechyporenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kostiantyn Rudakov</string-name>
          <email>k.rudakov@chdtu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cherkasy State Technological University</institution>
          ,
          <addr-line>Shevchenko blvd., 460, Cherkasy, 18006</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>5</volume>
      <fpage>18</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>The method for intelligent diagnosis of COVID-19 based on the LeNet-ViT deep neural network was proposed. The LeNet-ViT model was created, it has the following advantages: the input image is not square, which expands the scope; the input image is pre-compressed and the new size depends on the original image size, and it is empirically determined, which increases the model training speed and the model identification accuracy; the number of pairs “convolutional layer - downsampling layer” depends on the image's size, and automatically determined, which increases the model classification accuracy; the number of layer planes is automatically determined, which speeds up the definition of the model structure; the patch size depends on the image size, and it is empirically determined, which increases the model identification accuracy; the number of encoder blocks is empirically determined, which increases the model learning speed; the use of a convolutional neural network allows to efficiently extract features, and the use of a visual transformer allows to effectively analyze these features. The proposed method for intelligent diagnosis of COVID-19 can be used in various intelligent computer systems for medical diagnostics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>transformer
intelligent diagnostics, COVID-19, deep neural network, convolutional neural network, visual</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>The COVID-19 epidemic is rapidly spreading around the world and has already harmed the health
and well-being of the people from different countries. By 2022, there are hundreds of millions was
infected and millions was dead. Effective diagnosis of COVID-19 is essential to reduce the growth of
the disease and promptly respond to those who become ill.</p>
      <p>Currently, the COVID-19 diagnosis uses the following methods:
the absence of pain, automation, the absence of the requirement of the work rules strictest observance
in the laboratory and high accuracy. The disadvantage is the high cost.
EMAIL:
ORCID:
0000-0003-3841-7373 (E. Fedorov);
0000-0002-6614-4133 (T. Utkina);
0000-0002-0210-9582 (M. Leshchenko);</p>
      <p>2022 Copyright for this paper by its authors.
3. Radiography [13-17] is used to obtain a CXR image. The advantages are the absence of pain,
high speed, the absence of the requirement of the work rules strictest observance in the laboratory,
and low cost. The disadvantage is insufficient accuracy.</p>
      <p>An ensemble of classifiers can be used to recognize CCT and CXR images (for example, decision
tree, artificial neural network, naive Bayes, k-nearest neighbors, etc.) [18].</p>
      <p>Currently, classifying deep neural networks have become widespread for intelligent image
identification [19].</p>
      <p>The following 2D convolutional networks is the first class of such networks.</p>
      <p>The LeNet-5 neural network [20] has the simplest architecture and uses two pairs of convolutional
and downsampling layers, as well as two fully connected layers. The convolution layer reduces the
shear sensitivity of image elements. The downsampling layer reduces the dimensionality of the image.
A combination of LeNet-5 (for feature extraction) and long short-term memory (LSTM) (for
classification) is currently popular [21].</p>
      <p>Dark Net neural networks [22], AlexNet neural networks [23], and VGG neural networks (Visual
Geometry Group) [24, 25] are a modification of LeNet. There can be several consecutive convolutional
layers in these neural networks.</p>
      <p>ResNet neural networks [24, 25, 26] use a Residual block that contains two consecutive
convolutional layers. The planes’ outputs of the layer preceding of this block are added to the planes’
outputs of the second convolutional layer of this block. A combination of ResNet (for feature extraction)
and a support vector machine (SVM) (for classification) is currently popular [27].</p>
      <p>The DenseNet neural network (Dense Convolutional Network) [25, 28] uses a fully connected
(dense) block, which contains a set of Residual blocks. The planes’ outputs of the second convolutional
layer of the current Residual block of this dense block are concatenated with the planes’ outputs of the
second convolutional layer of all previous Residual blocks of this dense block and with the planes’
outputs of the layer preceding this dense block. In addition, the reduction of the convolutional layers’
planes (usually twice) located between dense blocks is used.</p>
      <p>The GoogLeNet neural network (Inception V1) [29] uses an Inception block that contains parallel
convolutional layers with different sizes of connection regions and one downsampling layer. The
planes’ outputs of these parallel layers are concatenated. The convolutional layers with a single
connection area are connected in series to these parallel layers to reduce the number of operations (in
the case of convolutional layers, such a convolutional layer is placed before them, and in the case of a
downsampling layer, such a convolutional layer is placed after it). A combination of ResNet (for feature
extraction) and support vector machine (SVM) (for classification) [27] used for diagnostics on CXR
images is currently popular, which provided a diagnostic probability close to 100%.</p>
      <p>The Inception V3 neural network [25, 26, 30] is a modification of GoogLeNet, and its Inception and
Reduction blocks are a modification of the GoogLeNet neural network’s Inception block.</p>
      <p>The Inception-ResNet-v2 neural network [25, 26, 31] is a modification of GoogLeNet and ResNet,
its Inception block is a modification of the Residual and Inception block, and the Reduction block is a
modification of the Inception block.</p>
      <p>The Xception neural network [25, 32] uses a Depthwise separable convolution block that performs
first a pointwise convolution and then a depthwise convolution. For both convolutions, the ReLU
activation function is usually used.</p>
      <p>The MobileNet neural network [33] uses a Depthwise separable convolution block that performs
depthwise convolution first and then pointwise convolution. For both convolutions, a linear activation
function is usually used.</p>
      <p>The MobileNet2 neural network [25, 34] uses an Inverse Residual block that performs pointwise
convolution first, then depthwise convolution, and then pointwise again. For both convolutions, the
SiLU activation function is usually used.</p>
      <p>The MobileNet3 neural network [35] uses a Squeeze-and-Excitation block in some Inverse Residual
blocks.</p>
      <p>The following transformers is the second class of such networks.</p>
      <p>ViT (Visual Transformer) [36] contains an encoder as the main component, consisting of a sequence
of blocks. Each block contains the first normalization layer, Multi-Head Attention (weights the image
patches), the second normalization layer, a two-layer perceptron.</p>
      <p>DeiT (Data-efficient image Transformers) [37] contains an encoder as the main component,
consisting of a sequence of blocks. Each block contains the first layer of normalization, Multi-Head
Attention, the second layer of normalization, a two-layer perceptron, as in the case of ViT. DeiT, unlike
ViT, additionally uses a distillation token in addition to patches.</p>
      <p>DeepViT (Deep Visual Transformer) [38] contains an encryptor as the main component, consisting
of a sequence of blocks. Each block contains the first normalization layer, Re-Attention (Multi-Head
Attention modification), the second normalization layer, a two-layer perceptron</p>
      <p>CaiT (Class-Attention in Image Transformers) [39] contains an encoder as the main component,
consisting of a sequence of blocks. Each block contains the first normalization layer, Multi-Head
Attention or Class-Attention (a modification of Multi-Head Attention that takes into account not only
the patch, but also the class), the second normalization layer, a two-layer perceptron.</p>
      <p>CrossViT (Cross-Attention Multi-Scale Vision Transformer) [40] contains a multi-scale encoder
consisting of a blocks’ sequence as the main component. Each block contains two encoders (each
encoder is similar to a ViT encoder) for large and small size patches, and a Cross-Attention (a
modification of Multi-Head Attention) that allows to share patches of different sizes.</p>
      <p>Hybrids of convolutional neural networks and transformers are actively currently used.</p>
      <p>The Compact Convolutional Transformer (CCT) [41] contains an encoder, as its main component,
consisting of a sequence of blocks. Each block contains the first layer of normalization, Multi-Head
Attention, the second layer of normalization, a two-layer perceptron, as in the case of ViT. Patch
extraction is used with a sequence of pairs of convolutional and downsampling layers, instead of
splitting the image into patches, and there is a pooling sequence layer after the encoder.</p>
      <p>The Pooling-based Vision Transformer (PiT) [42] contains an encoder as the main component,
consisting of a sequence of blocks. Each block contains the first layer of normalization, Multi-Head
Attention, the second layer of normalization, a two-layer perceptron, as in the case of ViT. PiT, unlike
ViT, uses depthwise convolution (each plane of the current layer is connected only to the corresponding
plane of the next layer) and a downsampling layer before and after the encoder.</p>
      <p>LeViT [43] contains a cipher consisting of a blocks’ sequence as the main component. Each block
contains LeViT Attention (three convolutions with normalization are used) without downsampling and
a two-layer perceptron. Between these blocks there are LeViT Attention (using three convolutions with
normalization) with downsampling. Patch extraction is used using a sequence of pairs of convolutional
and downsampling layers, instead of patching the image. LeViT, unlike ViT, additionally uses a
distillation token in addition to patches.</p>
      <p>The Convolutional vision Transformer (CvT) [44] contains an encoder as the main component,
consisting of a sequence of blocks. Each block contains the first layer of normalization, Multi-Head
Attention, the second layer of normalization, a two-layer perceptron, as in the case of ViT. CvT, unlike
ViT, uses patch extraction with a pairs’ sequence of convolutional and downsampling layers, instead of
patching the image.</p>
      <p>MobileViT [45] contains Inverse Residual blocks from the MobileViT2 neural network and
encoders, as the main components. Each encoder consists of a sequence of blocks. Each block contains
the first layer of normalization, Multi-Head Attention, the second layer of normalization, a two-layer
perceptron, as in the case of ViT. There are two convolutional layers before and after the encoder.</p>
      <p>Deep neural networks have one or more of the following disadvantages:
• classification accuracy is insufficiently high;
• complexity of neural network structure identification (number and size of layers, number of
transformer blocks, patch size, etc.);
• speed of parameters identification is insufficiently high.</p>
      <p>Parallel algorithms are used to increase the learning rate of deep neural networks [46-47]. The
problem of creating an effective deep neural network is relevant.</p>
      <p>The goal of the research is to increase the intelligent diagnostics’ efficiency of COVID-19 through
the use of deep neural networks.</p>
      <p>To achieve this goal, the following tasks were set and solved:
1. To create a COVID-19 smart diagnostic model based on a convolutional neural network and a
visual transformer.
2. To select the quality criteria for the COVID-19 smart diagnostic method.
3. To determine the structure of the COVID-19 smart diagnostic method.</p>
      <p>4. To conduct numerical research of the proposed method for intelligent diagnosis of COVID-19.</p>
    </sec>
    <sec id="sec-3">
      <title>2. LeNet-ViT deep neural network</title>
      <p>The LeNet-ViT deep neural network for CXR image classification was proposed by the research’s
authors, and it is a non-recurrent network (fig. 1). LeNet-ViT includes a sequence of alternating
convolutional and downsampling layers and a reshape layer, an encoder (consists of blocks; the k-th
block structure is shown in fig. 2), a flattening block, a multilayer perceptron (MLP).</p>
      <p>The encoder, unlike traditional LeNet, was added after the convolutional and downsampling layers,
which improves the efficiency of classification.</p>
      <p>The input CvT image, unlike the traditional one, is not square; the input image is pre-compressed
and the new size depends on the original image size and is empirically determined; the number of pairs
“convolutional layer - downsampling layer” depends on the size of the image and is empirically
determined; the number of planes is automatically determined, as the quotient of dividing the cells’
number in the input layer by two to the power (the power is equal to twice number of the pair
“convolutional layer - downsampling layer”), which will save the total number of cells in the layer after
downsampling; downsampling halves the planes’ layer size in height and width by half; the patch size
depends on the image size and is empirically determined; the number of encoder blocks is empirically
determined; the class number is not added to patches.</p>
    </sec>
    <sec id="sec-4">
      <title>3. The ANN functioning.</title>
      <p>1. Patch Shaping via convolution and downsampling.</p>
      <p>Let  – position in the connection areas,  = ( x , y ) , KI – is the number of cell planes in the I
input layer (for RGB images 3), Ksl – is the number of cell planes in the downsampling layer Sl , Kcl
– is the number of cell planes in the convolutional layer C , Al – is the connection area of the layer
l
plane Sl , L – is the number of convolutional (or downsampling) layers.</p>
      <p>1.1.</p>
      <p>l = 1.</p>
      <p>Kcl = 22l , N1cl = NN11sIl,−1 , ll = 11 , N 2cl = NN 22sIl,−1 , ll = 11,,</p>
      <p> KI
hcl (m,i) = bc1 (i) + kK=sl1−v1AI wc1 ( , k,i)x(m + , k ), l = 1,</p>
      <p>bcl (i) + k=1 vAl−1 wcl ( , k,i)usl−1 (m + , k ), l  1,
where wc1 ( , k,i) – is the connection weight from the  -th position in the connection area of the k-th
input layer’s plane of cells I to the i-th convolutional layer’s plane of cells C1 , wcl ( , k,i) – is the
connection weight of the  -th position in the connection area of the downsampling layer Sl−1 to the
ith convolutional layer’s plane of cells C1 , ucl (m,i) – is the output of the cell in the m -th position in
the i-th convolutional layer’s plane of cells C .</p>
      <p>1
1.3. The computation of the output signal of a downsampling cell (scale down by half based
on averaging)</p>
      <p>usl (m, k) = m{0a,x1}2 ucl (2m + , k) , m {1,..., N1sl }{1,..., N 2sl } , k 1, Ksl , Ksl = 22l ,
where wsl (k, k) – is the connection weight of the k-th convolutional layer’s plane of cells Cl to k-th
downsampling layer’s plane of cells Sl , usl (m, k ) – is the output of the cell in the m -th position in the
k-th downsampling layer’s plane of cells Sl .</p>
      <p>1.4. If l  L , then l = l +1, go to 1.2.</p>
      <p>
yn(1j) = ReLU  bn(1j) +

where wn(1ij) –is the connection weight from the i-th element of the image patch to the j-th neuron of the
fully connected layer, yn(1j) – is the output of the j-th neuron of the fully connected layer for the n-th
patch.</p>
      <p>3. The positional encryption (patch position added)</p>
      <p>
          n 
yn(2i) = yn(1i) + pos(n,i) , pos(n,i) =   100002i/ N (
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        )  i mod 2 = 0,
cos 
  n 
sin  i mod 2  0,
  100002i/ N (
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        ) 
      </p>
      <p>
        i 1, N (
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        ) , n 1, P .
      </p>
      <p>
        The encoder for each block does the following:
4.1. Normalization
yn(3i) =
g

( yn(2i) −  ) , i 1, N (
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        ) , n 1, P ,  =
      </p>
      <p>1
where g – is the amplification parameter, yn(3i) – is the output of the i-th normalization layer neuron for
the n-th patch.</p>
      <p>4.2. Multihead attention
There are N (H ) attention heads are using.</p>
      <p>4.2.1. The computation of the requests for each attention head.</p>
      <p>
        N (
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        )
qln j =  wl(iQj) yn(3i) , l 1, N (H ) , i 1, N (K ) , n 1, P ,
      </p>
      <p>i=1
where wl(iQj) – is the connection weight from the i-th image patch element to the j-th request for the
lth attention head, qln j – is the j-th request for the l-th attention head for the n-th patch.
4.2.2. The computation of the keys for each attention head.</p>
      <p>
        N (
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        )
kln j =  wl(iKj ) yn(3i) , l 1, N (H ) , j 1, N (K ) , n 1, P ,
      </p>
      <p>i=1
where wl(iKj ) – is the connection weight from the i-th image patch element to the j-th key for the l-th
attention head, kln j – is the j-th key for the l-th attention head for the n-th patch.</p>
      <p>4.2.3. The computation of the key values for each attention head.</p>
      <p>
        N (
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        )
vln j =  wl(iVj ) yn(3i) , l 1, N (H ) , j 1, N (V ) , n 1, P ,
      </p>
      <p>i=1
where wl(iVj ) – is the connection weight from the i-th image patch element to the j-th key’s value for the
l-th attention head, vln j – is the j-th key for the l-th attention head for the n-th patch.</p>
      <p>Usually, N (V ) = N (K )</p>
      <p>4.2.4. The computation of the attention weights (scores) for each attention head.</p>
      <p>Scaled multiplicative attention “dot” is used. The request and the key are matched.</p>
      <p>1 N (K )
elmn = N (K ) i=1 qln iklmi , l 1, N (H ) , m 1, P , n 1, P ,</p>
      <p>exp(elmn )
almn = softmax(elmn ) = P , n 1, P ,
 exp(elmz )
z=1
where almn – is the connection weight between the m-th and n-th patches for the l-th attention head.</p>
      <p>4.2.5. The computation of the requests for each attention head.</p>
      <p>The multiplication of the attention weight (score) by the key value.</p>
      <p>P
hln j =  almnvlmj , l 1, N (H ) , j 1, N (V ) , n 1, P ,</p>
      <p>
        m=1
where hln j – is the weighted j-th key value for the j-th key for the l-th attention head for the n-th patch.
4.2.6. The formation of the weighted key values matrix for attention heads and patches by
concatenation
where w(
        <xref ref-type="bibr" rid="ref5">4</xref>
        ) – is the connection weight from the j-th key value for the j-th neuron in the multihead
ij
attention layer, yn(4j) – is the output of the j-th neuron in the multihead attention layer for n-th patch.
4.3.
      </p>
      <p>Summation and normalization.</p>
      <p>yn(5i) = g
</p>
      <p>
        ( yn(2i) + yn(4i) −  ) , i 1, N (
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        ) , n 1, P ,
 =
      </p>
      <p>1
where wz(jk ) – is the connection weight of the z-th k-1-th layer’s neuron to j-th neuron of the k-th layer,
yn(kj ) – is the output of the fully connected layer’s j-th neuron for n-th patch.</p>
      <p>yn(8i) = Norm( yn(4i) + yn(7i) ) =
g</p>
      <p>
        ( yi(
        <xref ref-type="bibr" rid="ref5">4</xref>
        ) + yn(7i) −  ) , i 1, N (
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        ) , n 1, P ,

 =
      </p>
      <p>1</p>
      <p>
        y(
        <xref ref-type="bibr" rid="ref10">9</xref>
        ) = ( y1(18) ,..., y(
        <xref ref-type="bibr" rid="ref9">8</xref>
        )
      </p>
      <p>
        N(
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        )P )
      </p>
      <p>The computation of the output signal for the hidden and output layer of a two-layer perceptron.</p>
      <p>
         N(
        <xref ref-type="bibr" rid="ref1 ref2">1</xref>
        )P 
y(j10) = ReLU  b(j10) +  wz(1j0) yz(
        <xref ref-type="bibr" rid="ref10">9</xref>
        )  , j 1, N (
        <xref ref-type="bibr" rid="ref11">10</xref>
        ) ,
 z=1 
where wz(jk) – is the connection weight of the z-th k-1-th layer’s neuron to j-th neuron of the k-th layer,
y(jk) – is the output of the j-th neuron of the k-th fully connected layer.
4. The selection of quality criteria for the COVID-19 smart diagnostic method.
      </p>
      <p>The following criteria were chosen to evaluate the learning of the proposed deep neural networks
mathematical models in this research:
• accuracy criterion</p>
      <p>Accuracy =
1 Idi = yi  → max, yij = 1, j = arg mzax yiz ,</p>
      <p>I i=1 W 0, j  arg mzax yiz .
•
categorical cross-entropy criterion</p>
      <p>1 I K
CCE = −   dij ln yij → min,</p>
      <p>I i=1 j=1 W
where yi – is the i -th vector by model, yij [0,1] , di – is the i -th test vector, dij {0,1} ,
I – is the training set power, K – is the classes number (neurons in the output layer), W – is the weight
vector.
5. The structure’s determination of the COVID-19 smart diagnostic method</p>
      <p>The block diagram of the COVID-19 intelligent diagnostic method for the proposed LeNet-ViT deep
neural network is shown in fig. 3.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Numerical research</title>
      <p>The numerical research was carried out on the basis of data sets [48–49]. The size of the COVIDx
CXR-3 dataset was 30386 records. No pre-processing of the dataset was performed.The training data
set included 13992 COVID-19 Negative and 15994 COVID-19 Positive. The test data set included 200
COVID-19 Negative and 200 COVID-19 Positive.</p>
      <p>The training was performed using the GPU due to the fact that the proposed deep neural networks
don’t contain recurrent connections. The Tensorflow package was used to implement the proposed deep
neural networks, and Google Collaboratory was chosen as the software environment.</p>
      <p>The LeNet-ViT model’s structure with two encoder blocks is shown in Table 1, where K is the
number of classes.</p>
      <p>The losses dependence (based on categorical entropy) on the image size for the LeNet-ViT model is
shown in fig. 4. The losses dependence (based on categorical entropy) on the patch size for the
LeNetViT model is shown in fig. 5. The losses dependence (based on categorical entropy) on the number of
iterations for the LeNet-ViT model is shown in fig. 6. The dependence of accuracy on the number of
iterations for the LeNet-ViT model is shown in fig. 7. The losses dependence (based on categorical
entropy) on the number of pairs “convolutional layer - downsampling layer” for the LeNet-ViT model
is shown in fig. 8.
tn3,5
e
ss 3
o
rC2,5
a 2
l
c
i
r1,5
o
g
e 1
t
a
C0,5
0
0,002</p>
      <p>8
patch size
0,010
16
0,037
(16,16)
0,002
(32,32)
0,008
(64,64)</p>
      <p>0,095
(height,width)
(128,128) (256,256) (512,512) (1024,1024)
0,2
0
yp0,014
o
tr 0,012
n
sse 0,01
o
rC0,008
l
ica0,006
r
go0,004
e
ta0,002
C
0</p>
      <p>0,6579
0,5132
0,014
1 pair
0,002
2 pair
number pair
0,013
3 pair
1
2
3
4
5
6
7
8</p>
      <p>The following facts were found as a result of a numerical study:
• the best image size after compression for LeNet-ViT in terms of loss (based on categorical
entropy) is 32x32 (fig. 4);
• the best compression patch size for LeNet-ViT in terms of loss (based on categorical entropy)
is 8x8 (fig. 5);
• the minimum number of iterations for LeNet-ViT in terms of loss (based on categorical entropy)
(fig. 6) and accuracy (fig. 7) is 11;
• the best number of convolutional layer-downsampling layer pairs for LeNet-ViT in terms of
loss (based on categorical entropy) is 2 (fig. 8).</p>
      <p>The KFold cross-entropy method with 5 folds was used to prevent overfitting.
7. Conclusions
1. The relevant artificial intelligence methods have been researched to solve the problem of
improving the efficiency of intelligent diagnosis of COVID-19. The using of deep neural networks
is the most effective according to the research’s results.
2. The created LeNet-ViT model, unlike traditional CvT, has the following advantages: the input
image is not square, which expands the scope; the input image is pre-compressed and the new size
depends on the original image size and is empirically determined; this increases the model training
speed and the model identification accuracy; the number of pairs “convolutional layer
downsampling layer” depends on the image size and is empirically determined; this increases the
model identification accuracy, and the number of planes is automatically determined, as the quotient
of dividing the cells’ number in the input layer by two to the power (the power is equal to twice
number of the pair “convolutional layer - downsampling layer”), which will save the total number
of cells in the layer after downsampling; it halves the size of the layer planes in height and width,
which automates the structure definition of the model layers; the patch size depends on the image
size and is empirically determined, which increases the model identification accuracy; the number
of encoder blocks is empirically determined, which increases the model learning speed.
3. Using of the proposed method of intelligent COVID-19 diagnostics for various intelligent
medical diagnostic systems is a further research prospect.</p>
      <p>The KFold cross-entropy method with 5 folds was used to prevent overfitting.</p>
    </sec>
    <sec id="sec-6">
      <title>8. References</title>
      <p>[14] Baratella, Elisa, et al. “Severity of Lung Involvement on Chest x-Rays in SARS-Coronavirus-2
Infected Patients as a Possible Tool to Predict Clinical Progression: An Observational
Retrospective Analysis of the Relationship between Radiological, Clinical, and Laboratory Data.”
Jornal Brasileiro De Pneumologia, vol. 46, no. 5, 2020, doi:10.36416/1806-3756/e20200226.
[15] Amis, E. Stephen, et al. “American College of Radiology White Paper on Radiation Dose in
Medicine.” Journal of the American College of Radiology, vol. 4, no. 5, 2007, pp. 272–284.,
doi:10.1016/j.jacr.2007.03.002.
[16] Tahir, Anas M., et al. “A Systematic Approach to the Design and Characterization of a Smart
Insole for Detecting Vertical Ground Reaction Force (Vgrf) in Gait Analysis.” Sensors, vol. 20,
no. 4, 2020, p. 957., doi:10.3390/s20040957.
[17] Kallianos, K., et al. “How Far Have We Come? Artificial Intelligence for Chest Radiograph
Interpretation.” Clinical Radiology, vol. 74, no. 5, 2019, pp. 338–345.,
doi:10.1016/j.crad.2018.12.015.
[18] Chandra, Tej Bahadur, et al. “Coronavirus Disease (COVID-19) Detection in Chest X-Ray Images
Using Majority Voting Based Classifier Ensemble.” Expert Systems with Applications, vol. 165,
2021, p. 113909., doi:10.1016/j.eswa.2020.113909.
[19] Fedorov, Eugene, et al. “The Method of Intelligent Image Processing Based on a Three-Channel
Purely Convolutional Neural Network.” CEUR Workshop Proceedings, vol. 2255, 2018, pp. 336–
351., ceur-ws.org/Vol-2255/paper30.pdf. Accessed 25 Sept. 2022.
[20] Wan, Lanjun, et al. “Rolling-Element Bearing Fault Diagnosis Using Improved Lenet-5 Network.”</p>
      <p>Sensors, vol. 20, no. 6, 2020, p. 1693., doi:10.3390/s20061693.
[21] Islam, Md. Zabirul, et al. “A Combined Deep CNN-LSTM Network for the Detection of Novel
Coronavirus (COVID-19) Using X-Ray Images.” Informatics in Medicine Unlocked, vol. 20,
2020, p. 100412., doi:10.1016/j.imu.2020.100412.
[22] Ozturk, Tulin, et al. “Automated Detection of COVID-19 Cases Using Deep Neural Networks with
X-Ray Images.” Computers in Biology and Medicine, vol. 121, 2020, p. 103792.,
doi:10.1016/j.compbiomed.2020.103792.
[23] Krizhevsky, Alex, et al. “ImageNet Classification with Deep Convolutional Neural Networks.”</p>
      <p>Communications of the ACM, vol. 60, no. 6, 2017, pp. 84–90., doi:10.1145/3065386.
[24] He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on</p>
      <p>Computer Vision and Pattern Recognition (CVPR), 2016, doi:10.1109/cvpr.2016.90.
[25] Hemdan, Ezz El-Din, et al. “COVIDX-Net: A Framework of Deep Learning Classifiers to</p>
      <p>Diagnose COVID-19 in X-Ray Images.” ArXiv, 2020, doi:10.48550/ARXIV.2003.11055.
[26] Narin, Ali, et al. “Automatic Detection of Coronavirus Disease (COVID-19) Using X-Ray Images
and Deep Convolutional Neural Networks.” Pattern Analysis and Applications, vol. 24, no. 3,
2021, pp. 1207–1220., doi:10.1007/s10044-021-00984-y.
[27] “Detection of COVID-19 Chest X-Ray Using Support Vector Machine and Convolutional Neural
Network.” Communications in Mathematical Biology and Neuroscience, 2020,
doi:10.28919/cmbn/4765.
[28] Que, Yue, and Hyo Jong Lee. “Densely Connected Convolutional Networks for Multi-Exposure
Fusion.” 2018 International Conference on Computational Science and Computational Intelligence
(CSCI), 2018, doi:10.1109/csci46756.2018.00084.
[29] Szegedy, Christian, et al. “Going Deeper with Convolutions.” 2015 IEEE Conference on Computer</p>
      <p>Vision and Pattern Recognition (CVPR), 2015, doi:10.1109/cvpr.2015.7298594.
[30] Szegedy, Christian, et al. “Rethinking the Inception Architecture for Computer Vision.” 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
doi:10.1109/cvpr.2016.308.
[31] Szegedy, Christian, et al. “Inception-V4, Inception-Resnet and the Impact of Residual Connections
on Learning.” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017,
doi:10.1609/aaai.v31i1.11231.
[32] Chollet, Francois. “Xception: Deep Learning with Depthwise Separable Convolutions.” 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017,
doi:10.1109/cvpr.2017.195.
[33] Howard, Andrew G., et al. “Efficient Convolutional Neural Networks for Mobile Vision</p>
      <p>Applications.” ArXiv, 2017, doi:10.48550/arXiv.1704.04861.
[34] Sandler, Mark, et al. “MobileNetV2: Inverted Residuals and Linear Bottlenecks.” 2018 IEEE/CVF</p>
      <p>Conference on Computer Vision and Pattern Recognition, 2018, doi:10.1109/cvpr.2018.00474.
[35] Geng, Lei, et al. “Fertility Detection of Hatching Eggs Based on a Convolutional Neural Network.”</p>
      <p>Applied Sciences, vol. 9, no. 7, 2019, p. 1408., doi:10.3390/app9071408.
[36] Dosovitskiy, Alexey, et al. “An Image Is Worth 16x16 Words: Transformers for Image</p>
      <p>Recognition at Scale.” ArXiv, 2021, doi:10.48550/arXiv.2010.11929.
[37] Touvron, Hugo, et al. “Training Data-Efficient Image Transformers &amp;amp; Distillation through</p>
      <p>Attention .” ArXiv, 2021, doi:10.48550/arXiv.2012.12877.
[38] Zhou, Daquan, et al. “DeepViT: Towards Deeper Vision Transformer .” ArXiv, 2021,
doi:10.48550/arXiv.2103.11886.
[39] Touvron, Hugo, et al. “Going Deeper with Image Transformers.” 2021 IEEE/CVF International</p>
      <p>Conference on Computer Vision (ICCV), 2021, doi:10.1109/iccv48922.2021.00010.
[40] Chen, Chun-Fu Richard, et al. “Crossvit: Cross-Attention Multi-Scale Vision Transformer for
Image Classification.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV),
2021, doi:10.1109/iccv48922.2021.00041.
[41] Hassani, Ali, et al. “Escaping the Big Data Paradigm with Compact Transformers.” ArXiv, 7 June
2022, arxiv.org/abs/2104.05704.
[42] Heo, Byeongho, et al. “Rethinking Spatial Dimensions of Vision Transformers.” 2021 IEEE/CVF</p>
      <p>International Conference on Computer Vision (ICCV), 2021, doi:10.1109/iccv48922.2021.01172.
[43] Graham, Ben, et al. “Levit: A Vision Transformer in ConvNet’s Clothing for Faster Inference.”
2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021,
doi:10.1109/iccv48922.2021.01204.
[44] Wu, Haiping, et al. “CVT: Introducing Convolutions to Vision Transformers.” 2021 IEEE/CVF</p>
      <p>International Conference on Computer Vision (ICCV), 2021, doi:10.1109/iccv48922.2021.00009.
[45] Mehta, Sachin, and Mohammad Rastegari. “Mobilevit: Light-Weight, General-Purpose, and</p>
      <p>Mobile-Friendly Vision Transformer.” ArXiv, 4 Mar. 2022, arxiv.org/abs/2110.02178.
[46] Shvachych, G.G., et al. “Parallel Computational Algorithms in Thermal Processes in Metallurgy
and Mining.” Naukovyi Visnyk Natsionalnoho Hirnychoho Universytetu, no. 4, 2018, pp. 129–
137., doi:10.29202/nvngu/2018-4/19.
[47] Shlomchak, G, et al. “Automated Control of Temperature Regimes of Alloyed Steel Products
Based on Multiprocessors Computing Systems.” Metalurgija, vol. 58, no. 3-4, 2019, pp. 299–302.,
hrcak.srce.hr/218406. Accessed 25 Sept. 2022.
[48] Zhao, Andy. “Chest x-Ray Images for the Detection of COVID-19.” Kaggle, 2022,
www.kaggle.com/datasets/andyczhao/covidx-cxr2.
[49] Gunraj, Hayden. “COVIDx CT: A Large-Scale Chest CT Dataset for COVID-19 Detection.”
Kaggle, 2022, www.kaggle.com/datasets/c395fb339f210700ba392d81bf200f766418238c2734e
5237b5dd0b6fc724fcb.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1.5. The matrix [ yn(0i) ] , i 1, N (1) , n 1, P formation, corresponding to a sequence of flat patches (the patch matrix is turned into a vector) of P length. y(n1) = (usL (1</article-title>
          ,
          <string-name>
            <surname>n</surname>
            <given-names>)</given-names>
          </string-name>
          ,...,
          <source>usL (N (1)</source>
          , n)) , n 1,
          <string-name>
            <surname>P ,</surname>
          </string-name>
          <article-title>N (1) = N1sL N 2sL , P = KsL . The transformation of patch vectors in a fully connected layer</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Sheridan</surname>
          </string-name>
          , Cormac. “
          <article-title>Coronavirus and the Race to Distribute Reliable Diagnostics</article-title>
          .”
          <source>Nature Biotechnology</source>
          , vol.
          <volume>38</volume>
          , no.
          <issue>4</issue>
          ,
          <issue>2020</issue>
          , pp.
          <fpage>382</fpage>
          -
          <lpage>384</lpage>
          ., doi:10.1038/d41587-020-00002-2.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Lippi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Giuseppe</surname>
          </string-name>
          , et al. “
          <article-title>Potential Preanalytical and Analytical Vulnerabilities in the Laboratory Diagnosis of Coronavirus Disease 2019 (Covid-19).” Clinical Chemistry and Laboratory Medicine (CCLM)</article-title>
          , vol.
          <volume>58</volume>
          , no.
          <issue>7</issue>
          ,
          <issue>2020</issue>
          , pp.
          <fpage>1070</fpage>
          -
          <lpage>1076</lpage>
          ., doi:10.1515/cclm-2020-0285.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Lippi</surname>
          </string-name>
          , Giuseppe, and Mario Plebani.
          <article-title>“A Six-Sigma Approach for Comparing Diagnostic Errors in Healthcare-Where Does Laboratory Medicine Stand?” Annals of Translational Medicine</article-title>
          , vol.
          <volume>6</volume>
          , no.
          <issue>10</issue>
          ,
          <year>2018</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>180</lpage>
          ., doi:10.21037/atm.
          <year>2018</year>
          .
          <volume>04</volume>
          .02.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>Beatriz</given-names>
          </string-name>
          <string-name>
            <surname>Araujo</surname>
          </string-name>
          , et al. “
          <article-title>SARS-COV-2 and the COVID-19 Disease: A Mini Review on Diagnostic Methods</article-title>
          .”
          <string-name>
            <surname>Revista Do Instituto De Medicina Tropical De São Paulo</surname>
          </string-name>
          , vol.
          <volume>62</volume>
          ,
          <year>2020</year>
          , doi:10.1590/s1678-
          <fpage>9946202062044</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kasotakis</surname>
          </string-name>
          , George. “
          <article-title>Faculty Opinions Recommendation of Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (Covid-19</article-title>
          ) in
          <source>China: A Report of 1014 Cases.” Faculty Opinions - Post-Publication Peer Review of the Biomedical Literature</source>
          ,
          <year>2020</year>
          , doi:10.3410/f.
          <volume>737441336</volume>
          .793572936.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Wolach</surname>
          </string-name>
          , Ofir, and Richard M. Stone. “
          <string-name>
            <surname>Mixed-Phenotype Acute</surname>
          </string-name>
          Leukemia.” Current Opinion in Hematology, vol.
          <volume>24</volume>
          , no.
          <issue>2</issue>
          ,
          <issue>2017</issue>
          , pp.
          <fpage>139</fpage>
          -
          <lpage>145</lpage>
          ., doi:10.1097/moh.0000000000000322.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Gozes</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ophir</surname>
          </string-name>
          , et al. “
          <article-title>Rapid AI Development Cycle for the Coronavirus (COVID-19) Pandemic: Initial Results for Automated Detection &amp;amp; Patient Monitoring Using Deep Learning CT Image Analysis</article-title>
          .
          <source>” ArXiv</source>
          ,
          <year>2020</year>
          , doi:10.48550/arXiv.
          <year>2003</year>
          .
          <volume>05037</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Shuai</surname>
          </string-name>
          , et al. “
          <article-title>A Deep Learning Algorithm Using CT Images to Screen for Corona Virus Disease (COVID-19</article-title>
          ).” 2020, doi:10.1101/
          <year>2020</year>
          .02.14.20023028.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Yen</surname>
          </string-name>
          , et al. “
          <source>Imaging Profile of the COVID-19 Infection: Radiologic Findings and Literature Review.” Radiology: Cardiothoracic Imaging</source>
          , vol.
          <volume>2</volume>
          , no.
          <issue>1</issue>
          ,
          <year>2020</year>
          , doi:10.1148/ryct.2020200034.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Shuo</surname>
          </string-name>
          , et al. “
          <article-title>A Fully Automatic Deep Learning System for COVID-19 Diagnostic and Prognostic Analysis</article-title>
          .
          <source>” European Respiratory Journal</source>
          , vol.
          <volume>56</volume>
          , no.
          <issue>2</issue>
          ,
          <issue>2020</issue>
          , p.
          <fpage>2000775</fpage>
          ., doi:10.1183/13993003.
          <fpage>00775</fpage>
          -
          <lpage>2020</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Salehi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sana</surname>
          </string-name>
          , et al. “
          <source>Coronavirus Disease</source>
          <year>2019</year>
          (
          <article-title>COVID-19): A Systematic Review of Imaging Findings in 919 Patients</article-title>
          .”
          <source>American Journal of Roentgenology</source>
          , vol.
          <volume>215</volume>
          , no.
          <issue>1</issue>
          ,
          <issue>2020</issue>
          , pp.
          <fpage>87</fpage>
          -
          <lpage>93</lpage>
          ., doi:10.2214/ajr.20.23034.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jingwen</surname>
          </string-name>
          , et al. “
          <source>Radiology Indispensable for Tracking COVID-19.” Diagnostic and Interventional Imaging</source>
          , vol.
          <volume>102</volume>
          , no.
          <issue>2</issue>
          ,
          <issue>2021</issue>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>75</lpage>
          ., doi:10.1016/j.diii.
          <year>2020</year>
          .
          <volume>11</volume>
          .008.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>Geoffrey D.</given-names>
          </string-name>
          , et al. “
          <article-title>The Role of Chest Imaging in Patient Management during the COVID19 Pandemic: A Multinational Consensus Statement from the Fleischner Society</article-title>
          .” Radiology, vol.
          <volume>296</volume>
          , no.
          <issue>1</issue>
          ,
          <issue>2020</issue>
          , pp.
          <fpage>172</fpage>
          -
          <lpage>180</lpage>
          ., doi:10.1148/radiol.2020201365.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>