Classification of X-Ray Images of the Chest Using Convolutional
Neural Networks
Lesia Mochurad a,*, Andrii Dereviannyi a, Uliana Antoniv b

a
  Artificial intelligence Department, Lviv Polytechnic National University, Lviv, 79013, Ukraine;
  lesia.i.mochurad@lpnu.ua, andriidereviannyi@gmail.com
b
  Department of Specialized Computer Systems, Lviv Polytechnic National University, Lviv, 79013, Ukraine;
  Uliana.s.antoniv@lpnu.ua
* Correspondence lesia.i.mochurad@lpnu.ua; Tel.: +380-32-258-2404


                Abstract
                A proven way to detect various injuries: from fractures to heart failure, is an X-ray. However,
                because this examination method depends on the doctor's visual analysis, it can lead to
                misdiagnosis, that is, the case when the early stage of pneumonia will not be recognized and
                treatment will be ineffective. This study proposes using a convolutional neural network to
                classify chest X-rays to solve this problem. To do this, we analyzed the materials on the
                classification using neural networks for different areas of computer vision. In particular,
                convolutional neural networks for medical use are considered. The classification model of
                images on a database that included 112 thousand captions and 30 thousand unique patients is
                trained. High accuracy values of 0.93 and completeness of 0.99 models were obtained. An
                analysis of the literature on the acceleration, parallelism, and synchronization of
                convolutional neural networks was performed. Their shortcomings are taken into account,
                and a new optimization approach is proposed. The classification results were compared with
                a parallel approach on a GPU and a sequential on a CPU. The model trains on the GPU is
                6.13 times faster than on the CPU based on the proposed algorithm.

                Keywords 1
                Computer vision, image classification model, parallelization, acceleration, GPU, CPU.


1. Introduction
    When it comes to identifying images, it is straightforward for us humans to recognize and
distinguish the different features of the depicted objects. All because our brains are constantly
subconsciously training on the same set of data, we can easily distinguish between various entities. In
contrast, the computer looks at the world around it differently: it is an array of numerical values that
form the critical aspects of an image or video that it tries to recognize. The principle by which the
system interprets the picture is radically different from how people do it. To vision, a computer needs
image recognition algorithms to analyze and understand what is happening in an image or its
sequence. An excellent example is the identification of pedestrians and vehicles, which is possible due
to the preliminary categorization and sorting of millions of prints – data provided by users [1, 2].
    Medicine is a clear favorite for areas that require a reliable image identification system while
generating a large amount of data on which you can train the same recognition [3]. However, the
biggest challenge in collecting medical data is practical analysis and processing for their further
use [4]. There are many methods of organizing the data obtained. We will consider one of them,
namely the classification, because it is widely used in the medical field, for example, to detect the
disease's symptoms.


IDDM-2021: 4th International Conference on Informatics & Data-Driven Medicine, November 19–21, 2021 Valencia, Spain;
EMAIL: lesia.i.mochurad@lpnu.ua (L. Mochurad); andriidereviannyi@gmail.com (A. Dereviannyi); Uliana.s.antoniv@lpnu.ua (U.
Antoniv).
ORCID: 0000-0002-4957-1512 (L. Mochurad); 0000-0003-1456-1303 (A. Dereviannyi); 0000-0002-6792-043X (U. Antoniv).
           ©️ 2021 Copyright for this paper by its authors.
           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org)
    The classification task is for a computer to analyze an image and assign it to an appropriate class,
usually setting a label to a particular image. For us, the classification of images is elementary, but it is
also a great example of the Moravec paradox – that simple for humans is difficult for artificial
intelligence [5].
    Early image classification depended on pure pixels – the computer split the image provided into
individual pixels. The problem with this approach is that two photos of the same object may look
different. Different backgrounds, angles, poses, and many other factors made it difficult for computers
to recognize and categorize images. This problem is designed to correct deep learning [6]. The latter
involves the use of computer systems known as neural networks.
    Unfortunately, the task of classification is quite resource-intensive and when using weak
computing power or a consistent algorithm to obtain the result can take a long time. In such cases,
optimization or the possibility of using better resources is usually considered.
    The aim of this study is to propose a parallel algorithm and analyze the advantages obtained in
solving the problem of classifying X-ray images to detect pneumonia using the CPU and GPU.

2. Literature review
    Classification of images in training often involves the use of convolutional neural networks
(CNN). The latter proved their power when AlexNet won the ImageNet competition by a wide margin
from other, more traditional neural networks. Since then, CNN has become one of the most promising
machine learning algorithms. It is widely used to solve problems involving large-scale datasets.
However, learning deep convolutional neural networks on large datasets is a highly intensive
computational task and requires much time to learn. That is why the algorithm is subjected to
parallelization, which reduces the load on one core, dividing the work between several [7-10]. Due to
this, the use of the algorithm does not require spending days or even weeks to learn it.
    There are two approaches: model parallelism, i.e., the model is divided between several computing
nodes and trained on the same data, and data parallelism, if the data is distributed on several nodes
and the same model is used for learning. Hybrid approaches using both parallelisms have also been
proposed. Examples of hybrid systems are papers [11] and [12]. In hybrid approaches, a small number
of nodes are grouped to teach the model using model parallelism. The data set is divided into groups
to be processed simultaneously using data parallelism. They use the master-managed model, and the
main task of the server is to update parameters centrally. The disadvantage of this approach is that
because all groups access the same server, which ensures their interaction, a delay is created, which
reduces performance.
    One of CNN's most popular learning algorithms is stochastic gradient descent (SGD). It was
demonstrated in the article [13]. The algorithm works iteratively: the model parameters are updated
until they become optimal. Due to the dependence of the data on the model parameters between any
two sequential iterations, parallel SGD can suffer due to the expensive interprocess cost of
communication [14]. Recently, researchers have made many efforts to improve the scaling of the
similar SGD algorithm [15-16]. Traditional synchronous SGD guarantees optimal parameter updates
due to low parallel efficiency, which is achieved through frequent synchronization. Asynchronous
SGD has been designed to address such performance vulnerabilities. This approach is quite popular,
as can be seen from [17] and [18]. However, asynchronous SGD also has disadvantages. It requires
more iterations to match the same accuracy, so for many processes, this increases the learning time.
    In paper [19], the accuracy of classification of different models based on chest images is
investigated. At the same time, the authors managed to achieve high accuracy – 96.81%. However, no
studies have been conducted to process large-scale input data on the time of execution of such
classifications, and, accordingly, no parallelization was performed.
    Another example of an article related to medicine and the parallelism of CNN is [20]. But the
method used in this work is a hybrid in which there is CNN and RNN.
    Therefore, in the analysis of publications, no work was found that would demonstrate the
parallelization of CNN on the CPU and GPU to classify pneumonia.
    Some authors argue that parallelism of large-scale data is harmful and leads to deterioration of
results; others say it is beneficial [21-23]. The absolute competitiveness of parallelization can show
today, as we can obtain large amounts of data in free access, allowing more free experiments. Hence,
it is not surprising that the literature remains ambiguous because all of the above methods have pros
and cons.

3. Materials and Methods
   Deep convolutional neural networks are used in many tasks, such as image classification, work
with sound, etc. [24-27]. Although CNN was developed in 2012, it is gaining real popularity right
now. The main factors for CNN's success are the availability of large data sets and the high
performance of modern computer systems.
   For high results, this neural network requires large amounts of data. For example, one of the
earliest CNNs, LeNet-5 [28], was taught to recognize handwritten numbers using the MNIST dataset,
containing about 60 000 images in a training set and 10 000 images in a test set. The CIFAR-10
dataset consists of 60 000 32x32 color images, including 50 000 training images and 10 000 test
images. We took an NIH Chest X-rays database for our research, including 112 thousand high-
resolution pictures of 30 thousand different patients [29].

      3.1. Database description

    The database, which is considered in this paper, contains images of chest X-ray examinations, one
of the most common medical examinations. One of the main obstacles to creating large data sets of X-
ray images is the lack of resources to label many photos. Before releasing this dataset, Openi [30] was
the largest publicly available source of chest X-rays, with 4 143 images available.
    The NIH chest X-ray data set consists of 112 120 X-rays with disease labels of 30 805 unique
patients [29].
    Thus, the neural network's task is to divide all images into two classes: pneumonia and healthy,
based on notions of chest X-rays (see Figure 1).


Figure 1: X-rays of the chest, left - without pneumonia, right – with pneumonia

    Since this paper will compare the proposed parallel approach with the standard use of CNN, we
first consider the operation of the convolutional neural network algorithm.

      3.2. Cunvolutional Neural Network (CNN)

   In tightly coupled neural networks, neurons are divided into groups that form successive layers. In
them, each unit is connected to each neuron from neighboring layers. An example of this network is
shown below in Figure 2:
Figure 2: An example of a tightly coupled neural network

    This approach works well to solve a classification problem based on a limited set of defined
characteristics. But the situation is complicated when the data to be classified are images. We could
transmit the brightness of each pixel as separate units to the input of our dense network, but then for it
to work, it must contain tens or even hundreds of millions of neurons. One way to solve this problem
and reduce the network is to reduce the scale of the photo itself, but then we lose information.
Therefore, convolutional neural networks are used to solve this problem.
    To begin with, to analyze the operation of convolutional networks, you need to describe the data
structure with which they will work. For convenience, the image is stored in a 2d matrix, each of
which characterizes a particular image pixel. The image consists of three such matrices for the RGB
model, each of which corresponds to a specific channel - red, green, and blue. If the image is black
and white, then one matrix is used. Each of the digits is in the range from 0 to 255.
    The main element in the work of CNN is a matrix, which is a filter or core. To process it, we pass
our image matrix and convert it based on filter values. The following formula calculates the value of
the object map. The input of the image is denoted by f and the kernel by h. The indices of the rows
and columns of the result matrix are denoted by m and n, respectively.
                           𝐺[𝑚, 𝑛] = (𝑓 ∗ ℎ)[𝑚, 𝑛] = ∑𝑗 ∑𝑘 ℎ[𝑗, 𝑘]𝑓[𝑚 − 𝑗, 𝑛 − 𝑘].
   After placing our filter over the selected pixels, we take each value and multiply it by the
corresponding value from the kernel. Summarize all the results and write in the appropriate place on
the map of the initial characteristics. The filter is above pixels that have tens and zeros. Let each pixel
multiply by the corresponding one in the filter and add all the results. We write down this result in a
matrix. We move the kernel and repeat the previous steps. In the end, we get a matrix with new data.
Because the image decreases with each subsequent iteration, the number of image convolutions is
limited. Also, following the movement of the filter, the influence of pixels located on the outskirts is
observed. They are much smaller than the center of the image. In this way, part of the information
present in the picture is lost.
   A frame is added to the image matrix, mainly filled with zeros, to solve this problem. Depending
we use fillings or not, we deal with two types of convolutions - original and same. Valid – the original
image is used. Same – uses a frame around the original image so that when convolving, the output
gives a matrix of the same size as the original image. For the second case, the width of the frame must
be equal to the following value:
                                                           𝑓−1
                                                       𝑝=      ,
                                                           2

where 𝑝 – filling, and 𝑓 – size of the filter (usually this value is odd).
   One of the essential hyperparameters of the convolutional layer is the length of the step along
which the filter should move. For the CNN architecture, this parameter is vital. If you want to overlap
the receptive fields less or the spatial dimensions of the function map to perform the exercise, you
need to increase the step. The following function is used to calculate the size of the output matrix:
                                                 𝑛𝑖𝑛 + 2𝑝 − 𝑓
                                       𝑛𝑜𝑢𝑡 = [                 + 1].
                                                        𝑠
   Convolution over volume – is an essential concept because it allows you to work with color images
and apply multiple filters within one layer. But keep in mind that the number of channels contained in
the filter and the images must match. To use various filters in the same image, we collapsed filters
separately, and the results are combined at the end. We can find the size of the tensor we obtain using
the following formula:
                                                          𝑛 +2𝑝−𝑓         𝑛 +2𝑝−𝑓
                         [𝑛, 𝑛, 𝑛𝐶 ] ∗ [𝑓, 𝑓, 𝑛𝐶 ] = [[ 𝑖𝑛        + 1] , [ 𝑖𝑛 𝑠   + 1] , 𝑛𝑓 ],
                                                             𝑠

where 𝑛 is the size of the image, 𝑓 is the size of the filter, 𝑛𝐶 is the number of channels in the image,
p is the fill used, s is the step used, 𝑛𝑓 is the number of filters.
    Forward propagation consists of two steps. The first is the calculation of intermediate values of 𝑍,
which we obtain by convolving the input data from the previous layer by the tensor 𝑊 and then
adding the offset 𝑏. The second is the application of the nonlinear activation function to our
intermediate value, which is denoted by the letter g. The following equations can demonstrate these
steps:
                                  𝑍 [𝑖] = 𝑊 [𝑖] ⋅ 𝐴[𝑖−1] + 𝑏 [𝑖]   𝐴[𝑖] = 𝑔[𝑖] (𝑍 [𝑖−1] ).
    Now let's move on to the vital attributes in the complex layers. First, all neurons in convolutional
layer are interconnected. Second, some neurons have the same weight. This shows that the
convolution has reduced the parameters to be studied. It is also worth noting that one value from the
filter affects each element of the object map, which is crucial in backpropagation.
    Consider the algorithm of the network in reverse propagation. This algorithm aims to calculate the
derivatives and then use them to update the values of the parameters in a process called gradient
descent. We want to assess the impact of parameter changes on the resulting feature map and the final
result.
                                                                                                     𝜕𝐿
    The problem of inverse propagation is to calculate the partial derivatives of cost functions: 𝜕𝑊 [𝑖]
     𝜕𝐿                                                                                                   𝜕𝐿
and 𝜕𝑏[𝑖] – which are derivatives related to the parameters of the current layer, as well as values 𝜕𝐴[𝑖−1]
                                                                                     𝜕𝐿
which will be transferred to the previous layer. At the entrance we have 𝜕𝐴[𝑖] . The first step is to obtain
                          𝜕𝐿                                                                      𝜕𝐿      𝜕𝐿
the intermediate value             by applying the activation functions to the input tensor             = [𝑖] ∗
                         𝜕𝑍 [𝑖]                                                                  𝜕𝑍 [𝑖]  𝜕𝐴
𝑔′(𝑍 [𝑖] ). According to the chain rule, the result of this operation will be used later.
   Then the matrix operation is applied - full convolution. For this operation, we use a core that is
rotated 180 degrees. As a result, we have:
                                             𝜕𝐿      𝑛ℎ            𝜕𝐿
                                                  = ∑𝑚=0 ∑𝑛𝑛=0
                                                            𝑤
                                                               𝑊 ⋅ [𝑖] [𝑚, 𝑛],
                                            𝜕𝐴[𝑖]                 𝜕𝑍
                         𝜕𝐿
where 𝑊 − filter, and [𝑖] [𝑚, 𝑛] is a scalar that belongs to the partial derivative obtained from the
                        𝜕𝑍
previous layer.
   In addition to convolution layers, CNN very often uses so-called aggregation layers. They are
mainly used to reduce the size of the tensor and speed up calculations. These layers are superficial –
we need to divide our image into different regions and then perform a specific operation for each of
these parts. For example, for the highest layer of the pool, we select the maximum value from each
area and place it in the appropriate place at the output. As in the case of the convolutional layer, we
have two hyperparameters available – filter size and pitch. Last but not least, if you are merging for a
multi-channel image, the merging for each channel should be done separately.
   As we can see, CNN learning can be pretty slow due to the number of calculations required for
each iteration. Therefore, to speed up the work, it would be advisable to carry out parallel computing.

      3.3. Parrallelization on the GPU

    In parallel calculations, the problem is divided into independent smaller subtasks that run
simultaneously. The results obtained are recombined or synchronized to formulate the impact of the
initial count. The number of tasks into which computations can be divided depends on the number of
cores. Which in turn are contained in the equipment. Unlike CPUs, which process operations
sequentially, computationally tricky tasks on a GPU are distributed among thousands of processors,
allowing you to perform calculations much faster [31].
    Keras and TensorFlow technologies will be used to parallelize the algorithm. These technologies
use the capabilities of parallel processing of the graphics processor. Due to the CUDA, computational
tasks are performed sequentially on the CPU due to the C ++ software interface or in parallel on the
GPU.
    As mentioned above, a filter matrix with dimension n*n is used for a classical convolutional neural
network. This process can be seen in Figure 3. This matrix is multiplied by a matrix of image pixels.
These calculations are performed sequentially and independently of each other, which means that no
analysis depends on the results of any other count. This lets us see that the convolution operation can
be accelerated using a parallel programming approach and GPUs. This operation takes the most time
in this algorithm and is the most voluminous because the image size is 500x500.


Figure 3: The image matrix multiplies the matrix of the kernel or filter

    We can conclude that it is better to perform parallel calculations using graphics processors. To test
this hypothesis, we will conduct experiments and test the results and learning time of the
convolutional neural network sequentially on the CPU and in parallel on the graphics.

4. Results
   Intel (R) Core (TM) i7-10750H CPU with the following characteristics is used for numerical
experiments:
                Base speed:     2.59 GHz;
                Sockets:        1;
                Cores: 6;
                Logical processors:    12;
                Virtualization: Enabled;
   and GPU NVIDIA GeForce GTX 1650 Ti:
                Dedicated GPU memory           4.0 GB;
                Shared GPU memory 7.9 GB;
                GPU Memory 11.9 GB.
   In both cases, the same CNN model is used, which is presented in Figure 4.
Figure 4: Schematic representation of the CNN model used

   After training, the following results were obtained:

Table 1
The result of training CNN model using GPU
      Epoch        Time,      Time,        Loss           Accuracy   Validation     Validation
                sec/epoch ms/step        function                       loss         accuracy
                                                                      function
        1          91           344        0.5762         0.7623       0.2828        0.9308
        2          91           346        0.3177         0.8887       0.2724        0.9192
        3          93           353        0.2737         0.9093       0.2852        0.9279
        4          93           353        0.2462         0.9206       0.2363        0.9375
        5          91           345        0.2341         0.9229       0.2397        0.9288
        6          90           343        0.2227         0.9289       0.2056        0.9558
        7          91           345        0.1896         0.9359       0.1843        0.9519
        8          91           345        0.1816         0.9446       0.1696        0.9529
        9          90           343        0.1897         0.9405       0.2008        0.9452
        10         92           349         0.178         0.9419       0.2312        0.9423

Table 2
The result of training CNN model using CPU

     Epoch      Time,         Time,          Loss         Accuracy     Validation    Validation
              sec/ epoch     sec/step      function                       loss        accuracy
                                                                        function
       1          578            2          0.4927         0.7721        0.2566        0.9058
       2          590            2          0.2711         0.8881        0.183         0.9337
       3          597            2          0.2048         0.9152        0.1583        0.9423
       4          590            2          0.2066         0.9197        0.1347        0.9538
       5          579            2          0.1636         0.9292        0.1498        0.9452
       6          553            2          0.1861         0.9252        0.1527        0.9423
       7          535            2         0.1751        0.932          0.1405         0.9481
       8          526            2          0.14         0.9466         0.1479          0.95
       9          524            2         0.1366        0.9481         0.1292          0.95
       10         524            2         0.1511        0.9464         0.1512         0.9538

    Table 1 and Table 2 show the main indicators by which we will compare the training of the model
using the CPU and GPU, namely:
   1. Epoch - the number of times the algorithm passes through the entire dataset
   2. Time, sec/epoch – time in seconds spent on epoch training
   3. Time, sec/step – time in seconds spent training one step
   4. Loss function – neural network prediction error
   5. Accuracy – a metric for evaluating the classification of the model
   6. Validation function of losses – the function of losses during the passage of the algorithm
       through the validation dataset
   7. Validation accuracy – accuracy during the passage of the algorithm through the validation
       dataset
    When training a model, it is important that it coincides, ie finds the best option, which is why we
consider two indicators that can be used to monitor changes in the model. The first is the loss
function. Since the task is to minimize it, ideally to reduce it to zero, it is important that with the
number of epochs it decreases.

Table 3
Changing the function of losses with the change of training epochs

                   Epoch                 GPU                         CPU
                                Train           Val          Train           Val
                      1         0.5762         0.2825       0.4927         0.2566
                      2         0.3177         0.2724       0.2711         0.183
                      3         0.2737         0.2852       0.2048         0.1583
                      4         0.2462         0.2363       0.2066         0.1347
                      5         0.2341         0.2397       0.1636         0.1498
                      6         0.2227         0.2056       0.1861         0.1527
                      7         0.1896         0.1843       0.1751         0.1405
                      8         0.1816         0.1696         0.14         0.1479
                      9         0.1897         0.2008       0.1366         0.1292
                     10         0.178          0.2312       0.1511         0.1512
                                      0.8000


                      Loss Function
                                      0.6000

                                      0.4000

                                      0.2000

                                      0.0000
                                                   1       2       3       4       5       6      7     8      9    10
                                                                                   Epochs

                                            GPU Train              GPU Val             CPU Train              CPU Val

Figure 5: Сhange of loss function with change of training epochs

    As shown from Table 3 and Figure 3, the losses in the training and validation samples decrease at
the same rate in both the GPU and CPU versions. This means that according to the first indicator, the
loss function, there is no difference in the models to use.
    The second indicator that can be used to monitor the change in the model is its accuracy. The task
is to maximize it, ideally to reduce it to 1, i.e., 100% accuracy.

Table 4
Changing accuracy with changing training epochs

               Epoch                                       GPU                                              CPU
                                             Train                   Val                        Train                Val
                 1                          0.7623                 0.9308                      0.7721              0.9058
                 2                          0.8887                 0.9192                      0.8881              0.9337
                 3                          0.9093                 0.9279                      0.9152              0.9423
                 4                          0.9206                 0.9375                      0.9197              0.9538
                 5                          0.9229                 0.9288                      0.9292              0.9452
                 6                          0.9289                 0.9558                      0.9252              0.9423
                 7                          0.9359                 0.9519                      0.932               0.9481
                 8                          0.9446                 0.9529                      0.9466               0.95
                 9                          0.9405                 0.9452                      0.9481               0.95
                 10                         0.9419                 0.9423                      0.9464              0.9538


                                      1.2
                                       1
                                      0.8
                    Accuracy


                                      0.6
                                      0.4
                                      0.2
                                       0
                                               1       2       3       4       5       6         7      8      9     10
                                                                               Epochs

                                            GPU Train              GPU Val             CPU Train              CPU Val

Figure 6: Accuracy change chart with changing training epochs
   As shown from Table 4 and Figure 6, the accuracy of the training and validation samples increases
at the same rate in both the GPU and CPU versions. This means that for the second indicator,
accuracy, there is no difference in the models to use.
   Given the same rate of convergence of the model on both the CPU and GPU, an indicator that
becomes important is the time spent on training one era, which is equal to 263 steps.

Table 5
Time spent training one epoch/step

                        Epoch                  GPU                              CPU
                                      sec/epoch    sec/step            sec/epoch    sec/step
                             1            91         0.34                 578          2
                             2            91         0.35                 590          2
                             3            93         0.35                 597          2
                             4            93         0.35                 590          2
                             5            91         0.35                 579          2
                             6            90         0.34                 553          2
                             7            91         0.35                 535          2
                             8            91         0.35                 526          2
                             9            90         0.34                 524          2
                             10           92         0.35                 524          2


                       2.5


                        2


                       1.5
             Time, s


                        1


                       0.5


                        0
                                  1     2    3     4         5         6    7     8     9      10
                                                             Epochs

                                                       GPU       CPU

Figure 7: Сhange of average time for 1 step with a change of training epochs
                       700

                       600

                       500
             Time, s
                       400

                       300

                       200

                       100

                        0
                             1   2   3      4         5         6   7       8     9    10
                                                      Epochs

                                                GPU       CPU

Figure 8: Change of time for 1 training epoch

   From Table 5, Figure 7-8, the average time spent on training 1 step with a GPU - 0.347 sec/step;
CPU – 2 sec/step, the average time spent on training 1 era with GPU - 91.3 sec/epoch, and CPU -
559.6 sec/epoch. This means that the model trains on the GPU 6.13 times faster than on the CPU.

  One of the tasks is to train the model to classify chest X-rays to detect pneumonia. To prepare the
model for 10 epochs using a GPU, because as shown above, it gives the same result 6.13 times faster.

Table 6
Matrix of discrepancies

                                         Predicted Normal               Predicted Pneumonia
                  Actual Normal                 191                              43
                Actual Pneumonia                 3                              387

    As shown in Table 6, the model correctly classified 191 patients in whom pneumonia was not
detected (True Negative - TN), 397 patients in whom pneumonia was detected (True Positive - TP),
incorrectly classified 43 patients with no pneumonia was detected. pneumonia (False Positive - FP), 3
patients with pneumonia (False Negative - FN).
         Thus, the model showed the following metrics:
   1. Accuracy = (TP + TN) / (TP + FP + FN + TN) = 0.93
   2. Precision = TP / (TP + FP) = 0.90
   3. Recall = TP / (TP + FN) = 0.99
   4. F1 score = 2 * precision * recall / (precision + recall) = 0.94
    Of the 400 patients with pneumonia, only 3 were classified as having none. Of the 234 patients
without pneumonia, 43 were classified as having pneumonia. The medical field needs to keep the
number of patients to a minimum, even if this means that there may be more patients of the second
type, as it is better to misclassify a healthy patient than to miss a patient.

5. Discussion of results
    The results presented in the previous section showed the advantage of parallel computing on the
GPU overusing a serial algorithm on the CPU. Using the GPU for training took an average of 91.3
sec/epoch, and the CPU was 559.6 sec/epoch, which is 6.13 times faster for each period than the time
it took the CPU to perform the same calculations. At the same time, the losses in the training and
validation samples decrease at the same rate, and the accuracy in both instances increases at the same
rate regardless of which processor was used, i.e., the model was equally well trained on both CPU and
GPU.
    It can also be concluded that the time required to move data from the CPU to the GPU,
recombination or synchronization of data obtained from parallel computations to form the result of the
initial calculation is not a significant threat to the speed of the program when processing large
amounts of data. Fully compensated during the counts - so much faster they run on the GPU than on
the CPU.
    Figure 9 shows the result of the classification of some cases taken from the dataset. Regarding the
work of the model itself, it is worth noting once again its high metrics of accuracy, precision,
completeness, and the weighted average value of accuracy and recall (f1 score). As already
mentioned, the use of classification in medicine requires or not the highest possible indicators. We
can assume that our model meets this need, showing an 18% error in the absence of pneumonia and
only 0.77% when lung damage was still present on X-ray.


Figure 9: Classification results

   However, there is room for improvement. In particular, the next step may be to reduce the error in
the absence of pneumonia. By increasing the dataset or creating a more complex model, this can be
done.

6. Conclusion
    By performing convolutional neural network training to classify chest X-ray images on the CPU,
and in combination with the GPU, it was shown that the training on the GPU is faster. In this
particular case, training on the NVIDIA GeForce GTX 1650 Ti GPU was 6.13 times faster than on the
Intel (R) Core (TM) i7-10750H CPU only. And while setting up a GPU requires extra time, reducing
the training time of a convolutional neural network becomes very significant in real learning
scenarios, when training a single neural network can take days. In this case, training on the CPU can
last a week, when the use of the GPU will reduce this time to one day.
    A convolutional neural network was constructed to classify chest X-rays to detect pneumonia, and
an accuracy of 0.93 was obtained, with a recall of 0.99. Only three patients with pneumonia are
misclassified, which is essential in the medical field because it is better to misclassify a healthy
patient and pay more attention to him than to miss a patient who may get complications because of it.
References
[1]. Klette, Reinhard. Concise Computer Vision. An Introduction into Theory and Algorithms.
     Springer, London, 429 p., (2014), doi.org/10.1007/978-1-4471-6320-6.
[2]. Krishna, Srinivasan, Karthik, Raman, Jiecao, Chen, Michael, Bendersky, Marc, Najork. WIT:
     Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning. SIGIR
     Resource Track, Virtual. arXiv:2103.01913, 16 p., (2021).
[3]. Borad, Anand. Healthcare and Machine Learning: The Future with Possibilities. E Infochips :
     URL:           https://www.einfochips.com/blog/healthcare-and-machine-learning-the-future-with-
     possibilities/?utm_source=EIBlog&utm_medium=BlogPostShubham&utm_campaign=related-
     blog.
[4]. Mochurad, Lesia, Yatskiv, Mariia. Simulation of a Human Operator’s Response to Stressors
     under Production Conditions. Proceedings of the 3rd International Conference on Informatics &
     Data-Driven Medicine. Växjö, Sweden, November 19 - 21, pp. 156-169, (2020).
[5]. Moravec's paradox // Wikipedia : веб-сайт. URL: https: //en.wikipedia.org/ wiki/ Moravec%27
     s_paradox.
[6]. What is image classification in deep learning? // ThinkAutomation : веб-сайт. URL:
     https://www.thinkautomation.com/eli5/eli5-what-is-image-classification-in-deep-learning/.
[7]. Mochurad, L., Shakhovska, K., Montenegro, S. Parallel Solving of Fredholm Integral Equations
     of the First Kind by Tikhonov Regularization Method Using OpenMP Technology. In:
     Shakhovska N., Medykovskyy M. (eds) Advances in Intelligent Systems and Computing IV.
     CCSIT 2019. Advances in Intelligent Systems and Computing, vol 1080. Springer, Cham, pp.
     25-35,(2020) doi.org/10.1007/978-3-030-33695-0_3.
[8]. Zhou, J., Chen, W., Peng, G., Xiao, H., Wang, H., Chen, Z. Parallelizing convolutional neural
     network for the handwriting recognition problems with different architectures. 2017 International
     Conference on Progress in Informatics and Computing (PIC), pp. 71-76, (2017),
     doi.org/10.1109/PIC.2017.8359517.
[9]. Lee, S., Jha, D., Agrawal, A., Choudhary, A. and Liao, W. Parallel Deep Convolutional Neural
     Network Training by Exploiting the Overlapping of Computation and Communication. 2017
     IEEE 24th International Conference on High Performance Computing (HiPC), pp. 183-192,
     (2017), doi.org/10.1109/HiPC.2017.00030.
[10].Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig,
     George E. Dahl. Measuring the Effects of Data Parallelism on Neural Network Training.
     20(112):1−49, (2019).
[11].Dean, J., Corrado, G., Monga, R., et al. Large scaledistributed deep networks. In Advances in
     neural informationprocessing systems, pp. 1223–1231, (2012).
[12]. Das, D., Avancha, S., Mudigere, D., Vaidynathan, K., Srid-haran, S., Kalamkar, D., Kaul, B.
     and Dubey, P. Distributeddeep learning using synchronous stochastic gradient descent. arXiv
     preprint arXiv:1602.06709, (2016).
[13]. Robbins, Herbert, Monro, Sutton. A stochastic approximation method. The Annals of
     Mathematical Statistics, Vol. 22, No. 3. (Sep., 1951), pp. 400-407.
[14]. Sunwoo, Lee, Ankit, Agrawal, Prasanna, Balaprakash, Alok, Choudhary, Wei-keng, Liao.
     Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network
     Training. 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC), pp. 47-
     56 (2018).
[15]. Xiangrui, Li and Deng, Pan and Xin, Li and Dongxiao, Zhu. Improve SGD Training via
     Aligning Mini-batches. arXiv preprint arXiv: 2002.09917, 10 p., (2020).
[16].Yiming, Chen & Kun, Yuan & Yingya, Zhang & Pan, Pan. Accelerating Gossip SGD with
     Periodic Global Averaging. Proceedings of the 38th International Conference on Machine
     Learning, PMLR 139:1791-1802, (2021).
[17]. Zhao, Shen-Yi & Li, Wu-Jun. Fast Asynchronous Parallel Stochastic Gradient Decent, pp. 1-15,
     (2015), arXiv:1508.05711v1 [stat.ML] 24 Aug 2015.
[18]. S.-Y., Zhao and W.-J., Li. Fast asynchronous parallel stochas-tic gradient descent: A lock-free
     approach with convergenceguarantee. In AAAI, pp. 2379–2385, (2016).
[19]. Moujahid, Hicham & Cherradi, Bouchaib & el Gannour, Oussama & Bahatti, Lhoussain &
     Terrada, Oumaima & Hamida, Soufiane. Convolutional Neural Network Based Classification of
     Patients with Pneumonia using X-ray Lung Images. Adv. Sci. Technol. Eng. Syst. J. 5(5), 167-
     175 (2020); doi.org/10.25046/aj050522.
[20]. Yao, Hongdou et al. Parallel Structure Deep Neural Network Using CNN and RNN with an
     Attention Mechanism for Breast Cancer Histology Image Classification. Cancers vol. 11, 12
     1901, (2019), doi.org/10.3390/cancers11121901.
[21]. Elnashar, Alaa. To Parallelize or Not to Parallelize, Speed Up Issue. International Journal of
     Distributed and Parallel Systems (IJDPS) Vol. 2, № 2, March 2011, pp. 14-28, (2011).
[22]. Parallel      computing       and       its   advantage       and    disadvantage,      2018.
     https://www.geekboots.com/story/parallel-computing-and-its-advantage-and-disadvantage.
[23]. Martinovic, Goran, Zdravko Krpic and Snjezana Rimac-Drlje. Parallelization Programming
     Techniques: Benefits and Drawbacks. (2010).
[24]. Lee, S., Agrawal, A., Balaprakash, P., Choudhary, A., & Liao, W. K. Communication-Efficient
     Parallelization Strategy for Deep Convolutional Neural Network Training. In Proceedings of
     MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The
     International Conference for High Performance Computing, Networking, Storage and Analysis,
     pp. 47-56, (2019), doi.org/10.1109/MLHPC.2018.8638635.
[25]. Bird, Jordan & Faria, Diego & Manso, Luis & Ayrosa, Pedro & Ekárt, A. A study on CNN
     image classification of EEG Signals represented in 2D and 3D. Journal of Neural Engineering,
     18(2), (2021), doi.org/10.1088/1741-2552/abda0c.
[26]. Sharma, Atul & Phonsa, Gurbakash. Image Classification Using CNN. SSRN Electronic
     Journal. Proceedings of the International Conference on Innovative Computing &
     Communication (ICICC) 2021, 5 p., (2021).
[27]. Palanisamy, Kamalesh & Singhania, Dipika & Yao, Angela. Rethinking CNN Models for Audio
     Classification. 2020, 8 p., arXiv: 2007.11154v2 [cs.CV] 13 Nov 2020.
[28]. Yann, LeCun, Leon, Bottou, Yoshua, Bengio, and Patrick, Haffner. Gradient-Based Learning
     Applied to Document Recognition. Proc. of the IEEE, pp. 1-46, (1998).
[29]. NIH Chest X-ray Dataset. URL: https://www.kaggle.com/nih-chest-xrays/data.
[30]. Openi, chest X-ray collection. URL: https://openi.nlm.nih.gov/.
[31]. Mochurad, L.I. Optimization of numerical solution of model problems on the basis of parallel
     calculations. Chapter 1: monograph. Lviv: PE “BONA Publishing House”, 208 p., (2021).