=Paper=
{{Paper
|id=Vol-2874/short15
|storemode=property
|title=Classifying Raman Spectroscopy Data Using Machine Learning Algorithms for Diagnosing Infection With Sars-Cov-2
|pdfUrl=https://ceur-ws.org/Vol-2874/short15.pdf
|volume=Vol-2874
|authors=Robert Istvan Oniga
}}
==Classifying Raman Spectroscopy Data Using Machine Learning Algorithms for Diagnosing Infection With Sars-Cov-2==
<pdf width="1500px">https://ceur-ws.org/Vol-2874/short15.pdf</pdf>
<pre>
    Classifying Raman Spectroscopy Data
    Using Machine Learning Algorithms for
     Diagnosing Infection With Sars-Cov-2

                              Robert Istvan Oniga

                        Katholieke Universiteit Leuven, Belgium
                                oniga.robi@gmail.com

       Proceedings of the 1st Conference on Information Technology and Data Science
                           Debrecen, Hungary, November 6–8, 2020
                               published at http://ceur-ws.org


                                        Abstract
          The rapid development of the corona crisis requires new methods and ap-
      proaches that could help flatten the curve. For this reason, a possible alter-
      native for a detection method is investigated to diagnose in a faster and more
      reliable way the disease and help prevent the spread. Raman spectroscopy
      on blood serum is a potential candidate for this issue and thus, research was
      done towards this direction. The data obtained from the spectrometer was
      further analyzed through linear discriminant analysis and a predictive model
      was achieved with an accuracy of 93.5%.
      Keywords: SARS-COV-2, LDA, Raman spectroscopy, data processing


1. Introduction
In December 2019, a novel virus has emerged and in a brief period it has reached
all the corners of the world. This virus called the corona virus has affected all
people. Due to the continuous rise in the number of infected and deceased persons,
a new approach must be taken in order to deal with this situation. Since the
measures taken to isolate infected individuals had no significant result with the
current testing results, perhaps the development of a new detection method that
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).


                                            168
is more precise and rapid could prevent the further spread of the virus.[5, 9] A
possible detection method that could satisfy the requirements is by means of Raman
spectroscopy on blood serum and with the help of machine learning techniques.


2. Used Method
2.1. Raman Spectroscopy
Raman spectroscopy is a method that relies on a non ionizing laser that can excite
a molecule if the energy of the incident photon matches the energy gap between the
ground state and the excited state of said molecule. The phenomena of fluorescence
will occur when the molecule relaxes and generates emission of photons in both the
visible and near-infrared spectral ranges. The emission can happen either by means
of a elastic scattering which has no relevant information, or plastic scattering which
means that a part of the energy was absorbed and only a fraction of that was re-
leased back into the medium. That difference in terms of energy is studied in order
to obtain relevant information that may be used for bio-medical applications but
not only.[10] In this paper the differences in terms of energy absorption between
healthy and infected individuals is studied for a possible diagnosis. The said differ-
ences are subtle, however, precise statistical algorithms can detected these trends
leading to a clear diagnosis.
    In many bio-medical applications that include the use of Raman spectroscopy,
the probe being analyzed is blood serum. Many important features can be observed
that prove to be relevant in diagnosis. More precisely, in the blood serum, different
organic components are present such as proteins that can indicate the presence of
a virus, or one can analyze the serum in order to diagnose different types of carci-
noma. However, in order to be able to analyse blood serum, a series of preparatory
steps must be taken to obtain a clear and reliable sample.[11] These steps include:

   • Blood has to be drawn from the subject in question.
   • The blood has to sit in the test tube for 15-30 minutes in order to clot.
   • The blood has to be centrifuged in order to get rid of the clot.
   • The serum has to be immediately transferred to another tube to preserve the
     purity.

   Knowingly, the use of Raman spectroscopy on blood serum is proposed as a
detection method of infection with the SARS-COV-2 virus. Further knowledge re-
garding the proteins that mark the presence of the virus in the human system was
acquired in order to obtain an estimated location in the spectroscopic data of those
elements for better detection by means of machine learning. The most important
protein that characterizes the virus is the spike glicoprotein also called S-protein.
Because of this particular protein the virus is able to enter our system by attaching
the so called spikes to a receptor called angiotensin converting enzyme 2 (ACE2).[7]

                                         169
Proteins in general are made of amino acid chains which all contain amine and car-
boxyl functional groups. Because in this paper the main focus is about processing
raman spectroscopic data, the vibrational modes of the proteins have to be un-
derstood. The most characteristic bands in raman spectroscopy for proteins are
associated with the CONH group that stretch up to 3100 cm−1 (amide A) and sev-
eral other band for amide B: 1600–1690 cm−1 , 1480–1580 cm−1 , 1230–1300 cm−1 ,
625–770 cm−1 , 640–800 cm−1 , 540–600 cm−1 and 200 cm−1 .[8] This information will
be later compared with the results of the machine learning algorithm in order to
identify with greater preciseness the component in the spectroscopic data that will
yield the best differentiation rate.

2.2. Liner Discriminant Analysis (LDA)
For this application, a linear classification technique called LDA was used. This
method is straightforward and it consists of analysis of statistical properties of
data that are calculated for each class. The statistical properties of interest are the
mean and covariance matrix over the multiple variables. The assumption made by
the algorithm is that all data is Gaussian and each attribute of the data has equal
variance and each value varies around the mean with the same amount overall.
After these are taken into consideration, the algorithm estimates the mean and
the variance for each class. Having these parameters, the next step can be made:
classification.[2] In order to estimate the probability of a unknown sample to be
part of a certain class, the model uses Bayes theorem which is governed by the
following formula:
                                          𝑃 (𝐵 | 𝐴)𝑃 (𝐴)
                             𝑃 (𝐴 | 𝐵) =                 ,
                                              𝑃 (𝐵)
where 𝐴 and 𝐵 are events, 𝑃 (𝐴 | 𝐵) is the probability of 𝐴 given 𝐵 is true, 𝑃 (𝐵 | 𝐴)
is probability of 𝐵 given 𝐴 is true and finally, 𝑃 (𝐴) and 𝑃 (𝐵) are independent
probabilities of 𝐴 and 𝐵. [4]

2.3. Leave-One-Out Cross-Validation
The classification technique discussed above can yield accurate results on its own
without the need of reassuring techniques such as leave-one-out cross-validation.
However, being a delicate biomedical application, the maximum achievable accu-
racy is of interest. For this particular reason the sequential training and testing of
different unique values is made. Precisely, each data point from the data-set will
serve as test sample once. By doing this, all data points are thoroughly analyzed
and assimilated with one of the classes with greater preciseness.


3. Particular Application of Methods
The data processing was done in MATLAB software by means of machine learning
and statistical toolboxes. The data-set consisted of Raman spectra on blood serum

                                         170
of both healthy and infected subjects. The first step was to whiten the data by
removing the DC component and then the data was visually inspected in order to
observe any outstanding features that might impede the classification process. The
whitened data can be observed in Figure 1.

                                                 Healthy vs SARS-COV2

                               6                                                 Healthy
                                                                                 SARS-COV2

                               5


                               4
            Intensity [A.U.]


                               3


                               2


                               1


                               0


                               -1
                               400   600   800   1000   1200   1400      1600   1800   2000
                                                   Raman shift [cm-1 ]

                                     Figure 1. Normalized spectroscopic data.

    The displayed data has subtle differences which cannot be picked up by the
naked eye in order to make a precise classification. For this reason, after the
algorithm successfully detects those features that are relevant for the differentiation
between the two classes, the raman shift bands will be displayed for a better visual
analysis.
    After the normalization process, the data was randomized along with the labels
and in the healthy and infected individuals were mixed together in order to prepare
for the training process of said classifier. The division in terms of samples was done
in the following manner: 70% of the data was used as training set and 30% of the
data was used as test set. However, as it has been mentioned before, the training
was done through LDA and through LDA with leave-one-out cross-validation tech-
nique in order to compare the results and to take into consideration the possibility
of over-fitting of the model.[1]


4. Results
The data-set was composed of a total of 309 individuals of which 150 were healthy
and 159 were infected with the corona virus[12]. Initially, the training process

                                                        171
was done by using only the linear discriminant analysis method which yielded a
89% accuracy which means that 89% of the subjects in question were correctly
identified as being either healthy of infected while 11% were wrongfully attributed
to a class. In order to test for a possible improvement of the model, leave-one-out
cross-validation was used. In this algorithm rather than using sample by sample
to obtain the prediction, Raman shift ranges of length of approximately 50 cm-̂1
were used. The results of this alternative were better having a accuracy of 93.5%.
Since the algorithm performed many operations with a small amount of data, the
possibility of over-fitting was taken into consideration before applying the method.
However, this was not the case with this data-set. To get a better understanding
of the results, the confusion matrix was displayed and it is shown in Figure 2.

                                               Confusion Matrix


                                    45                 6             88.2%
                              0
                                  48.4%              6.5%            11.8%
               Output Class


                                    0                 42             100%
                              1
                                  0.0%              45.2%            0.0%


                                  100%              87.5%            93.5%
                                  0.0%              12.5%            6.5%
                                   0


                                                      1


                                                 Target Class

                                       Figure 2. Confusion matrix.


    From this table it can be seen that from the total of 48 healthy individuals used
for the testing process, 47 were correctly identified as being healthy (which accounts
for 47 true negatives) and one was erroneously identified as infected (false positive).
The infected individuals had a lower accuracy when it comes to detection as from
the total of 45 infected, only 40 were correctly identified as indeed infected (true
positive) and 5 were assumed to be healthy (false negative). Other parameters
were taken into consideration for performance assessment of the classifier such as
the accuracy, sensitivity and specificity of the model. All these parameters can be
seen in Table 1.

                                                  172
                       Table 1. Performance of the classifier.

                                 Accuracy      93.55%
                                Sensitivity    83.33%
                                Specificity    90.38%


    As discussed in 2.1, certain proteins yield Raman shift ranges that are useful for
the classification process. The predominant features in terms of Raman shift in this
particular data-set were in the following ranges: [400–591]cm−1 , [647–673]cm−1 ,
[721–798]cm−1 , [820–896]cm−1 and [1003–1241]cm−1 . From all the ranges dis-
cussed in 2.1 the one that overlaps significantly with the predominant features in
this dataset is the one in the 600–800 cm−1 range. The conclusion that can be
drawn from this is that the spike glicoprotein that is responsible for the infection
with the SARS-COV-2 virus has a high vibrational state in this particular range
which can be used to detect infected individuals. Now, additional information has
come to light and better approaches can be made in order to increase the accu-
racy of the model. The predominant features used to achieve the classification
process are now reduced to one feature generally speaking. This single feature is
represented by a range of Raman shifts in the 600–800 cm−1 that can be seen in
Figure 4.
    From a visual inspection it can be seen that the data from the two classes
show little differences in terms of intensities. However, there are certain location
in the specified ranges of raman shifts in which the differences in intensities go up
to 0.5 [A.U.]. These values are high from the perspective of Raman spectroscopy
accounting for a difference of 12.5% in terms of intensity. When analyzing these
differences the most important characteristic that was considered at all times was
the quantity of virus present in the sample which could have been very low for
some of the subjects. The quantity is highly dependent on one parameter: days
after infection (Figure 3). Because of this reason, and because of the fact that
there was no information regarding this parameter, possible errors in classification
might have occurred.


                Figure 3. Level of virus and antibodies in the system.


                                         173
                                                 Healthy vs SARS-COV2
                                                   Healthy
                                                   SARS-COV2

                                          4


                       Intensity [A.U.]   3


                                          2


                                          1


                                          0


                                          -1
                                           600   650        700      750     800
                                                       Raman shift [cm-1 ]

                  Figure 4. Range of interest from normalized data.


5. Conclusions
The proposed technique for detection of SARS-COV-2 infection proved itself to be
useful and reliable. The potential of Raman spectroscopy in the detection of the
virus is considerable especially due to its ability of observing even subtle differences
that can be missed by the naked eye and which are key elements for the classification
process. The LDA algorithm combined with leave-one-out cross-validation created
a statistical model that is well suited for the task at hand. Rapid tests are time-
efficient, however the accuracy of detection is only ∼80%. These tests can be done
by the patient itself and they require no medical expertise.[6] The most-widely-
used method is the PCR one and this requires a laboratory and trained personnel
in order to obtain the test result. Even though it has a high accuracy, the time
needed to obtain a result is long as it can take up to 48 hours.[3] Both methods
have their own shortcomings; when compared to the Raman spectroscopy based
detection method, it can be seen that this one has only the beneficial parts of both:
does not require a laboratory as there are plenty of portable Raman spectrometers,
this method is highly time efficient as the result can be passed to the patient in less
than one hour, there is no need for extra training in order to perform the testing
operation as the process can be done by a nurse, it has high accuracy (∼93.5%).
All these advantages add up to a novel method that can impede the rapid spread
of the virus thus contributing to a faster ending of the pandemic.


                                                           174
References
 [1] S. rekha Hanumanthu: Role of Intelligent Computing in COVID-19 Prognosis: A State-
     of-the-Art Review. Chaos, Solitons and Fractals (2020),
     doi: https://doi.org/10.1016/j.chaos.2020.109947.
 [2] A. A. Hussain, O. Bouachir, F. Al-Turjman, M. Aloqaily: AI Techniques for COVID-
     19 (2020),
     doi: https://doi.org/10.1109/ACCESS.2020.3007939.
 [3] T. Ishige, T. Murata, S. Taniguchi, T. Miyabe, A. Kitamura, K. Kawasaki, K.
     Nishimura, M. Igari, H. Matsushita: Highly sensitive detection of SARS-CoV-2 RNA
     by multiplex rRT-PCR for molecular diagnosis of COVID-19 by clinical laboratories. Clin-
     ica Chimica Acta (2020),
     doi: https://doi.org/10.1016/j.cca.2020.04.023.
 [4] A. J. Izenman: Linear Discriminant Analysis. In: Modern Multivariate Statistical Tech-
     niques, Springer Texts in Statistics. Springer, New York (2013),
     doi: https://doi.org/10.1007/978-0-387-78189-1_8.
 [5] S. Ludwig, A. Zarbock: Coronaviruses and SARS-CoV-2, Anesthesia & Analgesia (2020),
     doi: https://doi.org/10.1213/ane.0000000000004845.
 [6] G. Mak, K. Cheng, S. Lau, K. Wong, C. Lau, T. Lam, C. Chan, N. Tsang: Evaluation
     of rapid antigen test for detection of SARS-CoV-2 virus, Journal of Clinical Virology (2020),
     doi: https://doi.org/10.1016/j.jcv.2020.104500.
 [7] B. G. Pinto, A. E. Oliveira, Y. Singh, L. Jimenez, A. N. A. Gonçalves, R. L. Ogava,
     R. Creighton, J. P. S. Peron, I. Nakaya: ACE2 Expression is Increased in the Lungs of
     Patients with Comorbidities Associated with Severe COVID-19, The Journal of Infectious
     Diseases (2020),
     doi: https://doi.org/10.1101/2020.03.21.20040261.
 [8] A. Rygula, K. Majzner, K. M. Marzec, A. Kaczor, M. Pilarczyka, M. Baranska:
     Raman spectroscopy of proteins: a review, Wiley Online Library (2013),
     doi: https://doi.org/10.1002/jrs.4335.
 [9] B. A. Taha, Y. Al Mashhadany, M. H. Hafiz Mokhtar, M. S. Dzulkefly Bin Zan,
     N. Arsad: An Analysis Review of Detection Coronavirus Disease 2019 (COVID-19) Based
     on Biosensor Application, Sensors (2020),
     doi: https://doi.org/10.3390/s20236764.
[10] Q. Tu, C. Chang: Diagnostic applications of Raman spectroscopy, Nanomedicine: Nan-
     otechnology, Biology and Medicine 8.5 (2012),
     doi: https://doi.org/10.1016/j.nano.2011.09.013.
[11] M. K. Tuck, D. W. Chan, D. Chia, A. K. Godwin, W. E. Grizzle, K. E. Krueger, W.
     Rom, M. Sanda, L. Sorbara, S. Stass, W. Wang, D. E. Brenner: Standard Operating
     Procedures for Serum and Plasma Collection, Journal of Proteome Research (2020),
     doi: https://doi.org/10.1021/pr800545q.
[12] G. Yin, L. Li, S. Lu, Y. Yin, Y. Su, Y. Zeng, et al.: Data and code on serum Raman
     spectroscopy as an efficient primary screening of coronavirus disease in 2019 (COVID-19),
     Nanomedicine: Nanotechnology, Biology and Medicine (2020),
     doi: https://doi.org/10.6084/m9.figshare.12159924.v1.


                                              175

</pre>