Solving the problem of antibody grouping based on cross- inhibition index using hierarchical clustering methods Oleksandr Zelinskyia, Vitaliy Horlatcha, Yuri Lebedinb and Yaryna Paslavskaa a Ivan Franko National University of Lviv, 1 Universytetska St., Lviv, 79000, Ukraine b Xema OY, Myllymäenkatu 21, Lappeenranta, 53550, Finland Abstract Due to increasing number of viral diseases (including Covid-19) rapid research with the purpose of their detection, prevention, and treatment is crucial. This article considers a problem of finding two optimal antibodies to any virus that is important for detection of disease and development of tests but not for creation of vaccine. It is worth noting that the target protein (nucleoprotein), described in this article, is the only generally established target for SARS- CoV-2 diagnostics, using antigen rapid tests or any other antigen detection tools. Possible ways of solving the aforementioned problem were described using hierarchical clustering algorithm with different linkage methods. Affirmative results of dividing antibodies into groups were achieved. Keywords 1 Hierarchical clustering, SARS-CoV-2, antibodies, viruses 1. Introduction The Covid-19 epidemic has shown that it is still quite difficult for humanity to control and fight acute respiratory viral infections. According to WHO, almost 613 million people worldwide have been infected with COVID-19 and more than 6.5 million people have died due to the disease [3]. However, this is not the first and probably not the last such pandemic. Therefore, it is crucial to conduct research as quickly as possible, so that the diseases could be easily detected and treated. The next step is the development of vaccines, as well as tests that show the number of antibodies to a particular virus. It is clear that rapid detection of the disease helps to isolate spreading of the virus and treat a patient more effectively, and vaccination improves immunity to a particular virus and reduces the likelihood of negative or even fatal consequences. Nowadays, computers are a very powerful tool that allows solving not only mathematical problems, but also biological, chemical, and medical ones. Different types of models and algorithms including machine learning algorithms are used for this purpose. Moreover, the usage of computers helps scientists to reduce the number of experiments and routine work in laboratories around the world. The purpose of this work is to consider the problem of finding two optimal antibodies to any virus (for example, the SARS-CoV-2) and propose possible ways to solve it using machine learning algorithms, more precisely agglomerative clustering algorithms. 2. Formulation of the problem There is a molecule of the SARS-CoV-2 and a set of antibodies, which consists of 43 elements. The task is to attach only two antibodies to the given virus molecule. In this article, the target molecule is IDDM-2022: 5th International Conference on Informatics & Data-Driven Medicine, November 18-20, 2022, Lyon, France; EMAIL: sashko.zel2000@gmail.com (OZ); vitaliy.horlatch@lnu.edu.ua (VH); lebedin@xema.fi (YuL); p.yaryna@gmail.com (YaP). ORCID: 0000-0003-1247-7511 (OZ); 0000-0001-5401-1731 (VH); 0000-0003-4250-4322 (YuL); 0000-0003-4834-9597 (YaP). ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) the molecule of protein (nucleoprotein). This protein is the only generally established target for COVID- 19 diagnostics (not the vaccine) used by practically all antigen rapid tests [10] and other antigen detection tools globally. Only two antibodies are required to form a "sandwich", which is a standard way of the determination of any protein substance. The antibodies can be either different or the same (to distinguish them, one of them is marked with "*"). For simplicity, we will assume that the experiment happens in 2D, not 3D. Antibodies are two circles of approximately the same size with a small "beak" for interaction with the virus. Antibodies attach to the virus molecule, which is represented as a smaller circle. A schematic representation of this process can be seen in Figure 1. Figure 1: A schematic model of the attachment of antibodies to a viral molecule In the case of the considered problem, the molar weight of the target protein molecule is 45 kDa and the molar weight of the antibodies is 180 kDa. Since there is a need to attach two antibodies to a virus molecule, the main task is to find two antibodies (they can be identical) that are located at an optimal distance from each other. This means that they cannot overlap or locate too close to each other otherwise they start to compete and one of them cannot be attached. It was discovered that this problem mostly can be solved by dividing the list of antibodies into groups according to how much they interfere with each other, or in other words, whether they can attach to the virus in the same region. If two antibodies belong to different groups, there is a very high probability that they will bind in different areas and interact better than if they were from the same group. However, in some cases, antibodies still will not be able to attach to the virus molecule. The data considered in this paper are obtained from the experiment performed by Xema OY, Lappeenranta, Finland which consisted of several parts: 1. Conjugation of HRP to monoclonal antibodies One of the most popular methods of conjugation HRP (Faizyme, SAR) to antibodies was used. Periodate oxidized HRP formed a covalent linkage with mAbs after the reduction of the Schiff base by sodium borohydride [9]. 2. Direct binding of mAbs to N-Ag variants N-Ag preparations were diluted to 0,1 ug/ml by carbonate buffer pH 9,5. One hundred microliters of the solution were placed into the wells of high adsorption capacity polystyrene microplate (KHB, China) and incubated overnight at +4 °C. After removing the microwell content by vacuum, the microwells were washed once by ELISA [8] washing solution - 0,1% Tween 20 (Serva, Germany) in 0,9% sodium chloride (Merck, Germany) and filled with ELISA blocking solution (0,1M phosphate buffer containing 0.9% NaCl and 0,5% hydrolyzed casein) for 2 hours at ambient temperature, and then dried at ambient temperature for 48 hours. The mAbs were diluted by ELISA buffer (0,1M phosphate buffer containing 0,9% NaCl and 0,1% hydrolyzed casein) at a uniform concentration of 1 ug/ml. One hundred ul of mAb solution was incubated in the wells for 30 minutes at 37 °C. The wells were washed thrice with ELISA washing solution, and HRP-conjugated sheep anti-mouse Ig-HRP conjugate (Cat# AS302-HRP, Xema) in working dilution was added to the wells for another 30 minutes at 37 °C. After 5 washing with ELISA washing solution, the TMB chromogenic substrate (Cat#R055, Xema) was added into the wells for 15 minutes, the reaction was stopped by the addition of 5% sulfuric acid and optical density at 450 nm (OD450) was measured on HiPo microplate reader (Biosan, Latvia) 3. Cross-inhibition of mAbs by direct binding to solid phase N-Ag. Full-length N-Ag was coated onto the surface of polystyrene wells at 0,5 ug/ml (see the previous paragraph). In the preliminary test, each HRP-conjugated mAb was serially diluted (10x) in the microwells from 1:100 to 1:1 million and incubated for 30 minutes at 37 °C. Then the reaction was finalized by washing, TMB substrate, and stop solution as described in the previous paragraph. The dilution factor of each conjugate giving the OD450 within the range 1,0-1,5 was used as working dilution for the main cross-inhibition experiment as follows. Fifty microliters of the working dilution of each HRP-conjugated mAb were added into the antigen- coated microwells concurrently with the equal volume of ELISA buffer (reference wells) or all mAbs diluted to 10 ug/ml in the same buffer. After 30 minutes of incubation at 37 °C, the reaction was finalized as described above. All the combinations were run in duplicates. The data for each combination of HRP labeled and unlabeled mAbs are shown as the inhibition percentage: (average OD450 of actual combination – average OD450 of reference wells)/average OD450 of reference wells. Data are presented in the form of a table with 43 rows that represent antibodies and 32 columns that represent marked antibodies, where each cell is the cross-inhibition index of the marked antibody and unmarked. In the row labeled as "blank", the maximum values of the cross-inhibition index for the corresponding marked antibody are given. The value in each cell ranges from zero to the value in the "blank" cell of the corresponding column. An example of data is shown in Figure 2. Figure 2: Part of the dataset 3. Solutions for the problem Eventually, the problem, described in this article, is the dataset elements grouping problem, which is considered to be a problem of clustering. That is why it was decided to apply one of the most popular types of clustering – the hierarchical algorithms, namely its agglomerative subspecies. There were chosen several linkage methods [1]: • Ward linkage – the increase in variance for the cluster being merged • Complete linkage – the maximum distance between elements of each cluster • Average linkage – the mean distance between elements of each cluster • Single linkage – the minimum distance between elements of each cluster In addition, it was decided to use the simplest Euclidean distance (1) as a metric (1) 𝑑(𝑎, 𝑏) = √∑(𝑎𝑖 − 𝑏𝑖 )2 , 𝑖 Before applying any algorithm, equation (2) was applied to each cell except the “blank” row. −(𝑐𝑒𝑙𝑙𝑖,𝑗 − 𝑏𝑙𝑎𝑛𝑘𝑗 ) (2) 𝑐𝑒𝑙𝑙𝑖,𝑗 = , 𝑏𝑙𝑎𝑛𝑘𝑗 The new values represent the percentage ratio between the value in the cell and the maximum value for the corresponding column. The new values are in the range of 0 to 1. To develop an application for solving the described problem, the Python programming language was used. In particular, the “pandas” library was used to work with data and the “scikit-learn” library was used for clustering [4, 7]. The threshold was selected using the Elbow method based on the vector of distances between clusters [5, 6]. 4. Results The expected results are shown in Table 1. Based on it we aim to obtain 11 groups (or 7 large groups) with different numbers of antibodies in each of them. From the experiment, it is known that the best interaction will be between antibodies from the group 3B (X155, X41, X213, X32) and 4A (NP3706) or 4B (X211). Table 1 Expected result 1A 1B 1B/2 2 2B/3 3A 3B 4A 4B 4C 5 NP1501 X190 NP1512 NP1502 NP1528 X202 X32 NP3706 X211 X215 X220 NP1514 NP1526 NP1521 NP1503 X218 X41 X275 NP1516 X200 NP1508 NP1518 X155 NP1517 X201 NP1510 NP1527 X212 NP1507 NP1520 X213 NP1522 X217 NP1525 X223 X221 X224 X271 X233 NP3701 NP1524 NP3708 NP3715 As a result, 4 different outputs for each linkage method were received. The dendrogram in Figure 3 shows results for the usage of Ward linkage with a distance threshold equal to 1.5. Figure 3: Dendrogram for agglomerative clustering with Ward linkage As a result of this algorithm, 7 clusters were identified. Table 2 shows the result of clustering where each column contains a list of antibodies that belong to the corresponding cluster. From the result, it is obvious that cluster number 1 matches group 3B, cluster 5 completely matches group 2B/3, and cluster 7 matches group 1A (they are marked in green). Also, cluster 3 combines groups 1B, and 2, and cluster 4 corresponds to group 3A without the element NP1527, which is in cluster 6, which also contains groups 4A and 4B (they are marked in yellow and orange). Table 2 Result for agglomerative clustering with Ward linkage 1 2 3 4 5 6 7 X41 NP1521 X200 X202 NP1528 NP1527 NP1517 X32 X275 X190 NP1518 X211 NP1514 X233 X215 X221 X218 NP3706 NP1516 X224 NP1512 X201 NP1507 X223 X220 X271 NP1501 X217 NP3708 X213 NP3701 X212 NP1526 X155 NP1525 NP3715 NP1522 NP1524 NP1520 NP1502 NP1503 NP1510 NP1508 The dendrogram in Figure 4 shows the results of the usage of complete linkage with a distance threshold equal to 1.2. Figure 4: Dendrogram for agglomerative clustering with complete linkage As a result of this algorithm, 9 clusters were identified. Table 3 shows the result of clustering. As shown, cluster number 4 matches group 1A and cluster 8 completely matches group 2B/3 (they are marked in green). Also, cluster 2 combines groups 1B and 2, cluster 5 corresponds to group 3A without the element NP1527, which is in cluster 3, which also contains groups 4A and 4B, in addition, cluster 6 and cluster 9 contain the elements from group 3B (they are marked in yellow and orange). Table 3 Result for agglomerative clustering with complete linkage 1 2 3 4 5 6 7 8 9 X275 X221 X211 NP1507 X202 X213 NP1521 NP1528 X217 NP1512 X201 NP1527 NP1501 NP1518 X32 X215 X155 X200 NP3706 NP1516 X218 X41 X220 NP3715 X190 NP1517 X233 X271 NP1514 X224 NP3701 NP1524 NP1526 X212 NP1525 X223 NP1522 NP3708 NP1502 NP1503 NP1520 NP1508 NP1510 The dendrogram in Figure 5 shows the results of the usage of average linkage with a distance threshold equal to 0.97. Figure 5: Dendrogram for agglomerative clustering with average linkage As a result of this algorithm, 11 clusters were identified. Table 4 shows the result of clustering. As shown, cluster number 2 matches group 3B, cluster 5 completely matches group 1A, and cluster 8 matches group 2B/3 (they are marked in green). Also, cluster 6 combines groups 1B, and 2, and clusters 9, 10, and 11 correspond to group 3A without the element NP1527, which is in cluster 6, which also contains groups 4A and 4B (they are marked in yellow and orange). Table 4 Result for agglomerative clustering with average linkage 1 2 3 4 5 6 7 8 9, 10, 11 X275 NP3715 X211 X220 NP1501 NP1502 NP1512 NP1528 X202 NP1518 X218 NP1521 NP1524 NP3706 NP1517 NP1503 X215 X32 NP1527 NP1507 X221 X41 NP1514 NP1508 X155 NP1516 NP1510 X212 NP3701 X213 X200 X217 X190 X223 NP1520 X224 NP1522 X233 NP1525 X271 NP1526 X201 NP3708 The dendrogram in Figure 6 shows the results of the usage of a single linkage with a distance threshold equal to 0.75. Figure 6: Dendrogram for agglomerative clustering with single linkage As a result of this algorithm, 13 clusters were identified. Table 5 shows the result of clustering. As shown, cluster number 2 matches group 3B, cluster 3 completely matches group 1A and cluster 6 matches group 1A, and cluster 13 matches group 4B (they are marked in green). Also, cluster 4 combines groups 1B, and 2, and cluster 6 and 11 contains items from group 5 (they are marked in yellow and orange). Table 5 Result for agglomerative clustering with single linkage 1 2 3 4 5 6 7 8 9 10 11 12 13 NP1527 X213 NP1517 X201 NP1521 X220 X218 NP1518 NP1528 X202 X275 NP1512 X211 NP3706 X32 NP1514 X200 X215 X41 NP1507 X221 X155 NP1516 X190 NP1524 NP1501 NP3708 X212 NP3701 NP3715 NP1502 X217 NP1526 X223 NP1525 X224 NP1522 X233 NP1503 NP1508 NP1520 X271 NP1510 5. Conclusion As a metric of accuracy, the total amount of elements in the clusters, which fully correspond to the expected result, was taken. Based on this metric, it is obvious that the algorithm, which used a single linkage method, gives the best result. However, the algorithms that used ward linkage and average linkage methods are not much worse. Surprisingly, the algorithm, which used the complete linkage method is the worst. Even though the amount of data may seem to be small (40x30 matrix), the developed application does the amount of work in a short time (1-2 minutes), that would take a person several days to complete. Moreover, as the sample data size increases, the amount of time it takes for the computer to execute the algorithm will remain small compared to the time it would take a person to perform the same task. In conclusion, hierarchical clustering methods have shown themselves to be quite suitable for a given problem. However, they do not take into account the order in which it forms the clusters yet (the order of the clusters is not the same as the order of the groups in the expected result), but it is also a key aspect of this problem. 6. References [1] F. Nielsen, Introduction to HPC with MPI for Data Science, Chapter 8: Hierarchical Clustering, Springer, Switzerland, 2016, 195-211. https://www.doi.org/10.1007/978-3-319-21903-5_8 [2] O. Zelinskyi, V. Horlatch, Yu. Lebedin, Development of antibody clusterization system based on coefficient of cross-inhibition, International Student Scientific Conference of Applied Mathematics and Computer Science (ISSCAMCS – 2022), May 5-6, 2022, Lviv, Ukraine, 8-12, URL: https://ami.lnu.edu.ua/wp-content/uploads/2022/05/ISSCAMCS-2022.pdf [3] WHO Coronavirus (COVID-19) Dashboard, 28 September 2022, URL: https://covid19.who.int/ [4] Scikit-learn documentation, AgglomerativeClustering, 2022, URL: https://scikit- learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html?highlight=ag#s klearn.cluster.AgglomerativeClustering [5] Kneed documentation, Parameter Example, 2020, URL: https://kneed.readthedocs.io/ en/stable/parameters.html [6] I. D. Baruah, Cheat sheet for implementing 7 methods for selecting the optimal number of clusters in Python, 2020, URL: https://towardsdatascience.com/cheat-sheet-to-implementing-7-methods- for-selecting-optimal-number-of-clusters-in-python-898241e1d6ad [7] B. Alam, Implementation of Hierarchical Clustering using Python, 2022, URL: https://hands- on.cloud/implementation-of-hierarchical-clustering-using-python/ [8] JM. Anaya, Y. Shoenfeld, A. Rojas-Villarraga, Autoimmunity: From Bench to Bedside, 2018, Bogota, Colombia), URL: https://www.ncbi.nlm.nih.gov/books/NBK459443/ [9] PK. Nakane, A. Kawaoi, Peroxidase-labeled antibody. A new method of conjugation, Journal of Histochemistry & Cytochemistry, 1974; 22(12), 1084-1091. https://doi.org/10.1177/22.12.1084 [10] R.Yu. Hrytsko, H.I. Bila, R.O. Bilyy, Test for coronavirus – what does it really means for the patient?, , Infectious Diseases, I.Horbachevsky Ternopil National Medical University, 2020, 65- 72, URL: https://ojs.tdmu.edu.ua/index.php/inf-patol/article/download/11287/10737/41013 [11] G. Lippi, A.-M. Simundic, M.Plebani, Potential preanalytical and analytical vulnerabilities in the laboratory diagnosis of coronavirus disease 2019 (COVID-19), Clinical Chemistry and Laboratory Medicine (CCLM), 2020, 58 (7), 1070-1076. https://doi.org/10.1515/cclm-2020-0285