Annotating Protein Structures for Understanding SARS-CoV-2 Interactome

Annotating Protein Structures for Understanding SARS-CoV-2 Interactome BarbaraPuccio barbara.puccio@unicz.it Department of Surgical and Medical Sciences Magna Graecia University

Catanzaro Italy

UgoLomoio ugo.lomoio@studenti.unicz.it Department of Surgical and Medical Sciences Magna Graecia University

Catanzaro Italy

LuisaDi Department of Engineering Unit of Chemical-Physics Fundamentals in Chemical Engineering Università Campus Bio-Medico di Roma

via Álvaro del Portillo 21 00128 Rome Italy

PietroHiramGuzzi hguzzi@unicz.it PierangeloVeltri veltri@unicz.it Department of Surgical and Medical Sciences Magna Graecia University

Catanzaro Italy

Annotating Protein Structures for Understanding SARS-CoV-2 Interactome 1613-0073 51BAAD0524B6F2381740E401947327EC GROBID - A machine learning software for extracting information from scholarly documents Protein Structure Protein Contact Network Data Annotations

Protein Contact Network (PCN) is an emerging paradigm for modelling protein structure. A common approach to interpreting such data is through network-based analyses. It has been shown that clustering analysis may discover allostery in PCN. Nevertheless Network Embedding has shown good performances in discovering hidden communities and structures in network. SARS-CoV-2 proteins, and in particular S protein, have a modular structure that need to be annotated to understand complex mechanism of infections. Such annotations, and in particular the highlighting of regions participating in the binding of human ACE2 and TMPRSS, may help the design of tailored strategy for preventing and blocking infection. In this work, we compare some approaches for graph embedding with respect to some classical clustering approaches for annotating protein structures. Results shows that embedding may reveal interesting structure that constitute the starting point for further analysis.

Introduction

Proteins are polymers made of twenty different amino acids organised to assemble a linear chain. The linear sequence of the amino acid determines the spatial conformation of proteins [1]. The spatial structure of proteins is characterized by the presence of a central carbon atom (called carbon-C), a carboxyl group, an amino group and a lateral chain, different for each amino acid (this chain can be hydrophobic, no polar or charged). They are linked to each other by covalent bonds (that are called peptide bonds) between molecules [2,3].

The amino acids sequence is known as primary structure. The secondary structure indicates the folding of peptides, i.e. protein subsquences, chain resulting from the interaction between each amino acid and neighboring amino acids. The main types of secondary structure are 𝛼-helix and 𝛽-sheet. The tertiary structure is a combination of secondary structures, that makes a complex molecular shape (3D-shape). A protein in its 3D-conformation is called 'native' and this is closely connected with its biological function. Finally, the quaternary structure is only present in proteins with multiple subunits (peptide chains) that can be the same or different.

In such a scenario Protein Contact Networks (PCNs) emerged as a relevant paradigm for the analysis of protein molecular structures [4,5]. A PCN is a graph built from protein structure. A node in a PCN represents an Carbon-𝛼 atom of the backbone, while an edge represents a spatial distance of the atoms in the range 4 and 8 Å.

PCN descriptors are useful to model and analyse protein functions [4]. PCNs allow to identify modules in protein molecules through network spectral clustering [6,7,8,9,10], with relevant application in different biological contexts [11,12].

For instance, analysis of PCNs allows to detect such as allosteric regulation. 𝐴𝑙𝑙𝑜𝑠𝑡𝑒𝑟𝑦 is the ability of proteins to trasmit a signal from one site to another in response to environmental stimuli and this is related to the trasmission of information across the protein from a sensor site (or allosteric site) to the effector site (or binding site) [13,14]. Allostery may also be studied using wet lab methods such as X-ray or NMR structures correspondent to different activation states or molecular dynamics simulation of allosteric agent binding. Such methods are usually time and resource consuming, therefore there is the need to introduce computational method to detect allostery and then to annotate allosteric regions.

Starting from PCNs it is possibile to detect modules in protein structure using clustering algorithm approach. In a graph a cluster is a group of nodes that are characterized by a strong intra-cluster connection (in terms of number of contacts) and a weaker inter-cluster connection. Clustering allows to detect community (cluster) in a graph and this is perfectly comparable to the modules detection in protein structure. Two methods have been devised to partition PCNs into clusters: a geometrical method, based on the k-means algorithm and spectral clustering, and the clustering of the embedding of the network.

PCNs allows to simplify protein analysis to detect modules, essential in allosteric regulation. On the other hand graphs consist of high number of nodes and links (particularly in protein world) thus it is challenging to apply different mathematical and statistical operations. In this situation, embeddings appear as a reasonable solution. Based on potential of graph embeddings, we propose a PCNs analysis using clustering approaches on embeddings, in order to discover allostery in PCN and annotate protein structures.

We use PCN-Miner, a software tool implemented in the Python programming language able to import protein in the Protein Data Bank format and generate the corresponding protein contact network. Also it offers a set of algorithms for the PCN analysis [15,16,17,18]. As previously reported, in this work we focus on application of clustering algorithm on the embeddings with the aim of evaluating network embedding approaches in PCN analysis. Our analysis based on SARS-CoV-2 spike glycoprotein, in its closed form, and some of its Variants of Concern [11,15,19,20].

Protein Contact Networks

A protein structure can be represented as a complex three-dimensional object, formally defined by the coordinates in 3D space of its atoms. Despite the large availability of protein molecular structures data, there are yet many problems regarding the relationship between protein structures and their functions. For this reason it is necessary to define simple descriptors that can describe protein structures with few numerical variables. Structure and function are based on the complex network of inter-residue interactions, where residues are identified by amino acids sequences [21]. Therefore, the residues interactions are used to define protein interaction networks. Protein interaction networks are thus used to study protein functions. The most simple choice to define networks, is to represent the protein structure by means of 𝛼-carbon location. The spatial position of 𝐶 𝛼 is still reminiscent of the protein backbone and this allows to highlight also the most important characteristics of the three-dimensional structure. Starting from the 𝐶 𝛼 spatial distribution, a distance matrix d is evaluated where each 𝑑 𝑖,𝑗 represents the Euclidean distance in the 3D space between the 𝑖-th and 𝑗-th residues, defined as

𝑑 𝑖,𝑗 = √︁ ((𝑥 𝑖 − 𝑥 𝑗 ) 2 ) + ((𝑦 𝑖 − 𝑦 𝑗 ) 2 ) + (𝑧 𝑖 − 𝑧 𝑗 ) 2 )(1)

where (𝑥 𝑖 , 𝑦 𝑖 , 𝑧 𝑖 ) and (𝑥 𝑗 , 𝑦 𝑗 , 𝑧 𝑗 ) respectively are the cartesian coordinates of residue 𝑖 and 𝑗.

Matrix 𝑑 is used to define a Protein Contact Network concept, that is an alternative and different representation of using graph-based models to represent protein structures. A graph is the most natural structure to represent proteins, where nodes (or vertices) are the protein residues and links (or edges) between the 𝑖-th and the 𝑗-th nodes (residues) represent residue contacts. In the graph representation there exists a link between two residues 𝑖 and 𝑗 if the distance between two residues (i.e., 𝑑 𝑖,𝑗 ) is higher than 4 and lower than 8 Å. The lower end excludes all covalent bonds, which are not sensible to environment change (so to protein functionality), while the upper end gets rid of weaker non-covalent bonds (so not significant for protein functionality).

At this point, it is possible to build up adjacency matrix A, whose generic element is defined as:

𝐴 𝑖𝑗 = {︂ 1 if 4 ≤ 𝑑 𝑖𝑗 ≤ 8 0 otherwise(2)

The adjacency matrix of a graph is unique in regard to the ordering nodes. In the case of proteins, in which the order of nodes (residues) corresponds to the residues sequence (primary structure) it can be said that its corresponding network is unique: this establishes a one-to-one correspondence between protein and its network.

Protein Contact Network Analysis

𝐺𝑟𝑎𝑝ℎ 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑠 are the transformation of property graphs to a vector or a set of vectors. Embedding should capture the graph topology, vertex-to-vertex relationship, and other relevant information about graphs, subgraphs, and vertices. There are a few reasons why graph embeddings are needed:

1. Graphs consist of edges and nodes. Those network relationships can only use a specific subset of mathematics, statistics, and machine learning, while vector spaces have a richer toolset of approaches. Embeddings are more practical than the adjacency matrix since they pack node properties in a vector with a smaller dimension. 3. Vector operations are simpler and faster than comparable operations on graphs.

The development of novel methods fro encoding structural information of graph to be used for subsequent analysis is a recent area in research. These methods are usually referred to as graph representation learning or graph-embedding [22,23]. The goal of these approaches is learning a mapping for graph substructures (i.e. nodes or sub graphs) into points of a low-dimensional vector space R 𝑑 , having 𝑑 < 𝑛, n is the dimension of the adjacency matrix [22]. We here focus on node-based embeddings, thus all these methods realise a mapping among nodes and point of the embedding space so that geometric relationships among embedded objects reflect the structure of the original graph. Since embeddings are points of an euclidean space, they may be used in other machine learning tasks (e.g. node classification) or in other graph analysis algorithms.

Currently, there exists many algorithms and many classification attempts that are categorised and described in some previous surveys [24,25,26,27,22,28].

The input of representation learning algorithms is a undirected and unweighted graph 𝐺 = (𝑉, 𝐸) with its associated adjacency matrix 𝐴 and a real-valued matrix 𝑋 containing node attributes 𝑋 ∈ 𝑅 𝑚𝑥|𝑉 | . The goal of each algorithm is to map each node into a vector z ∈ R 𝑑 where 𝑑 < |𝑉 |.

Shallow embedding methods encode each node (𝑣 𝑖 ∈ 𝐺) into a single vector through the use of a simple encoding function defined as:

𝐸𝑁 𝐶(𝑣 𝑖 ) = 𝑀 𝑣 𝑖(3)

where M is a matrix containing the embedding vectors and 𝑣 𝑖 is a vector used for selecting the column. The matrix M contains all the embeddings. Each column of M encodes a node of the original graph and the number of rows 𝑑 is lower than the number of nodes 𝑛. These embeddings were initially inspired by matrix factorization approaches. The differences among these methods are in the use of different loss function and similarity measures.

Workflow of the Experiment

In this work we focused on the comparation between a direct analysis of PCNs using network clustering and network embedding approach followed by clustering on the embeddings. We used PCN-Miner, implemented in Python 3.8 programming. It uses scypy and numpy libraries for managing matrices; management of PDB files is provided by ProDy package; the network embedding is realised by wrapping the GEM library and clustering algorithms bu CdLib; visualisation of protein structures is made by wrapping the community edition of PyMol. Analysis started from protein data. The reference database for protein structures is the Protein Data Bank (PDB https://www.rcsb.org/), which also defines the PDB format, a standard for recording atom files. Using PDB files we obtain protein contact networks. First step consist of import protein structure in PDB format from which it is possible to obtain the PCN (alternatively we can directly import a PCN previously determined). After we access analysis fuctionalities. On one hand we work with network clustering, on the other we work with network embeddings followed by clustering on the embeddings. Therefore we compared the results by comparison of centrality measures and participation coefficient, computed for each residue.

We focused on four structures of Spike protein, in its closed form: the wild type and three variants of concern (alpha, delta, omicron).

Figure 2 shows the Structure of the SARS-CoV-2 spike glycoprotein pdb code 6VXX. We first built the protein contact network using PCN-Miner. Then we embedded the resulting graph by using the HOPE algorithm. Each node was embedded into a vector having 64 dimension. Finally we applied the spectral clustering algorithm. We found some interesting community that could be annotated as allosteric regions after verification. Figure 3 shows the Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). This structure is the result of the clustering analysis, with soft clustering using a normalised lapacian. Figure evidences the found communities.

Figure 4 shows the structure of Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code 7SBK). The structure is the result of the embeddings+clustering analysis, with HOPE as embdeddings algorithm.

Figure 5 shows the Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code 7SBK). This structure is the result of the clustering analysis, with soft clustering using a normalised lapacian. ). This structure is the result of the embeddings+clustering analysis, using the HOPE for node embedding

Conclusion

2 .2Adjacency matrix describes connections between nodes in the graph. It is a | 𝑉 | x | 𝑉 | matrix, where | 𝑉 | is a number of nodes in the graph. Each column and each row in the matrix present a node. Non-zero values in the matrix indicate that two nodes are connected. Using an adjacency matrix as a feature space for large graphs is almost impossible. Imagine a graph with 1M nodes and an adjacency matrix of 1M x 1M.

Figure 1 :1Figure 1: Workflow

Figure 2 :2Figure 2: Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). This structure is the result of the embeddings+clustering analysis, using the HOPE for node embedding

Figure 3 :3Figure 3: Structure of the SARS-CoV-2 spike glycoprotein (closed state)(pdb code 6VXX). This structure is the result of the clustering analysis, with soft clustering using a normalised lapacian.

Figure 4 :4Figure 4: Structure of Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code 7SBK). This structure is the result of the embeddings+clustering analysis, with HOPE as embdeddings algorithm.

Figure 5 :5Figure 5: Structure of Closed state of pre-fusion SARS-CoV-2 Delta variant spike protein (pdb code 7SBK). This structure is the result of the clustering analysis, with soft clustering using a normalised lapacian.

Toward an accurate prediction of inter-residue distances in proteins using 2d recursive neural networks PKukic CMirabello GTradigo IWalsh PVeltri GPollastri BMC bioinformatics 15 2014 Non-coding rnas in cancer: platforms and strategies for investigating the genomic "dark matter KGrillone CRiillo FScionti RRocca GTradigo PHGuzzi SAlcaro MTDi Martino PTagliaferri PTassone Journal of Experimental & Clinical Cancer Research 39 2020 Predicting the response of the dental pulp to sars-cov2 infection: a transcriptome-wide effect cross-analysis JCGalicia PHGuzzi FMGiorgi AAKhan Genes & Immunity 21 2020 Modularity in protein structures: study on all-alpha proteins TKhan IGhosh Journal of Biomolecular Structure and Dynamics 33 2015 On the use of networks in biomedicine EVocaturo PVeltri Procedia Computer Science 110 2017 Modules identification in protein structures: the topological and geometrical solutions STasdighian LDi Paola MDe Ruvo PPaci DSantoni PPalumbo GMei ADi AVenere Giuliani Journal of chemical information and modeling 54 2014 Modeling multi-scale data via a network of networks SGu MJiang PHGuzzi TMilenković Bioinformatics 38 2022 The discovery of a putative allosteric site in the sars-cov-2 spike protein using an integrated structural/dynamic approach LDi Paola HHadi-Alijanvand XSong GHu AGiuliani Journal of proteome research 19 2020 A framework for the atrial fibrillation prediction in electrophysiological studies PVizza ACurcio GTradigo CIndolfi PVeltri Computer methods and programs in biomedicine 120 2015 Methodologies of speech analysis for neurodegenerative diseases evaluation PVizza GTradigo DMirarchi RBBossio NLombardo GArabia AQuattrone PVeltri International journal of medical informatics 122 2019 Structural genetics of circulating variants affecting the sars-cov-2 spike/human ace2 complex FOrtuso DMercatelli PHGuzzi FMGiorgi Journal of Biomolecular Structure and Dynamics 2021 From single level analysis to multi-omics integrative approaches: a powerful strategy towards the precision oncology MEGallo Cantafio KGrillone DCaracciolo FScionti MArbitrio VBarbieri LPensabene PHGuzzi MTDi Martino High-throughput 7 33 2018 Disclosing allostery through protein contact networks LDPaola GMei ADVenere AGiuliani 2021 Springer Pattern discovery in multilayer networks YRen ASarkar PVeltri AAy ADobra TKahveci IEEE/ACM Transactions on Computational Biology and Bioinformatics 19 2022 PHGuzzi LDi Paola AGiuliani PVeltri arXiv:2201.05434 Design and development of pcn-miner: A tool for the analysis of protein contact networks 2022 arXiv preprint On the analysis of diseases and their related geographical data GCanino PHGuzzi GTradigo AZhang PVeltri IEEE journal of biomedical and health informatics 21 2015 Using dual-network-analyser for communities detecting in dual networks PHGuzzi GTradigo PVeltri BMC Bioinformatics 22 2021 Data science in unveiling covid-19 pathogenesis and diagnosis: Evolutionary origin to drug repurposing JKumarDas GTradigo PVeltri PGuzzi SRoy Briefings in Bioinformatics 22 2021 Exploiting the molecular basis of age and gender differences in outcomes of sars-cov-2 infections DMercatelli EPedace PVeltri FMGiorgi PHGuzzi Computational and Structural Biotechnology Journal 19 2021 Viral pneumonia images classification by multiple instance learning: preliminary results EZumpano AFuduli EVocaturo MAvolio 10.1145/3472163.3472170 doi:10.1145/ 3472163.3472170 25th International Database Engineering & Applications Symposium

Montreal, QC, Canada

ACM July 14-16, 2021. 2021 IDEAS 2021 Protein contact networks: an emerging paradigm in chemistry LDi Paola MDe Ruvo PPaci DSantoni AGiuliani Chemical reviews 113 2013 WLHamilton RYing JLeskovec arXiv:1709.05584 Representation learning on graphs: Methods and applications 2017 arXiv preprint Convolutional neural network techniques on x-ray images for covid-19 classification EVocaturo EZumpano LCaroprese 10.1109/BIBM52615.2021.9669784 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2021 YHuang LAKurgan FLuo XHu YChen ERDougherty AKloczkowski YLi

Houston, TX, USA

IEEE December 9-12, 2021. 2021 A survey on network embedding PCui XWang JPei WZhu IEEE Transactions on Knowledge and Data Engineering 31 2018 Network embedding in biomedical data science CSu JTong YZhu PCui FWang Briefings in bioinformatics 21 2020 To embed or not: network embedding as a paradigm in computational biology WNelson MZitnik BWang JLeskovec AGoldenberg RSharan Frontiers in genetics 10 2019 Graph embedding techniques, applications, and performance: A survey PGoyal EFerrara Knowledge-Based Systems 151 2018 Data mining and life sciences applications on the grid MCannataro PHGuzzi ASarica Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 3 2013