=Paper=
{{Paper
|id=Vol-3831/paper14
|storemode=property
|title=Identifying Candidates for Protein-Protein Interaction: A Focus on NKp46's Ligands
|pdfUrl=https://ceur-ws.org/Vol-3831/paper14.pdf
|volume=Vol-3831
|authors=Alessia Borghini,Federico Di Valerio,Alessio Ragno,Roberto Capobianco
|dblpUrl=https://dblp.org/rec/conf/explimed/BorghiniVRC24
}}
==Identifying Candidates for Protein-Protein Interaction: A Focus on NKp46's Ligands==
<pdf width="1500px">https://ceur-ws.org/Vol-3831/paper14.pdf</pdf>
<pre>
                                Identifying Candidates for Protein-Protein
                                Interaction: A Focus on NKp46’s Ligands
                                Alessia Borghini1,∗ , Federico Di Valerio1,∗ , Alessio Ragno1 and Roberto Capobianco2
                                1
                                    Sapienza University, Rome
                                2
                                    Sony AI


                                              Abstract
                                              Recent advances in protein-protein interaction (PPI) research have harnessed the power of artificial
                                              intelligence (AI) to enhance our understanding of protein behaviour. These approaches have become
                                              indispensable tools in the field of biology and medicine, enabling scientists to uncover hidden connections
                                              and predict novel interactions. The experimental processes to analyze and validate the interactions
                                              between proteins are usually expensive and time-consuming and with this work, we can reduce these
                                              costs by strategically filtering and computationally validating the possible proteins which might take
                                              part in the interactions at hand. Aiming at helping in broadening the repertoire of known interacting
                                              proteins, we present a method for the systematic screening of proteins that exhibit a high affinity for
                                              the interaction with a chosen protein. Specifically, building upon already known protein interactions,
                                              we exploit the self-explainability of the deep learning model DSCRIPT to search and find promising
                                              protein candidates for a determined PPI. We analyze and rank the candidates using various strategies,
                                              and then employ AlphaFold2 to validate the resulting interactions. Consequently, we compare our AI-
                                              driven methodology with traditional bioinformatics approaches commonly used to find potential protein
                                              candidates. Throughout the overall process, explanatory data is obtained, among which is an informative
                                              contact map that elucidates the potential interaction between a protein of the known interaction and
                                              the predicted proteins. As a case study, we apply our method to deepen our understanding of NKp46’s
                                              ligands repertoire, which is yet not fully uncovered.

                                              Keywords
                                              Protein-protein interaction, Explainable AI, Natural killer cell


                                1. Introduction
                                Predominant computational models employ data-driven algorithms that assess pairs of proteins,
                                evaluating their likelihood of interaction based on their primary features, thereby categorizing
                                the pairs as either interacting or non-interacting. Unraveling protein-protein interactions
                                (PPIs) represents a fundamental challenge in bioinformatics. Despite the extensive research
                                efforts dedicated to identifying PPIs, a significant discrepancy persists between the number of
                                experimentally verified interactions and the vast array of PPIs in biological systems.
                                   The fraction of PPI networks that have been experimentally mapped is minimal, primarily
                                due to the prohibitive costs and extensive time investments required by traditional experimental
                                methodologies. Consequently, the deployment of high-throughput computational strategies
                                becomes indispensable for the systematic discovery of protein interactions. When employed

                                EXPLIMED - First Workshop on Explainable Artificial Intelligence for the medical domain - 19-20 October 2024, Santiago
                                de Compostela, Spain
                                Envelope-Open ragno@diag.uniroma1.it (A. Ragno)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
alongside experimental techniques, these computational-based techniques substantially elevate
the fidelity and precision of PPI predictions. PPI models forecast potential binding between
pairs of protein complexes, starting from their amino acid sequences. Among state-of-the-
art methods, some studies [1, 2, 3, 4] propose using AlphaFold to fold the dimer of the pair
and predict contact points, providing accurate results. However, while it is very accurate,
predicting such interaction generally takes a moderate amount of time, posing challenges for
identifying novel interacting proteins. Recognizing these needs, our study concentrates on the
strategical identification of potential interacting protein candidates, informed by experimentally
pre-established interaction data.
   Avoiding the indiscriminate search for protein pairs with potential for interaction, our
methodology exploits the explanation of deep neural networks to pinpoint proteins with a high
propensity for interaction with another chosen protein. This original and simple approach
initiates with the choice of a protein pair known to interact experimentally, utilizing one
protein as a template to model interaction with its counterpart. The model protein undergoes a
computational process, subsequently serving as a basis to identify other proteins exhibiting
analogous interaction potential.
   To reduce the overall process time, we propose to use a deep learning interpretable sequence-
based, structure-aware, genome-scale protein-protein interaction model, DSCRIPT [5, 6], to
seek potential candidates that could take part in the interaction with a pre-chosen protein. In
this way, we optimize the starting phase by choosing only the candidates with the highest
potential for interaction. DSCRIPT1 is an attention-based PPI model that can be explained
through visualization of attention scores used to estimate the interaction between two proteins
(more about in section 2). Indeed, DSCRIPT generates contact maps of the predicted interaction
that are coherent with the ground truth [5]. This inherent interpretability of DSCRIPT allows us
to better strategize how to filter the potential protein candidates for interaction with a chosen
protein. The high potential proteins found through the inner representations computed by
DSCRIPT (obtained with different strategies) were then validated by using both DSCRIPT itself
and the SpeedPPI2 model [7], which is an innovative optimized protein structure prediction
method based on the AlphaFold2 (AF2) [8] model. This step was crucial in confirming the
biological relevance of the interactions identified by DSCRIPT.
   Our case study is the recently discovered interaction between the NKp46 of natural killer
(NK) cells with calreticulin (CRT) [9]. We focus on finding proteins similar to CRT which could
potentially interact with NKp46. As part of the innate immune system, NK cells play a crucial
role in detecting and eliminating infected, transformed, and stressed cells [10, 11, 12]. Among
the natural cytotoxicity receptors (NCRs) that activate NK cells, NKp46 stands out for its ability
to recognize and target a broad spectrum of tumour cells [13, 14, 9]. NKp46 is encoded by
the NCR1 gene and is the most evolutionarily ancient NK cell receptor, expressed by most
NK cells and some innate lymphoid cells. This remarkable ability is related to the receptor’s
interactions with its ligands. However, the specific identity of these ligands remains unclear [15].
Its activation is associated with destroying cancer cells, making it a potential target for cancer
immunotherapy. However, the scarcity of identified ligands for NKp46 hinders its therapeutic

1
    https://github.com/samsledje/D-SCRIPT?tab=readme-ov-file
2
    https://github.com/patrickbryant1/SpeedPPI
application.
   In this work, we specifically aim to identify potential proteins that might interact with NKp46
utilizing PPI models’ inherent interpretability and inner representations.
   Overall, we provide the following contributions:

    • Development of a systematic screening method for proteins that exhibit a high affinity for
      interaction with a chosen protein, leveraging the self-explainability of the deep learning
      model DSCRIPT;
    • Utilization of AlphaFold2 to validate the interactions predicted by DSCRIPT, comparing
      the AI-driven methodology with traditional bioinformatics approaches;
    • Finding potential ligands for NKp46 that may play an important role in the human immune
      system.

   The remainder of this work is organized as follows: Section 2 introduces a literature review
of methods related to PPI; in Section 3 we describe in detail DSCRIPT and SpeedPPI, on which
we base our work, and present our proposed approach; Section 4 provides the experimental
procedures to validate our approach using NkP46 as a case study; finally, in Section 6 we wrap
up our results, presenting the limitations and future directions of our work.


2. Related Work
Protein-Protein Interactions (PPIs) boast of several works leveraging prior knowledge from
known interacting protein sequences and employing machine-learning (ML) techniques [16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]. Some approaches focus solely on amino acid sequences,
employing strategies such as counting amino acid triplets [19, 23, 26], defining signatures as
sets of subsequences [20], assessing auto-correlation values of physicochemical scales [24, 28],
or analyzing normalized counts of amino acid residues or pairs [25]. These sequence-based
methods have demonstrated prediction accuracies ranging from 70% to 84% on human datasets
and around 70% on yeast datasets. Additionally, several methods incorporate information about
protein domains [29, 30], which has proven to be informative for PPI prediction [27]. However,
domain-based methods are limited in applicability to proteins lacking domain assignments.
   Identifying homologous proteins is a common strategy for inferring the functions of newly
discovered proteins, as homologs typically share similar functions and three-dimensional
structures. This deductive approach has also been employed in predicting PPIs, assuming
that homologous proteins exhibit similar interaction patterns and functions [31]. Traditionally,
a pair of interacting proteins in one species and their corresponding orthologs in another
species, known to interact with each other, are termed interaction-orthologs (interologs) [32, 33].
However, the concept can be expanded to include interaction-homologs, as the distinction
between orthologs and paralogs is not always clear-cut [31, 34].
   A similar concept to what we want to do was first proposed in 2001 by Chen and Zhi
[35], in which it was called Inverse Docking. It refers to computationally docking a specific
small molecule of interest to a library of receptor structures. The technique may be used to
identify new potential biological targets of known compounds [36, 37, 38], or to identify targets
for compounds among a family of related receptors [39]. Another application is to predict a
compound’s pharmacological profile [40] or to generate a virtual selectivity profile characterizing
the inhibitors’ promiscuity [41]. Given the multi-faceted nature of a pharmacologically active
compound’s biological effects, inverse docking is especially helpful, because it may generate
new hypotheses for the action mechanism. In the case of inverse docking and similar techniques,
to our knowledge, only small molecules are considered, whereas we include ligands/proteins
without constraints on dimensions, sequence length or structure.
   Deep learning models have the ability to learn patterns in data that are often unclear to
us humans, making them difficult to spot using conventional methods. This explains why
deep learning models are increasingly common in scientific research. For our purpose, the
characteristics of deep learning models provide a clear advantage over standard bioinformatic
tools and approaches. These models can identify proteins that are semantically relevant,
which might be overlooked by other analyses and methods when studying interactions with a
specific chosen protein. So choosing deep learning models to handle proteins information and
representations, we think it would allow us to reach a higher probability of finding the most
meaningful proteins for our purpose.
   One notable deep learning technique is DSCRIPT, a sequence-based, structure-aware model
for predicting protein-protein interactions. It bypasses the need for structural or experimental
interaction data, capturing complex patterns within protein sequences indicative of interaction
potential. It employs convolutional and recurrent neural networks to infer potential contact
points between proteins, simulating the physical interaction space and generating an inter-
protein contact map. DSCRIPT’s key conceptual advance lies in implementing an interpretable,
structure-based model despite having only sequence-based inputs. Leveraging recent advance-
ments in protein language modeling [42], it constructs informative protein representations
implicitly endowed with structural information. The model’s generalizability and interpretabil-
ity stem from its ability to learn informative geometric representations of proteins, transforming
protein embeddings into a 2D contact map. The authors hypothesized that proteins with
similar embeddings are likely to interact similarly, enabling the discovery of new interacting
pairs. Evaluation against other protein sequence representations and BLAST [43] searches con-
firmed DSCRIPT’s efficacy in identifying interactions. Additionally, the model’s interpretability
aids in predicting inter-protein docking contacts, producing contact maps consistent with the
ground-truth contacts.
   Given DSCRIPT’s ability to generate meaningful protein representations and its interpretabil-
ity through contact maps, it emerged as a logical choice for screening potential interaction
protein candidates. Its inherent interpretability made it a preferred option for our approach.
   To validate the proteins and interactions computed we use the SpeedPPI [7] model which
represents a significant advancement in the field of computational biology, particularly in the
rapid construction of protein-protein interaction networks. The model operates on the principle
of evaluating pairwise interactions through AlphaFold2 following the FoldDock pipeline [44],
which is specifically tailored for PPIs. We were inspired by this case study [45].
3. Methods
In this section, we present the mathematical foundation for our approach. We divide it into two
parts: the former presents the first step of our approach, which consists of searching candidates
and identifying proper filters; the latter focuses on validating the results.

3.1. Identification of potential candidates
The overall process starts from a known interaction between a target protein, T, and a ligand-
protein, L. The aim is to find proteins that might interact with protein T similarly to L. We
retrieve the proteins’ data in the STRING [46] and UniProt [47] databases. After obtaining a set
of proteins of interest, we pass them through the DSCRIPT model to get their representations.

Embeddings These representations are also referred to as “embeddings”. An embedding is
a numerical representation of objects, words, or entities in a continuous vector space. This
representation usually encapsulates semantic and syntactic information in a dense, fixed-length
vector format, facilitating computational operations.

DSCRIPT To retrieve the embeddings, we utilize a pre-trained model developed by Bepler
and Berger [42], called DSCRIPT. This model consists of a Bi-LSTM [48] trained on three dis-
tinct types of data: the proteins’ SCOP classification (i.e. Structural Classification of Proteins),
which provides a general structural framework, the self-contact map detailing the protein’s 3-D
structure, and the sequence alignment among similar proteins. These embeddings effectively
encapsulate the protein sequences’ local context and overarching structural attributes. Specifi-
cally, the encoding for each amino acid is 𝑑0 -dimensional, capturing not only the properties of
the individual amino acid but also the broader structural context of the entire protein sequence.
   For a given protein L with sequence 𝑆𝐿 of length 𝑛, DSCRIPT generates an embedding
𝐸𝐴 𝐿 ∈ ℝ𝑛×𝑑0 , where 𝑑0 is the dimensionality of the embedding space (𝑑0 = 6165 in our case).
DSCRIPT’s predictive capability hinges on its ability to generate embeddings for two proteins
and compute an interaction probability 𝑝̂ ∈ [0, 1]. The model’s architecture employs a projection
module to reduce embeddings to a lower dimension 𝑑 using a multi-layer perceptron using the
rectified linear unit (ReLU). The projection module is followed by a residue-contact module,
which takes the 𝑑-dimensional embeddings and models the interaction between the residues
of each protein. This process results in a contact prediction matrix 𝐶̂ ∈ [0, 1]𝑛×𝑚 and a single
interaction probability 𝑝,̂ derived from the matrix through global pooling operations, which
captures the intuition that a pair of interacting proteins will be characterized by a relatively
small number of high-probability interacting residues and a logistic activation function through
the interaction prediction module. The global pooling operation captures the intuition that a
pair of interacting proteins will be characterized by a relatively small number of high-probability
interacting residues or regions This activation function takes the raw probability predictions
and makes them steeper, depressing values below 0.5 towards 0 and inflating values above 0.5
towards 1, controlling the rate at which this occurs. Then 𝑝̂ and 𝐶̂ are returned as the model
prediction.
Cosine Similarity In our approach, we use DSCRIPT first to obtain representations of all the
proteins in the set, and then we compute a score that tells us how similar these representations
are to the representation of the protein of interest. In particular, we leverage the cosine similarity:
                                                        𝑑
                                                     ∑ 𝐸𝐴𝑖 𝐸𝐵𝑖
                        𝑐𝑜𝑠𝑖𝑛𝑒_𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝐸𝐴 , 𝐸𝐵 ) = 𝑖=0            ∈ [−1, 1],                    (1)
                                                      ||𝐸𝐴 ||||𝐸𝐵 ||

where 𝐸𝐴 and 𝐸𝐵 are two embedding vectors and 𝑑 is their dimension.

3.2. Validation of candidates
To validate the candidates, we use SpeedPPI, which is an optimized model based on AlphaFold2
[8] for PPI network prediction 40x faster and less disk space reliant [7].
   Given an organism, the proteome is extracted from UniProt or another database. All sequences
are used in single-sequence mode to create multiple sequence alignments (MSAs) with HHblits
[49] searching on the Uniclust30 [50] database.

Multiple Sequence Alignments A multiple sequence alignment (MSA) organizes protein
sequences into a rectangular array, aiming to align residues within columns that are homologous,
superposable, or serve a common functional role, although these criteria may yield different
alignments as sequence, structure, and function diverge over evolutionary time [51]. MSAs are
indispensable in biology and bioinformatics for comparing sequences, revealing evolutionary
relationships, and identifying conserved regions. They entail aligning nucleotide or amino
acid sequences to detect similarities and differences, employing diverse algorithms, which aid
in predicting functional domains and phylogenetic relationships but can face computational
challenges with highly diverse sequences.

SpeedPPI and AlphaFold2 In the procedure, the MSAs are paired based on species informa-
tion, and all single-chain information is maintained by block diagonalization. The MSA pairing
and block diagonalization are done within the network, avoiding writing this information to disk,
which reduces disk space requirements and the total prediction time. The structure prediction
is made with AlphaFold2, and the features are prefetched. AlphaFold2 is a system designed to
predict the 3D structures of proteins. It works by taking the sequence of amino acids that make
up a protein and then using deep learning algorithms to predict how these amino acids fold and
interact to form a 3D structure. Its architecture involves a deep neural network comprising two
main components: a novel attention mechanism and a fully differentiable module for structure
prediction. The attention mechanism allows the model to focus on the most relevant parts of
the protein sequence when making predictions, while the structure prediction module uses a
series of mathematical functions to estimate the distances between pairs of amino acids and the
angles between connected amino acids, which are crucial for determining the overall shape of
the protein. Additionally, AlphaFold2 uses a technique called “attention-based message passing”
to incorporate information from the entire protein sequence into its predictions, allowing it to
capture long-range interactions that were previously difficult to model accurately.
pDockQ score The predictions are evaluated with the pDockQ score within the prediction
runs. The model produces pDockQ scores (introduced in [44, 45]). This score is created by
fitting a sigmoidal curve, to the DockQ [52] scores:
                                                    𝐿
                                  𝑝𝐷𝑜𝑐𝑘𝑄 =                      +𝑏
                                              1 + 𝑒 −𝑘(𝑥−𝑥0 )
where 𝑥 is the average interface plDDT multiplied with the logarithm of the number of interface
contacts and 𝐿, 𝑥0 , 𝑘, 𝑏 are obtained from the fitting process. From the prediction of the pDockQ
score and 3D structure in AF2, we store the contact points of the interaction to validate that
the interaction is similar to the one of the original ligand. The 𝑝𝐷𝑜𝑐𝑘𝑄 ≥ 0.23 was considered
to indicate a well modeled interaction [45], hence we considered the protein used for the
interaction to be a more probable ligand of NKp46. To further expand the analysis we generate
a contact map and some histograms of the distribution of contact points in relation to NKp46
(see section 5).


4. Experiments
We start our experiments by validating the capability of PPI models to reproduce known
interactions found in the literature involving NKp46. In particular, we observe that DSCRIPT
recognizes the interactions of NKp46 with CFP and TYROBP. We focus our study mainly on the
recently proved interaction CRT-NKp46 [9] which caught our interest, given its relevance in
the medical field.
   We generate high-dimensional representations of NKp46 and CRT through DSCRIPT. We
compute the interaction score and contact map of the interaction (Figure 1).
   From the analyses of the contact map, we notice that the most active amino acids of CRT in
the interaction are in the range of 199-227, while for NKp46 they are in the range 80-258. Both
ranges are exactly within the bounds of the CRT’s P-domain and NKp46 extracellular region
which are the regions involved in the interaction. This is coherent with what the authors of
DSCRIPT claimed: DSCRIPT is capable of producing contact maps that are similar to the ground
truth of the interaction [5].
   We chose CRT as the template protein and we use the UniProt database[47] to find the
initial set of proteins we are interested in. Focusing on human proteins, we filter the database
accordingly and since for our case study the known interaction involves NKp46 and CRT, with
CRT located on the cell surface/membrane after having translocated because of the cell ER stress,
we refine our search to include only proteins likely to interact with NKp46 in this location. As
a result, we narrowed down our database to 7188 proteins.
   To find and filter the potentially similar proteins of CRT we try two approaches: (i) similarity
to the entirety of the P-domain (amino acids 198-308) and (ii) similarity to the most small
interacting range (triplet) of amino acids (199-201). There are two main reasons why we thought
of these two approaches: (i) the P-domain is actively involved in the interaction [9] and finding
proteins that have part of the amino acid sequence semantically similar to the entirety of
the P-domain could lead to proteins with similar general characteristics to it, hence potential
candidates for interaction; (ii) using the most “interacting” triplet of amino acids could be more
Figure 1: Contact map for the amino acids of NKp46 (on y axis) and CRT (on x axis).


precise because clearly the P-domain is not entirely interacting with NKp46 but most probably
only a limited percentage of the sequence will be interested in the interaction, hence using the
most interacting triplets could lead to a less noisy set of protein candidates compared to the
other approach.
   The high dimensional embeddings of the proteins are at the amino acid level so we compute
the similarities at this exact level. In both cases, (i) and (ii), we use DSCRIPT to generate
the proteins’ representations. We analyze each protein in the database obtained, amino acid
embedding per amino acid embedding of CRT. The similarity score at the amino acid level is
given by the cosine similarity (see 3.1). In (i) we compare each amino acid representation of a
protein to every amino acid of the P-domain in CRT, then we compute the mean and median.
In (ii) we compare every triplet in the protein at hand and the most interacting triplet in CRT,
then we compute the mean.
   So a similarity score is assigned for both cases to each protein. Then, we pass such proteins
to DSCRIPT in order to predict the interaction with NKp46, producing interaction scores and
contact maps. The most important information is obtained: for (i), mean and median of the
similarity to P-domain, DSCRIPT interaction score and contact map; for (ii) mean of the similarity
between the most interacting triplets of the protein and CRT, interaction score and contact map.
By combining this data we rank the proteins and obtain a limited list of candidates.
   We take the best proteins overall (DSCRIPT interaction score ≥ 0.5 and highest median/mean
similarity values) for the final validation through SpeedPPI. We compute the pDockQ score of
the interaction and the 3D structure, so we can also get the predicted contact points. AlphaFold
has some statistical components that may vary the outcomes in different executions, so we run
the model 25 times with different seeds to obtain more robust final results.
   We also run some experiments with commonly used bioinformatics tools such as BLAST [43]
to find proteins similar to CRT. Hence we also used our approach on these proteins (and CRT)
to compare the results.
   In rest of the study we will refer to the proteins we found as Protein + index (P+index), to see
the correspondence to the real proteins and their genes see table 3


5. Results
In this section, we analyze and compare the bioinformatics methodology (Section 5.1) with the
two main directions taken for our experiments. As we introduced in Section 4 we consider the
P-domain area (Section 5.2) and the three consecutive most interacting amino acids (Section 5.3)
of CRT in order to find proteins similar to CRT that also have relevant features for interaction
with NKp46.
   To analyze the obtained candidates we create the boxplot of pDockQ scores to better visualize
the distribution of the results on the different runs of SpeedPPI that allow us to have robust
outcomes. Then, knowing that the NKp46 is able to interact only in a specific portion of itself,
the extracellular domain, we further investigate where the contacts effectively occur. For this
reason, we report the distribution of the contact points along the NKp46 amino acids’ sequence.
We highlight the three main portions of NKp46 with different colors: yellow for the signaling
peptide, green for the extracellular domain and red for the intracellular and transmembranal
part. Additionally, we report in tables the average and median amount of contact points for
area.

5.1. Bioinformatics methods
More commonly used tools that compute sequence and structure similarity between proteins,
like BLAST [43], Prosite [53], MEME suite [54] or InterPro [55], when used to find similar
proteins to CRT always found its isoforms or calnexin and calmegin. These latter proteins
belong to the same protein family of CRT and have a similar sequence to the CRT’s P-domain.
Both of them reside in the endoplasmic reticulum (ER), share structural features and play
essential roles in protein folding and quality control within the ER.
   From Figure 2 and Table 1 we can see that while the pDockQ scores are high, most of the
contacts are in the signaling (yellow) and extracellular (green) parts of NKp46 for CRT, instead
for calmegin and calnexin the contacts tend to be mostly in the cytoplasmic part (red). It is
reasonable that most contacts happen in the cytoplasmic part, due to the fact that they reside
in the ER. CRT is also found in the ER, but when ER stress is induced it translocates to the
membrane. We do not know if the same could happen to calmegin or calnexin, but the models
seem to have captured this characteristic for CRT: the pDockQ score is high and the contacts
are mostly on the extracellular part of NKp46 which is not the case for the other two proteins.
This could mean that the similarity between calnexin, calmegin – which were found with
bioinformatic tools – and CRT is more about sequence similarity than a similarity that alludes
to a possibility for interaction.
   Clearly, these bioinformatic techniques are precise and indeed return “similar” proteins in
a sequence and structural sense but this kind of similarity is not enough, we think that using
           (a) Boxplot of pDockQ scores             (b) Contact points distribution along NKp46 areas
Figure 2: Analysis made on calreticulin (CRT), calmegin (CLGN) and calnexin (CANX) when we predict
their interaction with NKp46 using SpeedPPI model. In Figure (a) there are the pDockQ scores, while in
(b) the contact points distribution on the different areas of NKp46. Both computed over 25 runs of the
model.

                                          Calreticulin   Calmegin      Calnexin
                                Signal        0.63          0.82         0.92
                     Average    Extra         0.51          0.52         0.43
                                Intra         0.30          1.55         1.00
                                Signal        0.56          0.76         0.96
                     Median     Extra         0.32          0.40         0.32
                                Intra         0.28          1.56         0.64
Table 1
Predicted average and median contact frequencies by the SpeedPPI model across various regions of
the NKp46 protein (i.e. signal peptide (Signal), Extracellular (Extra) and Transmembranal/Intracellular
(Intra)) during interactions with calreticulin and candidates found via bioinformatics approach, namely
calmegin and calnexin.


deep learning could lead us to a similarity related to the interacting ability that proteins may
have thanks to its ability in learning hidden patterns.

5.2. P-domain Embedding Similarity
In experiment (i) we consider the P-domain area for computing cosine similarities and we
extract the top-10 candidates, that we call P1...10. In Table 3 we report the UniProt IDs and
genes that correspond to P1...10. In Figure 3 we can see the pDockQ scores obtained when
predict the interaction between NKp46 and them.
   Here we can see that only P1 has a pDockQ score ≥ 0.23. Figure 5, instead, shows the
distribution of contacts for all the proteins with NKp46, here we notice that P1 has most of
the contacts in the extracellular part with some high peaks. Hence, from the union of the
information of both figures, we can consider P1 to be well modeled while not the other proteins
Figure 3: Boxplot of pDockQ values of the interaction between NKp46 and proteins having high
similarity to the P-domain of CRT.


Figure 4: Boxplot of pDockQ values of the interaction between NKp46 and proteins amino acids’ triplets
having high similarity to the most interacting triplet of CRT.


that have contrasting data.
   In Table 2a the average and median contact frequencies across the three regions we indicated
(signaling, extracellular, intracellular regions) of the NKp46 protein are shown. From this data,
we observe that the proteins obtained in the (i) P-domain similarity experiment have an overall
low possibility of interaction. The data show slight inconsistencies between the pDockQ score
computed and the distribution of frequencies of the contacts for each protein. For instance P2
has low pDockQ score and yet it has very high contacts’ frequencies on the extracellular part;
the same could be said of other proteins like P3.
   In Figure 6 the mean and median of the number of contacts per amino acid over all the
proteins are shown.

5.3. Triplet Embedding Similarity
In experiment (ii) with the triplet similarity we obtain protein P11-P24 (see Table 3 for their
corresponding names and genes) and in Figure 4 we can see their computed pDockQ scores.
   Here we can see that P11, P19 and P21 have a pDockQ score ≥ 0.23, hence we can consider
                                 P1         P2       P3      P4     P5      P6       P7      P8         P9      P10
                    Signal      0.09       0.00     0.11    0.00   0.17    0.10     0.26    0.05       0.44     0.07
        Average     Extra       0.26       0.33     0.31    0.04   0.30    0.26     0.15    0.10       0.27     0.15
                    Intra       0.27       0.14     0.01    0.06   0.76    0.17     0.28    0.10       0.09     0.03
                    Signal      0.00       0.00     0.00    0.00   0.12    0.08     0.24    0.00       0.20     0.04
        Median      Extra       0.04       0.16     0.20    0.00   0.16    0.20     0.04    0.00       0.04     0.00
                    Intra       0.24       0.04     0.00    0.00   0.48    0.08     0.20    0.20       0.00     0.00
                    (a) Considering P-domain area of CRT for computing similarities
                    P11      P12    P13      P14     P15    P16    P17    P18     P19    P20    P21      P22     P23    P24
           Signal   0.89     0.45   0.39     0.39    0.16   0.19   0.22   0.06    1.40   0.04   0.00     0.50    0.39   0.60
 Average   Extra    0.98     0.44   0.30     0.73    0.36   0.24   0.29   0.30    0.83   0.25   0.31     0.71    0.31   0.72
           Intra    0.32     0.26   0.31     0.45    0.13   0.63   0.25   0.05    1.20   0.40   0.25     0.60    0.05   0.26
           Signal   0.88     0.44   0.32     0.36    0.16   0.16   0.20   0.00    1.28   0.0    0.00     0.40    0.40   0.40
 Median    Extra    0.84     0.32   0.20     0.52    0.24   0.12   0.28   0.20    0.08   0.08   0.08     0.52    0.16   0.68
           Intra    0.32     0.16   0.28     0.42    0.06   0.54   0.18   0.00    0.94   0.32   0.08     0.26    0.04   0.22

                (b) Considering triplet of CRT’s amino acids for computing similarities
Table 2
Predicted average and median contact frequencies by the SpeedPPI model across various regions of
the NKp46 protein (i.e. signal peptide (Signal), Extracellular (Extra) and Transmembranal/Intracellular
(Intra)) during interactions with the top-k proteins identified using our method. In Table (a) the results
were obtained by using the P-domain (k=10) and in (b) by considering a triplet of aminoacids (k=14).
The mapping between P1...24 and the protein names is shown in Table 3.


them to be well modeled with high probability while not the others are (note that P12 and P14
are near a pDockQ score of 0.23 so they could be decently modeled).
   In Figure 7 the distribution of contact points over 25 run is shown. As in the (i) P-domain
experiment, here in the (ii) triplets experiment, the coloured areas represent the sequence ranges
1-22 (yellow), 23-258 (green), 259-304 (red) and the green area is of most interest because it
represents the extracellular part of NKp46. In this (ii) experiment it’s clear that the quality of
the proteins obtained is better, indeed we observe three proteins having high pDockQ scores
and consistent distributions of contacts’ frequencies with high values in the extracellular part
and low values in the intracellular part. In Table 2b we report the average and median contacts’
frequencies across the three regions we indicated (signaling, extracellular, intracellular) of the
NKp46.
   From this data, we observe that the proteins obtained in the (ii) triplet similarity experiment
have better consistency. This difference in the results between the two experiments is related to
the way we found the proteins. It seems that focusing on specific small regions (triplets) with
the highest interacting score instead of large domains of interest to find proteins that share a
high level of similarity can lead to more consistent results.
   In Figure 8 the mean and median of the number of contacts per amino acid over all the
proteins are shown.
   In Figure 9 the mean and median of the number of contacts per amino acid over the high
potential proteins P11, P19, P21 are shown.
6. Conclusions
In this work, we propose an approach to identify candidate proteins for interaction starting
from a known interaction between a protein and its ligand, using such ligand as a template for
the search. The method is based on combining the interpretability of the DSCRIPT model with
the predictive power of AlphaFold.
    We firstly checked the performance of the models on known interactions, and chose the
recently discovered NKp46-CRT interaction as our case study. From UniProt we preliminarily
filtered the proteins based on specific constraints: proteins found in the human body and
specifically only on the membrane surface. We computed amino acids’ high-dimensional
representations of the proteins obtained from UniProt through the deep learning DSCRIPT
model. We used DSCRIPT’s inherent interpretability to make targeted choices on how to find
similar proteins to CRT based on the NKp46-CRT interaction. We compared the results obtained
with standard bioinformatics tools and our approach. Then we described the experiments: (i)
filtering proteins based on a large domain of interest (CRT’s P-domain in our case study); (ii)
filtering proteins based on specific small most interacting region (triplet of amino acids) of
the chosen protein (CRT in our case study). We used SpeedPPI (based on AlphaFold2) and the
pDockQ score (from FoldDock) as a validation scheme.
    Our findings reveal that this methodological approach can be useful in the search and analyses
of PPIs when added to the more traditional bioinformatics tools typically employed for assessing
proteins similarity. With our approach PPIs analyses can be more targeted and can help in
finding new potential interacting proteins. We also introduce AlphaFold2 as a validation tool.
The case study on NKp46’s ligands demonstrates the method’s ability to identify meaningful
protein interactions, which could have implications for cancer immunotherapy.
    In future, we would like to further investigate this case study with experimental validation
from biological experts, as well as applying the approach to other proteins of interest.


Acknowledgments
We would like to express our deepest gratitude to Dr. Piera Fiore, Dr. Nicola Tumino and Dr.
Paola Vacca from the Department of Immunology (IRCCS Ospedale Pediatrico Bambino Gesù,
Rome, Italy) for their invaluable contributions and collaborative efforts in this study. Their
expertise, guidance, and dedication were instrumental in advancing our research.


References
 [1] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool,
     R. Bates, A. Žídek, A. Potapenko, et al., Highly accurate protein structure prediction with
     alphafold, Nature 596 (2021) 583–589.
 [2] R. Evans, M. O’Neill, A. Pritzel, N. Antropova, A. Senior, T. Green, A. Žídek, R. Bates,
     S. Blackwell, J. Yim, et al., Protein complex prediction with alphafold-multimer, biorxiv
     (2021) 2021–10.
 [3] P. Bryant, G. Pozzati, W. Zhu, A. Shenoy, P. Kundrotas, A. Elofsson, Predicting the
     structure of large protein complexes using alphafold and monte carlo tree search, Nature
     communications 13 (2022) 6028.
 [4] C. Y. Lee, D. Hubrich, J. K. Varga, C. Schäfer, M. Welzel, E. Schumbera, M. Djokic, J. M.
     Strom, J. Schönfeld, J. L. Geist, et al., Systematic discovery of protein interaction interfaces
     using alphafold and experimental validation, Molecular Systems Biology 20 (2024) 75–97.
 [5] S. Sledzieski, R. Singh, L. Cowen, B. Berger, Sequence-based prediction of protein-protein
     interactions: a structure-aware interpretable deep learning model, bioRxiv (2021). doi:10.
     1101/2021.01.22.427866 .
 [6] S. Sledzieski, R. Singh, L. Cowen, B. Berger, D-script translates genome to phenome with
     sequence-based, structure-aware, genome-scale predictions of protein-protein interactions,
     Cell Systems 12 (2021) 969–982.
 [7] P. Bryant, F. Noé, Rapid protein-protein interaction network creation from multiple se-
     quence alignments with deep learning, bioRxiv (2023). doi:10.1101/2023.04.15.536993 .
 [8] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool,
     R. Bates, A. Žídek, A. Potapenko, et al., Highly accurate protein structure prediction with
     alphafold, Nature 596 (2021) 583–589.
 [9] S. Sen Santara, D.-J. Lee, A. Crespo, J. J. Hu, C. Walker, X. Ma, Y. Zhang, S. Chowdhury, K. F.
     Meza-Sosa, M. Lewandrowski, H. Zhang, M. Rowe, A. McClelland, H. Wu, C. Junqueira,
     J. Lieberman, The NK cell receptor NKp46 recognizes ecto-calreticulin on ER-stressed cells,
     Nature 616 (2023) 348–356. URL: https://www.nature.com/articles/s41586-023-05912-0.
     doi:10.1038/s41586- 023- 05912- 0 .
[10] T. Pazina, A. Shemesh, M. Brusilovsky, A. Porgador, K. S. Campbell, Regulation of the
     Functions of Natural Cytotoxicity Receptors by Interactions with Diverse Ligands and
     Alterations in Splice Variant Expression, Frontiers in Immunology 8 (2017). URL: https:
     //www.frontiersin.org/articles/10.3389/fimmu.2017.00369.
[11] S.-Y. Wu, T. Fu, Y.-Z. Jiang, Z.-M. Shao, Natural killer cells in cancer biology and therapy,
     Molecular Cancer 19 (2020) 120. URL: https://doi.org/10.1186/s12943-020-01238-x. doi:10.
     1186/s12943- 020- 01238- x .
[12] N. K. Wolf, D. U. Kissiov, D. H. Raulet, Roles of natural killer cells in immunity
     to cancer, and applications to immunotherapy, Nature Reviews Immunology 23
     (2023) 90–105. URL: https://www.nature.com/articles/s41577-022-00732-1. doi:10.1038/
     s41577- 022- 00732- 1 , number: 2 Publisher: Nature Publishing Group.
[13] O. Mandelboim, A. Porgador, NKp46, The International Journal of Biochemistry &
     Cell Biology 33 (2001) 1147–1150. URL: https://www.sciencedirect.com/science/article/pii/
     S1357272501000784. doi:10.1016/S1357- 2725(01)00078- 4 .
[14] N. Bloushtain, U. Qimron, A. Bar-Ilan, O. Hershkovitz, R. Gazit, E. Fima, M. Korc, I. Vlo-
     davsky, N. V. Bovin, A. Porgador, Membrane-associated heparan sulfate proteoglycans are
     involved in the recognition of cellular targets by NKp30 and NKp46, Journal of Immunology
     (Baltimore, Md.: 1950) 173 (2004) 2392–2401. doi:10.4049/jimmunol.173.4.2392 .
[15] T. I. Arnon, H. Achdout, N. Lieberman, R. Gazit, T. Gonen-Gross, G. Katz, A. Bar-Ilan,
     N. Bloushtain, M. Lev, A. Joseph, et al., The mechanisms controlling the recognition of
     tumor-and virus-infected cells by nkp46, Blood 103 (2004) 664–672.
[16] J. R. Bock, D. A. Gough, Predicting protein–protein interactions from primary structure,
     Bioinformatics 17 (2001) 455–460.
[17] E. Sprinzak, H. Margalit, Correlated sequence-signatures as markers of protein-protein
     interaction, Journal of molecular biology 311 (2001) 681–692.
[18] S. M. Gomez, W. S. Noble, A. Rzhetsky, Learning to predict protein–protein interactions
     from protein sequences, Bioinformatics 19 (2003) 1875–1881.
[19] A. Ben-Hur, W. S. Noble, Kernel methods for predicting protein–protein interactions,
     Bioinformatics 21 (2005) i38–i46.
[20] S. Martin, D. Roe, J.-L. Faulon, Predicting protein–protein interactions using signature
     products, Bioinformatics 21 (2005) 218–226.
[21] L. Nanni, A. Lumini, An ensemble of k-local hyperplanes for predicting protein–protein
     interactions, Bioinformatics 22 (2006) 1207–1210.
[22] S. Pitre, F. Dehne, A. Chan, J. Cheetham, A. Duong, A. Emili, M. Gebbia, J. Greenblatt,
     M. Jessulat, N. Krogan, et al., Pipe: a protein-protein interaction prediction engine based
     on the re-occurring short polypeptide sequences between known interacting protein pairs,
     BMC bioinformatics 7 (2006) 1–15.
[23] J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li, H. Jiang, Predicting protein–protein
     interactions based only on sequences information, Proceedings of the National Academy
     of Sciences 104 (2007) 4337–4341.
[24] Y. Guo, L. Yu, Z. Wen, M. Li, Using support vector machine combined with auto covariance
     to predict protein–protein interactions from protein sequences, Nucleic acids research 36
     (2008) 3025–3030.
[25] S. Roy, D. Martinez, H. Platero, T. Lane, M. Werner-Washburne, Exploiting amino acid
     composition for predicting protein-protein interactions, PloS one 4 (2009) e7813.
[26] C.-Y. Yu, L.-C. Chou, D. T.-H. Chang, Predicting protein-protein interactions in unbalanced
     data using the primary structure of proteins, BMC bioinformatics 11 (2010) 1–10.
[27] J. Yu, M. Guo, C. J. Needham, Y. Huang, L. Cai, D. R. Westhead, Simple sequence-based
     kernels do not predict protein–protein interactions, Bioinformatics 26 (2010) 2610–2614.
[28] Y. Guo, M. Li, X. Pu, G. Li, X. Guang, W. Xiong, J. Li, Pred_ppi: a server for predicting
     protein-protein interactions based on sequence data with probability assignment, BMC
     research notes 3 (2010) 1–7.
[29] M. Deng, S. Mehta, F. Sun, T. Chen, Inferring domain-domain interactions from protein-
     protein interactions, in: Proceedings of the sixth annual international conference on
     Computational biology, 2002, pp. 117–126.
[30] M. Hayashida, M. Kamada, J. Song, T. Akutsu, Conditional random field approach to
     prediction of protein-protein interactions using domain information, BMC systems biology
     5 (2011) 1–9.
[31] C.-C. Chen, C.-Y. Lin, Y.-S. Lo, J.-M. Yang, Ppisearch: a web server for searching homologous
     protein–protein interactions across multiple species, Nucleic acids research 37 (2009)
     W369–W375.
[32] L. R. Matthews, P. Vaglio, J. Reboul, H. Ge, B. P. Davis, J. Garrels, S. Vincent, M. Vidal, Iden-
     tification of potential interaction networks using sequence-based searches for conserved
     protein-protein interactions or “interologs”, Genome research 11 (2001) 2120–2126.
[33] H. Yu, N. M. Luscombe, H. X. Lu, X. Zhu, Y. Xia, J.-D. J. Han, N. Bertin, S. Chung, M. Vidal,
     M. Gerstein, Annotation transfer between genomes: protein–protein interologs and
     protein–dna regulogs, Genome research 14 (2004) 1107–1118.
[34] J. Garcia-Garcia, S. Schleker, J. Klein-Seetharaman, B. Oliva, Bips: Biana interolog pre-
     diction server. a tool for protein–protein interaction inference, Nucleic acids research 40
     (2012) W147–W151.
[35] Y. Chen, D. Zhi, Ligand–protein inverse docking and its potential use in the computer
     search of protein targets of a small molecule, Proteins: Structure, Function, and Bioinfor-
     matics 43 (2001) 217–226.
[36] Q.-T. Do, I. Renimel, P. Andre, C. Lugnier, C. D. Muller, P. Bernard, Reverse pharmacognosy:
     application of selnergy, a new tool for lead discovery. the example of 𝜀-viniferin, Current
     drug discovery technologies 2 (2005) 161–167.
[37] P. Muller, G. Lena, E. Boilard, S. Bezzine, G. Lambeau, G. Guichard, D. Rognan, In s
     ilico-guided target identification of a scaffold-focused library: 1, 3, 5-triazepan-2, 6-diones
     as novel phospholipase a2 inhibitors, Journal of medicinal chemistry 49 (2006) 6768–6778.
[38] S. Zahler, S. Tietze, F. Totzke, M. Kubbutat, L. Meijer, A. M. Vollmar, J. Apostolakis, Inverse
     in silico screening for identification of kinase inhibitor targets, Chemistry & biology 14
     (2007) 1207–1214.
[39] M. Schapira, R. Abagyan, M. Totrov, Nuclear hormone receptor targeted virtual screening,
     Journal of medicinal chemistry 46 (2003) 3045–3059.
[40] J. M. Rollinger, Accessing target information by virtual parallel screening—the impact on
     natural product research, Phytochemistry Letters 2 (2009) 53–58.
[41] C. Bissantz, A. Logean, D. Rognan, High-throughput modeling of human g-protein
     coupled receptors: amino acid sequence alignment, three-dimensional model building,
     and receptor library screening, Journal of chemical information and computer sciences 44
     (2004) 1162–1176.
[42] T. Bepler, B. Berger, Learning protein sequence embeddings using information
     from structure, CoRR abs/1902.08661 (2019). URL: http://arxiv.org/abs/1902.08661.
     arXiv:1902.08661 .
[43] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local align-
     ment search tool, Journal of Molecular Biology 215 (1990) 403–410. URL: https://
     www.sciencedirect.com/science/article/pii/S0022283605803602. doi:https://doi.org/10.
     1016/S0022- 2836(05)80360- 2 .
[44] P. Bryant, G. Pozzati, A. Elofsson, Improved prediction of protein-protein interactions
     using alphafold2 and extended multiple-sequence alignments, bioRxiv (2021). doi:10.
     1101/2021.09.15.460468 .
[45] D. F. Burke, P. Bryant, I. Barrio-Hernandez, D. Memon, G. Pozzati, A. Shenoy, W. Zhu,
     A. S. Dunham, P. Albanese, A. Keller, R. A. Scheltema, J. E. Bruce, A. Leitner, P. Kundrotas,
     P. Beltrao, A. Elofsson, Towards a structurally resolved human protein interaction network,
     bioRxiv (2021). doi:10.1101/2021.11.08.467664 .
[46] D. Szklarczyk, R. Kirsch, M. Koutrouli, K. Nastou, F. Mehryary, R. Hachilif, A. L. Gable,
     T. Fang, N. T. Doncheva, S. Pyysalo, et al., The string database in 2023: protein–protein
     association networks and functional enrichment analyses for any sequenced genome of
     interest, Nucleic acids research 51 (2023) D638–D646.
[47] T. U. Consortium, UniProt: the Universal Protein Knowledgebase in 2023, Nucleic
     Acids Research 51 (2022) D523–D531. URL: https://doi.org/10.1093/nar/gkac1052.
     doi:10.1093/nar/gkac1052 .    arXiv:https://academic.oup.com/nar/article-
     pdf/51/D1/D523/48441158/gkac1052.pdf .
[48] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
     1735–1780.
[49] M. Remmert, A. Biegert, A. Hauser, J. Söding, Hhblits: lightning-fast iterative protein
     sequence searching by hmm-hmm alignment, Nature methods 9 (2012) 173–175.
[50] M. Mirdita, L. von den Driesch, C. Galiez, M. J. Martin, J. Söding, M. Steinegger, Uniclust
     databases of clustered and deeply annotated protein sequences and alignments, Nucleic
     Acids Research 45 (2016) D170–D176. URL: https://doi.org/10.1093/nar/gkw1081.
     doi:10.1093/nar/gkw1081 .            arXiv:https://academic.oup.com/nar/article-
     pdf/45/D1/D170/8846789/gkw1081.pdf .
[51] R. C. Edgar, S. Batzoglou, Multiple sequence alignment, Current Opinion in Struc-
     tural Biology 16 (2006) 368–373. URL: https://www.sciencedirect.com/science/article/pii/
     S0959440X06000704. doi:https://doi.org/10.1016/j.sbi.2006.04.004 , nucleic acid-
     s/Sequences and topology.
[52] S. Basu, B. Wallner, Dockq: a quality measure for protein-protein docking models, PloS
     one 11 (2016) e0161879.
[53] C. J. A. Sigrist, E. de Castro, L. Cerutti, B. A. Cuche, N. Hulo, A. Bridge, L. Bouguel-
     eret, I. Xenarios,      New and continuing developments at PROSITE,                Nucleic
     Acids Research 41 (2012) D344–D347. URL: https://doi.org/10.1093/nar/gks1067.
     doi:10.1093/nar/gks1067 .            arXiv:https://academic.oup.com/nar/article-
     pdf/41/D1/D344/3608670/gks1067.pdf .
[54] T. L. Bailey, J. Johnson, C. E. Grant, W. S. Noble,                    The MEME Suite,
     Nucleic Acids Research 43 (2015) W39–W49. URL: https://doi.org/10.1093/nar/
     gkv416. doi:10.1093/nar/gkv416 . arXiv:https://academic.oup.com/nar/article-
     pdf/43/W1/W39/17435890/gkv416.pdf .
[55] S. Hunter, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork,
     U. Das, L. Daugherty, L. Duquenne, R. D. Finn, J. Gough, D. Haft, N. Hulo, D. Kahn,
     E. Kelly, A. Laugraud, I. Letunic, D. Lonsdale, R. Lopez, M. Madera, J. Maslen,
     C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, N. Mulder, D. Natale, C. Orengo,
     A. F. Quinn, J. D. Selengut, C. J. A. Sigrist, M. Thimma, P. D. Thomas, F. Valentin,
     D. Wilson, C. H. Wu, C. Yeats,             InterPro: the integrative protein signature
     database, Nucleic Acids Research 37 (2008) D211–D215. URL: https://doi.org/10.1093/nar/
     gkn785. doi:10.1093/nar/gkn785 . arXiv:https://academic.oup.com/nar/article-
     pdf/37/suppl_1/D211/3287888/gkn785.pdf .


A. Experiments details
In Table 3 there is the mapping from P1...24 proteins and the protein ID in UniProt database
and the corresponding gene name.
                          Name in paper      Protein ID        Gene
                          P1                 P15328            FOLR1
                          P2                 P14207            FOLR2
                          P3                 H0Y6X0            TAS1R1
                          P4                 B7Z487            -
                          P5                 O00451            GFRA2
                          P6                 P09603            CSF1
                          P7                 P56159            GFRA1
                          P8                 Q6ULR6            -
                          P9                 O60609            GFRA3
                          P10                Q9P0L0            VAPA
                          P11                A0A7I2V2B1        NRXN3
                          P12                A0A0D9SEM5        NRXN1
                          P13                A4FVB9            NRXN1
                          P14                A0A0A0MTQ4        RGMA
                          P15                A0A1B0GTB0        ATP6AP2
                          P16                A0A0H3VB22        FASLG
                          P17                A0A0U1RR22        PACSIN2
                          P18                A0A8I5KWD3        AP2M1
                          P19                A0A0U1RRJ0        NRXN3
                          P20                A0A977WMN2        TNF
                          P21                A6ND01            IZUMO1R
                          P22                O00481            BTN3A1
                          P23                P58400            NRXN1
                          P24                Q9HDB5            NRXN3
Table 3
Mapping between P1...24 and the corresponding protein ID in UniProt and their gene name if available.


B. Results details
In this section, we report some plots that show the distribution of the contact points’ frequencies
in NKp46 protein when interacting with other proteins. In Figure 5 there are the contact points of
the top-10 candidates obtained through our method when considering the similarity of potential
ligands with the P-domain of CRT. Additionally, in Figure 6, the average and median frequencies
of the contact points for the same experiment are plotted. We discuss the observations about
these results in Section 5.2.
   Similarly, we report the same plots but for the experiments considering the most interacting
amino acids’ triplet of CRT for computing the similarity scores with proteins. So, in Figure 7
there are the distributions for each of the top-14 candidates, while in Figure 8 the average and
median on such candidates. We discuss these results in Section 5.3.
   Lastly, in Figure 9 there are the plots of the frequency distributions of the contact points for
the candidates for which the SpeedPPI model predicts the interaction. In particular, the upper
image reports the average of the contact points along NKp46 amino acids divided by areas and
the bottom one the median. We also review these results in Section 5.3.
Figure 5: Distribution of the number of contact points for each protein in the (i) P-domain experiment.


Figure 6: The mean and median of the number of contacts per amino acid over all the proteins in the
(i) P-domain experiment.
Figure 7: Distribution of the number of contact points for each protein in the (ii) triplets experiment.
Figure 8: The mean and median of the number of contacts per amino acid over all the proteins in the
(ii) triplets experiment.


Figure 9: The mean and median of the number of contacts per amino acid over the high potential
proteins P11, P19, P21 in the (ii) triplets experiment.

</pre>