=Paper=
{{Paper
|id=None
|storemode=property
|title=Frankenstein Junior: a relational learning approach toward protein engineering
|pdfUrl=https://ceur-ws.org/Vol-645/Paper2.pdf
|volume=Vol-645
}}
==Frankenstein Junior: a relational learning approach toward protein engineering==
Frankenstein Junior: a relational learning approach toward
protein engineering
Elisa Cilia1,2 , Andrea Passerini∗1
1 Department of Computer Science and Information Engineering, University of Trento, via Sommarive 14, I-38123 (Povo) Trento,
Italy
2 FEM-IASMA Research & Innovation Center, via Edmund Mach 1, I-38010 S. Michele all’Adige (TN), Italy
Email: Elisa Cilia - cilia@disi.unitn.it, elisa.cilia@iasma.it; Andrea Passerini∗ - passerini@disi.unitn.it;
∗ Corresponding author
Abstract
Background: Mining relevant features from protein mutation data is fundamental for understanding the charac-
teristics of a protein functional site. The mined features could also be used for engineering novel proteins with
useful functions.
Results: We propose a simple relational learning approach for protein engineering. First, we learn a set of
relational rules from mutation data, then we use them for generating a set of candidate mutations that are
most probable to confer resistance to a certain inhibitor or improved activity on a specific substrate. We tested
our approach on a dataset of HIV drug resistance mutations, comparing it to a baseline random generator.
Statistically significant improvements are obtained on both categories of nucleoside and non-nucleoside HIV
reverse transcriptase inhibitors.
Conclusions: Our promising preliminary results suggest that the proposed approach for learning mutations has
a potential in guiding mutant engineering, as well as in predicting virus evolution in order to try and devise
appropriate countermeasures. The approach can be generalized quite easily to learning mutants characterized by
more complex rules correlating multiple mutations.
Background fects of specific mutations on the protein function.
The mining of relevant features from protein muta- The process typically involves extensive trial-and-
tion data has its first aim in understanding the prop- error experiments and is also used with the aim
erties of functional sites, for instance, which residues of improving the understanding mechanisms of a
are more likely to have a functional role. The same protein behavior and taking the necessary counter-
mined information can be used to engineer mutants measures.
of a protein with an improved activity on a certain In this work we moved the first steps towards the
substrate or resistance to a certain inhibitor. use of a relational learning approach to protein engi-
Rational design is an engineering technique mod- neering by focusing on learning relevant single muta-
ifying existing proteins by site directed mutagenesis. tions (mutations that change the amino acid type in
It assumes the knowledge or intuition about the ef- the protein sequence) that can affect the behavior of
1
a protein. Predicting the effect of single mutations mulated the learning problem as a mining task and
helps reducing the dimension of the search space for applied a relational association rule miner to derive
rational design. rules relating different mutations and their resistance
In the proposed approach an Inductive Logic properties.
Programming (ILP) [1] learner is trained to extract In previous work [7] we used the same dataset
general rules describing mutations relevant to a cer- for learning a relational model of mutant resistance
tain behavior (e.g. drug resistance of HIV) and the with the hierarchical kFOIL algorithm, a statistical
learned rules are then used to infer novel potentially relational learner [8]. Here we use the dataset for in-
relevant mutations. ferring rules that can characterize a single mutation
We focused on learning relevant mutations of the as resistant to a certain class of RT inhibitors. Those
HIV reverse transcriptase (RT). The HIV RT is a drug classes include: a) Nucleoside RT Inhibitors
DNA polymerase enzyme that transcribes RNA into (NRTI); b) NonNucleoside RT Inhibitors (NNRTI);
DNA, allowing it to be integrated into the genome c) NonCompetitive RT inhibitors (NCRTI); d) Py-
of the host cell and replicated along with it, and it rophosphate Analogue RT Inhibitors (PARTI). The
is therefore crucial for virus propagation. Viruses four classes of inhibitors differ in the targeted sites
typically have a very high mutation rate and a mu- and rely on quite different mechanisms. NNRTI and
tation can confer the mutant resistance to one or NCRTI inhibit the reverse transcriptase by binding
more drugs, for instance by modifying the inhibitor to the enzyme active site, therefore directly interfer-
target site on the protein. In the case of the HIV ing with the enzyme function. NRTI is instead in-
drug resistance, which we addressed in the present corporated into the newly synthesized viral DNA for
work, building artificial mutants that can resist to preventing its elongation. Finally the PARTI targets
the inhibitor could help to early predict the virus the pyrophosphate binding site and it is employed,
evolution and thus design more effective drugs. as part of a salvage therapy, on patients in which the
Many machine learning methods have been ap- HIV infection shows resistance to the other classes of
plied in the past to mutation data for predicting sin- antiretroviral-drugs. The final dataset is composed
gle point mutations on protein stability changes [2] of 164 mutations labeled as resistant over a set of
and the effect of mutations on the protein func- 581 observed mutations (extracted from 2339 mu-
tion [3] [4] or drug susceptibility [5]. At our knowl- tants). Among the 164 mutations, 95 are labeled as
edge this is the first approach proposal for learning resistant to NRTI, 56 to NNRTI, 5 to NCRTI and 8
relational features of mutations affecting a protein to PARTI.
behavior and use them for generating novel relevant
mutations. Furthermore, even if we focus on single
point mutations in our experimental evaluation, our
Learning in first order logic
approach can be quite straightforwardly extended to
multiple point mutations, and we are actively work- Our aim is to learn a first-order logic hypothesis for
ing in this direction. Conversely, the up-mentioned the target concept, i.e. mutation conferring resis-
predicting approaches would immediately blow up tance to a certain drug, and use it to infer novel
for the explosion in the number of candidate config- mutations consistent with such hypothesis. We rely
urations to evaluate. on definite clauses which are the basis of the Pro-
log programming language. A definite clause is an
expression of the form h ← b1 AND ... AND bn,
where h and the bi are atoms. Atoms are expres-
Results and Discussion sions of the form p(t1, ..., tn) where p/n is a
Dataset predicate symbol of arity n and the ti are terms,
We applied our approach to the same dataset of mu- either constants (denoted by lower case) or variables
tations already used in [6] for extracting relational (denoted by upper case) in our experiments. The
rules among mutations. The dataset is derived from atom h is also called the head of the clause, typ-
the Los Alamos National Laboratories (LANL) HIV ically the target predicate, and b1 AND ... AND
resistance database1 and reports mutations of the bn its body. Intuitively, a clause represents that
HIV reverse transcriptase (RT). Richter et al. [6] for- the head h will hold whenever the body b1 AND
1 http://www.hiv.lanl.gov/content/sequence/RESDB/
2
... AND bn holds. For instance, a simple hypothe- Other background knowledge facts and rules
sis like res against(A,ncrti) ← mutation(A,C) were added in order to express structural relations
AND close to site(C) would indicate that a muta- along the primary sequence and catalytic propensity
tion C in the proximity of a binding site confers to of the involved residues:
mutant A resistance against a ncrti. Learning in
this setting consists of searching for a set of definite close to site/1 indicates whether a specific posi-
clauses H = {ci , ..., cm } covering all or most positive tion is distant less than 5 positions from a
examples, and none or few negative ones if available. residue belonging to a binding or active site.
First-order clauses can thus be interpreted as rela- In our specific case, the background theory in-
tional features that characterize the target concept. corporates knowledge about a metal binding
The main advantage of these logic-based approaches site and a heterodimerization site.
with respect to other machine learning techniques is
the expressivity and interpretability of the learned location/2 indicates in which fragment of the pri-
models. Models can be readily interpreted by hu- mary sequence the amino acid is located. Lo-
man experts and provide direct explanations for the cations are numbered from 0 by dividing the
predictions. sequence into a certain number of fragments.
catalytic propensity/2 indicates whether an
Background knowledge amino acid has a high, medium or low cat-
alytic propensity according to [10].
We built a relational knowledge base for the domain
at hand. Table 1 summarizes the predicates we in- mutated residue cp/3 indicates how, in a mutated
cluded as a background knowledge. We represented position, the catalytic propensity has changed
the amino acids of the wild type with their positions (e.g. from low to high).
in the primary sequence (aa/2) and the specific mu-
tations characterizing them (mut prop/4). Target
predicates were encoded as resistance of the muta-
tion to a certain drug (res against/2). Algorithm overview
Additional background knowledge was included The proposed approach is sketched in Figure 1.
in order to highlight characteristics of residues and The first step is the learning phase, in which an
relationships between mutations: ILP learner is fed with a logical representation of
the data D (the mutations experimentally observed
color/2 indicates the type of the natural amino to confer resistance to a certain inhibitor) and of
acids according to the coloring proposed in [9]. the domain knowledge B we want to incorporate,
For example the magenta class includes basic and it returns a first-order logical hypothesis H for
amino acids as lysine and arginine while the the concept of mutation conferring resistance to a
blue class includes acidic amino acids as as- certain drug.
partic and glutamic acids. The hypothesis is derived using the Aleph (A
Learning Engine for Proposing Hypotheses) ILP sys-
same type/2 indicates whether two residues belong tem2 . Aleph allows also to learn from positive ex-
to the same type, i.e. a change from one ample only. This is the most suitable approach in
residue to the other conserves the type of the our case as the positive examples are the mutations
amino acid. experimentally proved to confer resistance to a drug
but no safe claim can be made on the other muta-
same type mut/2 indicates that a residue substitu- tions if there is no sufficient evidence due for example
tion at a certain position does not modify to the lack of an exhaustive set of laboratory exper-
the amino acid type with respect to the wild iments.
type. For example mutation d123e conserves Aleph incrementally builds a hypothesis covering
the amino acid type while mutation d123a does all positive examples guided by a Bayesian evalu-
not (i.e. different type mut/2 holds for it). ation function, described in [11], scoring candidate
2 http://www.comlab.ox.ac.uk/activities/machinelearning/Aleph/aleph.html
3
solutions according to an estimate of the Bayes’ pos- Performance evaluation
terior probability that allows to tradeoff hypothesis We divided the dataset of mutations into a train-
size and generality. ing and a test set (70/30) in a stratified way, which
In Figure 2 we show an example of learned hy- means by preserving, both in the train and test set,
pothesis covering a set of examples. The learned hy- the proportion of examples belonging to one of the
pothesis models the ability of a mutation to confer four drug classes. The resulting training set is com-
the resistance to NCRTIs and is composed of three posed of a total of 116 mutations while the test set
first-order clauses, each one covering different sets is composed of 48 mutations.
of mutations of the wild type as highlighted in col- We trained the ILP learner on the training set
ors: blue for the first clause, yellow for the second and we evaluated on the test set the set of muta-
and red for the third one. Some mutations are cov- tions generated using the learned model. The evalu-
ered by more than one clause as shown by the color ation procedure takes the generated mutations and
overlaps. computes its enrichment in test mutations by count-
ing how many generated mutations are actually ob-
Our background knowledge incorporates infor-
served in at least one test example. We compare the
mation about the RT metal binding site, which is
recall of the approach with the recall of an algorithm
composed of the aspartic acids D110, D185 and D186
that choose at random a set (of the same cardinality)
and, about the heterodimerization site composed of
of possible mutations among all legal ones.
W401 and W414 (in bold in Figure 2). +
Recall or sensitivity or TP rate is t+t+f − where
The second step is the generative phase, in which
t+ , the true positives, is the number of test set muta-
the learned hypothesis is employed to find novel mu-
tions and t+ +f − , true positives plus false negatives,
tations that can confer drug resistance to an RT mu-
corresponds to the number of generated mutations.
tant. A set of candidate mutations can be generated
Results averaged on 30 random splits of the
by using the Prolog inference engine starting from
dataset are reported in Table 2. On each split we
the rules in the learned model. The rules are actu-
performed 30 runs of our algorithm and of the ran-
ally constraints on the characteristics that a muta-
dom generation algorithm in each one of the different
tion of the wild type should have in order to confer
learning tasks (NNRTI, NRTI, NCRTI and PARTI).
resistance to a certain inhibitor.
For each task we also reported in column 3 and 4
Algorithm 1 details the mutation generation pro- the mean number of generated mutations over the
cedure. We here assume, for simplicity, to have a 30 splits and the number of test set mutations for
model H for a single drug class. The procedure reference.
works by querying the Prolog inference engine for We evaluated the statistical significance of the
all possible variable assignments that satisfy the hy- performance differences between the two algorithms
pothesis clauses. In order to generate novel muta- by paired Wilcoxon tests on the averaged recall re-
tions, the mut prop/4 predicate is modified here in ported on each split. We employed a confidence level
order to return all legal mutations instead of only α of 0.05.
those appearing in the training instances. The set The improvement of the algorithm with respect
of mutations identified by the variable assignments to the random generation of mutations is statistically
and not present in the training set is ranked accord- significant on the NRTI and NNRTI tasks, which are
ing to a scoring function SM before being returned the tasks on which we can learn the hypothesis from
by the algorithm. In this work we defined SM as the a largest set of training examples.
number of clauses in H that a candidate mutation
Figure 2 shows the trend of the mean recall over
m satisfies.
all splits when cutting the number of generated mu-
Referring to the model in Figure 2 some gen- tations (from 1 generated mutation to more than
erated candidate mutations with highest score are: 8000). The advantage of our approach remains quite
101P, 102A, 103F, 103I, 104M, 181I, 181V, 183M, stable when reducing the set of candidates, produc-
188L, 188F where for example the notation 101P in- ing almost nine times more test mutations than ran-
dicates a change of the wild type amino acid, located dom in the 100 highest scoring ones for NNRTI (Fig-
in position 101, into a proline (P). This list includes ure 3 shows the trend of the recall curves of the plot
also known surveillance mutations [12]. in Figure 3(a) for the highest ranked 150 mutations).
4
We finally learned a model on the whole set of appropriate countermeasures. A more detailed back-
training mutations in order to generate a single set of ground knowledge, possibly including 3D informa-
mutations for further inspection. Below we reported tion whenever available, is necessary in order to fur-
five examples of novel mutations with the highest ther focus the set of generated mutations, and pos-
rank for each one of the tasks: sibly post-processing stages involving mutant evalu-
ation by statistical machine learning approaches [2].
NNRTI 90I 98I 103I 106P 179I In the next future we also plan to generalize the pro-
posed approach to jointly generate sets of related
NRTI 60A 153M 212L 229F 239I
mutations shifting the focus from single point muta-
NCRTI 183V 183L 188V 188F 188I tions to entire mutants.
PARTI 84R 86E 88Y 88V 89N
Authors contributions
In [13], the authors found a set of novel muta-
EC built the background knowledge, implemented
tions conferring resistance to efavirenz and nevirap-
the algorithm and conducted the experimental eval-
ine, which are NNRTIs. Our mutation generation
uation. AP suggested the overall idea and coordi-
algorithm partially confirmed their findings. Muta-
nated the study. Both authors contributed in writ-
tion 90I was ranked high (5/5), mutation 101H was
ing the article. Both authors read and approved the
generated with a rank of 3/5, mutations 196R and
final manuscript.
138Q with rank 1/5, while mutation 28K was not
generated at all by our system as a candidate for
conferring resistance to NNRTI.
References
1. Muggleton S, De Raedt L: Inductive Logic Program-
ming: Theory and Methods. University Computing
Example of learned hypothesis 1994, :629–682.
An example of learned hypothesis is reported in Fig- 2. Capriotti E, Fariselli P, Rossi I, Casadio R: A three-
ure 5. For instance, according to the model, among state prediction of single point mutations on pro-
tein stability changes. BMC bioinformatics 2008, 9
the features a mutation should have for conferring
Suppl 2:S6.
resistance to a NNRTI, there is the change of a ba-
3. Needham CJ, Bradford JR, Bulpitt AJ, Care Ma, West-
sic (magenta) residue of the wild type, e.g. lysine head DR: Predicting the effect of missense muta-
or arginine, into a residue with completely different tions on protein function: analysis with Bayesian
phisico-chemical characteristics (rule 16). networks. BMC bioinformatics 2006, 7:405.
Another example for the resistance to NNRTI is 4. Bromberg Y, Rost B: SNAP: predict effect of non-
that a non conserved mutation is present in positions synonymous polymorphisms on function. Nucleic
acids research 2007, 35(11):3823–35.
between 98 and 106 of the wild type sequence (rule
8). 5. Rhee SY, Taylor J, Wadhera G, Ben-Hur A, Brutlag DL,
Shafer RW: Genotypic predictors of human immun-
odeficiency virus type 1 drug resistance. Proceed-
ings of the National Academy of Sciences of the United
Conclusions States of America 2006, 103(46):17355–60.
In this work we proposed a simple relational learning 6. Richter L, Augustin R, Kramer S: Finding Relational
Associations in HIV Resistance Mutation Data.
approach toward protein engineering. Starting from In Proceedings of Inductive Logic Programming (ILP),
HIV reverse transcriptase mutation data we built a Volume 9 2009.
relational knowledge base and we mined relevant re- 7. Cilia E, Landwehr N, Passerini A: Mining Drug Resis-
lational features for modeling mutant resistance by tance Relational Features with Hierarchical Mul-
using an ILP learner. Based on the learned relational titask kFOIL. In Proceedings of BioLogical@AI*IA2009
2009.
rules we generate a set of candidate mutations sat-
isfying them. 8. Landwehr N, Passerini A, Raedt L, Frasconi P: Fast
learning of relational kernels. Machine Learning
Albeit preliminary, our results suggest that the 2010, 78(3):305–342.
proposed approach for learning mutations has a po-
9. Taylor WR: The classification of amino acid
tential in guiding mutant engineering, as well as in conservation. Journal of Theoretical Biology 1986,
predicting virus evolution in order to try and devise 119(2):205–218.
5
10. Bartlett G, Porter C, Borkakoti N, Thornton J: Anal- RW: Drug resistance mutations for surveillance of
ysis of catalytic residues in enzyme active sites. transmitted HIV-1 drug-resistance: 2009 update.
Journal of Molecular Biology 2002, 324:105–121. PloS one 2009, 4(3):e4724.
11. Muggleton S: Learning from Positive Data. In Induc-
tive Logic Programming Workshop 1996:358–376. 13. Deforche K, Camacho RJ, Grossman Z, Soares Ma, Van
12. Bennett DE, Camacho RJ, Otelea D, Kuritzkes DR, Laethem K, Katzenstein Da, Harrigan PR, Kantor R,
Fleury H, Kiuchi M, Heneine W, Kantor R, Jordan MR, Shafer R, Vandamme AM: Bayesian network analyses
Schapiro JM, Vandamme AM, Sandstrom P, Boucher of resistance pathways against efavirenz and nevi-
CaB, van de Vijver D, Rhee SY, Liu TF, Pillay D, Shafer rapine. AIDS (London, England) 2008, 22(16):2107–15.
Algorithms
Algorithm 1 - Mutation generation algorithm.
Algorithm for novel relevant mutations discovery.
Algorithm 1 Mutation generation algorithm.
1: input: dataset of training mutations D, background knowledge B, learned model H
2: output: rank of the most relevant mutations R
3: procedure GenerateMutations(D, B, H)
4: Initialize DM ← ∅
5: A ← find all assignments a that satisfy at least one clause ci ∈ H
6: for a ∈ A do
7: m ← mutation corresponding to the assignments a ∈ A
8: score ← SM (m) . number of clauses ci satisfied by a
9: if not m ∈ D then . discard mutations observed in the training set
10: DM ← DM ∪ {(m, score)}
11: end if
12: end for
13: R ← RankMutations(DM , B, H) . rank relevant mutations
14: return R
15: end procedure
Figures
Figure 1 - Mutation engineering algorithm.
Schema of the mutation engineering algorithm.
dataset of
background
training
knowledge
mutations
D B
rank of
novel
relevant
mutations
generator of
ILP
learner H relevant
mutations
R
hypothesis
Figure 1: Schema of the mutation engineering algorithm.
6
Figure 2 - Model for the resistance to NCRTI
An example of learned hypothesis for the NCRTI task with highlighted examples covered by the hypothesis
clauses.
>wt ...AGLKKKKSVTVLDVG...YQYMDDLYVG...WETWWTEY...WIPEWEFVN...
D DD W W
| | | | | | | |
98 112 181 190 398 405 410 418
mut prop(A,B,C,D) AND location(11.0,C)
mut prop(A,B,C,D) AND mutated residue cp(C,high,small)
mut prop(A,B,C,D) AND color(green,B) AND close to site(C)
Figure 3 - Mean recall by varying the number of generated mutations
Mean recall by varying the number of generated mutations.
Recall curve Recall curve
1 1
algo algo
rand rand
0.9 0.9
0.8 0.8
0.7 0.7
Recall on the test set
Recall on the test set
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000
Number of generated mutations Number of generated mutations
(a) NNRTI (b) NRTI
Recall curve Recall curve
1 1
algo algo
rand rand
0.9 0.9
0.8 0.8
0.7 0.7
Recall on the test set
Recall on the test set
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000
Number of generated mutations Number of generated mutations
(c) NCRTI (d) PARTI
Figure 2: Mean recall by varying the number of generated mutations.
7
Figure 4 - Recall curves of the NNRTI task for the highest ranked mutations.
Detail of the mean recall curves of the NNRTI task for a number of generated mutations below 150.
Recall curve
0.2
algo
rand
0.175
0.15
Recall on the test set
0.125
0.1
0.075
0.05
0.025
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Number of generated mutations
Figure 3: Detail of the mean recall curves of the NNRTI task for a number of generated mutations below
150.
Figure 5 - Example of learned model
1-res against(A,nrti) ← mut prop(A,B,C,D) AND color(red,D)
2-res against(A,nrti) ← mut prop(A,B,C,D) AND color(red,D) AND color(red,B)
3-res against(A,nrti) ← mut prop(A,B,C,D) AND location(7.0,C) AND mutated residue cp(C,medium,medium)
4-res against(A,nrti) ← mut prop(A,B,C,D) AND location(7.0,C)
5-res against(A,nrti) ← same type mut(A,B)
6-res against(A,nrti) ← mut prop(A,B,C,D) AND mutated residue cp(C,medium,high)
7-res against(A,nrti) ← mut prop(A,B,C,D) AND mutated residue cp(C,medium,small)
8-res against(A,nnrti) ← different type mut(A,B) AND location(11.0,B)
9-res against(A,nnrti) ← mut prop(A,B,C,D) AND mutated residue cp(C,small,small)
10-res against(A,nnrti) ← same type mut(A,B)
11-res against(A,nnrti) ← mut prop(A,B,C,D) AND aminoacid(B,v)
12-res against(A,nnrti) ← mut prop(A,B,C,D) AND location(15.0,C)
13-res against(A,nnrti) ← mut prop(A,B,C,D) AND aminoacid(D,i)
14-res against(A,nnrti) ← mut prop(A,B,C,D) AND location(11.0,C)
15-res against(A,nnrti) ← mut prop(A,B,C,D) AND color(red,D)
16-res against(A,nnrti) ← mut prop(A,B,C,D) AND color(magenta,B) AND different type mut(A,C)
17-res against(A,nnrti) ← mut prop(A,B,C,D) AND location(21.0,C)
18-res against(A,ncrti) ← mut prop(A,B,C,D) AND location(11.0,C)
19-res against(A,ncrti) ← mut prop(A,B,C,D) AND mutated residue cp(C,high,small)
20-res against(A,ncrti) ← mut prop(A,B,C,D) AND color(green,B) AND close to site(C)
21-res against(A,parti) ← mut prop(A,B,C,D) AND location(9.0,C)
8
Tables
Table 1 - Background knowledge predicates
Summary of the background knowledge facts and rules.
Background Knowledge Predicates
aa(Pos,AA) indicates a residue in the wild type sequence
mut prop(MutationID,AA,Pos,AA1) indicates a mutation: mutation identifier, position and
amino acids involved, before and after the substitution
res against(MutationID,Drug) indicates whether a mutation is resistant to a certain drug
color(Color,AA) indicates the type of a natural amino acid
same type(R1,R2) indicates whether two residues are of the same type
same type mut(MutationID, Pos) indicates a mutation to a residue from the same type
different type mut(MutationID, Pos) indicates a mutation changing the type of residue
close to site(Pos) indicates whether a specific position is close to a binding
or active site if any
location(L,Pos) indicates in which fragment of the primary sequence the
amino acid is located
catalytic propensity(AA,CP) indicates whether an amino acid has a high, medium or
low catalytic propensity
mutated residue cp(Pos, CPold, CPnew) indicates how, in a mutated position, the catalytic
propensity has changed (e.g. from low to high)
Table 2 - Results
Statistical comparisons of the performance of the proposed algorithm with an algorithm generating mutations
at random. The average recall has been computed for each one of the learning tasks over the 30 splits by
averaging recall over 30 repeated runs of the two algorithms. Results of a paired Wilcoxon test (α = 0.05)
on the statistical significance of the performance differences are also reported. A black bullet indicates a
statistical significant improvement of our algorithm over a random generator.
Mean recall % on 30 splits
Algorithm Random Generator Mean n. generated mutations n. test mutations
NNRTI 86 • 58 5201 17
NRTI 55 • 46 5548 28
NCRTI 16 8 1042 1
PARTI 49 39 3425 2
9