=Paper= {{Paper |id=None |storemode=property |title=Frankenstein Junior: a relational learning approach toward protein engineering |pdfUrl=https://ceur-ws.org/Vol-645/Paper2.pdf |volume=Vol-645 }} ==Frankenstein Junior: a relational learning approach toward protein engineering== https://ceur-ws.org/Vol-645/Paper2.pdf
Frankenstein Junior: a relational learning approach toward
protein engineering
Elisa Cilia1,2 , Andrea Passerini∗1

1 Department of Computer Science and Information Engineering, University of Trento, via Sommarive 14, I-38123 (Povo) Trento,

Italy
2 FEM-IASMA Research & Innovation Center, via Edmund Mach 1, I-38010 S. Michele all’Adige (TN), Italy



Email: Elisa Cilia - cilia@disi.unitn.it, elisa.cilia@iasma.it; Andrea Passerini∗ - passerini@disi.unitn.it;

∗ Corresponding author




Abstract
Background: Mining relevant features from protein mutation data is fundamental for understanding the charac-
teristics of a protein functional site. The mined features could also be used for engineering novel proteins with
useful functions.
Results: We propose a simple relational learning approach for protein engineering. First, we learn a set of
relational rules from mutation data, then we use them for generating a set of candidate mutations that are
most probable to confer resistance to a certain inhibitor or improved activity on a specific substrate. We tested
our approach on a dataset of HIV drug resistance mutations, comparing it to a baseline random generator.
Statistically significant improvements are obtained on both categories of nucleoside and non-nucleoside HIV
reverse transcriptase inhibitors.
Conclusions: Our promising preliminary results suggest that the proposed approach for learning mutations has
a potential in guiding mutant engineering, as well as in predicting virus evolution in order to try and devise
appropriate countermeasures. The approach can be generalized quite easily to learning mutants characterized by
more complex rules correlating multiple mutations.




Background                                                                fects of specific mutations on the protein function.
The mining of relevant features from protein muta-                        The process typically involves extensive trial-and-
tion data has its first aim in understanding the prop-                    error experiments and is also used with the aim
erties of functional sites, for instance, which residues                  of improving the understanding mechanisms of a
are more likely to have a functional role. The same                       protein behavior and taking the necessary counter-
mined information can be used to engineer mutants                         measures.
of a protein with an improved activity on a certain                           In this work we moved the first steps towards the
substrate or resistance to a certain inhibitor.                           use of a relational learning approach to protein engi-
    Rational design is an engineering technique mod-                      neering by focusing on learning relevant single muta-
ifying existing proteins by site directed mutagenesis.                    tions (mutations that change the amino acid type in
It assumes the knowledge or intuition about the ef-                       the protein sequence) that can affect the behavior of


                                                                      1
a protein. Predicting the effect of single mutations          mulated the learning problem as a mining task and
helps reducing the dimension of the search space for          applied a relational association rule miner to derive
rational design.                                              rules relating different mutations and their resistance
    In the proposed approach an Inductive Logic               properties.
Programming (ILP) [1] learner is trained to extract               In previous work [7] we used the same dataset
general rules describing mutations relevant to a cer-         for learning a relational model of mutant resistance
tain behavior (e.g. drug resistance of HIV) and the           with the hierarchical kFOIL algorithm, a statistical
learned rules are then used to infer novel potentially        relational learner [8]. Here we use the dataset for in-
relevant mutations.                                           ferring rules that can characterize a single mutation
    We focused on learning relevant mutations of the          as resistant to a certain class of RT inhibitors. Those
HIV reverse transcriptase (RT). The HIV RT is a               drug classes include: a) Nucleoside RT Inhibitors
DNA polymerase enzyme that transcribes RNA into               (NRTI); b) NonNucleoside RT Inhibitors (NNRTI);
DNA, allowing it to be integrated into the genome             c) NonCompetitive RT inhibitors (NCRTI); d) Py-
of the host cell and replicated along with it, and it         rophosphate Analogue RT Inhibitors (PARTI). The
is therefore crucial for virus propagation. Viruses           four classes of inhibitors differ in the targeted sites
typically have a very high mutation rate and a mu-            and rely on quite different mechanisms. NNRTI and
tation can confer the mutant resistance to one or             NCRTI inhibit the reverse transcriptase by binding
more drugs, for instance by modifying the inhibitor           to the enzyme active site, therefore directly interfer-
target site on the protein. In the case of the HIV            ing with the enzyme function. NRTI is instead in-
drug resistance, which we addressed in the present            corporated into the newly synthesized viral DNA for
work, building artificial mutants that can resist to          preventing its elongation. Finally the PARTI targets
the inhibitor could help to early predict the virus           the pyrophosphate binding site and it is employed,
evolution and thus design more effective drugs.               as part of a salvage therapy, on patients in which the
    Many machine learning methods have been ap-               HIV infection shows resistance to the other classes of
plied in the past to mutation data for predicting sin-        antiretroviral-drugs. The final dataset is composed
gle point mutations on protein stability changes [2]          of 164 mutations labeled as resistant over a set of
and the effect of mutations on the protein func-              581 observed mutations (extracted from 2339 mu-
tion [3] [4] or drug susceptibility [5]. At our knowl-        tants). Among the 164 mutations, 95 are labeled as
edge this is the first approach proposal for learning         resistant to NRTI, 56 to NNRTI, 5 to NCRTI and 8
relational features of mutations affecting a protein          to PARTI.
behavior and use them for generating novel relevant
mutations. Furthermore, even if we focus on single
point mutations in our experimental evaluation, our
                                                              Learning in first order logic
approach can be quite straightforwardly extended to
multiple point mutations, and we are actively work-           Our aim is to learn a first-order logic hypothesis for
ing in this direction. Conversely, the up-mentioned           the target concept, i.e. mutation conferring resis-
predicting approaches would immediately blow up               tance to a certain drug, and use it to infer novel
for the explosion in the number of candidate config-          mutations consistent with such hypothesis. We rely
urations to evaluate.                                         on definite clauses which are the basis of the Pro-
                                                              log programming language. A definite clause is an
                                                              expression of the form h ← b1 AND ... AND bn,
                                                              where h and the bi are atoms. Atoms are expres-
Results and Discussion                                        sions of the form p(t1, ..., tn) where p/n is a
Dataset                                                       predicate symbol of arity n and the ti are terms,
We applied our approach to the same dataset of mu-            either constants (denoted by lower case) or variables
tations already used in [6] for extracting relational         (denoted by upper case) in our experiments. The
rules among mutations. The dataset is derived from            atom h is also called the head of the clause, typ-
the Los Alamos National Laboratories (LANL) HIV               ically the target predicate, and b1 AND ... AND
resistance database1 and reports mutations of the             bn its body. Intuitively, a clause represents that
HIV reverse transcriptase (RT). Richter et al. [6] for-       the head h will hold whenever the body b1 AND
  1 http://www.hiv.lanl.gov/content/sequence/RESDB/




                                                          2
... AND bn holds. For instance, a simple hypothe-                   Other background knowledge facts and rules
sis like res against(A,ncrti) ← mutation(A,C)                   were added in order to express structural relations
AND close to site(C) would indicate that a muta-                along the primary sequence and catalytic propensity
tion C in the proximity of a binding site confers to            of the involved residues:
mutant A resistance against a ncrti. Learning in
this setting consists of searching for a set of definite        close to site/1 indicates whether a specific posi-
clauses H = {ci , ..., cm } covering all or most positive            tion is distant less than 5 positions from a
examples, and none or few negative ones if available.                residue belonging to a binding or active site.
First-order clauses can thus be interpreted as rela-                 In our specific case, the background theory in-
tional features that characterize the target concept.                corporates knowledge about a metal binding
The main advantage of these logic-based approaches                   site and a heterodimerization site.
with respect to other machine learning techniques is
the expressivity and interpretability of the learned            location/2 indicates in which fragment of the pri-
models. Models can be readily interpreted by hu-                     mary sequence the amino acid is located. Lo-
man experts and provide direct explanations for the                  cations are numbered from 0 by dividing the
predictions.                                                         sequence into a certain number of fragments.

                                                                catalytic propensity/2 indicates whether an
Background knowledge                                                 amino acid has a high, medium or low cat-
                                                                     alytic propensity according to [10].
We built a relational knowledge base for the domain
at hand. Table 1 summarizes the predicates we in-               mutated residue cp/3 indicates how, in a mutated
cluded as a background knowledge. We represented                     position, the catalytic propensity has changed
the amino acids of the wild type with their positions                (e.g. from low to high).
in the primary sequence (aa/2) and the specific mu-
tations characterizing them (mut prop/4). Target
predicates were encoded as resistance of the muta-
tion to a certain drug (res against/2).                         Algorithm overview
    Additional background knowledge was included                The proposed approach is sketched in Figure 1.
in order to highlight characteristics of residues and               The first step is the learning phase, in which an
relationships between mutations:                                ILP learner is fed with a logical representation of
                                                                the data D (the mutations experimentally observed
color/2 indicates the type of the natural amino                 to confer resistance to a certain inhibitor) and of
     acids according to the coloring proposed in [9].           the domain knowledge B we want to incorporate,
     For example the magenta class includes basic               and it returns a first-order logical hypothesis H for
     amino acids as lysine and arginine while the               the concept of mutation conferring resistance to a
     blue class includes acidic amino acids as as-              certain drug.
     partic and glutamic acids.                                     The hypothesis is derived using the Aleph (A
                                                                Learning Engine for Proposing Hypotheses) ILP sys-
same type/2 indicates whether two residues belong               tem2 . Aleph allows also to learn from positive ex-
     to the same type, i.e. a change from one                   ample only. This is the most suitable approach in
     residue to the other conserves the type of the             our case as the positive examples are the mutations
     amino acid.                                                experimentally proved to confer resistance to a drug
                                                                but no safe claim can be made on the other muta-
same type mut/2 indicates that a residue substitu-              tions if there is no sufficient evidence due for example
     tion at a certain position does not modify                 to the lack of an exhaustive set of laboratory exper-
     the amino acid type with respect to the wild               iments.
     type. For example mutation d123e conserves                     Aleph incrementally builds a hypothesis covering
     the amino acid type while mutation d123a does              all positive examples guided by a Bayesian evalu-
     not (i.e. different type mut/2 holds for it).              ation function, described in [11], scoring candidate
  2 http://www.comlab.ox.ac.uk/activities/machinelearning/Aleph/aleph.html




                                                            3
solutions according to an estimate of the Bayes’ pos-         Performance evaluation
terior probability that allows to tradeoff hypothesis         We divided the dataset of mutations into a train-
size and generality.                                          ing and a test set (70/30) in a stratified way, which
    In Figure 2 we show an example of learned hy-             means by preserving, both in the train and test set,
pothesis covering a set of examples. The learned hy-          the proportion of examples belonging to one of the
pothesis models the ability of a mutation to confer           four drug classes. The resulting training set is com-
the resistance to NCRTIs and is composed of three             posed of a total of 116 mutations while the test set
first-order clauses, each one covering different sets         is composed of 48 mutations.
of mutations of the wild type as highlighted in col-               We trained the ILP learner on the training set
ors: blue for the first clause, yellow for the second         and we evaluated on the test set the set of muta-
and red for the third one. Some mutations are cov-            tions generated using the learned model. The evalu-
ered by more than one clause as shown by the color            ation procedure takes the generated mutations and
overlaps.                                                     computes its enrichment in test mutations by count-
                                                              ing how many generated mutations are actually ob-
   Our background knowledge incorporates infor-
                                                              served in at least one test example. We compare the
mation about the RT metal binding site, which is
                                                              recall of the approach with the recall of an algorithm
composed of the aspartic acids D110, D185 and D186
                                                              that choose at random a set (of the same cardinality)
and, about the heterodimerization site composed of
                                                              of possible mutations among all legal ones.
W401 and W414 (in bold in Figure 2).                                                                        +
                                                                   Recall or sensitivity or TP rate is t+t+f − where
    The second step is the generative phase, in which
                                                              t+ , the true positives, is the number of test set muta-
the learned hypothesis is employed to find novel mu-
                                                              tions and t+ +f − , true positives plus false negatives,
tations that can confer drug resistance to an RT mu-
                                                              corresponds to the number of generated mutations.
tant. A set of candidate mutations can be generated
                                                                   Results averaged on 30 random splits of the
by using the Prolog inference engine starting from
                                                              dataset are reported in Table 2. On each split we
the rules in the learned model. The rules are actu-
                                                              performed 30 runs of our algorithm and of the ran-
ally constraints on the characteristics that a muta-
                                                              dom generation algorithm in each one of the different
tion of the wild type should have in order to confer
                                                              learning tasks (NNRTI, NRTI, NCRTI and PARTI).
resistance to a certain inhibitor.
                                                              For each task we also reported in column 3 and 4
    Algorithm 1 details the mutation generation pro-          the mean number of generated mutations over the
cedure. We here assume, for simplicity, to have a             30 splits and the number of test set mutations for
model H for a single drug class. The procedure                reference.
works by querying the Prolog inference engine for                  We evaluated the statistical significance of the
all possible variable assignments that satisfy the hy-        performance differences between the two algorithms
pothesis clauses. In order to generate novel muta-            by paired Wilcoxon tests on the averaged recall re-
tions, the mut prop/4 predicate is modified here in           ported on each split. We employed a confidence level
order to return all legal mutations instead of only           α of 0.05.
those appearing in the training instances. The set                 The improvement of the algorithm with respect
of mutations identified by the variable assignments           to the random generation of mutations is statistically
and not present in the training set is ranked accord-         significant on the NRTI and NNRTI tasks, which are
ing to a scoring function SM before being returned            the tasks on which we can learn the hypothesis from
by the algorithm. In this work we defined SM as the           a largest set of training examples.
number of clauses in H that a candidate mutation
                                                                   Figure 2 shows the trend of the mean recall over
m satisfies.
                                                              all splits when cutting the number of generated mu-
    Referring to the model in Figure 2 some gen-              tations (from 1 generated mutation to more than
erated candidate mutations with highest score are:            8000). The advantage of our approach remains quite
101P, 102A, 103F, 103I, 104M, 181I, 181V, 183M,               stable when reducing the set of candidates, produc-
188L, 188F where for example the notation 101P in-            ing almost nine times more test mutations than ran-
dicates a change of the wild type amino acid, located         dom in the 100 highest scoring ones for NNRTI (Fig-
in position 101, into a proline (P). This list includes       ure 3 shows the trend of the recall curves of the plot
also known surveillance mutations [12].                       in Figure 3(a) for the highest ranked 150 mutations).


                                                          4
    We finally learned a model on the whole set of            appropriate countermeasures. A more detailed back-
training mutations in order to generate a single set of       ground knowledge, possibly including 3D informa-
mutations for further inspection. Below we reported           tion whenever available, is necessary in order to fur-
five examples of novel mutations with the highest             ther focus the set of generated mutations, and pos-
rank for each one of the tasks:                               sibly post-processing stages involving mutant evalu-
                                                              ation by statistical machine learning approaches [2].
NNRTI 90I 98I 103I 106P 179I                                  In the next future we also plan to generalize the pro-
                                                              posed approach to jointly generate sets of related
NRTI 60A 153M 212L 229F 239I
                                                              mutations shifting the focus from single point muta-
NCRTI 183V 183L 188V 188F 188I                                tions to entire mutants.

PARTI 84R 86E 88Y 88V 89N
                                                              Authors contributions
    In [13], the authors found a set of novel muta-
                                                              EC built the background knowledge, implemented
tions conferring resistance to efavirenz and nevirap-
                                                              the algorithm and conducted the experimental eval-
ine, which are NNRTIs. Our mutation generation
                                                              uation. AP suggested the overall idea and coordi-
algorithm partially confirmed their findings. Muta-
                                                              nated the study. Both authors contributed in writ-
tion 90I was ranked high (5/5), mutation 101H was
                                                              ing the article. Both authors read and approved the
generated with a rank of 3/5, mutations 196R and
                                                              final manuscript.
138Q with rank 1/5, while mutation 28K was not
generated at all by our system as a candidate for
conferring resistance to NNRTI.
                                                              References
                                                               1. Muggleton S, De Raedt L: Inductive Logic Program-
                                                                  ming: Theory and Methods. University Computing
Example of learned hypothesis                                     1994, :629–682.
An example of learned hypothesis is reported in Fig-           2. Capriotti E, Fariselli P, Rossi I, Casadio R: A three-
ure 5. For instance, according to the model, among                state prediction of single point mutations on pro-
                                                                  tein stability changes. BMC bioinformatics 2008, 9
the features a mutation should have for conferring
                                                                  Suppl 2:S6.
resistance to a NNRTI, there is the change of a ba-
                                                               3. Needham CJ, Bradford JR, Bulpitt AJ, Care Ma, West-
sic (magenta) residue of the wild type, e.g. lysine               head DR: Predicting the effect of missense muta-
or arginine, into a residue with completely different             tions on protein function: analysis with Bayesian
phisico-chemical characteristics (rule 16).                       networks. BMC bioinformatics 2006, 7:405.
    Another example for the resistance to NNRTI is             4. Bromberg Y, Rost B: SNAP: predict effect of non-
that a non conserved mutation is present in positions             synonymous polymorphisms on function. Nucleic
                                                                  acids research 2007, 35(11):3823–35.
between 98 and 106 of the wild type sequence (rule
8).                                                            5. Rhee SY, Taylor J, Wadhera G, Ben-Hur A, Brutlag DL,
                                                                  Shafer RW: Genotypic predictors of human immun-
                                                                  odeficiency virus type 1 drug resistance. Proceed-
                                                                  ings of the National Academy of Sciences of the United
Conclusions                                                       States of America 2006, 103(46):17355–60.

In this work we proposed a simple relational learning          6. Richter L, Augustin R, Kramer S: Finding Relational
                                                                  Associations in HIV Resistance Mutation Data.
approach toward protein engineering. Starting from                In Proceedings of Inductive Logic Programming (ILP),
HIV reverse transcriptase mutation data we built a                Volume 9 2009.
relational knowledge base and we mined relevant re-            7. Cilia E, Landwehr N, Passerini A: Mining Drug Resis-
lational features for modeling mutant resistance by               tance Relational Features with Hierarchical Mul-
using an ILP learner. Based on the learned relational             titask kFOIL. In Proceedings of BioLogical@AI*IA2009
                                                                  2009.
rules we generate a set of candidate mutations sat-
isfying them.                                                  8. Landwehr N, Passerini A, Raedt L, Frasconi P: Fast
                                                                  learning of relational kernels. Machine Learning
    Albeit preliminary, our results suggest that the              2010, 78(3):305–342.
proposed approach for learning mutations has a po-
                                                               9. Taylor WR: The classification of amino acid
tential in guiding mutant engineering, as well as in              conservation. Journal of Theoretical Biology 1986,
predicting virus evolution in order to try and devise             119(2):205–218.


                                                          5
10. Bartlett G, Porter C, Borkakoti N, Thornton J: Anal-                        RW: Drug resistance mutations for surveillance of
    ysis of catalytic residues in enzyme active sites.                          transmitted HIV-1 drug-resistance: 2009 update.
    Journal of Molecular Biology 2002, 324:105–121.                             PloS one 2009, 4(3):e4724.
11. Muggleton S: Learning from Positive Data. In Induc-
    tive Logic Programming Workshop 1996:358–376.                           13. Deforche K, Camacho RJ, Grossman Z, Soares Ma, Van
12. Bennett DE, Camacho RJ, Otelea D, Kuritzkes DR,                             Laethem K, Katzenstein Da, Harrigan PR, Kantor R,
    Fleury H, Kiuchi M, Heneine W, Kantor R, Jordan MR,                         Shafer R, Vandamme AM: Bayesian network analyses
    Schapiro JM, Vandamme AM, Sandstrom P, Boucher                              of resistance pathways against efavirenz and nevi-
    CaB, van de Vijver D, Rhee SY, Liu TF, Pillay D, Shafer                     rapine. AIDS (London, England) 2008, 22(16):2107–15.



Algorithms
Algorithm 1 - Mutation generation algorithm.
Algorithm for novel relevant mutations discovery.

Algorithm 1 Mutation generation algorithm.
 1: input: dataset of training mutations D, background knowledge B, learned model H
 2: output: rank of the most relevant mutations R
 3: procedure GenerateMutations(D, B, H)
 4:    Initialize DM ← ∅
 5:    A ← find all assignments a that satisfy at least one clause ci ∈ H
 6:    for a ∈ A do
 7:        m ← mutation corresponding to the assignments a ∈ A
 8:        score ← SM (m)                                               . number of clauses ci satisfied by a
 9:        if not m ∈ D then                                . discard mutations observed in the training set
10:            DM ← DM ∪ {(m, score)}
11:        end if
12:    end for
13:    R ← RankMutations(DM , B, H)                                              . rank relevant mutations
14:    return R
15: end procedure




Figures
Figure 1 - Mutation engineering algorithm.
Schema of the mutation engineering algorithm.
                                 dataset of
                                              background
                                  training
                                              knowledge
                                 mutations


                                   D              B
                                                                                                rank of
                                                                                                 novel
                                                                                                relevant
                                                                                               mutations
                                                                              generator of
                                          ILP
                                        learner              H                  relevant
                                                                               mutations
                                                                                                 R
                                                           hypothesis




                            Figure 1: Schema of the mutation engineering algorithm.



                                                                        6
Figure 2 - Model for the resistance to NCRTI
An example of learned hypothesis for the NCRTI task with highlighted examples covered by the hypothesis
clauses.

                                              >wt ...AGLKKKKSVTVLDVG...YQYMDDLYVG...WETWWTEY...WIPEWEFVN...
                                                                 D         DD          W       W
                                                     |             |   |        |   |      |   |       |
                                                    98           112 181      190 398    405 410     418


                                                     mut prop(A,B,C,D) AND location(11.0,C)
                                                     mut prop(A,B,C,D) AND mutated residue cp(C,high,small)
                                                     mut prop(A,B,C,D) AND color(green,B) AND close to site(C)


Figure 3 - Mean recall by varying the number of generated mutations
Mean recall by varying the number of generated mutations.

                                                               Recall curve                                                                                                  Recall curve
                              1                                                                                                              1
                                                                                                 algo                                                                                                          algo
                                                                                                 rand                                                                                                          rand
                             0.9                                                                                                            0.9

                             0.8                                                                                                            0.8

                             0.7                                                                                                            0.7
    Recall on the test set




                                                                                                                   Recall on the test set




                             0.6                                                                                                            0.6

                             0.5                                                                                                            0.5

                             0.4                                                                                                            0.4

                             0.3                                                                                                            0.3

                             0.2                                                                                                            0.2

                             0.1                                                                                                            0.1

                              0                                                                                                              0
                                   0   1000   2000    3000      4000       5000        6000   7000      8000                                      0   1000   2000   3000      4000       5000        6000   7000      8000
                                                       Number of generated mutations                                                                                 Number of generated mutations


                                                      (a) NNRTI                                                                                                      (b) NRTI
                                                               Recall curve                                                                                                  Recall curve
                              1                                                                                                              1
                                                                                                 algo                                                                                                          algo
                                                                                                 rand                                                                                                          rand
                             0.9                                                                                                            0.9

                             0.8                                                                                                            0.8

                             0.7                                                                                                            0.7
    Recall on the test set




                                                                                                                   Recall on the test set




                             0.6                                                                                                            0.6

                             0.5                                                                                                            0.5

                             0.4                                                                                                            0.4

                             0.3                                                                                                            0.3

                             0.2                                                                                                            0.2

                             0.1                                                                                                            0.1

                              0                                                                                                              0
                                   0   1000   2000    3000      4000          5000     6000   7000      8000                                      0   1000   2000   3000      4000          5000     6000   7000      8000
                                                      Number of generated mutations                                                                                 Number of generated mutations


                                                      (c) NCRTI                                                                                                     (d) PARTI



                                               Figure 2: Mean recall by varying the number of generated mutations.




                                                                                                               7
Figure 4 - Recall curves of the NNRTI task for the highest ranked mutations.
Detail of the mean recall curves of the NNRTI task for a number of generated mutations below 150.

                                                                                            Recall curve
                                                     0.2
                                                                                                                                algo
                                                                                                                                rand
                                                   0.175


                                                    0.15
                          Recall on the test set




                                                   0.125


                                                     0.1


                                                   0.075


                                                    0.05


                                                   0.025


                                                      0
                                                           0   10   20   30   40   50  60     70    80    90 100    110   120   130    140   150
                                                                                    Number of generated mutations




Figure 3: Detail of the mean recall curves of the NNRTI task for a number of generated mutations below
150.



Figure 5 - Example of learned model

     1-res against(A,nrti) ← mut prop(A,B,C,D) AND color(red,D)
     2-res against(A,nrti) ← mut prop(A,B,C,D) AND color(red,D) AND color(red,B)
     3-res against(A,nrti) ← mut prop(A,B,C,D) AND location(7.0,C) AND mutated residue cp(C,medium,medium)
     4-res against(A,nrti) ← mut prop(A,B,C,D) AND location(7.0,C)
     5-res against(A,nrti) ← same type mut(A,B)
     6-res against(A,nrti) ← mut prop(A,B,C,D) AND mutated residue cp(C,medium,high)
     7-res against(A,nrti) ← mut prop(A,B,C,D) AND mutated residue cp(C,medium,small)
     8-res against(A,nnrti) ← different type mut(A,B) AND location(11.0,B)
     9-res against(A,nnrti) ← mut prop(A,B,C,D) AND mutated residue cp(C,small,small)
     10-res against(A,nnrti) ← same type mut(A,B)
     11-res against(A,nnrti) ← mut prop(A,B,C,D) AND aminoacid(B,v)
     12-res against(A,nnrti) ← mut prop(A,B,C,D) AND location(15.0,C)
     13-res against(A,nnrti) ← mut prop(A,B,C,D) AND aminoacid(D,i)
     14-res against(A,nnrti) ← mut prop(A,B,C,D) AND location(11.0,C)
     15-res against(A,nnrti) ← mut prop(A,B,C,D) AND color(red,D)
     16-res against(A,nnrti) ← mut prop(A,B,C,D) AND color(magenta,B) AND different type mut(A,C)
     17-res against(A,nnrti) ← mut prop(A,B,C,D) AND location(21.0,C)
     18-res against(A,ncrti) ← mut prop(A,B,C,D) AND location(11.0,C)
     19-res against(A,ncrti) ← mut prop(A,B,C,D) AND mutated residue cp(C,high,small)
     20-res against(A,ncrti) ← mut prop(A,B,C,D) AND color(green,B) AND close to site(C)
     21-res against(A,parti) ← mut prop(A,B,C,D) AND location(9.0,C)




                                                                                          8
Tables
Table 1 - Background knowledge predicates
Summary of the background knowledge facts and rules.

      Background Knowledge Predicates
      aa(Pos,AA)                                indicates a residue in the wild type sequence
      mut prop(MutationID,AA,Pos,AA1)           indicates a mutation: mutation identifier, position and
                                                amino acids involved, before and after the substitution
      res against(MutationID,Drug)              indicates whether a mutation is resistant to a certain drug
      color(Color,AA)                           indicates the type of a natural amino acid
      same type(R1,R2)                          indicates whether two residues are of the same type
      same type mut(MutationID, Pos)            indicates a mutation to a residue from the same type
      different type mut(MutationID, Pos)       indicates a mutation changing the type of residue
      close to site(Pos)                        indicates whether a specific position is close to a binding
                                                or active site if any
      location(L,Pos)                           indicates in which fragment of the primary sequence the
                                                amino acid is located
      catalytic propensity(AA,CP)               indicates whether an amino acid has a high, medium or
                                                low catalytic propensity
      mutated residue cp(Pos, CPold, CPnew)     indicates how, in a mutated position, the catalytic
                                                propensity has changed (e.g. from low to high)


Table 2 - Results
Statistical comparisons of the performance of the proposed algorithm with an algorithm generating mutations
at random. The average recall has been computed for each one of the learning tasks over the 30 splits by
averaging recall over 30 repeated runs of the two algorithms. Results of a paired Wilcoxon test (α = 0.05)
on the statistical significance of the performance differences are also reported. A black bullet indicates a
statistical significant improvement of our algorithm over a random generator.

                   Mean recall % on 30 splits
               Algorithm     Random Generator       Mean n. generated mutations         n. test mutations
     NNRTI        86 •                58                       5201                             17
     NRTI         55 •                46                       5548                             28
     NCRTI         16                  8                       1042                              1
     PARTI         49                 39                       3425                              2




                                                     9