Predicting virus mutations through relational learning
Elisa Cilia1 , Stefano Teso2 , Sergio Ammendola3 , Tom Lenaerts1,4 , Andrea Passerini⇤2

1 MLG, Département d’Informatique, Université Livre de Bruxelles, Boulevard du Thriomphe - CP 212, 1050 - Brussels, Belgium
2 Department of Computer Science and Information Engineering, University of Trento, via Sommarive 14, I-38123 (Povo) Trento,

Italy
3 Ambiotec sas, R&D, Via Appia Nord 47, 00142 Cisterna di Latina (LT), Italy
4 AI-lab, Vakgroep Computerwetenschappen, Vrije Universiteit Brussel, Pleinlaan 2, 1050 - Brussels, Belgium


Email: Elisa Cilia - ecilia@ulb.ac.be; Stefano Teso - teso@disi.unitn.it; Sergio Ammendola - ricercaesviluppo@ambiotec.it; Tom
Lenaerts - tlenaert@ulb.ac.be; Andrea Passerini⇤ - passerini@disi.unitn.it;

⇤ Corresponding author


Abstract
Background: Viruses are typically characterized by high mutation rates, which allow them to quickly develop
drug-resistant mutations. Mining relevant rules from mutation data can be extremely useful to understand the
virus adaptation mechanism and to design drugs that e↵ectively counter potentially resistant mutants.
Results: We propose a simple relational learning approach for mutant prediction where the input consists of
mutation data with drug-resistance information, either as sets of mutations conferring resistance to a certain
drug, or as sets of mutants with information on their susceptibility to the drug. The algorithm learns a set of
relational rules characterizing drug-resistance and use them to generate a set of potentially resistant mutants.
Conclusions: Promising results were obtained in generating resistant mutations for both nucleoside and non-
nucleoside HIV reverse transcriptase inhibitors. The approach can be generalized quite easily to learning mutants
characterized by more complex rules correlating multiple mutations.


Background                                                          single-strand RNA particle, a reverse transcriptase
                                                                    and an integrase into the cell cytoplasm. Quickly
HIV is a pandemic cause of lethal pathologies in                    the RNA molecule is retro-transcribed in a DNA
more than 33 million people. Its horizontal transmis-               double strand molecule, which is integrated into the
sion trough mucosae is difficult to control and treat               host genome. The integration events induce a cel-
because the virus has a high virulence and it infects               lular response, which begins with the transcription
several type of immune surveillance cells, such as                  of the Tat gene by the RNA polymerase II. Tat is a
those characterized by CD4 receptor (CD4+ cells).                   well-known protein responsible for the HIV activa-
The major problem in treating the human virus in-                   tion since it recruits some cytoplasm host proteins
fection is the drug selectivity since the virus pene-               involved in the expression of viral genes. Remark-
trates in the cell where it releases its genetic material           ably, HIV can establish a life-long latent infection
to replicate itself by using the cell mechanisms. A                 by suppressing its transcription, thus making inef-
drug target is the replicating apparatus of the cell.               fective the large part of antiviral drugs aimed at
HIV antiviral molecules will be directed against sev-               controlling the viral replication. However replicating
eral cells such as macrophages or lymphocytes T to                  viruses adopt several drug resistance strategies, for
interfere with viral replication. The HIV releases a


                                                                1
instance, HIV induces amino acid mutations reduc-              tually the most common situation, in which one is
ing the efficacy of the pharmaceutical compounds.              presented with a number of mutants together with
The present work is aimed at gaining knowledge                 some evidence of their susceptibility to certain treat-
on mutations that may occur into the viral RNA                 ments. The goal is then to extract rules and poten-
transcriptase [9]. This is an important target to de-          tial mutations explaining why certain mutants con-
velop antiretroviral medicines and di↵erent types of           fer resistance and which other ones produce the same
molecules have been found active: the Nucleoside               e↵ect.
Reverse Transcriptase Inhibitors (NRTI) and Non
NRTI (NNRTI). Although RNA RT inhibitors are                       Many machine learning methods have been ap-
active the HIV virus quickly, more than frequently,            plied in the past to mutation data for predicting
changes the RNA RT encoding sequence thus acquir-              single amino acid mutations on protein stability
ing drug resistance. The antiviral therapy is based            changes [5] and the e↵ect of mutations on the pro-
on the use of cocktails of molecules including new             tein function [6, 7] or drug susceptibility [8]. To
RNA RT inhibitors, thus a computational approach               the best of our knowledge this is the first attempt
to predict possible mutation sites, and their sensi-           to learn relational features of mutations a↵ecting a
bility to drug is an important tool in drug discovery          protein behavior and use them for generating novel
for the antiretroviral therapy.                                relevant mutations. Furthermore, even if we focus
     Computational methods can assist here by ex-              on single amino acid mutations in our experimental
ploring the space of potential virus mutants, provid-          evaluation, our approach can be straightforwardly
ing potential avenues for anticipatory drugs [1]. To           extended to multiple point mutations, and we are
achieve such a goal, one first needs to understand             actively working in this direction. Note that the
what kind of mutants may lead to resistance.                   other approaches first generate all potential muta-
     A general engineering technique for building ar-          tions and then decide which of them leads to resis-
tificial mutants is referred to as rational design [2].        tance, whereas we produce the relevant ones directly.
The technique consists in modifying existing pro-
                                                                   We report an experimental evaluation focused on
teins by site directed mutagenesis. It relies on a
                                                               HIV RT. RT is a well-studied protein: a large num-
deep domain knowledge in order to identify candi-
                                                               ber of mutants have been shown to resist to one or
date mutations that may a↵ect protein structure or
                                                               more drugs and databases exist that collect those
function. The process typically involves extensive
                                                               data from di↵erent sources and make them available
trial-and-error experiments and is also aimed at im-
                                                               for further analyses [10].
proving the understanding mechanisms of a protein
behavior.                                                          We tested the ability of our approach to gener-
     In this work we report on our initial attempt to          ate drug-resistant amino acid mutations for NRTI
develop an artificial system mimicking the rational            and NNRTI. Our results show statistically signifi-
design process. An Inductive Logic Programming                 cant improvements for both drug classes over the
(ILP) learner [3] is trained to extract general rules          baseline results obtained through a random genera-
describing mutations relevant to a certain behav-              tor. Mutant-based results confirm our previous find-
ior, i.e. resistance towards certain inhibitors. The           ings [4] extending them to the more natural setting
learned rules are then used to infer novel mutations           in which only information on mutant behavior is ob-
which may induce similar resistance.                           served. Additional background knowledge allows to
     We consider here two learning settings: in the            produce here more compact and simple hypotheses,
first one we rely on a training set made of single             which have the advantage of generating a reduced
amino acid mutations known to confer resistance to             number of candidate mutations.
a certain class of inhibitors (we will refer to this
as mutation-based learning and preliminary promis-                 The approach can be in general applied in mu-
ing results in this setting were described in [4]), in         tation studies aimed at understanding protein func-
the second we learn the rules directly from mutants            tion. By searching for residues most likely to have
(comprising of 1 to n amino acid mutations) that               a functional role in an active site, the approach can
have been experimentally tested for their resistance           for instance be used in the engineering of enzyme
to the same classes of inhibitors (we will refer to this       mutants with an improved activity for a certain sub-
as mutant-based learning). This second setting is ac-          strate.


                                                           2
Results                                                        where h and the bi are atomic literals. Atomic lit-
Datasets                                                       erals are expressions of the form p(t1, ..., tn)
We applied our approach to predict HIV RT mu-                  where p/n is a predicate symbol of arity n and the
tations conferring resistance to two classes of in-            ti are terms, either constants (denoted by lower
hibitors: Nucleoside RT Inhibitors (NRTI) and Non-             case) or variables (denoted by upper case) in our
Nucleoside RT Inhibitors (NNRTI). The two classes              experiments. The atomic literal h is also called the
of inhibitors di↵er in the targeted sites and rely on          head of the clause, typically the target predicate,
quite di↵erent mechanisms [11, 12]. NNRTI inhibit              and b1 AND ... AND bn its body. Intuitively, a
the reverse transcriptase by binding to the enzyme             clause represents that the head h will hold whenever
active site, therefore directly interfering with the en-       the body b1 AND ... AND bn holds. For instance,
zyme function. NRTI are instead incorporated into              a simple hypothesis like res against(A,nnrti)
the newly synthesized viral DNA for preventing its             mutation(A,C) AND close to site(C) would indi-
elongation.                                                    cate that a mutation C in the proximity of a binding
                                                               site confers to mutant A resistance against a nnrti.
    We collected datasets for both mutation-based
                                                               Learning in this setting consists of searching for a
and mutant-based learning. The former (Dataset 1)
                                                               set of definite clauses H = {ci , ..., cm } covering all
is a dataset of amino acid mutations derived from
                                                               or most positive examples, and none or few nega-
the Los Alamos National Laboratories (LANL) HIV
                                                               tive ones if available. First-order clauses can thus
resistance database1 by Richter et al. [13], who used
                                                               be interpreted as relational features that character-
it to mine relational rules among mutations. It con-
                                                               ize the target concept. The main advantage of these
sists of 95 amino acid mutations labeled as resistant
                                                               logic-based approaches with respect to other ma-
to NRTI and 56 labeled as resistant to NNRTI, over
                                                               chine learning techniques is the expressivity and in-
a set of 581 observed mutations. For the mutant-
                                                               terpretability of the learned models. Models can be
based setting, we collected (Dataset 2) HIV RT
                                                               readily interpreted by human experts and provide
mutation data from the Stanford University HIV
                                                               direct explanations for the predictions.
Drug Resistance. The database provides a dataset
of selected mutants of HIV RT with results of sus-
ceptibility studies to various drugs, and was previ-
                                                               Background knowledge
ously employed [8] for predicting drug resistance of
novel (given) mutants2 . It is composed of 838 dif-            We built a relational knowledge base for the problem
ferent mutants annotated with susceptibility levels            domain. Table 1 summarizes the predicates we in-
(low, medium and high) to drugs belonging to the               cluded as a background knowledge. We represented
NRTI (639 mutants) and NNRTI (747 mutants) drug                the amino acids of the wild type with their posi-
classes. We considered a setting aimed at identifying          tions in the primary sequence (aa/2) and the spe-
amino acid mutations conferring high susceptibility            cific mutations characterizing them (mut/4). Target
(with respect to medium or low), and considered a              predicates were encoded as resistance of the muta-
mutant as highly susceptible to a drug class if it was         tion or mutant to a certain drug (res against/2).
annotated as being highly susceptible to at least one          Note that this encoding considers mutations at the
drug from that class.                                          amino acid rather than nucleotide level, i.e. a sin-
                                                               gle amino acid mutation can involve up to three nu-
                                                               cleotide changes. While focusing on single nucleotide
                                                               changes would drastically expand the space of pos-
Learning in first order logic                                  sible mutations, including the cost (in terms of nu-
Our aim is to learn a first-order logic hypothesis             cleotide changes) of a certain amino acid mutation
for a target concept, i.e. mutation conferring re-             could help refining our search procedure.
sistance to a certain drug, and use it to infer novel               Additional background knowledge was included
mutations consistent with such hypothesis. We rely             in order to highlight characteristics of residues and
on definite clauses which are the basis of the Pro-            mutations:
log programming language. A definite clause is an
expression of the form h      b1 AND ... AND bn,               typeaa/2 indicates the type of the natural amino
  1 http://www.hiv.lanl.gov/content/sequence/RESDB/
  2 downloadable at http://hivdb.stanford.edu/cgi-bin/GenoPhenoDS.cgi


                                                           3
     acids according to the Venn diagram grouping             location/2 indicates in which fragment of the pri-
     based on the amino acids properties proposed                  mary sequence the amino acid is located. Lo-
     in [14]. For example, a serine is a tiny and                  cations are numbered from 0 by dividing the
     polar amino acid.                                             sequence into fragments of 10 amino acid
                                                                   lenght.
color/2 indicates the type of the natural amino
     acids according to the coloring proposed                 catalytic propensity/2 indicates whether an
     in [15]. For example the magenta class includes               amino acid has a high, medium or low cat-
     basic amino acids as lysine and arginine while                alytic propensity according to [16].
     the blue class includes acidic amino acids as
                                                              mutated residue cp/5 indicates how, in a mutated
     aspartic and glutamic acids. These groups of
                                                                   position, the catalytic propensity has changed
     amino acids do not overlap as in the previous
                                                                   (e.g. from low to high).
     case.

same type aa/3 indicates whether two residues be-
     long to the same type T, i.e. a change from one          Algorithm overview
     residue to the other conserves the type of the           The proposed approach is sketched in Figure 1.
     amino acid.

same color type/2 indicates whether two residues              Step 1: Learning phase
     belong to the same coloring group, i.e. a                The first step is the learning phase. An ILP learner
     change from one residue to the other conserves           is fed with a logical representation of the data D
     the coloring group of the amino acid.                    and of the domain knowledge B to be incorporated,
                                                              and it returns a first-order logical hypothesis H for
same type mut t/3 indicates that a residue substi-            the concept of mutation conferring resistance to a
     tution at a certain position does not modify             certain class of inhibitors.
     the amino acid type T with respect to the wild               In this context there are two suitable ways to
     type. For example mutation i123v conserves               learn the target concept, depending on the type of
     the aliphatic amino acid type while mutation             input data and their labeling:
     i123d does not (i.e. different type mut t/3
     holds for it).                                             a) the one-class classification setting, learning a
                                                                   model from positive instances only. This is the
same color type mut/2 indicates that a residue                     approach we employ for Dataset 1: positive ex-
     substitution at a certain position does not                   amples are mutations for which experimental
     modify the amino acid coloring group with                     prove is available that shows resistance to a
     respect to the wild type. For example mu-                     drug, but no safe claim can be made on non-
     tation d123e conserves the blue amino acid                    annotated mutations.
     group while mutation d123a does not (i.e.
     different color type mut/2 holds for it).                  b) the binary classification setting, learning to
                                                                   discriminate between positive and negative
    Other background knowledge facts and rules                     instances. This setting is appropriate for
were added in order to express structural relations                Dataset 2: positive examples are in our exper-
along the primary sequence and catalytic propensity                iments mutants labelled as highly susceptible
of the involved residues:                                          to the drug class, negative examples are those
                                                                   with medium or low susceptibility.
close to site/1 indicates whether a specific posi-
     tion is distant less than 5 positions from a                The hypothesis is derived using the Aleph (A
     residue belonging to a binding or active site.           Learning Engine for Proposing Hypotheses) ILP sys-
     In our specific case, the background theory in-          tem3 , which allows to learn in both settings. In the
     corporates knowledge about a metal binding               one-class classification case, it incrementally builds a
     site and a heterodimerization site.                      hypothesis covering all positive examples guided by a
  3 http://www.comlab.ox.ac.uk/activities/machinelearning/Aleph/aleph.html


                                                          4
Bayesian evaluation function, described in [17], scor-         the generated candidate mutations that satify the
ing candidate solutions according to an estimate of            first two clauses are all the mutations changing the
the Bayes’ posterior probability that allows to trade-         glycine in position 190 in a polar amino acid: 190Y,
o↵ hypothesis size and generality. In the binary clas-         190W, 190T, 190S, 190R, 190Q, 190N, 190K, 190H,
sification case, Aleph builds the hypothesis covering          190E, 190D, 190C where for example the notation
all positive examples and none or few negative ones,           190Y indicates a change of the wild type amino acid,
guided by an m estimate evaluation function [18].              located in position 190, into a tyrosine (Y). For in-
     Aleph adds clauses to the hypothesis based on             stance, 190S and 190E in this list are already part of
their coverage of training examples. Given a learned           the known NNRTI surveillance mutations (see [19]).
model, the first clauses are those covering most train-
ing examples and thus usually the most representa-
tive of the underlying concept.                                Learning from mutations
     In Figure 2 we show a simple example of learned           We first learn general rules characterizing known re-
hypothesis covering a set of training mutants from             sistance mutations (from Dataset 1) to be used for
Dataset 2. The learned hypothesis models the abil-             predicting novel candidate ones.
ity of a mutation to confer the mutant resistance to               We divided the dataset of mutations into a train-
NNRTI and is composed of four first-order clauses,             ing and a test set (70/30) in a stratified way, which
each one covering di↵erent sets of mutations of the            means by preserving, both in the train and test set,
wild type as highlighted in colors: yellow for the first       the proportion of examples belonging to one of the
clause, blue for the second, red for the third, and            two drug classes. The resulting training set is com-
green for the fourth one. Some mutations are cov-              posed of a total of 106 mutations while the test set
ered by more than one clause as shown by the color             is composed of 45 mutations.
overlaps. Bold letters indicate residues involved in               We trained the ILP learner on the training set
the RT metal binding site (D110, D185 and D186)                and we evaluated on the test set the set of mutations
or the heterodimerization site (W401 and W414).                generated using the learned model. The evaluation
                                                               procedure takes the set of generated mutations and
                                                               computes its enrichment in test mutations. We com-
Step 2: Generative phase                                       pare the recall of the approach, i.e. the fraction of
The second step of our approach is the generative              test mutations generated by the model, with the re-
phase, in which the learned hypothesis is employed             call of a baseline algorithm that chooses at random
to find novel mutations that can confer drug resis-            a set (of the same cardinality) of possible mutations
tance to an RT mutant. A set of candidate muta-                among all legal ones.
tions can be generated by using the Prolog inference               Results averaged on 30 random splits of the
engine starting from the rules in the learned model.           dataset are reported in Table 2. On each split we
The rules are actually constraints on the character-           performed 30 runs of our algorithm (Aleph has a
istics that a mutation of the wild type should have            random component generating the seed for the hy-
in order to confer resistance to a certain inhibitor,          pothesis search) and of the random generation al-
according to the learned hypothesis.                           gorithm in each one of the di↵erent learning tasks
    Algorithm 1 details the mutation generation pro-           (NNRTI, NRTI). In this setting, the average sizes of
cedure. We assume, for simplicity, to have a model             the learned hypotheses for NNRTI and NRTI are 8
H for a single drug class. The procedure works by              and 7 rules respectively. For each task we also report
querying the Prolog inference engine for all possi-            in column 3 and 4 of Table 2 the mean number of
ble variable assignments that satisfy the hypothesis           generated mutations over the 30 splits and the num-
clauses, each representing a mutation by its position          ber of test set mutations for reference. In both cases,
and the amino acid replacing the wildtype residue.             improvements are statistically significant according
The set of mutations generated by the model is                 to a paired Wilcoxon test (↵=0.01). Figure 3 gives
ranked according to a scoring function SM before               further details about these results by showing how
being returned by the algorithm. We defined SM as              the mean recall on the test set varies if we com-
the number of clauses in H that a candidate muta-              pute it on the generated mutations satisfying only
tion m satisfies.                                              one clause, two clauses, three clauses and so on (x
    Referring to the model in Figure 2, for instance,          axis). The mean recall trend is shown in orange for


                                                           5
the proposed generative approach and in green for             of the learned hypotheses for NNRTI and NRTI are
a random generator of mutations for both classes of           5 and 3 rules respectively. The small size of the mod-
inhibitors.                                                   els allows to substantially reduce the space of possi-
    We finally learned a model on the whole dataset           ble mutations to be generated. Our approach shows
in order to generate a single set of mutations for fur-       an average recall of 17% and 7% on the test muta-
ther inspection. We report five examples of novel             tions, which is statistically significantly higher than
mutations with the highest score for each one of the          the recall obtained by randomly generating muta-
tasks: 90I, 98I, 103I, 106P, 179I for NNRTI and 60A,          tions (paired Wilcoxon test, ↵=0.01). Figure 5 gives
153M, 212L, 229F, 239I for NRTI.                              further details about these results by showing the
    In [20], the authors found a set of novel muta-           mean recall trend by varying the number of satisfied
tions conferring resistance to efavirenz and nevirap-         clauses of the generated mutations.
ine, which are NNRTI. Our mutation generation al-                 Figure 6 shows the hypothesis learned by the
gorithm partially confirmed their findings. Mutation          system when trained on the whole set of mutants
90I was scored high (getting the maximum score of             in Dataset 2. It is easy to see that the hypothe-
5 satisfied clauses), mutation 101H was generated             sis for the resistance to NNRTI identifies many (10
with a score of 3 out of 5, mutations 196R and 138Q           out of 19) of the known resistance survaillance mu-
with score 1 out of 5, while mutation 28K was not             tations reported in [19]: 103N, 106A, 181C, 181I,
generated at all by our system as a candidate for             181V, 188C, 188H, 190A, 190E and 190S. In partic-
conferring resistance to NNRTI.                               ular, mutation 181C has the highest predicted score
    An example of learned hypothesis is reported in           as it satisfies four clauses of the learned model. In-
Figure 4. For instance, according to the model,               terestingly, other not previously reported mutations
among the features a mutation should have for con-            are suggested with a high score (3 out of 4 satisifed
ferring resistance to a NNRTI, there is the change of         clauses) by the generative algorithm. These muta-
a basic (magenta) residue of the wild type, e.g. lysine       tions may be novel mutations conferring resistance
or arginine, into a residue with completely di↵erent          to NNRTI. For example, 181N, 181D, 318C and
phisico-chemical characteristics (rule 16).                   232C.
    Another example for the resistance to NNRTI is                29% of the mutations labeled as conferring re-
that a non conserved mutation is present in positions         sistance to NNRTI in Dataset 1 are also identified.
between 98 and 106 of the wild type sequence (rule            Apart from the survaillance mutations those include
8).                                                           also: 98G, 227C, 190C, 190Q, 190T and 190V. Key
                                                              positions for the resistance to NNRTI, like 181 and
                                                              190 can be easily identified from the model in Fig-
Learning from mutants                                         ure 6, thanks to rules 15 and 23 that anchor the
The next set of experiments is focused on learning            mutation to the specific position without restricting
mutations from mutant data (Dataset 2). Learned               the change to a specific group of amino acids. A
models are still limited to single amino acid muta-           quite uncommon NNRTI-selected mutation, 238N,
tions, and so are novel mutants generated by the              appears among the generated ones. Indeed, as re-
system.                                                       ported in the summaries of the Stanford HIV Resis-
    We randomly assigned the mutants in Dataset 2             tance Database, this mutations in combination with
to 30 training and a test set splits, by avoiding hav-        103N, causes high-level resistance to nevirapine and
ing mutants containing the same resistance mutation           efavirenz.
(according to the labelling used in Dataset 1) in both            Also in the case of NRTI the generative algo-
training and test sets. For each of the 30 splits, we         rithm suggests a few known survaillance mutations
evaluated the recall of the generated mutations on            reported in [19]: 67E, 67G, 67N, 116Y, 184V, 184I
the known resistance mutations (from Dataset 1),              (6 out of 34). 18% of the mutations labeled as
by first removing all the mutations that were also            conferring resistance to NRTI in Dataset 1 are also
present in the training set. Comparison is again              identified. Apart from known survaillance mutation
made on a baseline algorithm generating random le-            those include also: 44D, 62V, 67A, 67S, 69R, 184T.
gal mutations.                                                Key positions for the resistance to NRTI as pre-
    Results averaged on the 30 random splits are re-          dicted by our model in Figure 6 are: 33, 67, 184,
ported in Table 3. In this setting, the average sizes         194, 218. Of them, the e↵ects of mutations in posi-


                                                          6
tion 67 and especially in 184 have been well studied,        full protein engineering fashion.
while positions 33, 194, 218 seem to be novel. We
found also that the predicted mutation 219H, which
does not appear neither among the know survail-              Conclusions
lance mutations or among the resistance mutations            In this work we proposed a simple relational learn-
in Dataset 1, is indeed reported in the Stanford HIV         ing approach applicable to evolution prediction and
Database as a quite uncommon NRTI-selected mu-               protein engineering. The algorithm relies on a train-
tation. Note that this uncommon mutation requires            ing set of mutation data annotated with drug resis-
two nucleotide changes to emerge.                            tance information, builds a relational model charac-
                                                             terizing resistant mutations, and uses it to generate
                                                             novel potentially resistant ones. Encouraging pre-
Discussion and Future Work                                   liminary results on HIV RT data indicate a statisti-
                                                             cally significant enrichment in resistance conferring
The results shown in the previous section are a              mutations among those generated by the system, on
promising starting point to generalize our approach          both mutation-based and mutant-based learning set-
to more complex settings. We showed that the ap-             tings. Albeit preliminary, our results suggest that
proach scales from few hundreds of mutations as              the proposed approach for learning mutations has a
learning examples to almost a thousand of complete           potential in guiding mutant engineering, as well as in
mutants. Moreover the learned hypotheses signif-             predicting virus evolution in order to try and devise
icantly constraint the space of all possible single          appropriate countermeasures. A more detailed back-
amino acid mutations to be considered, paving the            ground knowledge, possibly including 3D informa-
way to the expansion of the method to multi-site             tion whenever available, is necessary in order to fur-
mutant generation. This represents a clear advan-            ther focus the set of generated mutations, and pos-
tage over alternative existing machine learning ap-          sibly post-processing stages involving mutant evalu-
proaches, which would require the preliminary gen-           ation by statistical machine learning approaches [5].
eration of all possible mutants for their evaluation.        In the next future we also plan to generalize the pro-
Restricting to RT mutants with two mutated amino             posed approach to jointly generate sets of related
acids, this would imply testing more than a hun-             mutations shifting the focus from the generation of
dred million candidate mutants. At the same time             single amino acid mutations to mutants with multi-
this simple ILP approach cannot attain the same              ple mutations.
accuracy levels of a pure statistical approach. We
are currently investigating more sophisticated sta-
tistical relational learning approaches for improving
the accuracy of the learned models and for assign-
                                                             Acknowledgements
                                                             EC and TL acknowledge the support of the F.W.O. (Bel-
ing weights to the clauses to be used for selecting
                                                             gium) and the F.R.S.-F.N.R.S. (Belgium). ST and AP
the most relevant generated mutations. These can             acknowledge the support of the Italian Ministry of Uni-
be used as a pre-filtering stage, producing candi-           versity and Research under grant PRIN 2009LNP494
date mutants to be further analysed by complex sta-          (Statistical Relational Learning: Algorithms and Appli-
tistical approaches, and additional tools evaluating,        cations).
for instance, their stability. An additional direction
to refine our predictions consists of jointly learning
models of resistance to di↵erent drugs (e.g. NNRTI           References
and NRTI), possibly further refining the joint mod-           1. Cao ZW, Han LY, Zheng CJ, Ji ZL, Chen X, Lin HH,
els on a per-class basis. On a predictive (rather than           Chen YZ: Computer prediction of drug resistance
generative) task, this was shown [21] to provide im-             mutations in proteins REVIEWS. Drug Discovery
                                                                 Today: BIOSILICO 2005, 10(7).
provements over learning distinct per-drug models.
    The approach is not restricted to learning drug-          2. Rubingh DN: Protein engineering from a bioindus-
                                                                 trial point of view. Current Opinion in Biotechnology
resistance mutations in viruses. More generally, it              1997, 8(4):417–422.
can be applied to learn mutants having certain prop-
                                                              3. Muggleton S, De Raedt L: Inductive Logic Program-
erties of interest, e.g. improved or more specific ac-           ming: Theory and Methods. University Computing
tivity of an enzyme with respect to a substrate, in a            1994, :629–682.


                                                         7
 4. Cilia E, Passerini A: Frankenstein Junior: a re-                 RW: Drug resistance mutations for surveillance of
    lational learning approach toward protein engi-                  transmitted HIV-1 drug-resistance: 2009 update.
    neering. In Proceedings of the Annotation, Interpre-             PloS one 2009, 4(3):e4724.
    tation and Management of Mutations (AIMM) Work-               20. Deforche K, Camacho RJ, Grossman Z, Soares Ma, Van
    shop@ECCB2010 2010.                                               Laethem K, Katzenstein Da, Harrigan PR, Kantor R,
 5. Capriotti E, Fariselli P, Rossi I, Casadio R: A three-            Shafer R, Vandamme AM: Bayesian network analyses
    state prediction of single point mutations on pro-                of resistance pathways against efavirenz and nevi-
    tein stability changes. BMC bioinformatics 2008, 9                rapine. AIDS (London, England) 2008, 22(16):2107–15.
    Suppl 2:S6.                                                   21. Cilia E, Landwehr N, Passerini A: Relational Feature
 6. Needham CJ, Bradford JR, Bulpitt AJ, Care Ma, West-               Mining with Hierarchical Multitask kFOIL. Fun-
    head DR: Predicting the e↵ect of missense muta-                   damenta Informaticae 2011, 113(2):151–177.
    tions on protein function: analysis with Bayesian             22. Theys K, Deforche K, Beheydt G, Moreau Y, van
    networks. BMC bioinformatics 2006, 7:405.                         Laethem K, Lemey P, Camacho RJ, Rhee SY, Shafer RW,
 7. Bromberg Y, Rost B: SNAP: predict e↵ect of non-                   Van Wijngaerden E, Vandamme AM: Estimating the
    synonymous polymorphisms on function. Nucleic                     individualized HIV-1 genetic barrier to resistance
    acids research 2007, 35(11):3823–35.                              using a nelfinavir fitness landscape. BMC bioinfor-
                                                                      matics 2010, 11:409.
 8. Rhee SY, Taylor J, Wadhera G, Ben-Hur A, Brutlag DL,
    Shafer RW: Genotypic predictors of human immun-
    odeficiency virus type 1 drug resistance. Proceed-
    ings of the National Academy of Sciences of the United
    States of America 2006, 103(46):17355–60.
 9. Götte M, Li X, Wainberg M: HIV-1 reverse tran-
    scription: a brief overview focused on structure-
    function relationships among molecules involved
    in initiation of the reaction. Archives of biochemistry
    and biophysics 1999, 365(2):199–210.
10. Shafer R: Rationale and Uses of a Public HIV
    Drug-Resistance Database. Journal of Infectious Dis-
    eases 2006, 194(Supplement 1):S51–S58.
11. De Clercq E: HIV inhibitors targeted at the reverse
    transcriptase. AIDS research and human retroviruses
    1992, 8(2):119–134.
12. Spence R, Kati W, Anderson K, Johnson K: Mech-
    anism of inhibition of HIV-1 reverse transcrip-
    tase by nonnucleoside inhibitors. Science 1995,
    267(5200):988–993.
13. Richter L, Augustin R, Kramer S: Finding Relational
    Associations in HIV Resistance Mutation Data.
    In Proceedings of Inductive Logic Programming (ILP),
    Volume 9 2009.
14. Betts M, Russell R: Amino-Acid Properties and
    Consequences of Substitutions. Bioinformatics for
    geneticists 2003, :311–342.
15. Taylor WR: The classification of amino acid
    conservation. Journal of Theoretical Biology 1986,
    119(2):205–218.
16. Bartlett G, Porter C, Borkakoti N, Thornton J: Anal-
    ysis of catalytic residues in enzyme active sites.
    Journal of Molecular Biology 2002, 324:105–121.
17. Muggleton S: Learning from Positive Data. In Induc-
    tive Logic Programming Workshop 1996:358–376.
18. Džeroski S, Bratko I: Handling noise in Inductive
    Logic Programming. In ILP92, Report ICOT TM-
    1182. Edited by Muggleton S 1992.
19. Bennett DE, Camacho RJ, Otelea D, Kuritzkes DR,
    Fleury H, Kiuchi M, Heneine W, Kantor R, Jordan MR,
    Schapiro JM, Vandamme AM, Sandstrom P, Boucher
    CaB, van de Vijver D, Rhee SY, Liu TF, Pillay D, Shafer


                                                              8
Algorithms
Algorithm 1 - Mutation generation algorithm.
Algorithm for novel relevant mutations discovery.

Algorithm 1 Mutation generation algorithm.
 1: input: background knowledge B, learned model H
 2: output: rank of the most relevant mutations R
 3: procedure GenerateMutations(B, H)
 4:    Initialize DM      ;
 5:    A     find all assignments a that satisfy at least one clause ci 2 H
 6:    for a 2 A do
 7:        m      mutation corresponding to the assignments a 2 A
 8:        score     SM (m)                                               . number of clauses ci satisfied by a
 9:        DM      DM [ {(m, score)}
10:    end for
11:    R     RankMutations(DM , B, H)                                              . rank relevant mutations
12:    return R
13: end procedure


Figures
Figure 1 - Mutation engineering algorithm.
Schema of the mutation engineering algorithm.


               dataset of        background
           mutations / mutants   knowledge


                                                                                           rank of
                                                                                            novel
                                                                                           relevant
                                                                                          mutations


                                              hypothesis


                                                           9
Figure 2 - Model for the resistance to NNRTI learned from Dataset 2
An example of learned hypothesis for the NNRTI task with highlighted examples covered by the hypothesis
clauses.

                              >wt ...AGLKKKKSVTVLDVG...YQYMDDLYVG...WETWWTEY...WIPEWEFVN...
                                                 D         DD          W       W
                                     |             |   |        |   |      |   |       |
                                    98           112 181      190 398    405 410     418
                                 mut(A,B,C,D) AND position(C,190)
                                 mut(A,B,C,D) AND position(C,190) AND typeaa(polar,D)
                                 mut(A,y,C,D) AND typeaa(aliphatic,D)
                                 mut(A,B,C,a) AND position(C,106)


Figure 3 - Mean recall trend by number of satisfied clauses (Dataset 1)
Mean recall of the generated mutations on the resistance test set mutations from Dataset 1 by varying the
number of satisfied clauses. The mean recall values in orange refer to the proposed generative algorithm.
The mean recal values in green refer to a random generator of mutations.

                  100"
                   90"
                   80"
                   70"
   mean%recall%


                   60"
                                                                                                                NNRTI"
                   50"
                                                                                                                NNRTI"(rand)"
                   40"
                                                                                                                NRTI"
                   30"
                                                                                                                NRTI"(rand)"
                   20"
                   10"
                    0"
                         1"       2"     3"          4"        5"        6"        7"           8"   9"   10"
                                              nuber%of%sa.sﬁed%clauses%per%generated%muta.on%


                                                                       10
Figure 4 - Example of model learned from Dataset 1
Model learned for NRTI and NNRTI on the dataset mutations Dataset 1. Note that in this case A is a
variable that identifies a single amino acid mutation.

     1-res against(A,nrti)     mut(A,B,C,D) AND color(red,D)
     2-res against(A,nrti)     mut(A,B,C,D) AND color(red,D) AND color(red,B)
     3-res against(A,nrti)     mut(A,B,C,D) AND location(7.0,C) AND mutated residue cp(C,medium,medium)
     4-res against(A,nrti)     mut(A,B,C,D) AND location(7.0,C)
     5-res against(A,nrti)     same color type mut(A,B)
     6-res against(A,nrti)     mut(A,B,C,D) AND mutated residue cp(C,medium,high)
     7-res against(A,nrti)     mut(A,B,C,D) AND mutated residue cp(C,medium,small)
     8-res against(A,nnrti)     different color type mut(A,B) AND location(11.0,B)
     9-res against(A,nnrti)     mut(A,B,C,D) AND mutated residue cp(C,small,small)
     10-res against(A,nnrti)     same color type mut(A,B)
     11-res against(A,nnrti)     mut(A,v,C,D)
     12-res against(A,nnrti)     mut(A,B,C,D) AND location(15.0,C)
     13-res against(A,nnrti)     mut(A,B,C,i)
     14-res against(A,nnrti)     mut(A,B,C,D) AND location(11.0,C)
     15-res against(A,nnrti)     mut(A,B,C,D) AND color(red,D)
     16-res against(A,nnrti)     mut(A,B,C,D) AND color(magenta,B) AND different color type mut(A,C)
     17-res against(A,nnrti)     mut(A,B,C,D) AND location(21.0,C)


Figure 5 - Mean recall trend by number of satisfied clauses (Dataset 2)
Mean recall of the generated mutations on the resistance test set mutations from Dataset 2 by varying the
number of satisfied clauses. The mean recall values in orange refer to the proposed generative algorithm.
The mean recal values in green refer to a random generator of mutations.

                                               18"

                                               16"

                                               14"

                                               12"
                                mean%recall%


                                               10"
                                                                                    NNRTI"
                                                8"                                  NNRTI"(rand)"

                                                6"                                  NRTI"
                                                                                    NRTI"(rand)"
                                                4"

                                                2"

                                                0"
                                                        1"                2"
                                                     number%of%sa.sﬁed%clauses%%
                                                      per%generated%muta.on%


                                                                    11
Figure 6 - Model learned from the whole Dataset 2
Model learned for NRTI and NNRTI on the whole dataset of mutants Dataset 2. Note that in this case A
is a variable that identifies a mutant.

     1-res against(A,nrti)        mut(A,B,C,D) AND position(C,67)
     2-res against(A,nrti)        mut(A,B,C,g) AND color(red,B)
     3-res against(A,nrti)        mut(a,B,C,D) AND typeaa(nonpolar,D)
     4-res against(A,nrti)        mut(A,B,C,D) AND position(C,33)
     5-res against(A,nrti)        mut(A,B,C,D) AND position(C,218)
     6-res against(A,nrti)        mut(A,B,C,h) AND catalytic propensity(B,high)
     7-res against(A,nnrti)        mut(A,B,C,D) AND position(C,194)
     8-res against(A,nnrti)        mut(A,B,C,D) AND position(C,184)
     9-res against(A,nnrti)                  mut(A,B,C,D) AND catalytic propensity(D,high) AND typeaa(neutral,D) AND
     typeaa(nonpolar,B)
     10-res against(A,nrti)        mut(A,B,C,d) AND position(C,44)
     11-res against(A,nrti)        mut(A,B,C,r) AND typeaa(tiny,B) AND typeaa(polar,B)
     12-res against(A,nrti)        mut(A,d,C,D) AND catalytic propensity(D,medium)
     13-res against(A,nrti)        mut(A,d,C,D) AND catalytic propensity(D,medium) AND typeaa(polar,D)
     14-res against(A,nrti)        mut(A,B,C,k) AND position(C,203)
     15-res against(A,nnrti)        mut(A,B,C,D) AND position(C,190)
     16-res against(A,nnrti)        mut(A,B,C,a) AND position(C,106)
     17-res against(A,nnrti)        mut(A,B,C,D) AND same typeaa(D,B,tiny) AND same typeaa(D,B,nonpolar)
     18-res against(A,nnrti)                mut(A,B,C,D) AND catalytic propensity(D,high) AND typeaa(aromatic,B) AND
     same typeaa(D,B,neutral)
     19-res against(A,nnrti)        mut(A,B,C,n) AND position(C,103)
     20-res against(A,nnrti)                 mut(A,B,C,D) AND catalytic propensity(B,high) AND typeaa(neutral,B) AND
     same typeaa(D,B,polar)
     21-res against(A,nnrti)                 mut(A,B,C,D) AND position(C,177) AND catalytic propensity(D,medium) AND
     same type mut t(A,C,polar)
     22-res against(A,nnrti)        mut(A,B,C,n) AND position(C,238)
     23-res against(A,nnrti)        mut(A,B,C,D) AND position(C,181)
     24-res against(A,nnrti)        mut(A,y,C,D) AND typeaa(small,D)


                                                                  12
Tables
Table 1 - Background knowledge predicates
Summary of the background knowledge facts and rules. MutID is a mutation or a mutant identifier depending
on the type of the learning problem.
      Background Knowledge Predicates
      aa(Pos,AA)                                     indicates a residue in the wild type sequence
      mut(MutID,AA,Pos,AA1)                          indicates a mutation: mutation or mutant identifier, position and
                                                     amino acids involved, before and after the substitution
      res against(MutID,Drug)                        indicates whether a mutation or mutant is resistant to a certain
                                                     drug
      color(Color,AA)                                indicates the coloring group of a natural amino acid
      typeaa(T,AA)                                   indicates the type (e.g. aliphafatic, charged, aromatic, polar) of
                                                     a natural amino acid
      same color type(R1,R2)                         indicates whether two residues belong to the same coloring group
      same typeaa(R1,R2,T)                           indicates whether two residues are of the same type T
      same color type mut(MutID, Pos)                indicates a mutation to a residue of the same coloring group
      different color type mut(MutID, Pos)           indicates a mutation changing the coloring group of the residue
      same type mut t(MutID, Pos, T)                 indicates a mutation to a residue of the same type T
      different type mut t(MutID, Pos)               indicates a mutation changing the type of the residue
      close to site(Pos)                             indicates whether a specific position is close to a binding or active
                                                     site if any
      location(L,Pos)                                indicates in which fragment of the primary sequence the amino
                                                     acid is located
      catalytic propensity(AA,CP)                    indicates whether an amino acid has a high, medium or low cat-
                                                     alytic propensity
      mutated residue cp(Rw,Pos,Rm,CPold,CPnew)      indicates how, in a mutated position, the catalytic propensity has
                                                     changed (e.g. from low to high)


Table 2 - Results (Dataset 1)
Statistical comparisons of the performance of the proposed algorithm with an algorithm generating mutations
at random. The average recall has been computed for each one of the learning tasks over the 30 splits by
averaging recall over 30 repeated runs of the two algorithms. Results of a paired Wilcoxon test (↵ = 0.01)
on the statistical significance of the performance di↵erences are also reported. A black bullet indicates a
statistical significant improvement of our algorithm over a random generator. The mean number of generated
mutations over the 30 splits and the number of test mutations is reported on the side.

                   Mean recall % on 30 splits
               Algorithm     Random Generator        Mean n. generated mutations               n. test mutations
     NNRTI        86 •                  58                      5201                                   17
     NRTI         55 •                  46                      5548                                   28


Table 3 - Results (Dataset 2)
Statistical comparisons of the performance of the proposed algorithm with an algorithm generating mutations
at random. The average recall has been computed for each one of the learning tasks over the 30 splits. Results
of a paired Wilcoxon test (↵ = 0.01) on the statistical significance of the performance di↵erences are also
reported. A black bullet indicates a statistical significant improvement of our algorithm over a random
generator. The mean number of generated mutations and the mean number of test mutations over the 30
splits is reported on the side.

                      mean recall %
               Algorithm     Random Gen.        mean n. generated mutations            mean n. of test mutations
     NNRTI        17 •              1                      236                                    26
     NRTI          7•               3                      420                                    40


                                                     13