=Paper=
{{Paper
|id=None
|storemode=property
|title=Predicting virus mutations through relational learning
|pdfUrl=https://ceur-ws.org/Vol-916/Paper6.pdf
|volume=Vol-916
|dblpUrl=https://dblp.org/rec/conf/eccb/CiliaTALP12
}}
==Predicting virus mutations through relational learning==
Predicting virus mutations through relational learning
Elisa Cilia1 , Stefano Teso2 , Sergio Ammendola3 , Tom Lenaerts1,4 , Andrea Passerini⇤2
1 MLG, Département d’Informatique, Université Livre de Bruxelles, Boulevard du Thriomphe - CP 212, 1050 - Brussels, Belgium
2 Department of Computer Science and Information Engineering, University of Trento, via Sommarive 14, I-38123 (Povo) Trento,
Italy
3 Ambiotec sas, R&D, Via Appia Nord 47, 00142 Cisterna di Latina (LT), Italy
4 AI-lab, Vakgroep Computerwetenschappen, Vrije Universiteit Brussel, Pleinlaan 2, 1050 - Brussels, Belgium
Email: Elisa Cilia - ecilia@ulb.ac.be; Stefano Teso - teso@disi.unitn.it; Sergio Ammendola - ricercaesviluppo@ambiotec.it; Tom
Lenaerts - tlenaert@ulb.ac.be; Andrea Passerini⇤ - passerini@disi.unitn.it;
⇤ Corresponding author
Abstract
Background: Viruses are typically characterized by high mutation rates, which allow them to quickly develop
drug-resistant mutations. Mining relevant rules from mutation data can be extremely useful to understand the
virus adaptation mechanism and to design drugs that e↵ectively counter potentially resistant mutants.
Results: We propose a simple relational learning approach for mutant prediction where the input consists of
mutation data with drug-resistance information, either as sets of mutations conferring resistance to a certain
drug, or as sets of mutants with information on their susceptibility to the drug. The algorithm learns a set of
relational rules characterizing drug-resistance and use them to generate a set of potentially resistant mutants.
Conclusions: Promising results were obtained in generating resistant mutations for both nucleoside and non-
nucleoside HIV reverse transcriptase inhibitors. The approach can be generalized quite easily to learning mutants
characterized by more complex rules correlating multiple mutations.
Background single-strand RNA particle, a reverse transcriptase
and an integrase into the cell cytoplasm. Quickly
HIV is a pandemic cause of lethal pathologies in the RNA molecule is retro-transcribed in a DNA
more than 33 million people. Its horizontal transmis- double strand molecule, which is integrated into the
sion trough mucosae is difficult to control and treat host genome. The integration events induce a cel-
because the virus has a high virulence and it infects lular response, which begins with the transcription
several type of immune surveillance cells, such as of the Tat gene by the RNA polymerase II. Tat is a
those characterized by CD4 receptor (CD4+ cells). well-known protein responsible for the HIV activa-
The major problem in treating the human virus in- tion since it recruits some cytoplasm host proteins
fection is the drug selectivity since the virus pene- involved in the expression of viral genes. Remark-
trates in the cell where it releases its genetic material ably, HIV can establish a life-long latent infection
to replicate itself by using the cell mechanisms. A by suppressing its transcription, thus making inef-
drug target is the replicating apparatus of the cell. fective the large part of antiviral drugs aimed at
HIV antiviral molecules will be directed against sev- controlling the viral replication. However replicating
eral cells such as macrophages or lymphocytes T to viruses adopt several drug resistance strategies, for
interfere with viral replication. The HIV releases a
1
instance, HIV induces amino acid mutations reduc- tually the most common situation, in which one is
ing the efficacy of the pharmaceutical compounds. presented with a number of mutants together with
The present work is aimed at gaining knowledge some evidence of their susceptibility to certain treat-
on mutations that may occur into the viral RNA ments. The goal is then to extract rules and poten-
transcriptase [9]. This is an important target to de- tial mutations explaining why certain mutants con-
velop antiretroviral medicines and di↵erent types of fer resistance and which other ones produce the same
molecules have been found active: the Nucleoside e↵ect.
Reverse Transcriptase Inhibitors (NRTI) and Non
NRTI (NNRTI). Although RNA RT inhibitors are Many machine learning methods have been ap-
active the HIV virus quickly, more than frequently, plied in the past to mutation data for predicting
changes the RNA RT encoding sequence thus acquir- single amino acid mutations on protein stability
ing drug resistance. The antiviral therapy is based changes [5] and the e↵ect of mutations on the pro-
on the use of cocktails of molecules including new tein function [6, 7] or drug susceptibility [8]. To
RNA RT inhibitors, thus a computational approach the best of our knowledge this is the first attempt
to predict possible mutation sites, and their sensi- to learn relational features of mutations a↵ecting a
bility to drug is an important tool in drug discovery protein behavior and use them for generating novel
for the antiretroviral therapy. relevant mutations. Furthermore, even if we focus
Computational methods can assist here by ex- on single amino acid mutations in our experimental
ploring the space of potential virus mutants, provid- evaluation, our approach can be straightforwardly
ing potential avenues for anticipatory drugs [1]. To extended to multiple point mutations, and we are
achieve such a goal, one first needs to understand actively working in this direction. Note that the
what kind of mutants may lead to resistance. other approaches first generate all potential muta-
A general engineering technique for building ar- tions and then decide which of them leads to resis-
tificial mutants is referred to as rational design [2]. tance, whereas we produce the relevant ones directly.
The technique consists in modifying existing pro-
We report an experimental evaluation focused on
teins by site directed mutagenesis. It relies on a
HIV RT. RT is a well-studied protein: a large num-
deep domain knowledge in order to identify candi-
ber of mutants have been shown to resist to one or
date mutations that may a↵ect protein structure or
more drugs and databases exist that collect those
function. The process typically involves extensive
data from di↵erent sources and make them available
trial-and-error experiments and is also aimed at im-
for further analyses [10].
proving the understanding mechanisms of a protein
behavior. We tested the ability of our approach to gener-
In this work we report on our initial attempt to ate drug-resistant amino acid mutations for NRTI
develop an artificial system mimicking the rational and NNRTI. Our results show statistically signifi-
design process. An Inductive Logic Programming cant improvements for both drug classes over the
(ILP) learner [3] is trained to extract general rules baseline results obtained through a random genera-
describing mutations relevant to a certain behav- tor. Mutant-based results confirm our previous find-
ior, i.e. resistance towards certain inhibitors. The ings [4] extending them to the more natural setting
learned rules are then used to infer novel mutations in which only information on mutant behavior is ob-
which may induce similar resistance. served. Additional background knowledge allows to
We consider here two learning settings: in the produce here more compact and simple hypotheses,
first one we rely on a training set made of single which have the advantage of generating a reduced
amino acid mutations known to confer resistance to number of candidate mutations.
a certain class of inhibitors (we will refer to this
as mutation-based learning and preliminary promis- The approach can be in general applied in mu-
ing results in this setting were described in [4]), in tation studies aimed at understanding protein func-
the second we learn the rules directly from mutants tion. By searching for residues most likely to have
(comprising of 1 to n amino acid mutations) that a functional role in an active site, the approach can
have been experimentally tested for their resistance for instance be used in the engineering of enzyme
to the same classes of inhibitors (we will refer to this mutants with an improved activity for a certain sub-
as mutant-based learning). This second setting is ac- strate.
2
Results where h and the bi are atomic literals. Atomic lit-
Datasets erals are expressions of the form p(t1, ..., tn)
We applied our approach to predict HIV RT mu- where p/n is a predicate symbol of arity n and the
tations conferring resistance to two classes of in- ti are terms, either constants (denoted by lower
hibitors: Nucleoside RT Inhibitors (NRTI) and Non- case) or variables (denoted by upper case) in our
Nucleoside RT Inhibitors (NNRTI). The two classes experiments. The atomic literal h is also called the
of inhibitors di↵er in the targeted sites and rely on head of the clause, typically the target predicate,
quite di↵erent mechanisms [11, 12]. NNRTI inhibit and b1 AND ... AND bn its body. Intuitively, a
the reverse transcriptase by binding to the enzyme clause represents that the head h will hold whenever
active site, therefore directly interfering with the en- the body b1 AND ... AND bn holds. For instance,
zyme function. NRTI are instead incorporated into a simple hypothesis like res against(A,nnrti)
the newly synthesized viral DNA for preventing its mutation(A,C) AND close to site(C) would indi-
elongation. cate that a mutation C in the proximity of a binding
site confers to mutant A resistance against a nnrti.
We collected datasets for both mutation-based
Learning in this setting consists of searching for a
and mutant-based learning. The former (Dataset 1)
set of definite clauses H = {ci , ..., cm } covering all
is a dataset of amino acid mutations derived from
or most positive examples, and none or few nega-
the Los Alamos National Laboratories (LANL) HIV
tive ones if available. First-order clauses can thus
resistance database1 by Richter et al. [13], who used
be interpreted as relational features that character-
it to mine relational rules among mutations. It con-
ize the target concept. The main advantage of these
sists of 95 amino acid mutations labeled as resistant
logic-based approaches with respect to other ma-
to NRTI and 56 labeled as resistant to NNRTI, over
chine learning techniques is the expressivity and in-
a set of 581 observed mutations. For the mutant-
terpretability of the learned models. Models can be
based setting, we collected (Dataset 2) HIV RT
readily interpreted by human experts and provide
mutation data from the Stanford University HIV
direct explanations for the predictions.
Drug Resistance. The database provides a dataset
of selected mutants of HIV RT with results of sus-
ceptibility studies to various drugs, and was previ-
Background knowledge
ously employed [8] for predicting drug resistance of
novel (given) mutants2 . It is composed of 838 dif- We built a relational knowledge base for the problem
ferent mutants annotated with susceptibility levels domain. Table 1 summarizes the predicates we in-
(low, medium and high) to drugs belonging to the cluded as a background knowledge. We represented
NRTI (639 mutants) and NNRTI (747 mutants) drug the amino acids of the wild type with their posi-
classes. We considered a setting aimed at identifying tions in the primary sequence (aa/2) and the spe-
amino acid mutations conferring high susceptibility cific mutations characterizing them (mut/4). Target
(with respect to medium or low), and considered a predicates were encoded as resistance of the muta-
mutant as highly susceptible to a drug class if it was tion or mutant to a certain drug (res against/2).
annotated as being highly susceptible to at least one Note that this encoding considers mutations at the
drug from that class. amino acid rather than nucleotide level, i.e. a sin-
gle amino acid mutation can involve up to three nu-
cleotide changes. While focusing on single nucleotide
changes would drastically expand the space of pos-
Learning in first order logic sible mutations, including the cost (in terms of nu-
Our aim is to learn a first-order logic hypothesis cleotide changes) of a certain amino acid mutation
for a target concept, i.e. mutation conferring re- could help refining our search procedure.
sistance to a certain drug, and use it to infer novel Additional background knowledge was included
mutations consistent with such hypothesis. We rely in order to highlight characteristics of residues and
on definite clauses which are the basis of the Pro- mutations:
log programming language. A definite clause is an
expression of the form h b1 AND ... AND bn, typeaa/2 indicates the type of the natural amino
1 http://www.hiv.lanl.gov/content/sequence/RESDB/
2 downloadable at http://hivdb.stanford.edu/cgi-bin/GenoPhenoDS.cgi
3
acids according to the Venn diagram grouping location/2 indicates in which fragment of the pri-
based on the amino acids properties proposed mary sequence the amino acid is located. Lo-
in [14]. For example, a serine is a tiny and cations are numbered from 0 by dividing the
polar amino acid. sequence into fragments of 10 amino acid
lenght.
color/2 indicates the type of the natural amino
acids according to the coloring proposed catalytic propensity/2 indicates whether an
in [15]. For example the magenta class includes amino acid has a high, medium or low cat-
basic amino acids as lysine and arginine while alytic propensity according to [16].
the blue class includes acidic amino acids as
mutated residue cp/5 indicates how, in a mutated
aspartic and glutamic acids. These groups of
position, the catalytic propensity has changed
amino acids do not overlap as in the previous
(e.g. from low to high).
case.
same type aa/3 indicates whether two residues be-
long to the same type T, i.e. a change from one Algorithm overview
residue to the other conserves the type of the The proposed approach is sketched in Figure 1.
amino acid.
same color type/2 indicates whether two residues Step 1: Learning phase
belong to the same coloring group, i.e. a The first step is the learning phase. An ILP learner
change from one residue to the other conserves is fed with a logical representation of the data D
the coloring group of the amino acid. and of the domain knowledge B to be incorporated,
and it returns a first-order logical hypothesis H for
same type mut t/3 indicates that a residue substi- the concept of mutation conferring resistance to a
tution at a certain position does not modify certain class of inhibitors.
the amino acid type T with respect to the wild In this context there are two suitable ways to
type. For example mutation i123v conserves learn the target concept, depending on the type of
the aliphatic amino acid type while mutation input data and their labeling:
i123d does not (i.e. different type mut t/3
holds for it). a) the one-class classification setting, learning a
model from positive instances only. This is the
same color type mut/2 indicates that a residue approach we employ for Dataset 1: positive ex-
substitution at a certain position does not amples are mutations for which experimental
modify the amino acid coloring group with prove is available that shows resistance to a
respect to the wild type. For example mu- drug, but no safe claim can be made on non-
tation d123e conserves the blue amino acid annotated mutations.
group while mutation d123a does not (i.e.
different color type mut/2 holds for it). b) the binary classification setting, learning to
discriminate between positive and negative
Other background knowledge facts and rules instances. This setting is appropriate for
were added in order to express structural relations Dataset 2: positive examples are in our exper-
along the primary sequence and catalytic propensity iments mutants labelled as highly susceptible
of the involved residues: to the drug class, negative examples are those
with medium or low susceptibility.
close to site/1 indicates whether a specific posi-
tion is distant less than 5 positions from a The hypothesis is derived using the Aleph (A
residue belonging to a binding or active site. Learning Engine for Proposing Hypotheses) ILP sys-
In our specific case, the background theory in- tem3 , which allows to learn in both settings. In the
corporates knowledge about a metal binding one-class classification case, it incrementally builds a
site and a heterodimerization site. hypothesis covering all positive examples guided by a
3 http://www.comlab.ox.ac.uk/activities/machinelearning/Aleph/aleph.html
4
Bayesian evaluation function, described in [17], scor- the generated candidate mutations that satify the
ing candidate solutions according to an estimate of first two clauses are all the mutations changing the
the Bayes’ posterior probability that allows to trade- glycine in position 190 in a polar amino acid: 190Y,
o↵ hypothesis size and generality. In the binary clas- 190W, 190T, 190S, 190R, 190Q, 190N, 190K, 190H,
sification case, Aleph builds the hypothesis covering 190E, 190D, 190C where for example the notation
all positive examples and none or few negative ones, 190Y indicates a change of the wild type amino acid,
guided by an m estimate evaluation function [18]. located in position 190, into a tyrosine (Y). For in-
Aleph adds clauses to the hypothesis based on stance, 190S and 190E in this list are already part of
their coverage of training examples. Given a learned the known NNRTI surveillance mutations (see [19]).
model, the first clauses are those covering most train-
ing examples and thus usually the most representa-
tive of the underlying concept. Learning from mutations
In Figure 2 we show a simple example of learned We first learn general rules characterizing known re-
hypothesis covering a set of training mutants from sistance mutations (from Dataset 1) to be used for
Dataset 2. The learned hypothesis models the abil- predicting novel candidate ones.
ity of a mutation to confer the mutant resistance to We divided the dataset of mutations into a train-
NNRTI and is composed of four first-order clauses, ing and a test set (70/30) in a stratified way, which
each one covering di↵erent sets of mutations of the means by preserving, both in the train and test set,
wild type as highlighted in colors: yellow for the first the proportion of examples belonging to one of the
clause, blue for the second, red for the third, and two drug classes. The resulting training set is com-
green for the fourth one. Some mutations are cov- posed of a total of 106 mutations while the test set
ered by more than one clause as shown by the color is composed of 45 mutations.
overlaps. Bold letters indicate residues involved in We trained the ILP learner on the training set
the RT metal binding site (D110, D185 and D186) and we evaluated on the test set the set of mutations
or the heterodimerization site (W401 and W414). generated using the learned model. The evaluation
procedure takes the set of generated mutations and
computes its enrichment in test mutations. We com-
Step 2: Generative phase pare the recall of the approach, i.e. the fraction of
The second step of our approach is the generative test mutations generated by the model, with the re-
phase, in which the learned hypothesis is employed call of a baseline algorithm that chooses at random
to find novel mutations that can confer drug resis- a set (of the same cardinality) of possible mutations
tance to an RT mutant. A set of candidate muta- among all legal ones.
tions can be generated by using the Prolog inference Results averaged on 30 random splits of the
engine starting from the rules in the learned model. dataset are reported in Table 2. On each split we
The rules are actually constraints on the character- performed 30 runs of our algorithm (Aleph has a
istics that a mutation of the wild type should have random component generating the seed for the hy-
in order to confer resistance to a certain inhibitor, pothesis search) and of the random generation al-
according to the learned hypothesis. gorithm in each one of the di↵erent learning tasks
Algorithm 1 details the mutation generation pro- (NNRTI, NRTI). In this setting, the average sizes of
cedure. We assume, for simplicity, to have a model the learned hypotheses for NNRTI and NRTI are 8
H for a single drug class. The procedure works by and 7 rules respectively. For each task we also report
querying the Prolog inference engine for all possi- in column 3 and 4 of Table 2 the mean number of
ble variable assignments that satisfy the hypothesis generated mutations over the 30 splits and the num-
clauses, each representing a mutation by its position ber of test set mutations for reference. In both cases,
and the amino acid replacing the wildtype residue. improvements are statistically significant according
The set of mutations generated by the model is to a paired Wilcoxon test (↵=0.01). Figure 3 gives
ranked according to a scoring function SM before further details about these results by showing how
being returned by the algorithm. We defined SM as the mean recall on the test set varies if we com-
the number of clauses in H that a candidate muta- pute it on the generated mutations satisfying only
tion m satisfies. one clause, two clauses, three clauses and so on (x
Referring to the model in Figure 2, for instance, axis). The mean recall trend is shown in orange for
5
the proposed generative approach and in green for of the learned hypotheses for NNRTI and NRTI are
a random generator of mutations for both classes of 5 and 3 rules respectively. The small size of the mod-
inhibitors. els allows to substantially reduce the space of possi-
We finally learned a model on the whole dataset ble mutations to be generated. Our approach shows
in order to generate a single set of mutations for fur- an average recall of 17% and 7% on the test muta-
ther inspection. We report five examples of novel tions, which is statistically significantly higher than
mutations with the highest score for each one of the the recall obtained by randomly generating muta-
tasks: 90I, 98I, 103I, 106P, 179I for NNRTI and 60A, tions (paired Wilcoxon test, ↵=0.01). Figure 5 gives
153M, 212L, 229F, 239I for NRTI. further details about these results by showing the
In [20], the authors found a set of novel muta- mean recall trend by varying the number of satisfied
tions conferring resistance to efavirenz and nevirap- clauses of the generated mutations.
ine, which are NNRTI. Our mutation generation al- Figure 6 shows the hypothesis learned by the
gorithm partially confirmed their findings. Mutation system when trained on the whole set of mutants
90I was scored high (getting the maximum score of in Dataset 2. It is easy to see that the hypothe-
5 satisfied clauses), mutation 101H was generated sis for the resistance to NNRTI identifies many (10
with a score of 3 out of 5, mutations 196R and 138Q out of 19) of the known resistance survaillance mu-
with score 1 out of 5, while mutation 28K was not tations reported in [19]: 103N, 106A, 181C, 181I,
generated at all by our system as a candidate for 181V, 188C, 188H, 190A, 190E and 190S. In partic-
conferring resistance to NNRTI. ular, mutation 181C has the highest predicted score
An example of learned hypothesis is reported in as it satisfies four clauses of the learned model. In-
Figure 4. For instance, according to the model, terestingly, other not previously reported mutations
among the features a mutation should have for con- are suggested with a high score (3 out of 4 satisifed
ferring resistance to a NNRTI, there is the change of clauses) by the generative algorithm. These muta-
a basic (magenta) residue of the wild type, e.g. lysine tions may be novel mutations conferring resistance
or arginine, into a residue with completely di↵erent to NNRTI. For example, 181N, 181D, 318C and
phisico-chemical characteristics (rule 16). 232C.
Another example for the resistance to NNRTI is 29% of the mutations labeled as conferring re-
that a non conserved mutation is present in positions sistance to NNRTI in Dataset 1 are also identified.
between 98 and 106 of the wild type sequence (rule Apart from the survaillance mutations those include
8). also: 98G, 227C, 190C, 190Q, 190T and 190V. Key
positions for the resistance to NNRTI, like 181 and
190 can be easily identified from the model in Fig-
Learning from mutants ure 6, thanks to rules 15 and 23 that anchor the
The next set of experiments is focused on learning mutation to the specific position without restricting
mutations from mutant data (Dataset 2). Learned the change to a specific group of amino acids. A
models are still limited to single amino acid muta- quite uncommon NNRTI-selected mutation, 238N,
tions, and so are novel mutants generated by the appears among the generated ones. Indeed, as re-
system. ported in the summaries of the Stanford HIV Resis-
We randomly assigned the mutants in Dataset 2 tance Database, this mutations in combination with
to 30 training and a test set splits, by avoiding hav- 103N, causes high-level resistance to nevirapine and
ing mutants containing the same resistance mutation efavirenz.
(according to the labelling used in Dataset 1) in both Also in the case of NRTI the generative algo-
training and test sets. For each of the 30 splits, we rithm suggests a few known survaillance mutations
evaluated the recall of the generated mutations on reported in [19]: 67E, 67G, 67N, 116Y, 184V, 184I
the known resistance mutations (from Dataset 1), (6 out of 34). 18% of the mutations labeled as
by first removing all the mutations that were also conferring resistance to NRTI in Dataset 1 are also
present in the training set. Comparison is again identified. Apart from known survaillance mutation
made on a baseline algorithm generating random le- those include also: 44D, 62V, 67A, 67S, 69R, 184T.
gal mutations. Key positions for the resistance to NRTI as pre-
Results averaged on the 30 random splits are re- dicted by our model in Figure 6 are: 33, 67, 184,
ported in Table 3. In this setting, the average sizes 194, 218. Of them, the e↵ects of mutations in posi-
6
tion 67 and especially in 184 have been well studied, full protein engineering fashion.
while positions 33, 194, 218 seem to be novel. We
found also that the predicted mutation 219H, which
does not appear neither among the know survail- Conclusions
lance mutations or among the resistance mutations In this work we proposed a simple relational learn-
in Dataset 1, is indeed reported in the Stanford HIV ing approach applicable to evolution prediction and
Database as a quite uncommon NRTI-selected mu- protein engineering. The algorithm relies on a train-
tation. Note that this uncommon mutation requires ing set of mutation data annotated with drug resis-
two nucleotide changes to emerge. tance information, builds a relational model charac-
terizing resistant mutations, and uses it to generate
novel potentially resistant ones. Encouraging pre-
Discussion and Future Work liminary results on HIV RT data indicate a statisti-
cally significant enrichment in resistance conferring
The results shown in the previous section are a mutations among those generated by the system, on
promising starting point to generalize our approach both mutation-based and mutant-based learning set-
to more complex settings. We showed that the ap- tings. Albeit preliminary, our results suggest that
proach scales from few hundreds of mutations as the proposed approach for learning mutations has a
learning examples to almost a thousand of complete potential in guiding mutant engineering, as well as in
mutants. Moreover the learned hypotheses signif- predicting virus evolution in order to try and devise
icantly constraint the space of all possible single appropriate countermeasures. A more detailed back-
amino acid mutations to be considered, paving the ground knowledge, possibly including 3D informa-
way to the expansion of the method to multi-site tion whenever available, is necessary in order to fur-
mutant generation. This represents a clear advan- ther focus the set of generated mutations, and pos-
tage over alternative existing machine learning ap- sibly post-processing stages involving mutant evalu-
proaches, which would require the preliminary gen- ation by statistical machine learning approaches [5].
eration of all possible mutants for their evaluation. In the next future we also plan to generalize the pro-
Restricting to RT mutants with two mutated amino posed approach to jointly generate sets of related
acids, this would imply testing more than a hun- mutations shifting the focus from the generation of
dred million candidate mutants. At the same time single amino acid mutations to mutants with multi-
this simple ILP approach cannot attain the same ple mutations.
accuracy levels of a pure statistical approach. We
are currently investigating more sophisticated sta-
tistical relational learning approaches for improving
the accuracy of the learned models and for assign-
Acknowledgements
EC and TL acknowledge the support of the F.W.O. (Bel-
ing weights to the clauses to be used for selecting
gium) and the F.R.S.-F.N.R.S. (Belgium). ST and AP
the most relevant generated mutations. These can acknowledge the support of the Italian Ministry of Uni-
be used as a pre-filtering stage, producing candi- versity and Research under grant PRIN 2009LNP494
date mutants to be further analysed by complex sta- (Statistical Relational Learning: Algorithms and Appli-
tistical approaches, and additional tools evaluating, cations).
for instance, their stability. An additional direction
to refine our predictions consists of jointly learning
models of resistance to di↵erent drugs (e.g. NNRTI References
and NRTI), possibly further refining the joint mod- 1. Cao ZW, Han LY, Zheng CJ, Ji ZL, Chen X, Lin HH,
els on a per-class basis. On a predictive (rather than Chen YZ: Computer prediction of drug resistance
generative) task, this was shown [21] to provide im- mutations in proteins REVIEWS. Drug Discovery
Today: BIOSILICO 2005, 10(7).
provements over learning distinct per-drug models.
The approach is not restricted to learning drug- 2. Rubingh DN: Protein engineering from a bioindus-
trial point of view. Current Opinion in Biotechnology
resistance mutations in viruses. More generally, it 1997, 8(4):417–422.
can be applied to learn mutants having certain prop-
3. Muggleton S, De Raedt L: Inductive Logic Program-
erties of interest, e.g. improved or more specific ac- ming: Theory and Methods. University Computing
tivity of an enzyme with respect to a substrate, in a 1994, :629–682.
7
4. Cilia E, Passerini A: Frankenstein Junior: a re- RW: Drug resistance mutations for surveillance of
lational learning approach toward protein engi- transmitted HIV-1 drug-resistance: 2009 update.
neering. In Proceedings of the Annotation, Interpre- PloS one 2009, 4(3):e4724.
tation and Management of Mutations (AIMM) Work- 20. Deforche K, Camacho RJ, Grossman Z, Soares Ma, Van
shop@ECCB2010 2010. Laethem K, Katzenstein Da, Harrigan PR, Kantor R,
5. Capriotti E, Fariselli P, Rossi I, Casadio R: A three- Shafer R, Vandamme AM: Bayesian network analyses
state prediction of single point mutations on pro- of resistance pathways against efavirenz and nevi-
tein stability changes. BMC bioinformatics 2008, 9 rapine. AIDS (London, England) 2008, 22(16):2107–15.
Suppl 2:S6. 21. Cilia E, Landwehr N, Passerini A: Relational Feature
6. Needham CJ, Bradford JR, Bulpitt AJ, Care Ma, West- Mining with Hierarchical Multitask kFOIL. Fun-
head DR: Predicting the e↵ect of missense muta- damenta Informaticae 2011, 113(2):151–177.
tions on protein function: analysis with Bayesian 22. Theys K, Deforche K, Beheydt G, Moreau Y, van
networks. BMC bioinformatics 2006, 7:405. Laethem K, Lemey P, Camacho RJ, Rhee SY, Shafer RW,
7. Bromberg Y, Rost B: SNAP: predict e↵ect of non- Van Wijngaerden E, Vandamme AM: Estimating the
synonymous polymorphisms on function. Nucleic individualized HIV-1 genetic barrier to resistance
acids research 2007, 35(11):3823–35. using a nelfinavir fitness landscape. BMC bioinfor-
matics 2010, 11:409.
8. Rhee SY, Taylor J, Wadhera G, Ben-Hur A, Brutlag DL,
Shafer RW: Genotypic predictors of human immun-
odeficiency virus type 1 drug resistance. Proceed-
ings of the National Academy of Sciences of the United
States of America 2006, 103(46):17355–60.
9. Götte M, Li X, Wainberg M: HIV-1 reverse tran-
scription: a brief overview focused on structure-
function relationships among molecules involved
in initiation of the reaction. Archives of biochemistry
and biophysics 1999, 365(2):199–210.
10. Shafer R: Rationale and Uses of a Public HIV
Drug-Resistance Database. Journal of Infectious Dis-
eases 2006, 194(Supplement 1):S51–S58.
11. De Clercq E: HIV inhibitors targeted at the reverse
transcriptase. AIDS research and human retroviruses
1992, 8(2):119–134.
12. Spence R, Kati W, Anderson K, Johnson K: Mech-
anism of inhibition of HIV-1 reverse transcrip-
tase by nonnucleoside inhibitors. Science 1995,
267(5200):988–993.
13. Richter L, Augustin R, Kramer S: Finding Relational
Associations in HIV Resistance Mutation Data.
In Proceedings of Inductive Logic Programming (ILP),
Volume 9 2009.
14. Betts M, Russell R: Amino-Acid Properties and
Consequences of Substitutions. Bioinformatics for
geneticists 2003, :311–342.
15. Taylor WR: The classification of amino acid
conservation. Journal of Theoretical Biology 1986,
119(2):205–218.
16. Bartlett G, Porter C, Borkakoti N, Thornton J: Anal-
ysis of catalytic residues in enzyme active sites.
Journal of Molecular Biology 2002, 324:105–121.
17. Muggleton S: Learning from Positive Data. In Induc-
tive Logic Programming Workshop 1996:358–376.
18. Džeroski S, Bratko I: Handling noise in Inductive
Logic Programming. In ILP92, Report ICOT TM-
1182. Edited by Muggleton S 1992.
19. Bennett DE, Camacho RJ, Otelea D, Kuritzkes DR,
Fleury H, Kiuchi M, Heneine W, Kantor R, Jordan MR,
Schapiro JM, Vandamme AM, Sandstrom P, Boucher
CaB, van de Vijver D, Rhee SY, Liu TF, Pillay D, Shafer
8
Algorithms
Algorithm 1 - Mutation generation algorithm.
Algorithm for novel relevant mutations discovery.
Algorithm 1 Mutation generation algorithm.
1: input: background knowledge B, learned model H
2: output: rank of the most relevant mutations R
3: procedure GenerateMutations(B, H)
4: Initialize DM ;
5: A find all assignments a that satisfy at least one clause ci 2 H
6: for a 2 A do
7: m mutation corresponding to the assignments a 2 A
8: score SM (m) . number of clauses ci satisfied by a
9: DM DM [ {(m, score)}
10: end for
11: R RankMutations(DM , B, H) . rank relevant mutations
12: return R
13: end procedure
Figures
Figure 1 - Mutation engineering algorithm.
Schema of the mutation engineering algorithm.
dataset of background
mutations / mutants knowledge
rank of
novel
relevant
mutations
hypothesis
9
Figure 2 - Model for the resistance to NNRTI learned from Dataset 2
An example of learned hypothesis for the NNRTI task with highlighted examples covered by the hypothesis
clauses.
>wt ...AGLKKKKSVTVLDVG...YQYMDDLYVG...WETWWTEY...WIPEWEFVN...
D DD W W
| | | | | | | |
98 112 181 190 398 405 410 418
mut(A,B,C,D) AND position(C,190)
mut(A,B,C,D) AND position(C,190) AND typeaa(polar,D)
mut(A,y,C,D) AND typeaa(aliphatic,D)
mut(A,B,C,a) AND position(C,106)
Figure 3 - Mean recall trend by number of satisfied clauses (Dataset 1)
Mean recall of the generated mutations on the resistance test set mutations from Dataset 1 by varying the
number of satisfied clauses. The mean recall values in orange refer to the proposed generative algorithm.
The mean recal values in green refer to a random generator of mutations.
100"
90"
80"
70"
mean%recall%
60"
NNRTI"
50"
NNRTI"(rand)"
40"
NRTI"
30"
NRTI"(rand)"
20"
10"
0"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"
nuber%of%sa.sfied%clauses%per%generated%muta.on%
10
Figure 4 - Example of model learned from Dataset 1
Model learned for NRTI and NNRTI on the dataset mutations Dataset 1. Note that in this case A is a
variable that identifies a single amino acid mutation.
1-res against(A,nrti) mut(A,B,C,D) AND color(red,D)
2-res against(A,nrti) mut(A,B,C,D) AND color(red,D) AND color(red,B)
3-res against(A,nrti) mut(A,B,C,D) AND location(7.0,C) AND mutated residue cp(C,medium,medium)
4-res against(A,nrti) mut(A,B,C,D) AND location(7.0,C)
5-res against(A,nrti) same color type mut(A,B)
6-res against(A,nrti) mut(A,B,C,D) AND mutated residue cp(C,medium,high)
7-res against(A,nrti) mut(A,B,C,D) AND mutated residue cp(C,medium,small)
8-res against(A,nnrti) different color type mut(A,B) AND location(11.0,B)
9-res against(A,nnrti) mut(A,B,C,D) AND mutated residue cp(C,small,small)
10-res against(A,nnrti) same color type mut(A,B)
11-res against(A,nnrti) mut(A,v,C,D)
12-res against(A,nnrti) mut(A,B,C,D) AND location(15.0,C)
13-res against(A,nnrti) mut(A,B,C,i)
14-res against(A,nnrti) mut(A,B,C,D) AND location(11.0,C)
15-res against(A,nnrti) mut(A,B,C,D) AND color(red,D)
16-res against(A,nnrti) mut(A,B,C,D) AND color(magenta,B) AND different color type mut(A,C)
17-res against(A,nnrti) mut(A,B,C,D) AND location(21.0,C)
Figure 5 - Mean recall trend by number of satisfied clauses (Dataset 2)
Mean recall of the generated mutations on the resistance test set mutations from Dataset 2 by varying the
number of satisfied clauses. The mean recall values in orange refer to the proposed generative algorithm.
The mean recal values in green refer to a random generator of mutations.
18"
16"
14"
12"
mean%recall%
10"
NNRTI"
8" NNRTI"(rand)"
6" NRTI"
NRTI"(rand)"
4"
2"
0"
1" 2"
number%of%sa.sfied%clauses%%
per%generated%muta.on%
11
Figure 6 - Model learned from the whole Dataset 2
Model learned for NRTI and NNRTI on the whole dataset of mutants Dataset 2. Note that in this case A
is a variable that identifies a mutant.
1-res against(A,nrti) mut(A,B,C,D) AND position(C,67)
2-res against(A,nrti) mut(A,B,C,g) AND color(red,B)
3-res against(A,nrti) mut(a,B,C,D) AND typeaa(nonpolar,D)
4-res against(A,nrti) mut(A,B,C,D) AND position(C,33)
5-res against(A,nrti) mut(A,B,C,D) AND position(C,218)
6-res against(A,nrti) mut(A,B,C,h) AND catalytic propensity(B,high)
7-res against(A,nnrti) mut(A,B,C,D) AND position(C,194)
8-res against(A,nnrti) mut(A,B,C,D) AND position(C,184)
9-res against(A,nnrti) mut(A,B,C,D) AND catalytic propensity(D,high) AND typeaa(neutral,D) AND
typeaa(nonpolar,B)
10-res against(A,nrti) mut(A,B,C,d) AND position(C,44)
11-res against(A,nrti) mut(A,B,C,r) AND typeaa(tiny,B) AND typeaa(polar,B)
12-res against(A,nrti) mut(A,d,C,D) AND catalytic propensity(D,medium)
13-res against(A,nrti) mut(A,d,C,D) AND catalytic propensity(D,medium) AND typeaa(polar,D)
14-res against(A,nrti) mut(A,B,C,k) AND position(C,203)
15-res against(A,nnrti) mut(A,B,C,D) AND position(C,190)
16-res against(A,nnrti) mut(A,B,C,a) AND position(C,106)
17-res against(A,nnrti) mut(A,B,C,D) AND same typeaa(D,B,tiny) AND same typeaa(D,B,nonpolar)
18-res against(A,nnrti) mut(A,B,C,D) AND catalytic propensity(D,high) AND typeaa(aromatic,B) AND
same typeaa(D,B,neutral)
19-res against(A,nnrti) mut(A,B,C,n) AND position(C,103)
20-res against(A,nnrti) mut(A,B,C,D) AND catalytic propensity(B,high) AND typeaa(neutral,B) AND
same typeaa(D,B,polar)
21-res against(A,nnrti) mut(A,B,C,D) AND position(C,177) AND catalytic propensity(D,medium) AND
same type mut t(A,C,polar)
22-res against(A,nnrti) mut(A,B,C,n) AND position(C,238)
23-res against(A,nnrti) mut(A,B,C,D) AND position(C,181)
24-res against(A,nnrti) mut(A,y,C,D) AND typeaa(small,D)
12
Tables
Table 1 - Background knowledge predicates
Summary of the background knowledge facts and rules. MutID is a mutation or a mutant identifier depending
on the type of the learning problem.
Background Knowledge Predicates
aa(Pos,AA) indicates a residue in the wild type sequence
mut(MutID,AA,Pos,AA1) indicates a mutation: mutation or mutant identifier, position and
amino acids involved, before and after the substitution
res against(MutID,Drug) indicates whether a mutation or mutant is resistant to a certain
drug
color(Color,AA) indicates the coloring group of a natural amino acid
typeaa(T,AA) indicates the type (e.g. aliphafatic, charged, aromatic, polar) of
a natural amino acid
same color type(R1,R2) indicates whether two residues belong to the same coloring group
same typeaa(R1,R2,T) indicates whether two residues are of the same type T
same color type mut(MutID, Pos) indicates a mutation to a residue of the same coloring group
different color type mut(MutID, Pos) indicates a mutation changing the coloring group of the residue
same type mut t(MutID, Pos, T) indicates a mutation to a residue of the same type T
different type mut t(MutID, Pos) indicates a mutation changing the type of the residue
close to site(Pos) indicates whether a specific position is close to a binding or active
site if any
location(L,Pos) indicates in which fragment of the primary sequence the amino
acid is located
catalytic propensity(AA,CP) indicates whether an amino acid has a high, medium or low cat-
alytic propensity
mutated residue cp(Rw,Pos,Rm,CPold,CPnew) indicates how, in a mutated position, the catalytic propensity has
changed (e.g. from low to high)
Table 2 - Results (Dataset 1)
Statistical comparisons of the performance of the proposed algorithm with an algorithm generating mutations
at random. The average recall has been computed for each one of the learning tasks over the 30 splits by
averaging recall over 30 repeated runs of the two algorithms. Results of a paired Wilcoxon test (↵ = 0.01)
on the statistical significance of the performance di↵erences are also reported. A black bullet indicates a
statistical significant improvement of our algorithm over a random generator. The mean number of generated
mutations over the 30 splits and the number of test mutations is reported on the side.
Mean recall % on 30 splits
Algorithm Random Generator Mean n. generated mutations n. test mutations
NNRTI 86 • 58 5201 17
NRTI 55 • 46 5548 28
Table 3 - Results (Dataset 2)
Statistical comparisons of the performance of the proposed algorithm with an algorithm generating mutations
at random. The average recall has been computed for each one of the learning tasks over the 30 splits. Results
of a paired Wilcoxon test (↵ = 0.01) on the statistical significance of the performance di↵erences are also
reported. A black bullet indicates a statistical significant improvement of our algorithm over a random
generator. The mean number of generated mutations and the mean number of test mutations over the 30
splits is reported on the side.
mean recall %
Algorithm Random Gen. mean n. generated mutations mean n. of test mutations
NNRTI 17 • 1 236 26
NRTI 7• 3 420 40
13