Predicting virus mutations through relational learning Elisa Cilia1 , Stefano Teso2 , Sergio Ammendola3 , Tom Lenaerts1,4 , Andrea Passerini⇤2 1 MLG, Département d’Informatique, Université Livre de Bruxelles, Boulevard du Thriomphe - CP 212, 1050 - Brussels, Belgium 2 Department of Computer Science and Information Engineering, University of Trento, via Sommarive 14, I-38123 (Povo) Trento, Italy 3 Ambiotec sas, R&D, Via Appia Nord 47, 00142 Cisterna di Latina (LT), Italy 4 AI-lab, Vakgroep Computerwetenschappen, Vrije Universiteit Brussel, Pleinlaan 2, 1050 - Brussels, Belgium Email: Elisa Cilia - ecilia@ulb.ac.be; Stefano Teso - teso@disi.unitn.it; Sergio Ammendola - ricercaesviluppo@ambiotec.it; Tom Lenaerts - tlenaert@ulb.ac.be; Andrea Passerini⇤ - passerini@disi.unitn.it; ⇤ Corresponding author Abstract Background: Viruses are typically characterized by high mutation rates, which allow them to quickly develop drug-resistant mutations. Mining relevant rules from mutation data can be extremely useful to understand the virus adaptation mechanism and to design drugs that e↵ectively counter potentially resistant mutants. Results: We propose a simple relational learning approach for mutant prediction where the input consists of mutation data with drug-resistance information, either as sets of mutations conferring resistance to a certain drug, or as sets of mutants with information on their susceptibility to the drug. The algorithm learns a set of relational rules characterizing drug-resistance and use them to generate a set of potentially resistant mutants. Conclusions: Promising results were obtained in generating resistant mutations for both nucleoside and non- nucleoside HIV reverse transcriptase inhibitors. The approach can be generalized quite easily to learning mutants characterized by more complex rules correlating multiple mutations. Background single-strand RNA particle, a reverse transcriptase and an integrase into the cell cytoplasm. Quickly HIV is a pandemic cause of lethal pathologies in the RNA molecule is retro-transcribed in a DNA more than 33 million people. Its horizontal transmis- double strand molecule, which is integrated into the sion trough mucosae is difficult to control and treat host genome. The integration events induce a cel- because the virus has a high virulence and it infects lular response, which begins with the transcription several type of immune surveillance cells, such as of the Tat gene by the RNA polymerase II. Tat is a those characterized by CD4 receptor (CD4+ cells). well-known protein responsible for the HIV activa- The major problem in treating the human virus in- tion since it recruits some cytoplasm host proteins fection is the drug selectivity since the virus pene- involved in the expression of viral genes. Remark- trates in the cell where it releases its genetic material ably, HIV can establish a life-long latent infection to replicate itself by using the cell mechanisms. A by suppressing its transcription, thus making inef- drug target is the replicating apparatus of the cell. fective the large part of antiviral drugs aimed at HIV antiviral molecules will be directed against sev- controlling the viral replication. However replicating eral cells such as macrophages or lymphocytes T to viruses adopt several drug resistance strategies, for interfere with viral replication. The HIV releases a 1 instance, HIV induces amino acid mutations reduc- tually the most common situation, in which one is ing the efficacy of the pharmaceutical compounds. presented with a number of mutants together with The present work is aimed at gaining knowledge some evidence of their susceptibility to certain treat- on mutations that may occur into the viral RNA ments. The goal is then to extract rules and poten- transcriptase [9]. This is an important target to de- tial mutations explaining why certain mutants con- velop antiretroviral medicines and di↵erent types of fer resistance and which other ones produce the same molecules have been found active: the Nucleoside e↵ect. Reverse Transcriptase Inhibitors (NRTI) and Non NRTI (NNRTI). Although RNA RT inhibitors are Many machine learning methods have been ap- active the HIV virus quickly, more than frequently, plied in the past to mutation data for predicting changes the RNA RT encoding sequence thus acquir- single amino acid mutations on protein stability ing drug resistance. The antiviral therapy is based changes [5] and the e↵ect of mutations on the pro- on the use of cocktails of molecules including new tein function [6, 7] or drug susceptibility [8]. To RNA RT inhibitors, thus a computational approach the best of our knowledge this is the first attempt to predict possible mutation sites, and their sensi- to learn relational features of mutations a↵ecting a bility to drug is an important tool in drug discovery protein behavior and use them for generating novel for the antiretroviral therapy. relevant mutations. Furthermore, even if we focus Computational methods can assist here by ex- on single amino acid mutations in our experimental ploring the space of potential virus mutants, provid- evaluation, our approach can be straightforwardly ing potential avenues for anticipatory drugs [1]. To extended to multiple point mutations, and we are achieve such a goal, one first needs to understand actively working in this direction. Note that the what kind of mutants may lead to resistance. other approaches first generate all potential muta- A general engineering technique for building ar- tions and then decide which of them leads to resis- tificial mutants is referred to as rational design [2]. tance, whereas we produce the relevant ones directly. The technique consists in modifying existing pro- We report an experimental evaluation focused on teins by site directed mutagenesis. It relies on a HIV RT. RT is a well-studied protein: a large num- deep domain knowledge in order to identify candi- ber of mutants have been shown to resist to one or date mutations that may a↵ect protein structure or more drugs and databases exist that collect those function. The process typically involves extensive data from di↵erent sources and make them available trial-and-error experiments and is also aimed at im- for further analyses [10]. proving the understanding mechanisms of a protein behavior. We tested the ability of our approach to gener- In this work we report on our initial attempt to ate drug-resistant amino acid mutations for NRTI develop an artificial system mimicking the rational and NNRTI. Our results show statistically signifi- design process. An Inductive Logic Programming cant improvements for both drug classes over the (ILP) learner [3] is trained to extract general rules baseline results obtained through a random genera- describing mutations relevant to a certain behav- tor. Mutant-based results confirm our previous find- ior, i.e. resistance towards certain inhibitors. The ings [4] extending them to the more natural setting learned rules are then used to infer novel mutations in which only information on mutant behavior is ob- which may induce similar resistance. served. Additional background knowledge allows to We consider here two learning settings: in the produce here more compact and simple hypotheses, first one we rely on a training set made of single which have the advantage of generating a reduced amino acid mutations known to confer resistance to number of candidate mutations. a certain class of inhibitors (we will refer to this as mutation-based learning and preliminary promis- The approach can be in general applied in mu- ing results in this setting were described in [4]), in tation studies aimed at understanding protein func- the second we learn the rules directly from mutants tion. By searching for residues most likely to have (comprising of 1 to n amino acid mutations) that a functional role in an active site, the approach can have been experimentally tested for their resistance for instance be used in the engineering of enzyme to the same classes of inhibitors (we will refer to this mutants with an improved activity for a certain sub- as mutant-based learning). This second setting is ac- strate. 2 Results where h and the bi are atomic literals. Atomic lit- Datasets erals are expressions of the form p(t1, ..., tn) We applied our approach to predict HIV RT mu- where p/n is a predicate symbol of arity n and the tations conferring resistance to two classes of in- ti are terms, either constants (denoted by lower hibitors: Nucleoside RT Inhibitors (NRTI) and Non- case) or variables (denoted by upper case) in our Nucleoside RT Inhibitors (NNRTI). The two classes experiments. The atomic literal h is also called the of inhibitors di↵er in the targeted sites and rely on head of the clause, typically the target predicate, quite di↵erent mechanisms [11, 12]. NNRTI inhibit and b1 AND ... AND bn its body. Intuitively, a the reverse transcriptase by binding to the enzyme clause represents that the head h will hold whenever active site, therefore directly interfering with the en- the body b1 AND ... AND bn holds. For instance, zyme function. NRTI are instead incorporated into a simple hypothesis like res against(A,nnrti) the newly synthesized viral DNA for preventing its mutation(A,C) AND close to site(C) would indi- elongation. cate that a mutation C in the proximity of a binding site confers to mutant A resistance against a nnrti. We collected datasets for both mutation-based Learning in this setting consists of searching for a and mutant-based learning. The former (Dataset 1) set of definite clauses H = {ci , ..., cm } covering all is a dataset of amino acid mutations derived from or most positive examples, and none or few nega- the Los Alamos National Laboratories (LANL) HIV tive ones if available. First-order clauses can thus resistance database1 by Richter et al. [13], who used be interpreted as relational features that character- it to mine relational rules among mutations. It con- ize the target concept. The main advantage of these sists of 95 amino acid mutations labeled as resistant logic-based approaches with respect to other ma- to NRTI and 56 labeled as resistant to NNRTI, over chine learning techniques is the expressivity and in- a set of 581 observed mutations. For the mutant- terpretability of the learned models. Models can be based setting, we collected (Dataset 2) HIV RT readily interpreted by human experts and provide mutation data from the Stanford University HIV direct explanations for the predictions. Drug Resistance. The database provides a dataset of selected mutants of HIV RT with results of sus- ceptibility studies to various drugs, and was previ- Background knowledge ously employed [8] for predicting drug resistance of novel (given) mutants2 . It is composed of 838 dif- We built a relational knowledge base for the problem ferent mutants annotated with susceptibility levels domain. Table 1 summarizes the predicates we in- (low, medium and high) to drugs belonging to the cluded as a background knowledge. We represented NRTI (639 mutants) and NNRTI (747 mutants) drug the amino acids of the wild type with their posi- classes. We considered a setting aimed at identifying tions in the primary sequence (aa/2) and the spe- amino acid mutations conferring high susceptibility cific mutations characterizing them (mut/4). Target (with respect to medium or low), and considered a predicates were encoded as resistance of the muta- mutant as highly susceptible to a drug class if it was tion or mutant to a certain drug (res against/2). annotated as being highly susceptible to at least one Note that this encoding considers mutations at the drug from that class. amino acid rather than nucleotide level, i.e. a sin- gle amino acid mutation can involve up to three nu- cleotide changes. While focusing on single nucleotide changes would drastically expand the space of pos- Learning in first order logic sible mutations, including the cost (in terms of nu- Our aim is to learn a first-order logic hypothesis cleotide changes) of a certain amino acid mutation for a target concept, i.e. mutation conferring re- could help refining our search procedure. sistance to a certain drug, and use it to infer novel Additional background knowledge was included mutations consistent with such hypothesis. We rely in order to highlight characteristics of residues and on definite clauses which are the basis of the Pro- mutations: log programming language. A definite clause is an expression of the form h b1 AND ... AND bn, typeaa/2 indicates the type of the natural amino 1 http://www.hiv.lanl.gov/content/sequence/RESDB/ 2 downloadable at http://hivdb.stanford.edu/cgi-bin/GenoPhenoDS.cgi 3 acids according to the Venn diagram grouping location/2 indicates in which fragment of the pri- based on the amino acids properties proposed mary sequence the amino acid is located. Lo- in [14]. For example, a serine is a tiny and cations are numbered from 0 by dividing the polar amino acid. sequence into fragments of 10 amino acid lenght. color/2 indicates the type of the natural amino acids according to the coloring proposed catalytic propensity/2 indicates whether an in [15]. For example the magenta class includes amino acid has a high, medium or low cat- basic amino acids as lysine and arginine while alytic propensity according to [16]. the blue class includes acidic amino acids as mutated residue cp/5 indicates how, in a mutated aspartic and glutamic acids. These groups of position, the catalytic propensity has changed amino acids do not overlap as in the previous (e.g. from low to high). case. same type aa/3 indicates whether two residues be- long to the same type T, i.e. a change from one Algorithm overview residue to the other conserves the type of the The proposed approach is sketched in Figure 1. amino acid. same color type/2 indicates whether two residues Step 1: Learning phase belong to the same coloring group, i.e. a The first step is the learning phase. An ILP learner change from one residue to the other conserves is fed with a logical representation of the data D the coloring group of the amino acid. and of the domain knowledge B to be incorporated, and it returns a first-order logical hypothesis H for same type mut t/3 indicates that a residue substi- the concept of mutation conferring resistance to a tution at a certain position does not modify certain class of inhibitors. the amino acid type T with respect to the wild In this context there are two suitable ways to type. For example mutation i123v conserves learn the target concept, depending on the type of the aliphatic amino acid type while mutation input data and their labeling: i123d does not (i.e. different type mut t/3 holds for it). a) the one-class classification setting, learning a model from positive instances only. This is the same color type mut/2 indicates that a residue approach we employ for Dataset 1: positive ex- substitution at a certain position does not amples are mutations for which experimental modify the amino acid coloring group with prove is available that shows resistance to a respect to the wild type. For example mu- drug, but no safe claim can be made on non- tation d123e conserves the blue amino acid annotated mutations. group while mutation d123a does not (i.e. different color type mut/2 holds for it). b) the binary classification setting, learning to discriminate between positive and negative Other background knowledge facts and rules instances. This setting is appropriate for were added in order to express structural relations Dataset 2: positive examples are in our exper- along the primary sequence and catalytic propensity iments mutants labelled as highly susceptible of the involved residues: to the drug class, negative examples are those with medium or low susceptibility. close to site/1 indicates whether a specific posi- tion is distant less than 5 positions from a The hypothesis is derived using the Aleph (A residue belonging to a binding or active site. Learning Engine for Proposing Hypotheses) ILP sys- In our specific case, the background theory in- tem3 , which allows to learn in both settings. In the corporates knowledge about a metal binding one-class classification case, it incrementally builds a site and a heterodimerization site. hypothesis covering all positive examples guided by a 3 http://www.comlab.ox.ac.uk/activities/machinelearning/Aleph/aleph.html 4 Bayesian evaluation function, described in [17], scor- the generated candidate mutations that satify the ing candidate solutions according to an estimate of first two clauses are all the mutations changing the the Bayes’ posterior probability that allows to trade- glycine in position 190 in a polar amino acid: 190Y, o↵ hypothesis size and generality. In the binary clas- 190W, 190T, 190S, 190R, 190Q, 190N, 190K, 190H, sification case, Aleph builds the hypothesis covering 190E, 190D, 190C where for example the notation all positive examples and none or few negative ones, 190Y indicates a change of the wild type amino acid, guided by an m estimate evaluation function [18]. located in position 190, into a tyrosine (Y). For in- Aleph adds clauses to the hypothesis based on stance, 190S and 190E in this list are already part of their coverage of training examples. Given a learned the known NNRTI surveillance mutations (see [19]). model, the first clauses are those covering most train- ing examples and thus usually the most representa- tive of the underlying concept. Learning from mutations In Figure 2 we show a simple example of learned We first learn general rules characterizing known re- hypothesis covering a set of training mutants from sistance mutations (from Dataset 1) to be used for Dataset 2. The learned hypothesis models the abil- predicting novel candidate ones. ity of a mutation to confer the mutant resistance to We divided the dataset of mutations into a train- NNRTI and is composed of four first-order clauses, ing and a test set (70/30) in a stratified way, which each one covering di↵erent sets of mutations of the means by preserving, both in the train and test set, wild type as highlighted in colors: yellow for the first the proportion of examples belonging to one of the clause, blue for the second, red for the third, and two drug classes. The resulting training set is com- green for the fourth one. Some mutations are cov- posed of a total of 106 mutations while the test set ered by more than one clause as shown by the color is composed of 45 mutations. overlaps. Bold letters indicate residues involved in We trained the ILP learner on the training set the RT metal binding site (D110, D185 and D186) and we evaluated on the test set the set of mutations or the heterodimerization site (W401 and W414). generated using the learned model. The evaluation procedure takes the set of generated mutations and computes its enrichment in test mutations. We com- Step 2: Generative phase pare the recall of the approach, i.e. the fraction of The second step of our approach is the generative test mutations generated by the model, with the re- phase, in which the learned hypothesis is employed call of a baseline algorithm that chooses at random to find novel mutations that can confer drug resis- a set (of the same cardinality) of possible mutations tance to an RT mutant. A set of candidate muta- among all legal ones. tions can be generated by using the Prolog inference Results averaged on 30 random splits of the engine starting from the rules in the learned model. dataset are reported in Table 2. On each split we The rules are actually constraints on the character- performed 30 runs of our algorithm (Aleph has a istics that a mutation of the wild type should have random component generating the seed for the hy- in order to confer resistance to a certain inhibitor, pothesis search) and of the random generation al- according to the learned hypothesis. gorithm in each one of the di↵erent learning tasks Algorithm 1 details the mutation generation pro- (NNRTI, NRTI). In this setting, the average sizes of cedure. We assume, for simplicity, to have a model the learned hypotheses for NNRTI and NRTI are 8 H for a single drug class. The procedure works by and 7 rules respectively. For each task we also report querying the Prolog inference engine for all possi- in column 3 and 4 of Table 2 the mean number of ble variable assignments that satisfy the hypothesis generated mutations over the 30 splits and the num- clauses, each representing a mutation by its position ber of test set mutations for reference. In both cases, and the amino acid replacing the wildtype residue. improvements are statistically significant according The set of mutations generated by the model is to a paired Wilcoxon test (↵=0.01). Figure 3 gives ranked according to a scoring function SM before further details about these results by showing how being returned by the algorithm. We defined SM as the mean recall on the test set varies if we com- the number of clauses in H that a candidate muta- pute it on the generated mutations satisfying only tion m satisfies. one clause, two clauses, three clauses and so on (x Referring to the model in Figure 2, for instance, axis). The mean recall trend is shown in orange for 5 the proposed generative approach and in green for of the learned hypotheses for NNRTI and NRTI are a random generator of mutations for both classes of 5 and 3 rules respectively. The small size of the mod- inhibitors. els allows to substantially reduce the space of possi- We finally learned a model on the whole dataset ble mutations to be generated. Our approach shows in order to generate a single set of mutations for fur- an average recall of 17% and 7% on the test muta- ther inspection. We report five examples of novel tions, which is statistically significantly higher than mutations with the highest score for each one of the the recall obtained by randomly generating muta- tasks: 90I, 98I, 103I, 106P, 179I for NNRTI and 60A, tions (paired Wilcoxon test, ↵=0.01). Figure 5 gives 153M, 212L, 229F, 239I for NRTI. further details about these results by showing the In [20], the authors found a set of novel muta- mean recall trend by varying the number of satisfied tions conferring resistance to efavirenz and nevirap- clauses of the generated mutations. ine, which are NNRTI. Our mutation generation al- Figure 6 shows the hypothesis learned by the gorithm partially confirmed their findings. Mutation system when trained on the whole set of mutants 90I was scored high (getting the maximum score of in Dataset 2. It is easy to see that the hypothe- 5 satisfied clauses), mutation 101H was generated sis for the resistance to NNRTI identifies many (10 with a score of 3 out of 5, mutations 196R and 138Q out of 19) of the known resistance survaillance mu- with score 1 out of 5, while mutation 28K was not tations reported in [19]: 103N, 106A, 181C, 181I, generated at all by our system as a candidate for 181V, 188C, 188H, 190A, 190E and 190S. In partic- conferring resistance to NNRTI. ular, mutation 181C has the highest predicted score An example of learned hypothesis is reported in as it satisfies four clauses of the learned model. In- Figure 4. For instance, according to the model, terestingly, other not previously reported mutations among the features a mutation should have for con- are suggested with a high score (3 out of 4 satisifed ferring resistance to a NNRTI, there is the change of clauses) by the generative algorithm. These muta- a basic (magenta) residue of the wild type, e.g. lysine tions may be novel mutations conferring resistance or arginine, into a residue with completely di↵erent to NNRTI. For example, 181N, 181D, 318C and phisico-chemical characteristics (rule 16). 232C. Another example for the resistance to NNRTI is 29% of the mutations labeled as conferring re- that a non conserved mutation is present in positions sistance to NNRTI in Dataset 1 are also identified. between 98 and 106 of the wild type sequence (rule Apart from the survaillance mutations those include 8). also: 98G, 227C, 190C, 190Q, 190T and 190V. Key positions for the resistance to NNRTI, like 181 and 190 can be easily identified from the model in Fig- Learning from mutants ure 6, thanks to rules 15 and 23 that anchor the The next set of experiments is focused on learning mutation to the specific position without restricting mutations from mutant data (Dataset 2). Learned the change to a specific group of amino acids. A models are still limited to single amino acid muta- quite uncommon NNRTI-selected mutation, 238N, tions, and so are novel mutants generated by the appears among the generated ones. Indeed, as re- system. ported in the summaries of the Stanford HIV Resis- We randomly assigned the mutants in Dataset 2 tance Database, this mutations in combination with to 30 training and a test set splits, by avoiding hav- 103N, causes high-level resistance to nevirapine and ing mutants containing the same resistance mutation efavirenz. (according to the labelling used in Dataset 1) in both Also in the case of NRTI the generative algo- training and test sets. For each of the 30 splits, we rithm suggests a few known survaillance mutations evaluated the recall of the generated mutations on reported in [19]: 67E, 67G, 67N, 116Y, 184V, 184I the known resistance mutations (from Dataset 1), (6 out of 34). 18% of the mutations labeled as by first removing all the mutations that were also conferring resistance to NRTI in Dataset 1 are also present in the training set. Comparison is again identified. Apart from known survaillance mutation made on a baseline algorithm generating random le- those include also: 44D, 62V, 67A, 67S, 69R, 184T. gal mutations. Key positions for the resistance to NRTI as pre- Results averaged on the 30 random splits are re- dicted by our model in Figure 6 are: 33, 67, 184, ported in Table 3. In this setting, the average sizes 194, 218. Of them, the e↵ects of mutations in posi- 6 tion 67 and especially in 184 have been well studied, full protein engineering fashion. while positions 33, 194, 218 seem to be novel. We found also that the predicted mutation 219H, which does not appear neither among the know survail- Conclusions lance mutations or among the resistance mutations In this work we proposed a simple relational learn- in Dataset 1, is indeed reported in the Stanford HIV ing approach applicable to evolution prediction and Database as a quite uncommon NRTI-selected mu- protein engineering. The algorithm relies on a train- tation. Note that this uncommon mutation requires ing set of mutation data annotated with drug resis- two nucleotide changes to emerge. tance information, builds a relational model charac- terizing resistant mutations, and uses it to generate novel potentially resistant ones. Encouraging pre- Discussion and Future Work liminary results on HIV RT data indicate a statisti- cally significant enrichment in resistance conferring The results shown in the previous section are a mutations among those generated by the system, on promising starting point to generalize our approach both mutation-based and mutant-based learning set- to more complex settings. We showed that the ap- tings. Albeit preliminary, our results suggest that proach scales from few hundreds of mutations as the proposed approach for learning mutations has a learning examples to almost a thousand of complete potential in guiding mutant engineering, as well as in mutants. Moreover the learned hypotheses signif- predicting virus evolution in order to try and devise icantly constraint the space of all possible single appropriate countermeasures. A more detailed back- amino acid mutations to be considered, paving the ground knowledge, possibly including 3D informa- way to the expansion of the method to multi-site tion whenever available, is necessary in order to fur- mutant generation. This represents a clear advan- ther focus the set of generated mutations, and pos- tage over alternative existing machine learning ap- sibly post-processing stages involving mutant evalu- proaches, which would require the preliminary gen- ation by statistical machine learning approaches [5]. eration of all possible mutants for their evaluation. In the next future we also plan to generalize the pro- Restricting to RT mutants with two mutated amino posed approach to jointly generate sets of related acids, this would imply testing more than a hun- mutations shifting the focus from the generation of dred million candidate mutants. At the same time single amino acid mutations to mutants with multi- this simple ILP approach cannot attain the same ple mutations. accuracy levels of a pure statistical approach. We are currently investigating more sophisticated sta- tistical relational learning approaches for improving the accuracy of the learned models and for assign- Acknowledgements EC and TL acknowledge the support of the F.W.O. (Bel- ing weights to the clauses to be used for selecting gium) and the F.R.S.-F.N.R.S. (Belgium). ST and AP the most relevant generated mutations. These can acknowledge the support of the Italian Ministry of Uni- be used as a pre-filtering stage, producing candi- versity and Research under grant PRIN 2009LNP494 date mutants to be further analysed by complex sta- (Statistical Relational Learning: Algorithms and Appli- tistical approaches, and additional tools evaluating, cations). for instance, their stability. An additional direction to refine our predictions consists of jointly learning models of resistance to di↵erent drugs (e.g. NNRTI References and NRTI), possibly further refining the joint mod- 1. Cao ZW, Han LY, Zheng CJ, Ji ZL, Chen X, Lin HH, els on a per-class basis. On a predictive (rather than Chen YZ: Computer prediction of drug resistance generative) task, this was shown [21] to provide im- mutations in proteins REVIEWS. Drug Discovery Today: BIOSILICO 2005, 10(7). provements over learning distinct per-drug models. The approach is not restricted to learning drug- 2. Rubingh DN: Protein engineering from a bioindus- trial point of view. Current Opinion in Biotechnology resistance mutations in viruses. More generally, it 1997, 8(4):417–422. can be applied to learn mutants having certain prop- 3. Muggleton S, De Raedt L: Inductive Logic Program- erties of interest, e.g. improved or more specific ac- ming: Theory and Methods. University Computing tivity of an enzyme with respect to a substrate, in a 1994, :629–682. 7 4. Cilia E, Passerini A: Frankenstein Junior: a re- RW: Drug resistance mutations for surveillance of lational learning approach toward protein engi- transmitted HIV-1 drug-resistance: 2009 update. neering. In Proceedings of the Annotation, Interpre- PloS one 2009, 4(3):e4724. tation and Management of Mutations (AIMM) Work- 20. Deforche K, Camacho RJ, Grossman Z, Soares Ma, Van shop@ECCB2010 2010. Laethem K, Katzenstein Da, Harrigan PR, Kantor R, 5. Capriotti E, Fariselli P, Rossi I, Casadio R: A three- Shafer R, Vandamme AM: Bayesian network analyses state prediction of single point mutations on pro- of resistance pathways against efavirenz and nevi- tein stability changes. BMC bioinformatics 2008, 9 rapine. AIDS (London, England) 2008, 22(16):2107–15. Suppl 2:S6. 21. Cilia E, Landwehr N, Passerini A: Relational Feature 6. Needham CJ, Bradford JR, Bulpitt AJ, Care Ma, West- Mining with Hierarchical Multitask kFOIL. Fun- head DR: Predicting the e↵ect of missense muta- damenta Informaticae 2011, 113(2):151–177. tions on protein function: analysis with Bayesian 22. Theys K, Deforche K, Beheydt G, Moreau Y, van networks. BMC bioinformatics 2006, 7:405. Laethem K, Lemey P, Camacho RJ, Rhee SY, Shafer RW, 7. Bromberg Y, Rost B: SNAP: predict e↵ect of non- Van Wijngaerden E, Vandamme AM: Estimating the synonymous polymorphisms on function. Nucleic individualized HIV-1 genetic barrier to resistance acids research 2007, 35(11):3823–35. using a nelfinavir fitness landscape. BMC bioinfor- matics 2010, 11:409. 8. Rhee SY, Taylor J, Wadhera G, Ben-Hur A, Brutlag DL, Shafer RW: Genotypic predictors of human immun- odeficiency virus type 1 drug resistance. Proceed- ings of the National Academy of Sciences of the United States of America 2006, 103(46):17355–60. 9. Götte M, Li X, Wainberg M: HIV-1 reverse tran- scription: a brief overview focused on structure- function relationships among molecules involved in initiation of the reaction. Archives of biochemistry and biophysics 1999, 365(2):199–210. 10. Shafer R: Rationale and Uses of a Public HIV Drug-Resistance Database. Journal of Infectious Dis- eases 2006, 194(Supplement 1):S51–S58. 11. De Clercq E: HIV inhibitors targeted at the reverse transcriptase. AIDS research and human retroviruses 1992, 8(2):119–134. 12. Spence R, Kati W, Anderson K, Johnson K: Mech- anism of inhibition of HIV-1 reverse transcrip- tase by nonnucleoside inhibitors. Science 1995, 267(5200):988–993. 13. Richter L, Augustin R, Kramer S: Finding Relational Associations in HIV Resistance Mutation Data. In Proceedings of Inductive Logic Programming (ILP), Volume 9 2009. 14. Betts M, Russell R: Amino-Acid Properties and Consequences of Substitutions. Bioinformatics for geneticists 2003, :311–342. 15. Taylor WR: The classification of amino acid conservation. Journal of Theoretical Biology 1986, 119(2):205–218. 16. Bartlett G, Porter C, Borkakoti N, Thornton J: Anal- ysis of catalytic residues in enzyme active sites. Journal of Molecular Biology 2002, 324:105–121. 17. Muggleton S: Learning from Positive Data. In Induc- tive Logic Programming Workshop 1996:358–376. 18. Džeroski S, Bratko I: Handling noise in Inductive Logic Programming. In ILP92, Report ICOT TM- 1182. Edited by Muggleton S 1992. 19. Bennett DE, Camacho RJ, Otelea D, Kuritzkes DR, Fleury H, Kiuchi M, Heneine W, Kantor R, Jordan MR, Schapiro JM, Vandamme AM, Sandstrom P, Boucher CaB, van de Vijver D, Rhee SY, Liu TF, Pillay D, Shafer 8 Algorithms Algorithm 1 - Mutation generation algorithm. Algorithm for novel relevant mutations discovery. Algorithm 1 Mutation generation algorithm. 1: input: background knowledge B, learned model H 2: output: rank of the most relevant mutations R 3: procedure GenerateMutations(B, H) 4: Initialize DM ; 5: A find all assignments a that satisfy at least one clause ci 2 H 6: for a 2 A do 7: m mutation corresponding to the assignments a 2 A 8: score SM (m) . number of clauses ci satisfied by a 9: DM DM [ {(m, score)} 10: end for 11: R RankMutations(DM , B, H) . rank relevant mutations 12: return R 13: end procedure Figures Figure 1 - Mutation engineering algorithm. Schema of the mutation engineering algorithm. dataset of background mutations / mutants knowledge rank of novel relevant mutations hypothesis 9 Figure 2 - Model for the resistance to NNRTI learned from Dataset 2 An example of learned hypothesis for the NNRTI task with highlighted examples covered by the hypothesis clauses. >wt ...AGLKKKKSVTVLDVG...YQYMDDLYVG...WETWWTEY...WIPEWEFVN... D DD W W | | | | | | | | 98 112 181 190 398 405 410 418 mut(A,B,C,D) AND position(C,190) mut(A,B,C,D) AND position(C,190) AND typeaa(polar,D) mut(A,y,C,D) AND typeaa(aliphatic,D) mut(A,B,C,a) AND position(C,106) Figure 3 - Mean recall trend by number of satisfied clauses (Dataset 1) Mean recall of the generated mutations on the resistance test set mutations from Dataset 1 by varying the number of satisfied clauses. The mean recall values in orange refer to the proposed generative algorithm. The mean recal values in green refer to a random generator of mutations. 100" 90" 80" 70" mean%recall% 60" NNRTI" 50" NNRTI"(rand)" 40" NRTI" 30" NRTI"(rand)" 20" 10" 0" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" nuber%of%sa.sfied%clauses%per%generated%muta.on% 10 Figure 4 - Example of model learned from Dataset 1 Model learned for NRTI and NNRTI on the dataset mutations Dataset 1. Note that in this case A is a variable that identifies a single amino acid mutation. 1-res against(A,nrti) mut(A,B,C,D) AND color(red,D) 2-res against(A,nrti) mut(A,B,C,D) AND color(red,D) AND color(red,B) 3-res against(A,nrti) mut(A,B,C,D) AND location(7.0,C) AND mutated residue cp(C,medium,medium) 4-res against(A,nrti) mut(A,B,C,D) AND location(7.0,C) 5-res against(A,nrti) same color type mut(A,B) 6-res against(A,nrti) mut(A,B,C,D) AND mutated residue cp(C,medium,high) 7-res against(A,nrti) mut(A,B,C,D) AND mutated residue cp(C,medium,small) 8-res against(A,nnrti) different color type mut(A,B) AND location(11.0,B) 9-res against(A,nnrti) mut(A,B,C,D) AND mutated residue cp(C,small,small) 10-res against(A,nnrti) same color type mut(A,B) 11-res against(A,nnrti) mut(A,v,C,D) 12-res against(A,nnrti) mut(A,B,C,D) AND location(15.0,C) 13-res against(A,nnrti) mut(A,B,C,i) 14-res against(A,nnrti) mut(A,B,C,D) AND location(11.0,C) 15-res against(A,nnrti) mut(A,B,C,D) AND color(red,D) 16-res against(A,nnrti) mut(A,B,C,D) AND color(magenta,B) AND different color type mut(A,C) 17-res against(A,nnrti) mut(A,B,C,D) AND location(21.0,C) Figure 5 - Mean recall trend by number of satisfied clauses (Dataset 2) Mean recall of the generated mutations on the resistance test set mutations from Dataset 2 by varying the number of satisfied clauses. The mean recall values in orange refer to the proposed generative algorithm. The mean recal values in green refer to a random generator of mutations. 18" 16" 14" 12" mean%recall% 10" NNRTI" 8" NNRTI"(rand)" 6" NRTI" NRTI"(rand)" 4" 2" 0" 1" 2" number%of%sa.sfied%clauses%% per%generated%muta.on% 11 Figure 6 - Model learned from the whole Dataset 2 Model learned for NRTI and NNRTI on the whole dataset of mutants Dataset 2. Note that in this case A is a variable that identifies a mutant. 1-res against(A,nrti) mut(A,B,C,D) AND position(C,67) 2-res against(A,nrti) mut(A,B,C,g) AND color(red,B) 3-res against(A,nrti) mut(a,B,C,D) AND typeaa(nonpolar,D) 4-res against(A,nrti) mut(A,B,C,D) AND position(C,33) 5-res against(A,nrti) mut(A,B,C,D) AND position(C,218) 6-res against(A,nrti) mut(A,B,C,h) AND catalytic propensity(B,high) 7-res against(A,nnrti) mut(A,B,C,D) AND position(C,194) 8-res against(A,nnrti) mut(A,B,C,D) AND position(C,184) 9-res against(A,nnrti) mut(A,B,C,D) AND catalytic propensity(D,high) AND typeaa(neutral,D) AND typeaa(nonpolar,B) 10-res against(A,nrti) mut(A,B,C,d) AND position(C,44) 11-res against(A,nrti) mut(A,B,C,r) AND typeaa(tiny,B) AND typeaa(polar,B) 12-res against(A,nrti) mut(A,d,C,D) AND catalytic propensity(D,medium) 13-res against(A,nrti) mut(A,d,C,D) AND catalytic propensity(D,medium) AND typeaa(polar,D) 14-res against(A,nrti) mut(A,B,C,k) AND position(C,203) 15-res against(A,nnrti) mut(A,B,C,D) AND position(C,190) 16-res against(A,nnrti) mut(A,B,C,a) AND position(C,106) 17-res against(A,nnrti) mut(A,B,C,D) AND same typeaa(D,B,tiny) AND same typeaa(D,B,nonpolar) 18-res against(A,nnrti) mut(A,B,C,D) AND catalytic propensity(D,high) AND typeaa(aromatic,B) AND same typeaa(D,B,neutral) 19-res against(A,nnrti) mut(A,B,C,n) AND position(C,103) 20-res against(A,nnrti) mut(A,B,C,D) AND catalytic propensity(B,high) AND typeaa(neutral,B) AND same typeaa(D,B,polar) 21-res against(A,nnrti) mut(A,B,C,D) AND position(C,177) AND catalytic propensity(D,medium) AND same type mut t(A,C,polar) 22-res against(A,nnrti) mut(A,B,C,n) AND position(C,238) 23-res against(A,nnrti) mut(A,B,C,D) AND position(C,181) 24-res against(A,nnrti) mut(A,y,C,D) AND typeaa(small,D) 12 Tables Table 1 - Background knowledge predicates Summary of the background knowledge facts and rules. MutID is a mutation or a mutant identifier depending on the type of the learning problem. Background Knowledge Predicates aa(Pos,AA) indicates a residue in the wild type sequence mut(MutID,AA,Pos,AA1) indicates a mutation: mutation or mutant identifier, position and amino acids involved, before and after the substitution res against(MutID,Drug) indicates whether a mutation or mutant is resistant to a certain drug color(Color,AA) indicates the coloring group of a natural amino acid typeaa(T,AA) indicates the type (e.g. aliphafatic, charged, aromatic, polar) of a natural amino acid same color type(R1,R2) indicates whether two residues belong to the same coloring group same typeaa(R1,R2,T) indicates whether two residues are of the same type T same color type mut(MutID, Pos) indicates a mutation to a residue of the same coloring group different color type mut(MutID, Pos) indicates a mutation changing the coloring group of the residue same type mut t(MutID, Pos, T) indicates a mutation to a residue of the same type T different type mut t(MutID, Pos) indicates a mutation changing the type of the residue close to site(Pos) indicates whether a specific position is close to a binding or active site if any location(L,Pos) indicates in which fragment of the primary sequence the amino acid is located catalytic propensity(AA,CP) indicates whether an amino acid has a high, medium or low cat- alytic propensity mutated residue cp(Rw,Pos,Rm,CPold,CPnew) indicates how, in a mutated position, the catalytic propensity has changed (e.g. from low to high) Table 2 - Results (Dataset 1) Statistical comparisons of the performance of the proposed algorithm with an algorithm generating mutations at random. The average recall has been computed for each one of the learning tasks over the 30 splits by averaging recall over 30 repeated runs of the two algorithms. Results of a paired Wilcoxon test (↵ = 0.01) on the statistical significance of the performance di↵erences are also reported. A black bullet indicates a statistical significant improvement of our algorithm over a random generator. The mean number of generated mutations over the 30 splits and the number of test mutations is reported on the side. Mean recall % on 30 splits Algorithm Random Generator Mean n. generated mutations n. test mutations NNRTI 86 • 58 5201 17 NRTI 55 • 46 5548 28 Table 3 - Results (Dataset 2) Statistical comparisons of the performance of the proposed algorithm with an algorithm generating mutations at random. The average recall has been computed for each one of the learning tasks over the 30 splits. Results of a paired Wilcoxon test (↵ = 0.01) on the statistical significance of the performance di↵erences are also reported. A black bullet indicates a statistical significant improvement of our algorithm over a random generator. The mean number of generated mutations and the mean number of test mutations over the 30 splits is reported on the side. mean recall % Algorithm Random Gen. mean n. generated mutations mean n. of test mutations NNRTI 17 • 1 236 26 NRTI 7• 3 420 40 13