Extracting the Common Structure of Compounds to Induce Plant Immunity Activation using ILP Atsushi matsumoto,1, 2 Katsutoshi Kanamori,1 Kazuyuki Kuchitsu,3 and Hayato Ohwada1 1. Department of Industrial Administration, Faculty of Science and Technology, Tokyo University of Science 2. 7415617@ed.tus.ac.jp 3. Department of Applied Biological Science, Faculty of Science and Technology, Tokyo University of Science Abstract. While recent studies have referred to plant immunity activators, it is difficult to find a compound to use for the immunity activation of plants. In this study, we seek to determine compounds that enable plant immunity activity using ILP. With the proposed method, it is possible to predict compounds that induce plant immunity activity, based on the structural features of the compounds. The predicted structure rule also includes structures of known plant immunity activa- tors. However, further investigation is needed regarding the relationship between plant immunity and structure rules. Keywords: ILP, Machine learning, Plant immunity activation, Virtual screening 1. INTRODUCTION Virtual screening is an important approach in the drug discovery process. Especial- ly, machine learning has recently received broad attention. This paper picks up two method, Support Vector Machine (SVM) [1] and Inductive Logic Programming (ILP). Both method are often used in drug discovery field [2], [3]. On the other hand, decreased production of agricultural crops due to pathogenic bacteria and pests is a serious problem that has not yet been solved. To address this problem, grower have made a deal with fungicides and pesticides, however, it is difficult to act selectively on the target (e.g., pests and pathogens). There is a pos- sibility that the cause of health damage in humans and destruction of biota. In addi- tion, long-term use of the same drug may cause the emergence of resistant bacteria; thus, the effect of the drug gradually decreases. In recent years, plant immunity ac- tivators have attracted attention, based on the idea of increasing the immunity of the plant rather than directly killing pathogens and pests. However, only three types of plant immunity activator are currently marketed in Japan (Fig. 1). In addi- tion, the mechanism of plant-immunity activation is still largely unknown [4]. 69 Fig. 1. Known plant-immunity activators The development of plant immunity activators has been slow, due to the time re- quired and the high cost of screening candidate compounds. Cause of this problem is the kind of candidate compounds is enormous and each of the compounds were reacted to the cells to confirm the effect of immunity activation. In this study, we predict compounds that induce plant-immunity activation using ILP to study compound structures. ILP can be used to determine relationship pat- terns between data; therefore, it is suitable to represent the structure of compounds. Additionally, we obtained the structure of the predicted compound as a rule, which is one of the excellent points of ILP. A recent study that was conducted to predict the structure of compounds using ILP exhibited high performance [5]. In those cases, the target of compound bonds was known. However, in the present study, the target of compound bonds is not known. Additionally, we also tried SVM for comparison with ILP. SVM also exhibited high performance [2]. 2. PLANT IMMUNITY Plant immunity is a defense system to protect plants from various enemies. A plant-immunity activator is a drug that activates plant immunity. The Kuchitu group constructed a screening system to find a candidate using the amount of ROS (reactive oxygen species) generation as an index [6]. Experiment results indi- cated that if the ROS value is high, the compound is likely to be a plant-immunity activator. 3. DATASET In the present study, the datasets are experiment data about the plant immunity activa- tor in Arabidopsis thaliana, compiled by the Kuchitu group. This dataset includes 10000 compounds. Positive examples are 271 high-ROS compounds, and negative examples are the other 9729 compounds. However, negative examples were reduced to 813 compounds by random sampling for two reasons. First, imbalanced data dete- riorates learning accuracy. Second, if there are many compounds, calculation takes a long time. Therefore, 1084 compounds were used in this study. 70 4. METHOD This chapter describes our method. We had two approaches. Fig. 2 shows the over- view of our method. Fig. 2. Method overview The two approaches are described as follow. 4.1 ILP Approach With the ILP approach, structural features and some numerical features of the com- pound were used as background knowledge. In this study, we used GKS [7], which is an ILP system. We defined seven predicates to represent the features of the com- pounds. In parentheses, there are argument of predicates. ・atom (compound_name, atom_id, element) Types of atoms present in the compound ・bond (compound_name, atom_id, atom_id, bondtype) Bonding state between atoms and bond type in the compound ・Num_AromaticRings (compound_name,Num_AromaticRing) The number of aromatic rings in the compound ・Num_Rings (compound_name, Num_Ring) The number of rings in the compound ・LogP98 (compound_name, value) Lipid solubility of the compound ・LogD (compound_name, value) Indication of a change in lipid solubility by a change in Ph value 71 ・ring (compound_name,ring_id,atom_id,ringsize,ringtype) Type of ring structure that is composed of each atom. It can represent the connection of the ring structure and other structures by using this predicate. By selecting several predicates as background knowledge, we can obtain the structure of the compound as a learning result (Table 1). Background knowledge is a set of atomic formulas of each predicate. Atom and bond are always necessary. The reason why selecting LogP98 and LogD is result of importance calculation using the average Gini coefficient. Table 1. Predicates selected for background knowledge Setting name Predicate ILP1 atom,bond ILP2 atom,bond,Num_AromaticRings ILP3 atom,bond,Num_AromaticRings,Num_rings ILP4 atom,bond,ALogP98 ILP5 atom,bond,Num_AromaticRings,Num_rings,ALogP98,LogD ILP6 atom,bond,Num_AromaticRings,Num_rings,LogD ILP7 atom,bond,LogD,ring ILP8 atom,bond,ring Mode declaration as input is shown in Fig. 3. A rule selected if it was covered more than 10 positive examples and less than 10 negative examples. @dock,+molecular @atom,+molecular,+atomid,#atomtype @atom,+molecular,-atomid,#atomtype @bond,+molecular,+atomid,+atomid,#bondtype @bond,+molecular,-atomid,+atomid,#bondtype @bond,+molecular,+atomid,-atomid,#bondtype @bond,+molecular,-atomid,-atomid,#bondtype @Num_Rings,+molecular,#Num_Ring @Num_AromaticRings,+molecular,#Num_AromaticRing @LogD,+molecular,#value @ALogP98,+molecular,#value @ring,+molecular,+ringid,+atomid,#ringsize,#ringtype @ring,+molecular,-ringid,+atomid,#ringsize,#ringtype @ring,+molecular,+ringid,-atomid,#ringsize,#ringtype @ring,+molecular,-ringid,-atomid,#ringsize,#ringtype Fig. 3. Mode declaration 72 4.2 SVM Approach We also tried SVM for comparison with ILP, using 77 features for learning (Table 2). Detail information is shown in Appendix A. Table 2. Attributes used for SVM Types of features The number of features Related to structure 39 Related to ALogP 6 Related to size or weight 14 Related to energy 12 Other 6 Total 77 Cost parameters and gamma parameters were determined using a grid search for 20 split from 0.0001 to 10,000. The kernel used RBF. 4.3 Evaluation Ten-fold cross-validation was used in both approaches. True Positive (tp) , False Negative (fn) , True Negative (tn) , False Positive (fp) , Accuracy , Precision , Recall and F value were used for Evaluation. Especially, this paper focuses on tp and F val- ue. 5. RESULTS Table 3 shows the ILP results. Table 3. ILP results Setting name tp fn tn fp Accuracy Precision Recall F value ILP1 92 179 699 114 0.73 0.447 0.339 0.386 ILP2 116 155 644 169 0.701 0.407 0.428 0.417 ILP3 127 144 605 208 0.675 0.379 0.469 0.419 ILP4 88 183 712 101 0.738 0.466 0.325 0.383 ILP5 131 140 572 241 0.649 0.352 0.483 0.407 ILP6 139 132 568 245 0.652 0.362 0.513 0.424 ILP7 165 106 523 290 0.635 0.363 0.609 0.455 ILP8 165 106 542 271 0.652 0.378 0.609 0.467 Table 4 shows comparison of the best of SVM and the best of ILP 73 Table 4. Comparison of the best of SVM and the best of ILP Approach tp fn tn fp Accuracy Precision Recall F value SVM 123 148 703 110 0.762 0.528 0.454 0.488 ILP8 165 106 542 271 0.652 0.378 0.609 0.467 Table 5 shows the best rules obtained by ILP8. A good rule has many positive exam- ples and few negative examples. All the output list of rules obtained by ILP8 are shown in Appendix B Table 5. Rules for compound structure Rule number Interpretation Positive Negative Rule1 Atom C has a single bond with the aromatic ring. 27 10 Rule2 There is an aromatic ring containing an atom S 20 8 and atom C has a double bond with something. Rule3 Two aromatic rings bond to each other 22 10 and each aromatic ring have a single bond. Rule4 An aromatic ring containing an atom N 15 3 and An aromatic ring consisted of 5 atoms bond to each other Rule5 An aromatic ring containing an atom S 14 2 and another aromatic ring bond to each other 6. CONCLUSION Although SVM F values slightly exceeded those of ILP, ILP tp values greatly exceed- ed those of SVM. For virtual screening, it is very important to reduce the positive example of misclassification. Results of this study indicate that structural features of the compounds are useful in predicting immunity activation. Using the ring structure as background knowledge yielded better results than not us- ing ring structure. Therefore, the ring structure is considered an important factor in plant immunity activation. When analyzing rules using ILP, comparison of known plant immunity activators indicated that Rule 2 was true for all three compounds. For rule showing a structure that is different from the known plant immunity activator, there is a need for further investigation. In this study, it was possible to predict the partial structure that exists in all com- pounds of known plant-immunity activators. In addition, the rule that is unknown the relationship between immunity activity has been predicted. In order to improve pre- diction accuracy, it is essential to improve background knowledge in the future. 74 References 1. V.Vapnik,The Nature of Statistical learning Theory. Spring-Verlag,NY, USA,1995 2. Tadasuke Ito,Hayato Ohwada and Shin Aoki,Combining two machine learn- ing methods for predicting protein-ligand docking using structure and physio- chemical properties,Proc. of the 7th International Conference on Bioinformatics and Computational Biology, pp. 19-24,March 2015 3. A.Srinivasan,S.H.Muggleton,R.D.King and M.J.E.Sternberg,Mutagenesis ILP experiments in a non-determinate biological domain,Proceedings of the Fourth International Inductive Logic Programming Workshop,1994 4. Yoshiteru Noutoshi,Masateru Okazaki,Tatsuya Kida,Yuta Nishina, Yoshihiko Morishita,Takumi Ogawa,Hideyuki Suzuki,Daisuke Shibata, Yusuke Jikumaru,Atsushi Hamada,Yuji Kamiya,Ken Shirasu,Novel Plant Immune-Priming Compounds Identified via High-Throughput Chemical Screening Target Salicylic Acid Glucosyltransferases in Arabidopsis.The Plant Cell,vol.24:3795-3804,2012 5. Jose C A Santos,Houssam Nassif,David Page,Stephen H Muggleton, Michael J E Sternberg,Automated identification of protein-ligand interaction features using Inductive Logic Programming:a hexose binding case study . Santos st al.BMC Bioinformatics 2012,13:162,2012 6. T Higashi,T Kurusu,S Hasegawa,K Kuchitsu,Dynamic intracellular reor- ganization of cytoskeletons and the vacuole in defense responses and hypersensi- tive cell death in plants.Journal of Plant Research,Volume 124,Issue 3, pp315-324,2011 7. Hayato Ohwada,Hiroyuki Nishiyama,Fumio Mizoguchi,Concurrent execu- tion of optimal hypoyhesis search for inverse entailment.Lecture Notes in Arti- ficial Intelligence,Spring-Verlag,No.1866,Vol.4,pp.165-173,2000 75 Appendix: A Table 6 shows feature list in SVM approach. Feature name depends on Discovery Studio. Table 6. Feature list in SVM Category Feature name Category Feature name C HBA_Count A ALogP HBD_Count ALogP_MR NPlusO_Count ALogP98 Num_AromaticBonds ALogP98_Unknown Num_AromaticRings Apol Num_AtomClasses LogD Num_Atoms W Molecular_3D_PolarSASA Num_Bonds Molecular_3D_SASA Num_BridgeBonds Molecular_3D_SAVol Num_BridgeHeadAtoms Molecular_FractionalPolarSASA Num_ChainAssemblies Molecular_FractionalPolarSurfaceArea Num_Chains Molecular_Mass Num_ExplicitAtoms Molecular_PolarSASA Num_ExplicitBonds Molecular_PolarSurfaceArea Num_ExplicitHydrogens Molecular_SASA Num_H_Acceptors Molecular_SAVol Num_H_Acceptors_Lipinski Molecular_SurfaceArea Num_H_Donors Molecular_Volume Num_H_Donors_Lipinski Molecular_Weight Num_Hydrogens VSA_TotalArea Num_NegativeAtoms E Angle Energy Num_PositiveAtoms Bond Energy Num_RingAssemblies CHARMm Energy Num_RingBonds Dihedral Energy Num_Rings Electrostatic Energy Num_Rings3 Energy Num_Rings5 Improper Energy Num_Rings6 Initial Potential Energy Num_Rings7 Minimized_Energy Num_Rings8 Potential Energy Num_RotatableBonds Strain_Energy Num_SpiroAtoms Van der Waals Energy Num_StereoAtoms O AverageBondLength Num_StereoBonds FormalCharge Num_TerminalRotomers Initial RMS Gradient Num_TrueStereoAtoms Molecular_Solubility Num_UnknownPseudoStereoAtoms RadOfGyration Num_UnknownTrueStereoAtoms RMS Gradient Organic_Count C: Related to structure A: Related to AlogP W: Related to size or weight E: Related to energy O: Other 76 Appendix: B Fig.4 show all the output list of rules obtained by ILP8. Rule Positive Negative dock(A) :- atom#1(A, B, s), atom#1(A, C, c), bond#1(A, D, C, 2), bond#2(A, B, E, ar), bond#2(A, 20 8 E, F, ar) dock(A) :- bond#3(A, B, C, 2), bond#1(A, D, C, 1), bond#2(A, B, E, 1), bond#2(A, E, F, 1) 10 2 dock(A) :- bond#3(A, B, C, ar), bond#1(A, D, C, 1), bond#1(A, E, B, ar), bond#1(A, F, D, 3) 10 8 dock(A) :- bond#3(A, B, C, 1), bond#1(A, D, B, ar), bond#1(A, E, D, 1), ring#2(A, F, E, 6, ar) 17 6 dock(A) :- bond#3(A, B, C, ar), atom(A, B, n), bond#1(A, D, B, 1), bond#2(A, C, E, ar) 14 10 dock(A) :- bond#3(A, B, C, ar), atom(A, B, s), bond#1(A, D, C, 1), bond#1(A, E, D, ar) 14 2 dock(A) :- atom#1(A, B, c), bond#1(A, C, B, ar), bond#1(A, D, C, ar), bond#2(A, B, E, 1), 27 10 ring#2(A, F, E, 6, ar) dock(A) :- atom#1(A, B, n), atom#1(A, C, n), bond#1(A, D, B, 2), bond#1(A, E, C, 1), ring#2(A, F, 10 9 D, 6, not_ar) dock(A) :- bond#3(A, B, C, ar), atom(A, C, n), bond#1(A, D, B, 1), bond#1(A, E, B, ar), bond#1(A, 11 3 F, D, 2), bond#2(A, E, G, 1) dock(A) :- atom#1(A, B, n), atom#1(A, C, o), bond#1(A, D, C, 1), bond#1(A, E, D, 1), ring#2(A, F, 18 8 B, 5, ar) dock(A) :- atom#1(A, B, c), bond#1(A, C, B, 2), bond#1(A, D, B, 1), bond#1(A, E, D, ar), 11 8 bond#2(A, C, F, 1) dock(A) :- bond#3(A, B, C, ar), bond#1(A, D, B, ar), bond#1(A, E, C, 1), bond#2(A, D, F, 1), 20 10 ring#2(A, G, E, 6, ar) dock(A) :- atom#1(A, B, n), atom#1(A, C, c), bond#1(A, D, C, ar), bond#1(A, E, D, 1), bond#2(A, 10 9 B, F, ar), ring#2(A, G, E, 5, not_ar) dock(A) :- atom#1(A, B, n), atom#1(A, C, c), bond#1(A, D, B, ar), bond#1(A, E, D, 1), bond#2(A, 15 10 C, F, ar), ring#2(A, G, F, 6, not_ar) dock(A) :- bond#3(A, B, C, ar), atom(A, B, n), bond#1(A, D, C, ar), bond#2(A, C, E, ar), bond#2(A, 20 10 E, F, ar), bond#2(A, D, G, 1) dock(A) :- bond#3(A, B, C, 1), bond#1(A, D, B, 1), bond#2(A, D, E, 1), ring#2(A, F, E, 6, not_ar), 10 5 ring#2(A, G, C, 5, ar) dock(A) :- atom#1(A, B, n), bond#2(A, B, C, 2), bond#2(A, C, D, 1), ring#2(A, E, D, 5, not_ar) 10 8 dock(A) :- atom#1(A, B, n), atom#1(A, C, n), bond#2(A, C, D, ar), bond#2(A, D, E, ar), bond#2(A, 16 10 E, F, ar), ring#2(A, G, B, 6, not_ar) dock(A) :- atom#1(A, B, c), bond#1(A, C, B, ar), bond#1(A, D, C, 1), bond#1(A, E, D, ar), 22 10 bond#2(A, B, F, 1), bond#2(A, E, G, 1) dock(A) :- bond#3(A, B, C, 1), atom(A, B, c), bond#1(A, D, C, 2), bond#2(A, C, E, 1), ring#2(A, F, 15 10 E, 5, ar) dock(A) :- bond#3(A, B, C, ar), atom(A, B, n), bond#1(A, D, C, ar), bond#1(A, E, C, 1), bond#1(A, 12 10 F, D, 1), bond#1(A, G, F, ar) dock(A) :- atom#1(A, B, c), bond#1(A, C, B, 2), bond#2(A, B, D, 1), bond#2(A, D, E, ar), 11 10 bond#2(A, C, F, 1) dock(A) :- atom#1(A, B, n), atom#1(A, C, h), bond#1(A, D, B, 2), bond#1(A, E, D, 1), bond#1(A, 11 10 F, C, 1), ring#2(A, G, F, 6, not_ar) dock(A) :- atom#1(A, B, c), atom#1(A, C, o), bond#1(A, D, B, ar), bond#1(A, E, D, 1), bond#1(A, 11 10 F, E, ar), bond#2(A, C, G, ar) dock(A) :- atom#1(A, B, n), bond#2(A, B, C, 2), bond#2(A, B, D, 1) 12 10 dock(A) :- atom#1(A, B, n), atom#1(A, C, o), bond#2(A, B, D, ar), bond#2(A, D, E, ar), ring#2(A, 11 10 F, C, 6, not_ar) dock(A) :- atom#1(A, B, o), atom#1(A, C, n), bond#1(A, D, C, ar), bond#2(A, B, E, ar), bond#2(A, 10 7 D, F, 1) dock(A) :- bond#3(A, B, C, ar), atom(A, B, n), bond#2(A, C, D, ar), bond#2(A, D, E, ar), bond#2(A, 15 3 E, F, 1), ring#2(A, G, F, 5, ar) 77 dock(A) :- bond#3(A, B, C, 1), bond#2(A, C, D, 1), bond#2(A, B, E, 2), ring#2(A, F, E, 5, not_ar) 10 5 dock(A) :- bond#3(A, B, C, ar), atom(A, B, n), bond#1(A, D, C, ar), bond#1(A, E, D, ar), bond#1(A, 15 7 F, B, 1), bond#2(A, E, G, 1) dock(A) :- atom#1(A, B, n), atom#1(A, C, c), bond#1(A, D, C, 2), bond#2(A, D, E, 1), bond#2(A, 10 6 B, F, 1), ring#2(A, G, F, 5, not_ar) dock(A) :- bond#3(A, B, C, ar), bond#1(A, D, B, ar), bond#1(A, E, D, ar), bond#2(A, E, F, ar), 11 2 bond#2(A, C, G, ar), ring#2(A, H, F, 5, not_ar) dock(A) :- atom#1(A, B, s), bond#1(A, C, B, ar), bond#2(A, B, D, ar), bond#2(A, D, E, ar), 21 10 ring#2(A, F, E, 5, ar) dock(A) :- bond#3(A, B, C, ar), atom(A, C, n), bond#1(A, D, B, ar), bond#1(A, E, B, 1), bond#1(A, 10 3 F, D, ar), bond#1(A, G, E, 2) dock(A) :- atom#1(A, B, s), atom#1(A, C, n), bond#1(A, D, B, ar), bond#1(A, E, D, ar), bond#2(A, 10 9 C, F, ar), bond#2(A, F, G, ar) dock(A) :- bond#3(A, B, C, 1), atom(A, B, o), bond#1(A, D, B, 1), bond#1(A, E, C, ar), bond#1(A, 10 9 F, D, 1), bond#1(A, G, E, ar) dock(A) :- atom#1(A, B, o), atom#1(A, C, h), bond#1(A, D, B, ar), bond#1(A, E, C, 1), bond#2(A, 10 7 E, F, 1), ring#2(A, G, F, 5, ar) dock(A) :- atom#1(A, B, o), atom#1(A, C, f), bond#1(A, D, B, 1), bond#2(A, C, E, 1), bond#2(A, E, 11 4 F, 1), bond#2(A, F, G, 1) dock(A) :- bond#3(A, B, C, 1), atom(A, C, n), bond#1(A, D, B, 1), bond#1(A, E, C, 1), bond#1(A, 11 10 F, D, 2), ring#2(A, G, E, 6, not_ar) dock(A) :- bond#3(A, B, C, ar), atom(A, B, n), bond#1(A, D, C, ar), bond#1(A, E, D, ar), bond#2(A, 18 10 C, F, ar), bond#2(A, F, G, ar) dock(A) :- atom#1(A, B, n), atom#1(A, C, o), bond#1(A, D, C, ar), bond#1(A, E, B, 1), bond#1(A, 11 10 F, E, ar) dock(A) :- bond#3(A, B, C, 1), atom(A, B, s), bond#1(A, D, C, ar), bond#1(A, E, D, ar), ring#2(A, 11 8 F, C, 5, ar) dock(A) :- bond#3(A, B, C, 1), atom(A, C, h), bond#1(A, D, B, ar), bond#1(A, E, D, ar), bond#1(A, 10 8 F, D, 1), ring#2(A, G, E, 5, ar) dock(A) :- atom#1(A, B, c), atom#1(A, C, h), bond#1(A, D, C, 1), bond#2(A, B, E, 1), ring#2(A, F, 10 9 D, 5, not_ar), ring#2(A, G, E, 5, ar) dock(A) :- bond#3(A, B, C, 1), atom(A, B, c), bond#1(A, D, C, ar), ring#2(A, E, C, 5, ar), ring#2(A, 10 8 F, D, 6, ar) dock(A) :- bond#3(A, B, C, 1), atom(A, C, h), bond#1(A, D, B, 1), bond#2(A, D, E, 1), bond#2(A, 12 10 E, F, ar) dock(A) :- bond#3(A, B, C, 1), atom(A, C, n), bond#1(A, D, B, 1), bond#1(A, E, C, ar), bond#1(A, 10 8 F, E, ar), bond#2(A, F, G, 1) dock(A) :- atom#1(A, B, n), atom#1(A, C, o), bond#1(A, D, C, ar), bond#2(A, B, E, ar), bond#2(A, 12 9 E, F, ar), ring#2(A, G, F, 5, ar) Fig. 4 Rule list 78