-

Machine learning applications in bioinformatics

Jiˇr´ı Kl´ema

0 0 Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University Technick ́a 2 , 166 27, Prague , Czech Republic

Bioinformatics is a field of study dealing with quence alignment problem or they appear in simplified methods for storing, retrieving and analyzing gene and pro- models of protein folding. tein oriented biological data. High-throughput technologies like DNA sequencing or microarrays allow researchers to obtain large volumes of heterogeneous and mutually in- 1.1 Success stories and interactions teracting data. Analysis and understanding of these data provides a natural application field for machine learning algorithms. At the same time, bioinformatics is a scientific branch of such analytical complexity, data variety and abundance that it motivates further development of specialized learning algorithms such as co-clustering or multiple sequence alignment. This paper provides a brief overview of the topics and works discussed during my talk on machine learning applications in bioinformatics. The talk starts with a preview of fundamental bioinformatics analytical tasks solved by machine learning algorithms mentioning a few success stories. The second part summarizes the recent bioinformatics research carried out in my home research group, the Intelligent Data Analysis group of Czech Technical University.

The bioinformatics tool with the largest impact is undoubtedly The Basic Alignment Search Tool (BLAST) and its successors [4] for searching a large sequence database against a query sequence. The NCBI server that provides the service with heuristic methods for sequence database searching handles more than half a million queries a day, the paper [4] introducing the improved PSI-BLAST has tens of thousands of citations. Another success story is an early case study on predictive classification from gene expression data [5].

The study proved feasibility of cancer classification based solely on gene expression monitoring. Although other latter studies showed that this positive result cannot be by means taken for granted, since 1 Analytical bioinformatics tasks then molecular classification is an option in disease diagnostics.

A complete overview of analytical bioinformatics tasks Bioinformatics directly motivates some cutting solvable and being solved by machine learning (ML) edge ML projects such as automated hypotheses genalgorithms is out of scope of this short summary. [1] is eration and learning of optimal workflows. [6] reports a textbook that provides an introduction to the most the development of Robot Scientist “Adam”, which important problems in computational biology and autonomously generated functional genomics hypothea unified treatment of the ML methods for solving ses about the yeast Saccharomyces cerevisiae and exthese problems. The book is self-contained, its large perimentally tested these hypotheses by using laborapart focuses on the principles of fundamental tory automation. One of its main objectives of ML algorithms. A relevant concise review appeared the ongoing European ML and data mining project in [2], its updated recent modification was presented e-LICO [7, 8] is to implement an intelligent data minin [3]. The reviews distinguish four principal classes ing assistant that takes in user specifications of the of tasks. Firstly, a large group of bioinformatics prob- learning task and the available data, plans a methodlems can be posed as classification tasks. Genome an- ologically correct learning process, and suggests worknotation including gene finding and searching for DNA flows that the user can execute to achieve the prespecibinding sites with proteins or gene function prediction fied objectives. Bioinformatics is the major application and protein secondary structure prediction make ex- area. amples. Secondly, clustering can be used to learn functional similarity from gene expression data or it can form phylogenetic trees. Thirdly, probabilistic graphi- 2 IDA bioinformatics research topics cal models can serve for modelling of DNA sequences in genomics or inference of genetic networks in sys- One of our main research topics is learning from gene tems biology. Last but not least, optimization algo- expression data driven by background knowledge [9]. rithms have been proposed to solve the multiple se- Mining patterns from gene expression data represents an alternative way to clustering [10]. Clustering pro- 3. I. Inza, B. Calvo, R. Armananzas, E. Bengoetxea, vides the most straightforward and traditional ap- P. Larranaga, J. A. Lozano: Machine learning: an inproach to obtain co-expressed genes. However, a typ- dispensable tool in bioinformatics. Methods Mol. Biol. ical group of genes shares an activation pattern only 593, 2010, 25–48. under specific experimental conditions. Local meth- 4. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, ods such as pattern mining can identify exactly the aZn.dZhPanSgI,-BWLA.SMTil:lerA, Dn.Je.wLigpemnaenra:tiGonappoedf BpLroAteSiTn sets of genes displaying a specific expression charac- database search programs. Nucleic Acids Research, 25, teristic in a set of situations. The main bottleneck of 1997, 3389–3402. this type of analysis is twofold – computational costs 5. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, and an overwhelming number of candidate patterns M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. , which can hardly be further exploited. A timely appli- J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, cation of background knowledge available in literature E. S. Lander: Molecular classification of cancer: class databases, gene ontologies and other sources can help discovery and class prediction by gene expression monto focus on the most plausible patterns only. Molecu- itoring. Science, 286 (5439), 1999, 531–537. lar classification of biological samples based on their 6. R. D. King, J. Rowland, S. G. Oliver, M. Young, gene-expression profiles is a natural learning task with PW. . PAiru,bLre.yN,. ES.oldBaytronvea,, MA.. SLpiaakrakteas,, KM..EM.Warkhhelaamn,, immediate practical uses. Nevertheless, molecular clas- A. Clare: The automation of science. Science 324 sifiers based solely on gene expression in most cases (5923), 2009, 85–89. cannot be considered useful decision-making tools or 7. e-lico project: An e-laboratory for interdisciplinary coldecision-supporting tools. Similarly to the domain of laborative research in data mining and data-intensive pattern mining, recent efforts in the field of molec- science, http://www.e-lico.eu/, August 2012. ular classification aim to employ background knowl- 8. M. Hilario, P. Nguyen, H. Do, A. Woznica, edge. The idea is to extract features that correspond A. Kalousis: Ontology-based meta-mining of knowlto functionally related gene sets instead of the individ- edge discovery workflows. In Jankowski, N., Duchs, W. ual genes, respectively the probesets whose expression Grabczewski, K., Meta-Learning in Computational Inis available in the original expression data [11, 12]. 9. tJe.llKigleenmcae:, LSeparrinnginegr, f2ro0m11,h2e7te3r–o3g1e6n.eous genomic data.

The previous paragraph employs the available FEE CTU, habilitation thesis, to appear. structural genomic knowledge to improve the analy- 10. J. Klema, S. Blachon, A. Soulet, B Cremilleux, sis of gene expression data. We also studied several O. Gandrilon: Constraint-based knowledge discovery methods to create it from collections of free biomedical from SAGE data. In Silico Biology, 8, 0014, 2008. texts, namely the research papers and their short sum- 11. M. Holec, J. Klema, F. Zelezny, J. Tolar: Comparative maries [13]. [14] proposes a novel ball-histogram ap- evaluation of set-level techniques in predictive classifiproach to DNA-binding propensity prediction of pro- cation of gene expression samples. BMC Bioinformatteins. ics, 13, (10), 2012, S15.

12. M. Krejnik, J. Klema: Empirical evidence of

Last but not least, the IDA group cooperates with the applicability of functional clustering through several biological institutes and labs. To exemplify, gene expression classification. IEEE/ACM Transac[15] shows an application of the set-level approach dis- tions on Computational Biology and Bioinformatcussed above to the particular domain of respirable ics, 9(3), 2012, 788–798. ambient air particulate matter, the principal research 13. M. Plantevit, T. Charnois, J. Klema, C. Rigotti, partner was the Department of Genetic Ecotoxicology B. Cremilleux: Combining sequence and itemset mining from Czech Academy of Sciences. [16] evaluates dif- to discover named entities in biomedical texts: A new ferences in the intragraft transcriptome after success- type of pattern. International Journal of Data Minful induction therapy using two rabbit antithymocyte ing, Modelling and Management, 1(2), 2009, 119–148. globulins, the partner was the Department of Nephrol- 14. A. Szaboova, O. Kuzelka, F. Zelezny, J. Tolar: Prediction of DNA-binding propensity of proteins by the ogy, Transplant Center, Institute for Clinical and Ex- ball-histogram method using automatic template search perimental Medicine. BMC. Bioinformatics 13 (10), 2012, 3. 15. H. Libalova, K. Uhlirova, J. Klema, M. Machala, References R. Sram, M. Ciganek, J. Topinka: Global gene expression changes in human embryonic lung fibroblasts in1. P. Baldi,S. Brunak: Bioinformatics: the machine learn- duced by organic extracts from respirable air particles.

ing approach, 2nd edition, MIT Press, 2001, 452. Particle and Fibre Toxicology, 9(1), 2012. 2. P. Larranaga, B. Calvo, R. Santana, C. Bielza, 16. M. Urbanova, I. Brabcova, E. Girmanova, F. Zelezny, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, O. Viklicky: Differential regulation of the nuclear G. Santafe, A. Perez, V. Robles: Machine learning in factor-kappa-B pathway by rabbit antithymocyte globubioinformatics. Briefings in Bioinformatics, 7(1), 2005, lins in kidney transplantation. Transplantation 93(6), 86–112. 2012, 589–96.