=Paper= {{Paper |id=None |storemode=property |title=Machine learning applications in bioinformatics |pdfUrl=https://ceur-ws.org/Vol-990/paper1.pdf |volume=Vol-990 |dblpUrl=https://dblp.org/rec/conf/itat/Klema12 }} ==Machine learning applications in bioinformatics== https://ceur-ws.org/Vol-990/paper1.pdf
                  Machine learning applications in bioinformatics

                                                        Jiřı́ Kléma

              Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University
                                  Technická 2, 166 27, Prague, Czech Republic
                                           klema@labe.felk.cvut.cz,
                      WWW home page: http://labe.felk.cvut.cz/∼klema/klema.html

Abstract. Bioinformatics is a field of study dealing with       quence alignment problem or they appear in simplified
methods for storing, retrieving and analyzing gene and pro-     models of protein folding.
tein oriented biological data. High-throughput technologies
like DNA sequencing or microarrays allow researchers to
obtain large volumes of heterogeneous and mutually in-          1.1     Success stories and interactions
teracting data. Analysis and understanding of these data
provides a natural application field for machine learning
                                                                The bioinformatics tool with the largest impact is un-
algorithms. At the same time, bioinformatics is a scien-
                                                                doubtedly The Basic Alignment Search Tool (BLAST)
tific branch of such analytical complexity, data variety and
abundance that it motivates further development of special-     and its successors [4] for searching a large sequence
ized learning algorithms such as co-clustering or multiple      database against a query sequence. The NCBI server
sequence alignment. This paper provides a brief overview of     that provides the service with heuristic methods for
the topics and works discussed during my talk on machine        sequence database searching handles more than half
learning applications in bioinformatics. The talk starts with   a million queries a day, the paper [4] introducing the
a preview of fundamental bioinformatics analytical tasks        improved PSI-BLAST has tens of thousands of cita-
solved by machine learning algorithms mentioning a few          tions. Another success story is an early case study on
success stories. The second part summarizes the recent bio-     predictive classification from gene expression data [5].
informatics research carried out in my home research            The study proved feasibility of cancer classification
group, the Intelligent Data Analysis group of Czech Tech-
                                                                based solely on gene expression monitoring. Although
nical University.
                                                                other latter studies showed that this positive result
                                                                cannot be by means taken for granted, since
1    Analytical bioinformatics tasks                            then molecular classification is an option in disease
                                                                diagnostics.
A complete overview of analytical bioinformatics tasks              Bioinformatics directly motivates some cutting
solvable and being solved by machine learning (ML)              edge ML projects such as automated hypotheses gen-
algorithms is out of scope of this short summary. [1] is        eration and learning of optimal workflows. [6] reports
a textbook that provides an introduction to the most            the development of Robot Scientist “Adam”, which
important problems in computational biology and                 autonomously generated functional genomics hypothe-
a unified treatment of the ML methods for solving               ses about the yeast Saccharomyces cerevisiae and ex-
these problems. The book is self-contained, its large           perimentally tested these hypotheses by using labora-
part focuses on the principles of fundamental                   tory automation. One of its main objectives of
ML algorithms. A relevant concise review appeared               the ongoing European ML and data mining project
in [2], its updated recent modification was presented           e-LICO [7, 8] is to implement an intelligent data min-
in [3]. The reviews distinguish four principal classes          ing assistant that takes in user specifications of the
of tasks. Firstly, a large group of bioinformatics prob-        learning task and the available data, plans a method-
lems can be posed as classification tasks. Genome an-           ologically correct learning process, and suggests work-
notation including gene finding and searching for DNA           flows that the user can execute to achieve the prespeci-
binding sites with proteins or gene function prediction         fied objectives. Bioinformatics is the major application
and protein secondary structure prediction make ex-             area.
amples. Secondly, clustering can be used to learn func-
tional similarity from gene expression data or it can
form phylogenetic trees. Thirdly, probabilistic graphi- 2 IDA bioinformatics research topics
cal models can serve for modelling of DNA sequences
in genomics or inference of genetic networks in sys- One of our main research topics is learning from gene
tems biology. Last but not least, optimization algo- expression data driven by background knowledge [9].
rithms have been proposed to solve the multiple se- Mining patterns from gene expression data represents
2       Jiřı́ Kléma

an alternative way to clustering [10]. Clustering pro-          3. I. Inza, B. Calvo, R. Armananzas, E. Bengoetxea,
vides the most straightforward and traditional ap-                 P. Larranaga, J. A. Lozano: Machine learning: an in-
proach to obtain co-expressed genes. However, a typ-               dispensable tool in bioinformatics. Methods Mol. Biol.
ical group of genes shares an activation pattern only              593, 2010, 25–48.
                                                                4. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang,
under specific experimental conditions. Local meth-
                                                                   Z. Zhang, W. Miller, D. J. Lipman: Gapped BLAST
ods such as pattern mining can identify exactly the
                                                                   and PSI-BLAST: A new generation of protein
sets of genes displaying a specific expression charac-             database search programs. Nucleic Acids Research, 25,
teristic in a set of situations. The main bottleneck of            1997, 3389–3402.
this type of analysis is twofold – computational costs          5. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard,
and an overwhelming number of candidate patterns                   M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. ,
which can hardly be further exploited. A timely appli-             J. R. Downing, M. A. Caligiuri, C. D. Bloomfield,
cation of background knowledge available in literature             E. S. Lander: Molecular classification of cancer: class
databases, gene ontologies and other sources can help              discovery and class prediction by gene expression mon-
to focus on the most plausible patterns only. Molecu-              itoring. Science, 286 (5439), 1999, 531–537.
                                                                6. R. D. King, J. Rowland, S. G. Oliver, M. Young,
lar classification of biological samples based on their
                                                                   W. Aubrey, E. Byrne, M. Liakata, M. Markham,
gene-expression profiles is a natural learning task with
                                                                   P. Pir, L. N. Soldatova, A. Sparkes, K. E. Whelan,
immediate practical uses. Nevertheless, molecular clas-            A. Clare: The automation of science. Science 324
sifiers based solely on gene expression in most cases              (5923), 2009, 85–89.
cannot be considered useful decision-making tools or            7. e-lico project: An e-laboratory for interdisciplinary col-
decision-supporting tools. Similarly to the domain of              laborative research in data mining and data-intensive
pattern mining, recent efforts in the field of molec-              science, http://www.e-lico.eu/, August 2012.
ular classification aim to employ background knowl-             8. M. Hilario, P. Nguyen, H. Do, A. Woznica,
edge. The idea is to extract features that correspond              A. Kalousis: Ontology-based meta-mining of knowl-
to functionally related gene sets instead of the individ-          edge discovery workflows. In Jankowski, N., Duchs, W.
                                                                   Grabczewski, K., Meta-Learning in Computational In-
ual genes, respectively the probesets whose expression
                                                                   telligence, Springer, 2011, 273–316.
is available in the original expression data [11, 12].          9. J. Klema: Learning from heterogeneous genomic data.
     The previous paragraph employs the available                  FEE CTU, habilitation thesis, to appear.
structural genomic knowledge to improve the analy-             10. J. Klema, S. Blachon, A. Soulet, B Cremilleux,
sis of gene expression data. We also studied several               O. Gandrilon: Constraint-based knowledge discovery
methods to create it from collections of free biomedical           from SAGE data. In Silico Biology, 8, 0014, 2008.
texts, namely the research papers and their short sum-         11. M. Holec, J. Klema, F. Zelezny, J. Tolar: Comparative
maries [13]. [14] proposes a novel ball-histogram ap-              evaluation of set-level techniques in predictive classifi-
proach to DNA-binding propensity prediction of pro-                cation of gene expression samples. BMC Bioinformat-
teins.                                                             ics, 13, (10), 2012, S15.
                                                               12. M. Krejnik, J. Klema: Empirical evidence of
     Last but not least, the IDA group cooperates with
                                                                   the applicability of functional clustering through
several biological institutes and labs. To exemplify,              gene expression classification. IEEE/ACM Transac-
[15] shows an application of the set-level approach dis-           tions on Computational Biology and Bioinformat-
cussed above to the particular domain of respirable                ics, 9(3), 2012, 788–798.
ambient air particulate matter, the principal research         13. M. Plantevit, T. Charnois, J. Klema, C. Rigotti,
partner was the Department of Genetic Ecotoxicology                B. Cremilleux: Combining sequence and itemset mining
from Czech Academy of Sciences. [16] evaluates dif-                to discover named entities in biomedical texts: A new
ferences in the intragraft transcriptome after success-            type of pattern. International Journal of Data Min-
ful induction therapy using two rabbit antithymocyte               ing, Modelling and Management, 1(2), 2009, 119–148.
                                                               14. A. Szaboova, O. Kuzelka, F. Zelezny, J. Tolar: Pre-
globulins, the partner was the Department of Nephrol-
                                                                   diction of DNA-binding propensity of proteins by the
ogy, Transplant Center, Institute for Clinical and Ex-
                                                                   ball-histogram method using automatic template search
perimental Medicine.                                               BMC. Bioinformatics 13 (10), 2012, 3.
                                                               15. H. Libalova, K. Uhlirova, J. Klema, M. Machala,
References                                                         R. Sram, M. Ciganek, J. Topinka: Global gene expres-
                                                                   sion changes in human embryonic lung fibroblasts in-
 1. P. Baldi,S. Brunak: Bioinformatics: the machine learn-         duced by organic extracts from respirable air particles.
    ing approach, 2nd edition, MIT Press, 2001, 452.               Particle and Fibre Toxicology, 9(1), 2012.
 2. P. Larranaga, B. Calvo, R. Santana, C. Bielza,             16. M. Urbanova, I. Brabcova, E. Girmanova, F. Zelezny,
    J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas,             O. Viklicky: Differential regulation of the nuclear
    G. Santafe, A. Perez, V. Robles: Machine learning in           factor-kappa-B pathway by rabbit antithymocyte globu-
    bioinformatics. Briefings in Bioinformatics, 7(1), 2005,       lins in kidney transplantation. Transplantation 93(6),
    86–112.                                                        2012, 589–96.