<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Machine learning applications in bioinformatics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiˇr´ı Kl´ema</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University Technick ́a 2</institution>
          ,
          <addr-line>166 27, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Bioinformatics is a field of study dealing with quence alignment problem or they appear in simplified methods for storing, retrieving and analyzing gene and pro- models of protein folding. tein oriented biological data. High-throughput technologies like DNA sequencing or microarrays allow researchers to obtain large volumes of heterogeneous and mutually in- 1.1 Success stories and interactions teracting data. Analysis and understanding of these data provides a natural application field for machine learning algorithms. At the same time, bioinformatics is a scientific branch of such analytical complexity, data variety and abundance that it motivates further development of specialized learning algorithms such as co-clustering or multiple sequence alignment. This paper provides a brief overview of the topics and works discussed during my talk on machine learning applications in bioinformatics. The talk starts with a preview of fundamental bioinformatics analytical tasks solved by machine learning algorithms mentioning a few success stories. The second part summarizes the recent bioinformatics research carried out in my home research group, the Intelligent Data Analysis group of Czech Technical University.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The bioinformatics tool with the largest impact is
undoubtedly The Basic Alignment Search Tool (BLAST)
and its successors [4] for searching a large sequence
database against a query sequence. The NCBI server
that provides the service with heuristic methods for
sequence database searching handles more than half
a million queries a day, the paper [4] introducing the
improved PSI-BLAST has tens of thousands of
citations. Another success story is an early case study on
predictive classification from gene expression data [5].</p>
      <p>The study proved feasibility of cancer classification
based solely on gene expression monitoring. Although
other latter studies showed that this positive result
cannot be by means taken for granted, since
1 Analytical bioinformatics tasks then molecular classification is an option in disease
diagnostics.</p>
      <p>A complete overview of analytical bioinformatics tasks Bioinformatics directly motivates some cutting
solvable and being solved by machine learning (ML) edge ML projects such as automated hypotheses
genalgorithms is out of scope of this short summary. [1] is eration and learning of optimal workflows. [6] reports
a textbook that provides an introduction to the most the development of Robot Scientist “Adam”, which
important problems in computational biology and autonomously generated functional genomics
hypothea unified treatment of the ML methods for solving ses about the yeast Saccharomyces cerevisiae and
exthese problems. The book is self-contained, its large perimentally tested these hypotheses by using
laborapart focuses on the principles of fundamental tory automation. One of its main objectives of
ML algorithms. A relevant concise review appeared the ongoing European ML and data mining project
in [2], its updated recent modification was presented e-LICO [7, 8] is to implement an intelligent data
minin [3]. The reviews distinguish four principal classes ing assistant that takes in user specifications of the
of tasks. Firstly, a large group of bioinformatics prob- learning task and the available data, plans a
methodlems can be posed as classification tasks. Genome an- ologically correct learning process, and suggests
worknotation including gene finding and searching for DNA flows that the user can execute to achieve the
prespecibinding sites with proteins or gene function prediction fied objectives. Bioinformatics is the major application
and protein secondary structure prediction make ex- area.
amples. Secondly, clustering can be used to learn
functional similarity from gene expression data or it can
form phylogenetic trees. Thirdly, probabilistic graphi- 2 IDA bioinformatics research topics
cal models can serve for modelling of DNA sequences
in genomics or inference of genetic networks in sys- One of our main research topics is learning from gene
tems biology. Last but not least, optimization algo- expression data driven by background knowledge [9].
rithms have been proposed to solve the multiple se- Mining patterns from gene expression data represents
an alternative way to clustering [10]. Clustering pro- 3. I. Inza, B. Calvo, R. Armananzas, E. Bengoetxea,
vides the most straightforward and traditional ap- P. Larranaga, J. A. Lozano: Machine learning: an
inproach to obtain co-expressed genes. However, a typ- dispensable tool in bioinformatics. Methods Mol. Biol.
ical group of genes shares an activation pattern only 593, 2010, 25–48.
under specific experimental conditions. Local meth- 4. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang,
ods such as pattern mining can identify exactly the aZn.dZhPanSgI,-BWLA.SMTil:lerA, Dn.Je.wLigpemnaenra:tiGonappoedf BpLroAteSiTn
sets of genes displaying a specific expression charac- database search programs. Nucleic Acids Research, 25,
teristic in a set of situations. The main bottleneck of 1997, 3389–3402.
this type of analysis is twofold – computational costs 5. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard,
and an overwhelming number of candidate patterns M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. ,
which can hardly be further exploited. A timely appli- J. R. Downing, M. A. Caligiuri, C. D. Bloomfield,
cation of background knowledge available in literature E. S. Lander: Molecular classification of cancer: class
databases, gene ontologies and other sources can help discovery and class prediction by gene expression
monto focus on the most plausible patterns only. Molecu- itoring. Science, 286 (5439), 1999, 531–537.
lar classification of biological samples based on their 6. R. D. King, J. Rowland, S. G. Oliver, M. Young,
gene-expression profiles is a natural learning task with PW. . PAiru,bLre.yN,. ES.oldBaytronvea,, MA.. SLpiaakrakteas,, KM..EM.Warkhhelaamn,,
immediate practical uses. Nevertheless, molecular clas- A. Clare: The automation of science. Science 324
sifiers based solely on gene expression in most cases (5923), 2009, 85–89.
cannot be considered useful decision-making tools or 7. e-lico project: An e-laboratory for interdisciplinary
coldecision-supporting tools. Similarly to the domain of laborative research in data mining and data-intensive
pattern mining, recent efforts in the field of molec- science, http://www.e-lico.eu/, August 2012.
ular classification aim to employ background knowl- 8. M. Hilario, P. Nguyen, H. Do, A. Woznica,
edge. The idea is to extract features that correspond A. Kalousis: Ontology-based meta-mining of
knowlto functionally related gene sets instead of the individ- edge discovery workflows. In Jankowski, N., Duchs, W.
ual genes, respectively the probesets whose expression Grabczewski, K., Meta-Learning in Computational
Inis available in the original expression data [11, 12]. 9. tJe.llKigleenmcae:, LSeparrinnginegr, f2ro0m11,h2e7te3r–o3g1e6n.eous genomic data.</p>
      <p>The previous paragraph employs the available FEE CTU, habilitation thesis, to appear.
structural genomic knowledge to improve the analy- 10. J. Klema, S. Blachon, A. Soulet, B Cremilleux,
sis of gene expression data. We also studied several O. Gandrilon: Constraint-based knowledge discovery
methods to create it from collections of free biomedical from SAGE data. In Silico Biology, 8, 0014, 2008.
texts, namely the research papers and their short sum- 11. M. Holec, J. Klema, F. Zelezny, J. Tolar: Comparative
maries [13]. [14] proposes a novel ball-histogram ap- evaluation of set-level techniques in predictive
classifiproach to DNA-binding propensity prediction of pro- cation of gene expression samples. BMC
Bioinformatteins. ics, 13, (10), 2012, S15.</p>
      <p>12. M. Krejnik, J. Klema: Empirical evidence of</p>
      <p>Last but not least, the IDA group cooperates with the applicability of functional clustering through
several biological institutes and labs. To exemplify, gene expression classification. IEEE/ACM
Transac[15] shows an application of the set-level approach dis- tions on Computational Biology and
Bioinformatcussed above to the particular domain of respirable ics, 9(3), 2012, 788–798.
ambient air particulate matter, the principal research 13. M. Plantevit, T. Charnois, J. Klema, C. Rigotti,
partner was the Department of Genetic Ecotoxicology B. Cremilleux: Combining sequence and itemset mining
from Czech Academy of Sciences. [16] evaluates dif- to discover named entities in biomedical texts: A new
ferences in the intragraft transcriptome after success- type of pattern. International Journal of Data
Minful induction therapy using two rabbit antithymocyte ing, Modelling and Management, 1(2), 2009, 119–148.
globulins, the partner was the Department of Nephrol- 14. A. Szaboova, O. Kuzelka, F. Zelezny, J. Tolar:
Prediction of DNA-binding propensity of proteins by the
ogy, Transplant Center, Institute for Clinical and Ex- ball-histogram method using automatic template search
perimental Medicine. BMC. Bioinformatics 13 (10), 2012, 3.
15. H. Libalova, K. Uhlirova, J. Klema, M. Machala,
References R. Sram, M. Ciganek, J. Topinka: Global gene
expression changes in human embryonic lung fibroblasts
in1. P. Baldi,S. Brunak: Bioinformatics: the machine learn- duced by organic extracts from respirable air particles.</p>
      <p>ing approach, 2nd edition, MIT Press, 2001, 452. Particle and Fibre Toxicology, 9(1), 2012.
2. P. Larranaga, B. Calvo, R. Santana, C. Bielza, 16. M. Urbanova, I. Brabcova, E. Girmanova, F. Zelezny,
J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, O. Viklicky: Differential regulation of the nuclear
G. Santafe, A. Perez, V. Robles: Machine learning in factor-kappa-B pathway by rabbit antithymocyte
globubioinformatics. Briefings in Bioinformatics, 7(1), 2005, lins in kidney transplantation. Transplantation 93(6),
86–112. 2012, 589–96.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>