<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mining t he inte rne t for scie ntific discove rie s: what can automatic page tagging te ll us about the study of ge ne s?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shao Chih Kuo</string-name>
          <email>shaochih.kuo@bbsrc.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Splendian i</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael De foin-Plate l</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Rawlings</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Biomathematics and Bioinfor matics, Rothamsted Research</institution>
          ,
          <addr-line>Harpenden AL5 2JQ</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing Science</institution>
          ,
          <addr-line>Claremont Tower</addr-line>
          ,
          <institution>Newcastle University</institution>
          ,
          <addr-line>Newcastle Upon Tyne NE1 7RU</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The internet is not only a platform for publishing documents; it is a provider of data and services. Incr easingly, scientific disciplines are exposing their tools and data to the internet, as a result, some scientific problems have become essentially internet mining problems. We show that candidate gene prioritisation, a challenging problem in biology, is essentially an internet mining problem. Thus, improving our ability to mine Future Internet Knowledge Bases (FIKBs) will advance biology and other sciences.</p>
      </abstract>
      <kwd-group>
        <kwd>Bioinformatics</kwd>
        <kwd>semantic gr aph</kwd>
        <kwd>graph mining</kwd>
        <kwd>gene prioritisation</kwd>
        <kwd>automatic page tagging</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The internet has been, and still is, prima rily concerned with publishing documents.
However, it is clea rly a lso a provider of data and services: scientific data is
increasingly accessible on the internet, and many scientific tools are made available
via the web, as web services, web applications, or otherwise e xposed to the internet.</p>
      <p>
        Th is is particularly evident in the Life Sc ience doma in, wh ich has embraced
theinternet as a mediu m for publishing data and tools. To cite a few e xa mples, for
mo lecular data, the journal Nucleic Acids Research has tracked 1,230 databases [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
covering a diverse range of topics and this figure is growing at an rapid rate.
Likewise, the BioCatalogue directory tracks 1,695 public ly available web services of
bioinformat ics analysis tools [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. PubMed, the web’s largest bibliography, is also life
science centred with a historical focus on biomedica l topics. Therefore the internet is,
amongst other things, a distributed knowledge base for biological studies where the
network of bio logical entit ies and their re lations is described “in the web”: via
interlinked websites, or more e xp licit ly, as RDF graphs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In light of the “internetisation” of biological data and resources, we assert that
many biologica l problems are de facto internet mining problems, analogous to more
conventional internet mining proble ms. Therefore imp roving our ability to mine
Future Internet Knowledge Bases (FIKBs) will certain ly advance biology and other
sciences. We demonstrate this by showing how the problem of gene prioritisation is
analogous to automatic page tagging.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Gene prioritisation: a biological problem</title>
      <p>
        Finding causes that influence particular traits is an impo rtant challenge in bio logy;
whether it is locating disease genes affecting humans, factors decreasing food
production for cereals, or factors increasing industrial insulin production,
fundamentally, the goal is the same, to find causes of biological traits. Often the
causes under study are genetic actors, and the methods employed to exa mine them
invariably rely on dra wing para lle ls against the body of studied genes; that is to say,
given some new gene of study, the assumption is always that it works in a simila r way
to closely evolutionarily re lated genes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Th is assertion underpins the choice to focus study on model organisms, usually
organisms which lend themselves to study (i.e. by virtue of having easily observed
characteristics or by being cheap to work with) which are representative of their
respective classes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], for e xa mp le, mouse is commonly used as a model organism for
human. For studying a newly discovered gene, bioinformat ics can be used first to
identify studied evolutionarily related genes by various similarity measures and then
to transpose information to the unstudied gene by assigning it putative functions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Using observations and statistical techniques, associations can be drawn between
comple x tra its and genomic regions, however, these regions can be large [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and also
the costs of gene testing may be high, so as to ma ke the cost of e xhaustively testing
every gene in the region prohibitive. For the biofuels crop willow, b io mass is an
important trait involved in the production of biofuels. Testing time for a single gene
for its influence on this trait ranges from months to years, and genomic regions
derived fro m the statistical techniques may contain several hundred genes. As
randomly testing genes is unlike ly to reveal trait-affecting genes, this is a clear case
for gene prioritisation techniques.
      </p>
      <p>
        When analysing genes, whilst some useful knowledge may be gleaned from
analysing their sequences directly [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], by and large, the bulk of useful knowledge
about these genes will be derived fro m co mparing or otherwise associating the newly
sequence genes to the corpus of well-studied genes [9], to e xisting pathways [10], to
publications [11], and any other availab le data. These associations induce a
semantically heterogeneous graph, with each gene comparison or association method
asserting a new type of relationship between genes from the newly sequenced
organism to the wider genera l body of knowledge, wh ich itself would be a semantic
heterogeneous graph (see Figure 1 for an exa mp le). Once viewed as a graph,
descriptions of comple x tra its of interest can then be represented as a collection of
nodes in the graph representing functional annotations, such as those from the various
biological ontologies [12] or controlled vocabularies. Finding good ways to prioritise
genes for e xperimental testing for comp le x biolog ical t raits then becomes equivalent
to ranking the overall association in a heterogeneous graph, from a set of nodes of a
gene type (genes from the newly sequenced organism) to a set of a set of nodes of the
annotation types.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Automatic page tagging: an analogy</title>
      <p>The pattern of the problem presented earlier is not unique to the biological domain.
As an exa mple, the same pattern can also be found in an automatic page tagging
system that works by comparing untagged pages against an existing corpus, where
pages may be associated by relationships such as: “belonging to the same domain”,
“being written in the same language”, or “belonging to the same web ring”.
Furthermore, pages can be related by the simila rity of their structure, by shared
keywords, by a shared audience of the pages, and by other page comparison methods.
These relationships may be more or less informat ive, but will induce a semantically
heterogeneous graph.</p>
      <p>Suppose that the pages in the existing corpus have been assigned appropriate tags
by curation, these tags may be from a controlled vocabulary, ontology, or free (which
could still fall into a structure such as WordNet [13]). An automat ic tagging method
then might be, for each untagged page, to determine the strength of association
between that page and the existing tags, and then assign tags according to the strength
of association (with some sensible threshold).</p>
      <p>Then tag based search (by single tags or by collection of tags) of the new,
automatically tagged pages would return an ordered list of those pages most
associated to those tags, this order can utilise the semantic distances between tags, and
ma kes the proble m analogous to the gene prioritisation proble m. An e xa mp le of a
graph that this view of the tagging proble m may induce is shown in Figure 1.
Fig. 1. Two e xa mp le graphs, candidate gene prioritizat ion based on studied genes,
and automatic page tagging based on curated pages sharing the same graph topology.
Bold type represent to the “gene version” of this graph topology, whereas regular type
represents the “page version” of this graph topology. Where edge/node types are the
same in both cases, they are italic ised.</p>
    </sec>
    <sec id="sec-4">
      <title>4 Gene prioritisation: an internet mining proble m</title>
      <p>In the previous section, we have shown how a typical bioinformatics problem, gene
prioritisation, is analogous to a typical internet mining proble m. Beyond this analogy,
this and other bioinformatics studies should also be considered internet mining
problems in the own right.</p>
      <p>Each of the node types shown in the gene priorit isation exa mple in Figure 1 is
represented by one or more internet resources, as are the edge types. For each of the
node types shown in Figure 1, one source of this type of data is given in Table 1. For
each of the edge types shown in Figure 1, one source of data for this type of
relationship, or one tool for asserting this type of relationship, is shown in Table 2.
Biologica l entities and relationships are encoded in a variety of forms, as documents,
in structured data, and combinations of both, for our purposes, we only wish to
illustrate that at least one form is (and in general, many forms are) available on the
internet.</p>
      <p>Thus, heterogeneous graphs that can be used for solving the candidate gene
prioritisation proble m are d irect ly availab le on the internet, and along with other
scientific resources, will be part of Future Internet Knowledge Bases (FIKBs).</p>
      <sec id="sec-4-1">
        <title>Nodes</title>
        <p>Genes</p>
        <sec id="sec-4-1-1">
          <title>Pathways</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Annotation terms</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Source</title>
        <p>Ensemb l [14]
URL
http://www.ensemb l.org/inde x.ht ml</p>
        <sec id="sec-4-2-1">
          <title>KEGG Pathway</title>
          <p>[15]
Gene Ontology [12]
http://www.genome .jp/kegg/pathway.
html
http://www.geneontology.org/
In conclusion, with the greater availab ility of scientific resources on the internet, tasks
in mining scientific data will increasingly become internet min ing problems.
Scientific research will increasingly rely on the design and availability of dedicated
Future Internet Knowledge Bases (FIKBs), and on the development of associated
methods to analyse them.</p>
          <p>This brings with it, both new challenges and new opportunities. Whilst we have
illustrated our case with a problem in the biological domain, the principles hold more
widely for other sciences. Some scientific problems have parallels amongst existing
internet mining p roblems, and it is reasonable to expect that advances in techniques
in mining the future internet will provide solutions to scientific problems, and vice
versa.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledge ments</title>
      <p>The authors would like to thank Julia Halder fo r valuable comments, the authors also
gratefully acknowledge funding fro m the UK Biotechnology and Biological Sciences
Research Council (DTG BB/F529038/1, SA BR Grant BB/F006039/1).
9. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lip man, D.J.: Basic local
align ment search tool. J Mol Bio l 215, 403-410 (1990)
10. Dale, J.M., Popescu, L., Karp, P.D.: Mach ine learning methods for metabolic
pathway prediction. Bmc Bio informat ics 11, 15 (2010)
11. Krallinger, M., Valencia, A., Hirschman, L.: Linking genes to literature: text
mining, informat ion extract ion, and retrieval applications for biology. Genome Biol 9
Suppl 2, S8 (2008)
12. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M.,
Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P.,
IsselTarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M.,
Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. The
Gene Ontology Consortium. Nat Genet 25, 25-29 (2000)
13. Sig man, M., Cecchi, G.A.: Global organization of the Wordnet lexicon. Proc
Natl Acad Sci U S A 99, 1742-1747 (2002)
14. Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Co x,
T., Cuff, J., Curwen, V., Down, T., Durbin, R., Eyras, E., Gilbert, J., Hammond, M.,
Hu min iecki, L., Kasprzyk, A., Lehvaslaiho, H., Lijnzaad, P., Melsopp, C., Mongin,
E., Pettett, R., Pocock, M., Potter, S., Rust, A., Sch midt, E., Searle, S., Slater, G.,
Smith, J., Spooner, W., Stabenau, A., Stalker, J., Stupka, E., Ureta-Vidal, A., Vastrik,
I., Clamp , M.: The Ensembl genome database project. Nucleic Acids Res 30, 38-41
(2002)
15. Kanehisa, M., Goto, S., Hattori, M ., Aoki-Kinoshita, K.F., Itoh, M.,
Kawashima, S., Katayama, T., Araki, M., Hirakawa, M.: Fro m genomics to chemical
genomics: new developments in KEGG. Nucleic Acids Res 34, D354-357 (2006)
16. Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinn is, S.,
Madden, T.L.: NCBI BLAST: a better web interface. Nucleic Acids Res 36, W5-9
(2008)
17. Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O'Donovan, C., Apweiler,
R.: The GOA database in 2009--an integrated Gene Ontology Annotation resource.
Nucleic Acids Res 37, D396-403 (2009)
18. Go ffard, N., Weiller, G.: PathExpress: a web -based tool to identify relevant
pathways in gene expression data. Nucleic Acids Res 35, W176-181 (2007)
19. He, M., Wang, Y., Li, W.: PPI finder: a mining tool for human
proteinprotein interactions. Plos One 4, e4554 (2009)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cochrane</surname>
            ,
            <given-names>G.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Galperin</surname>
          </string-name>
          , M .Y.:
          <article-title>The 2010 Nucleic Acids Research Database Issue and online Database Collection: a co mmunity of data resources</article-title>
          .
          <source>Nucleic Acids Res</source>
          <volume>38</volume>
          ,
          <fpage>D1</fpage>
          -
          <lpage>4</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bhagat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanoh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nzuobontane</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laurent</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orlowski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolstencroft</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aleksejevs</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pettifer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Gob le, C.A.:
          <article-title>BioCatalogue: a universal catalogue of web services for the life sciences</article-title>
          .
          <source>Nucleic Acids Res 38 Suppl, W</source>
          <volume>689</volume>
          -
          <fpage>694</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Antezana</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blonde</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Egana</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rutherford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Baets</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mironov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuiper</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Bio Gateway: a semantic systems biology tool for the life sciences</article-title>
          .
          <source>Bmc Bio informat ics 10 Suppl</source>
          <volume>10</volume>
          ,
          <issue>S11</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Eisenberg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcotte</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xenarios</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeates</surname>
            ,
            <given-names>T.O.</given-names>
          </string-name>
          :
          <article-title>Protein function in the post-genomic era</article-title>
          .
          <source>Nature</source>
          <volume>405</volume>
          ,
          <fpage>823</fpage>
          -
          <lpage>826</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hedges</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          :
          <article-title>The orig in and evolution of model organisms</article-title>
          .
          <source>Nat Rev Genet</source>
          <volume>3</volume>
          ,
          <fpage>838</fpage>
          -
          <lpage>849</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kaminski</surname>
          </string-name>
          , N.:
          <string-name>
            <surname>Bioinformatics</surname>
          </string-name>
          .
          <article-title>A user's perspective</article-title>
          .
          <source>Am J Respir Cell Mol Biol 23</source>
          ,
          <fpage>705</fpage>
          -
          <lpage>711</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kleeberger</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          , Sch wart z, D.A.:
          <article-title>Fro m quantitative trait locus to gene: a work in p rogress</article-title>
          .
          <source>Am J Respir Crit Care Med</source>
          <volume>171</volume>
          ,
          <fpage>804</fpage>
          -
          <lpage>805</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Skoln</surname>
            <given-names>ick</given-names>
          </string-name>
          , J., Fetro w, J.S.:
          <article-title>Fro m genes to protein structure and function: novel applications of computational approaches in the genomic era</article-title>
          .
          <source>Trends Biotechnol</source>
          <volume>18</volume>
          ,
          <fpage>34</fpage>
          -
          <lpage>39</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>