1 Introduction

Mining t he inte rne t for scie ntific discove rie s: what can automatic page tagging te ll us about the study of ge ne s?

Shao Chih Kuo

shaochih.kuo@bbsrc.ac.uk 0 1

Andrea Splendian i

Michael De foin-Plate l

Chris Rawlings

0 0 Department of Biomathematics and Bioinfor matics, Rothamsted Research , Harpenden AL5 2JQ , United Kingdom 1 School of Computing Science , Claremont Tower , Newcastle University , Newcastle Upon Tyne NE1 7RU , United Kingdom

The internet is not only a platform for publishing documents; it is a provider of data and services. Incr easingly, scientific disciplines are exposing their tools and data to the internet, as a result, some scientific problems have become essentially internet mining problems. We show that candidate gene prioritisation, a challenging problem in biology, is essentially an internet mining problem. Thus, improving our ability to mine Future Internet Knowledge Bases (FIKBs) will advance biology and other sciences.

Bioinformatics semantic gr aph graph mining gene prioritisation automatic page tagging

1 Introduction

The internet has been, and still is, prima rily concerned with publishing documents. However, it is clea rly a lso a provider of data and services: scientific data is increasingly accessible on the internet, and many scientific tools are made available via the web, as web services, web applications, or otherwise e xposed to the internet.

Th is is particularly evident in the Life Sc ience doma in, wh ich has embraced theinternet as a mediu m for publishing data and tools. To cite a few e xa mples, for mo lecular data, the journal Nucleic Acids Research has tracked 1,230 databases [ 1 ], covering a diverse range of topics and this figure is growing at an rapid rate. Likewise, the BioCatalogue directory tracks 1,695 public ly available web services of bioinformat ics analysis tools [ 2 ]. PubMed, the web’s largest bibliography, is also life science centred with a historical focus on biomedica l topics. Therefore the internet is, amongst other things, a distributed knowledge base for biological studies where the network of bio logical entit ies and their re lations is described “in the web”: via interlinked websites, or more e xp licit ly, as RDF graphs [ 3 ].

In light of the “internetisation” of biological data and resources, we assert that many biologica l problems are de facto internet mining problems, analogous to more conventional internet mining proble ms. Therefore imp roving our ability to mine Future Internet Knowledge Bases (FIKBs) will certain ly advance biology and other sciences. We demonstrate this by showing how the problem of gene prioritisation is analogous to automatic page tagging.

2 Gene prioritisation: a biological problem

Finding causes that influence particular traits is an impo rtant challenge in bio logy; whether it is locating disease genes affecting humans, factors decreasing food production for cereals, or factors increasing industrial insulin production, fundamentally, the goal is the same, to find causes of biological traits. Often the causes under study are genetic actors, and the methods employed to exa mine them invariably rely on dra wing para lle ls against the body of studied genes; that is to say, given some new gene of study, the assumption is always that it works in a simila r way to closely evolutionarily re lated genes [ 4 ].

Th is assertion underpins the choice to focus study on model organisms, usually organisms which lend themselves to study (i.e. by virtue of having easily observed characteristics or by being cheap to work with) which are representative of their respective classes [ 5 ], for e xa mp le, mouse is commonly used as a model organism for human. For studying a newly discovered gene, bioinformat ics can be used first to identify studied evolutionarily related genes by various similarity measures and then to transpose information to the unstudied gene by assigning it putative functions [ 6 ].

Using observations and statistical techniques, associations can be drawn between comple x tra its and genomic regions, however, these regions can be large [ 7 ] and also the costs of gene testing may be high, so as to ma ke the cost of e xhaustively testing every gene in the region prohibitive. For the biofuels crop willow, b io mass is an important trait involved in the production of biofuels. Testing time for a single gene for its influence on this trait ranges from months to years, and genomic regions derived fro m the statistical techniques may contain several hundred genes. As randomly testing genes is unlike ly to reveal trait-affecting genes, this is a clear case for gene prioritisation techniques.

When analysing genes, whilst some useful knowledge may be gleaned from analysing their sequences directly [ 8 ], by and large, the bulk of useful knowledge about these genes will be derived fro m co mparing or otherwise associating the newly sequence genes to the corpus of well-studied genes [9], to e xisting pathways [10], to publications [11], and any other availab le data. These associations induce a semantically heterogeneous graph, with each gene comparison or association method asserting a new type of relationship between genes from the newly sequenced organism to the wider genera l body of knowledge, wh ich itself would be a semantic heterogeneous graph (see Figure 1 for an exa mp le). Once viewed as a graph, descriptions of comple x tra its of interest can then be represented as a collection of nodes in the graph representing functional annotations, such as those from the various biological ontologies [12] or controlled vocabularies. Finding good ways to prioritise genes for e xperimental testing for comp le x biolog ical t raits then becomes equivalent to ranking the overall association in a heterogeneous graph, from a set of nodes of a gene type (genes from the newly sequenced organism) to a set of a set of nodes of the annotation types.

3 Automatic page tagging: an analogy

The pattern of the problem presented earlier is not unique to the biological domain. As an exa mple, the same pattern can also be found in an automatic page tagging system that works by comparing untagged pages against an existing corpus, where pages may be associated by relationships such as: “belonging to the same domain”, “being written in the same language”, or “belonging to the same web ring”. Furthermore, pages can be related by the simila rity of their structure, by shared keywords, by a shared audience of the pages, and by other page comparison methods. These relationships may be more or less informat ive, but will induce a semantically heterogeneous graph.

Suppose that the pages in the existing corpus have been assigned appropriate tags by curation, these tags may be from a controlled vocabulary, ontology, or free (which could still fall into a structure such as WordNet [13]). An automat ic tagging method then might be, for each untagged page, to determine the strength of association between that page and the existing tags, and then assign tags according to the strength of association (with some sensible threshold).

Then tag based search (by single tags or by collection of tags) of the new, automatically tagged pages would return an ordered list of those pages most associated to those tags, this order can utilise the semantic distances between tags, and ma kes the proble m analogous to the gene prioritisation proble m. An e xa mp le of a graph that this view of the tagging proble m may induce is shown in Figure 1. Fig. 1. Two e xa mp le graphs, candidate gene prioritizat ion based on studied genes, and automatic page tagging based on curated pages sharing the same graph topology. Bold type represent to the “gene version” of this graph topology, whereas regular type represents the “page version” of this graph topology. Where edge/node types are the same in both cases, they are italic ised.

4 Gene prioritisation: an internet mining proble m

In the previous section, we have shown how a typical bioinformatics problem, gene prioritisation, is analogous to a typical internet mining proble m. Beyond this analogy, this and other bioinformatics studies should also be considered internet mining problems in the own right.

Each of the node types shown in the gene priorit isation exa mple in Figure 1 is represented by one or more internet resources, as are the edge types. For each of the node types shown in Figure 1, one source of this type of data is given in Table 1. For each of the edge types shown in Figure 1, one source of data for this type of relationship, or one tool for asserting this type of relationship, is shown in Table 2. Biologica l entities and relationships are encoded in a variety of forms, as documents, in structured data, and combinations of both, for our purposes, we only wish to illustrate that at least one form is (and in general, many forms are) available on the internet.

Thus, heterogeneous graphs that can be used for solving the candidate gene prioritisation proble m are d irect ly availab le on the internet, and along with other scientific resources, will be part of Future Internet Knowledge Bases (FIKBs).

Nodes

Genes

Pathways Annotation terms Source

Ensemb l [14] URL http://www.ensemb l.org/inde x.ht ml

KEGG Pathway

[15] Gene Ontology [12] http://www.genome .jp/kegg/pathway. html http://www.geneontology.org/ In conclusion, with the greater availab ility of scientific resources on the internet, tasks in mining scientific data will increasingly become internet min ing problems. Scientific research will increasingly rely on the design and availability of dedicated Future Internet Knowledge Bases (FIKBs), and on the development of associated methods to analyse them.

This brings with it, both new challenges and new opportunities. Whilst we have illustrated our case with a problem in the biological domain, the principles hold more widely for other sciences. Some scientific problems have parallels amongst existing internet mining p roblems, and it is reasonable to expect that advances in techniques in mining the future internet will provide solutions to scientific problems, and vice versa.

Acknowledge ments

The authors would like to thank Julia Halder fo r valuable comments, the authors also gratefully acknowledge funding fro m the UK Biotechnology and Biological Sciences Research Council (DTG BB/F529038/1, SA BR Grant BB/F006039/1). 9. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lip man, D.J.: Basic local align ment search tool. J Mol Bio l 215, 403-410 (1990) 10. Dale, J.M., Popescu, L., Karp, P.D.: Mach ine learning methods for metabolic pathway prediction. Bmc Bio informat ics 11, 15 (2010) 11. Krallinger, M., Valencia, A., Hirschman, L.: Linking genes to literature: text mining, informat ion extract ion, and retrieval applications for biology. Genome Biol 9 Suppl 2, S8 (2008) 12. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., IsselTarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-29 (2000) 13. Sig man, M., Cecchi, G.A.: Global organization of the Wordnet lexicon. Proc Natl Acad Sci U S A 99, 1742-1747 (2002) 14. Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Co x, T., Cuff, J., Curwen, V., Down, T., Durbin, R., Eyras, E., Gilbert, J., Hammond, M., Hu min iecki, L., Kasprzyk, A., Lehvaslaiho, H., Lijnzaad, P., Melsopp, C., Mongin, E., Pettett, R., Pocock, M., Potter, S., Rust, A., Sch midt, E., Searle, S., Slater, G., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Stupka, E., Ureta-Vidal, A., Vastrik, I., Clamp , M.: The Ensembl genome database project. Nucleic Acids Res 30, 38-41 (2002) 15. Kanehisa, M., Goto, S., Hattori, M ., Aoki-Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., Hirakawa, M.: Fro m genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34, D354-357 (2006) 16. Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinn is, S., Madden, T.L.: NCBI BLAST: a better web interface. Nucleic Acids Res 36, W5-9 (2008) 17. Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O'Donovan, C., Apweiler, R.: The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Res 37, D396-403 (2009) 18. Go ffard, N., Weiller, G.: PathExpress: a web -based tool to identify relevant pathways in gene expression data. Nucleic Acids Res 35, W176-181 (2007) 19. He, M., Wang, Y., Li, W.: PPI finder: a mining tool for human proteinprotein interactions. Plos One 4, e4554 (2009)

1. Cochrane , G.R. , Galperin , M .Y.: The 2010 Nucleic Acids Research Database Issue and online Database Collection: a co mmunity of data resources . Nucleic Acids Res 38 , D1 - 4 ( 2010 )

2. Bhagat , J. , Tanoh , F. , Nzuobontane , E. , Laurent , T. , Orlowski , J. , Roos , M. , Wolstencroft , K. , Aleksejevs , S. , Stevens , R. , Pettifer , S. , Lopez , R. , Gob le, C.A.: BioCatalogue: a universal catalogue of web services for the life sciences . Nucleic Acids Res 38 Suppl, W 689 - 694 ( 2010 )

3. Antezana , E. , Blonde , W. , Egana , M. , Rutherford , A. , Stevens , R. , De Baets , B. , Mironov , V. , Kuiper , M. : Bio Gateway: a semantic systems biology tool for the life sciences . Bmc Bio informat ics 10 Suppl 10 , S11 ( 2009 )

4. Eisenberg , D. , Marcotte , E.M. , Xenarios , I. , Yeates , T.O. : Protein function in the post-genomic era . Nature 405 , 823 - 826 ( 2000 )

5. Hedges , S.B. : The orig in and evolution of model organisms . Nat Rev Genet 3 , 838 - 849 ( 2002 )

6. Kaminski , N.: Bioinformatics . A user's perspective . Am J Respir Cell Mol Biol 23 , 705 - 711 ( 2000 )

7. Kleeberger , S.R. , Sch wart z, D.A.: Fro m quantitative trait locus to gene: a work in p rogress . Am J Respir Crit Care Med 171 , 804 - 805 ( 2005 )

8. Skoln

ick

, J., Fetro w, J.S.: Fro m genes to protein structure and function: novel applications of computational approaches in the genomic era . Trends Biotechnol 18 , 34 - 39 ( 2000 )