=Paper= {{Paper |id=None |storemode=property |title=Mining Semantic Networks of Bioinformatics e-Resources from the Literature |pdfUrl=https://ceur-ws.org/Vol-559/Paper4.pdf |volume=Vol-559 |dblpUrl=https://dblp.org/rec/conf/swat4ls/AfzalESN09 }} ==Mining Semantic Networks of Bioinformatics e-Resources from the Literature== https://ceur-ws.org/Vol-559/Paper4.pdf
           Mining Semantic Networks of Bioinformatics
                e-Resources from the Literature

            Hammad Afzal1,2, James Eales1, Robert Stevens1, Goran Nenadic1
                   1
                  School of Computer Science, University of Manchester,
                         Oxford Road, Manchester, M13 9PL, UK
                      2
                        DERI, Unit for Natural Language Processing,
                     National University of Ireland, Galway, Ireland
               {Hammad.Afzal@postgrad., James.Eales@, Robert.Stevens@,
                             G.Nenadic@}manchester.ac.uk



       Abstract. There have been a number of recent efforts (e.g. BioCatalogue,
       BioMOBY, etc.) to systematically catalogue bioinformatics tools, services and
       datasets. These efforts mostly rely on manual curation and are unable to cope
       with the huge influx of various electronic resources, which consequently result
       in their unavailability to the community. We present a text mining approach
       that utilizes the literature to extract and semantically profile bioinformatics
       resources. Our method identifies the mentions of resources in the literature and
       assigns a set of co-occurring terminological and ontological entities
       (descriptors) to represent them. Since such representations can be extremely
       sparse, we use kernel metrics based on lexical term/descriptor similarities to
       identify semantically related resources. Resources are then either clustered or
       linked into a network, providing the users (bioinformaticians and service/tool
       crawlers) with a possibility to explore tools, services and datasets based on their
       relatedness, thus potentially improving the resource discovery process.

       Keywords: bioinformatics services, service description, text mining, networks,
       kernel similarity.




1 Introduction

The rapid increase in the amount of bioinformatics data produced in recent years has
resulted in the huge influx of bioinformatics electronic resources (e-resources), such
as online-databases [1], data-analysis tools, Web services [2] etc. Discovering such
resources became a major bottleneck in bioinformatics: in order to effectively utilize
them, e-resources need to be organised and their functionalities semantically
described. A number of community wide efforts such as BioCatalogue [3] and
BioMoby [4] have been initiated to systematically catalogue the “resourceome”. By
annotating services using keywords and ontological concepts, such catalogues
facilitate access to both bioinformaticians and Semantic Web crawlers and agents that
can orchestrate the use of such resources. However, the annotation process depends
on a typically slow manual curation process that hinders the growth of such curated
resources to keep pace with the very field they attempt to catalogue. For instance, the
number of registered services in BioCatalogue (there are 1,084 of them1) is still small
compared with the total number of Web services available online: it is estimated that
there are ~3500 life science Web services in Taverna alone [3, 5]. This fact calls for
the development of semi-automatic methods for resource annotation and their
cataloguing in order to maximise the utility of e-resources by making them widely
available to the community.
    One of the key aims of providing bioinformatics resources with semantic
descriptions is to improve resource discovery. Semantically-described resources can
not only be searched, browsed and discovered by using keyword-based queries (for
instance, via their names or task descriptions), but also on the basis of the semantic
relatedness of their functionalities. For example, BioCatalogue descriptions refer to
similar services (see Fig. 1) so that the users can identify related tools.




           Fig. 1. A snapshot of a Web service description taken from BioCatalogue2


    When manually assigned annotation tags and/or related services are not available,
we hypothesise that automated approaches could be used to improve the discovery
process. These include building networks and clusters of similar resources. For
example, a user can search for a Web service that corresponds to a particular input,
output or operation performed. If, however, the retrieved services do not fulfil the
exact requirement, the user may be interested in exploring similar services (for
example, with more generic/specific input/output, but still with a related
functionality), which can be identified by browsing a Web service network or by
exploring clusters of related services. Traditionally, similar or related services have
been identified by using lexical comparisons of their names and names of their
parameters (input/output) and operations. This process has been further improved by
concept-based comparisons using domain ontologies that have been used to annotate
the resources (as in myGrid [6] and BioCatalogue).

1 Statistics collected on 10th Oct, 2009.
2 http://www.biocatalogue.org/services/2048-wsblastpgpservice_414364
    In our previous work, we have shown that the vast amounts of scientific literature
related to bioinformatics resources can be tapped in order to automatically extract
their key semantic functional features [7, 8]. In this paper we propose a methodology
to build and explore clusters and semantic networks of bioinformatics resources,
which can help to identify related resources on the basis of their similarity as well as
by their semantic relatedness. In order to measure the semantic relatedness between
the resources, we have designed a kernel-based similarity approach that uses lexical
and semantic properties of resource mentions as extracted from the literature.


2 Methodology

The overall methodology adopted in the work presented here is based on the concepts
of bioinformatics resources, semantic resource descriptors, and kernel/similarity
functions, which are explained below.

    Bioinformatics resources represent the list of e-resources which are used by
bioinformaticians while performing in-silico experiments [9, 10]. We have focused on
the four major classes: Algorithms3, Applications, Data and Data Resources. These
have been engineered from the myGrid ontology. Table 1 shows example resource
instances belonging to these classes. In our previous work, we have described a set of
text mining tools that can be used to efficiently identify, classify and extract mentions
of these resources in the literature [7]. The method is based on key terminological
heads assigned to each of the semantic classes (e.g. alignment and method are
“linked” to Algorithms, while sequence and record point to a Data entity) and specific
lexico-syntactic patterns (enumerations, coordination, etc.).
                     Table 1. Examples of semantic classes and their instances

        Semantic class                               Example instances
                             SigCalc algorithm, CHAOS local alignment, SNP analysis,
     Algorithm               KEGG Genome-based approach, GeneMark method, K-fold cross
                             validation procedure
                             PreBIND Searcher program, Apollo2Go Web Service, FLIP
     Application             application, Apollo Genome Annotation curation tool, GenePix
                             software, Pegasys system
                             GeneBank record, Genome Microbial CoDing sequences, Drug
     Data                    Data report
                             PIR Protein Information Resource, BIND database,
     Data resource           TIGR dataset, BioMOBY Public Code repository


    Semantic resource descriptors are the key terminological phrases used in the
existing textual descriptions of bioinformatics resources, as given by various
providers such as BioCatalogue, BioMoby, EBI4, etc. These descriptors refer to
3 Note that, to aid simplicity and uniformity, we consider Algorithms as e-resources.
4 http://www.ebi.ac.uk/Tools/webservices/
concepts and specific roles (e.g. input/output parameters, etc) and have been
frequently used in the existing descriptions. For example, frequent descriptors are
gene expression, phylogenetic tree, microarray experiment, hierarchical clustering,
amino acid sequence, motif, etc. We use such descriptors to profile a given resource
and/or to link it to a domain ontology.
     We have used two sources to build a dictionary of bioinformatics resource
descriptors. The first source is the list of terms collected from the bioinformatics
ontology used in the myGrid project [11]. This list contains 443 terms describing
concepts in informatics (the key concepts of data, data structures, databases and
metadata); bioinformatics (domain-specific data sources e.g. model organism
sequencing databases, and domain-specific algorithms for searching and analysing
data e.g. a sequence alignment algorithm); molecular biology (higher level concepts
used to describe bioinformatics data types, used as inputs and outputs in services e.g.
protein sequence, nucleic acid sequence); and tasks (generic tasks a service operation
can perform e.g. retrieving, displaying, aligning). The second source includes
automatically extracted terms (recognised by the TerMine5 service) and frequent noun
phrases obtained from existing descriptions of bioinformatics Web resources available
from BioCatalogue.
     For each bioinformatics resource that can be identified in the literature, we build
its semantic profile by harvesting all descriptors that co-occur with the resource in the
same sentence in a given coprus (see Fig. 2 for an example). These profiles are then
used to establish semantic similarities between resources by comparing the
descriptors (used as features) that have been assigned to them.

                     Kyoto Encyclopaedia of Genes and Genomes
                                    (KEGG)


                                   Semantic Descriptors
                    data | database | DDBJ | EBI | enzyme | GenBank |
                 Gene Ontology | gene | genome | Kyoto Encyclopedia |
                microarray data | pathway | protein | transcription factor |
                          protein-protein interaction | UniProt




   Fig. 2. Semantic Resource Descriptors for the Kyoto Encyclopaedia of Genes and Genomes


    Since service representations using descriptors can be extremely sparse, we use
kernel metrics based on term/descriptor similarities to identify semantically related
resources. The main aim is to enhance the comparison process by incorporating
lexical and contextual properties of descriptors retrieved from the literature. This
approach is inherent to our method, as descriptors (used as features for the resources)
have been retrieved from sentences that are related to resources. Various similarity
kernels can be used for comparisons (e.g. bag-of-words kernels [12, 13], string

5 http://www.nactem.ac.uk/software/termine/
kernels [14], etc.). Here we have considered three approaches which are described
below.

•    Method 1: lexical comparison of resource names. This is a simple similarity
     function that relies on lexical profiles of resource names. The lexical profile of a
     term comprises all possible linear combinations of word-level substrings present
     in that term [15]. For example, the lexical profile of term ‘protein sequence
     alignment’ comprises the following terms protein, sequence, alignment, protein
     sequence, sequence alignment, protein sequence alignment. In this method, the
     similarity between two resources is then calculated as a similarity between lexical
     profiles of their names. Formally, let LP(s1) and LP(s2) be lexical profiles
     (represented as vectors) of names of resources s1 and s2. Then the similarity
     function is defined as:
                                                 LP( s1 ) ⋅ LP ( s 2 )
                          Sim 1 ( s1 , s 2 ) =                                         (1)
                                                 LP( s1 ) ⋅ LP ( s 2 )

•    Method 2: shared descriptors. Another option is to use the standard bag-of-
     descriptors kernel, where each resource is represented as a bag of its descriptors
     and the similarity is based on exact matches between descriptors. This kernel
     compares the resources using the inner product that measures the degree of
     descriptor sharing:

                                   Sim 2 ( s1 , s 2 ) = s1 ⋅ s 2                       (2)

    where s1 and s2 are vectors that represent the semantic descriptors assigned to the
    resources being compared. Alternatively, cosine similarity can be used if we use
    the frequency of the occurrence of the semantic descriptors (not presented here).

•    Method 3: lexical similarity of shared descriptors. A further option for a
     kernel function is to use the relatedness between descriptors to measure the
     similarity between the resources. The main motivation behind this approach is
     that resources can share related but not exactly the same descriptors. We
     therefore suggest using a kernel that takes into account descriptor smoothing by
     incorporating a similarity measure between descriptors themselves in the kernel
     function that calculates similarity between resources. Formally, let S = {s1, ..., sk}
     be the set of e-resources whose descriptions have been collected from the
     literature. Let D = {d1, ..., dm} be the set of all descriptors, where m is the total
     number of descriptors. In order to measure similarity between two resources, we
     first build a similarity matrix A (m x m), where each element aij corresponds to
     the similarity between descriptors di and dj. Then, the similarity between two
     resources s1 and s2 is calculated as:

                          Sim3 ( s1 , s2 ) = s1 ⋅ A ⋅ s2                               (3)

    In the experiments reported below, the smoothing is done by calculating the cosine
    similarity between the lexical profiles of the descriptors (analogously to (1)).
3 Experiments and Discussion

Here we demonstrate the development of networks of related resources by using each
of the three methods stated above. The networks are visualised as weighted,
undirected graphs where nodes are resources and edges represent relatedness between
them. This relatedness is estimated using the similarity functions, where the weight of
an edge represents the strength of the relationship between the two connected nodes
(see Section 3.2). We also investigate different methods of exploring and visualising
our similarity matrices; specifically we use hierarchical clustering dendrograms and
heatmap visualisations.


3.1    Data

Table 2 gives the number of bioinformatics resources that were identified in a corpus
of 2,691 full-text articles published by the journal BMC Bioinformatics. The details
of the extraction process are presented in [7].

Table 2. The statistics of Bioinformatics e-resources found in the BMC Bioinformatics corpus

          Semantic Class         Total # of instances        Average # of descriptors
           Algorithm                    5,722                           9
           Application                  2,076                           8
           Data                         2,662                          15
           Data Resource                1,992                          10

    Each of the e-resources has been assigned a set of associated descriptors (11
descriptors on average; see Table 2 for details for the specific classes). As can be
expected, single word descriptors appeared more frequently in the corpus. Table 3
lists the most frequent single word, two- and three-word descriptors.

           Table 3. The most frequent single-word, two-word and three-word descriptors

      Single word          Two-word descriptors              Three-word descriptors
      descriptors
 gene: 13,585         gene expression: 1,147            protein-protein interaction: 308
 method: 8,203        secondary structure: 887          multiple sequence alignment: 295
 protein: 6,417       protein sequence: 780             gene expression data: 262
 sequence: 5,991      protein structure: 574            amino acid sequence: 257
 analysis: 4,287      microarray experiment: 488        Smith-Waterman algorithm: 48
3.2   Exploration of Semantic Networks

Here we assess the utility of resource descriptors for semantic profiling of
bioinformatics resources. We do this by exploring our hypothesis that bioinformatics
resources can be semantically linked via resource descriptions. For this, we have
manually identified a sample of 18 resources that are commonly used in
bioinformatics (see Table 4). Each of these has occurred in more than 120 sentences
in our corpus. The sample contains resources from all four semantic classes of
resource. The results have been generated using the three methods for deriving
semantic relatedness between resources as described above.

                    Table 4. A sample of resources used for exploration

                                                        Number of         Resource
  Resource Name
                                                        sentences         Class
  Gene ontology (GO)                                        6757          Data resource
  Support vector machine (SVM)                              2456          Algorithm
  Protein data bank (PDB)                                   904           Data resource
  Hidden Markov model (HMM)                                 602           Algorithm
  Principal components analysis (PCA)                       599           Algorithm
  Position-specific scoring matrix (PSSM)                   457           Algorithm
  Self organising map (SOM)                                 305           Algorithm
  Medical subject headings (MeSH)                           261           Data resource
  Neural network                                            256           Algorithm
  Markov chain Monte Carlo (MCMC)                           252           Algorithm
  Expression profile                                        252           Data
  Basic local alignment search tool (BLAST)                 238           Application
  Phylogenetic tree                                         233           Data
  Structural classification of proteins (SCOP)              216           Data resource
  Kyoto encyclopaedia of genes and genomes (KEGG)           187           Data resource
  Clusters of orthologous groups (COG)                      163           Data resource
  ChIp-chip data                                            126           Data
  Pairwise alignment                                        123           Data

Method 1: lexical comparison of resource names. As expected, this method did not
yield useful results as very little similarity was found between resource names. This
suggests that surface-level lexical information originating from resource names is not
sufficient to develop semantic networks of resources.
Method 2: shared descriptors. We derived mutual similarity scores for the 18
resources, with a mean of 0.34 and a standard deviation of 0.09. This method
identified significant relatedness between many resources (see Fig. 3 for a heat-map).
Clearly, the addition of descriptors improved our ability to derive a measure of
semantic similarity between resources whose names are lexically disparate. Although
it is difficult to define any clear semantic relationships from these data, it is noticeable
that ChIp-chip data has specific properties that are not commonly matched by others
in the sample, (manifested as a line of light yellow on the heat-map, see Fig. 3).




  Fig. 3. Heatmap representation of the matrix of shared descriptor similarity scores between
          resources (method 2). Values vary from 1.0 (red) to 0.00 (white), see legend.
                      Heatmap generated by R function, ‘heatmap’ [16].


   To further highlight the subtle differences and similarities between the resources in
the sample, we applied a hierarchical clustering algorithm [17] to the matrix of scores
(see the resulting tree in Fig. 4).




  Fig. 4. Hierarchical clustering of e-resources using the shared descriptors similarity matrix
  (method 2). Distances were calculated as (1 – Sim2). Ward’s minimum variance clustering
  method [17] was used to cluster the data. The tree was generated using R function ‘hclust’.
   The tree in Fig. 4 highlights some interesting clusters of the examined resources.
Rather than being clustered by resource class, there are semantically important links
being identified. For example, PCA and SOM are important and widely used methods
for exploring expression data [18], and these resources form their own cluster.
Additionally, there is a link established between phylogenetic tree and MCMC;
MCMC, in combination with a Bayesian approach, is a popular method in
phylogenetic analysis for the derivation of trees of relationships between sequences
[19]. The cluster of pairwise alignment, HMM, PSSM and neural network highlights
the semantic theme of sequence analysis (HMM, PSSM and neural networks have all
been used successfully to analyse pairwise and multiple sequence alignments).
KEGG, BLAST, COG, SCOP and MeSH form their own group, which do not highlight
any obvious semantic relationships; a likely reason is that these resources have such
broad utility that the specifics of the relationships between them are lost. It is
surprising, however, that GO and PDB did not follow a similar pattern.




Fig. 5. Semantic network of bioinformatics resources (using method 2 and values shown in Fig.
3). Node size represents frequency in the corpus, edge thickness represents how similar the two
connected nodes are. Node colour is determined by the semantic class of the node (red for
Data, green for Data resource, blue for Algorithm and yellow for Application. The image was
generated using Cytoscape6, the network was laid out using the Cytoscape layout algorithm
‘Edge-Weighted Spring Embedded’, using the edge weight data in the network.

   Even though similarity data alone can identify important semantic links, we further
explored the importance of the number and strength of links between resources. In
Fig. 5 we present our similarity data as edges in a network connecting each node
(representing individual resources) with those that have some similarity to it. Each

6 Cytoscape, http://www.cytoscape.org/
edge is weighted by the similarity between the resources it connects, so that edges that
appear thick represent strong relationships and weak relationships are represented by
thin edges. We have removed all edges that have a weight below the median edge
weight for the network. Our intention with this was to remove edges that exist due to
chance alone and to better highlight the strongest relationships in the network.7 The
strongest links occur between the resources that appear most frequently in the corpus.
The strongest link is between Gene Ontology and SVM, most probably because SVM
methods have been widely used for protein annotations using GO (see, for example,
[26]). Strong links also occur between PCA and expression profile and expression
profile and SVM, indicating types of algorithms used with specific data types.

Method 3: lexical similarity of shared descriptors. The results of calculations for
linking the resources considering the lexical similarities between their descriptors are
summarised in figures 6, 7 and 8.




Fig. 6. Heatmap representation of the matrix of lexically smoothed descriptor similarity scores
between resources (method 3). Values vary from 1.0 (red) to 0.0 (white), see legend. Heatmap
                          generated by R function, ‘heatmap’ [16].


   Fig. 6 has some similarity with Fig. 3. However, there are clusters of more closely
related resources (for example expression profile, SVM and Gene Ontology). All
resources again have some similarity to all others, making it more difficult to identify
the most important relationships. This indicates that sensible thresholds need to be
identified to remove uninteresting links. The similarity scores have a mean of 0.47
and a standard deviation of 0.14, which suggests that the scores from method 3 vary
more widely than those from method 2 and thus potentially provide better
discrimination.

7 ChIp-chip data is missing from Fig. 5 because all its edges have weights below the median.
   Fig. 7 (the hierarchical tree) is similar to Fig. 4 in the sense that some of the
clusters are shared between the trees. The cluster of resources with broad implications
and uses (MeSH, BLAST, SCOP, KEGG and COG) in particular is still present.
However, some new interesting clusters have emerged: for example, the data
resources phylogenetic tree and pairwise alignment have been clustered together, both
of which are common data forms in sequence analysis.




    Fig. 7. Hierarchical clustering of e-resources using the lexically smoothed similarity matrix
    (method 3). Distances were calculated as (1 – Sim3). Ward’s minimum variance clustering
     method [17] was used to cluster the data. Tree generated using R function ‘hclust’ [16].

   The network8 given in Fig. 8 presents the strongest clustering of resources based on
their class. Although the Data nodes (represented as red) are not strongly linked to
each other, the Data Resource nodes (green) are all clustered together. There is also a
similar pattern with the Algorithm nodes (blue). The strongest edge weights again
occur between resources that appear most frequently in the corpus, suggesting that
frequency normalisation may be needed to reduce this impact. Gene Ontology, in
particular, is linked to all other resources, and that is primarily a product of its
ubiquity in the literature and therefore the tendency for many descriptors and
resources to be linked to it.



8   We have again removed the edges with a weight below the median edge weight. This has
    caused the removal of the MeSH node from the network. This could be due to MeSH being
    the only resource strongly related to literature resources.
Fig. 8. Semantic network of bioinformatics resources (using method 3 and values shown in Fig.
6). Node size represents frequency in the corpus, edge thickness / weight represents how
similar the 2 connected nodes are. Node colour is determined by the semantic class of the
node, red for Data, green for Data resource, blue for Algorithm and yellow for Application.
The image was generated using Cytoscape (see Fig. 6 for details).

   By further analysing the associated semantic profiles, we can see that significant
relatedness between resources typically originates from sharing a number of very
generic descriptors, in particular single-word ones (see Table 3). Many of these have
a generic nature (e.g. method, analysis, gene, etc.). This problem, however, can be
resolved by either filtering generic concepts from the descriptors using stop-words or
by using a tf*idf-like weights [20] assigned to descriptors (considering the frequency
of descriptors appearing in profiles of different resources). Selecting and varying the
threshold for the edge weight in our network representations can also discard
unwanted weak links.


4     Related Work
Most of the efforts in the domain of Semantic Web for the Life Sciences have been
focused on data annotation (e.g. a number of protein function databases), using both
manual and automated approaches. Recently, these efforts have been extended to
semantic description of services and tools that are used to analyse, visualise and
explore such data.
    In addition to manual annotation, Semantic Web technologies have been applied
for the description of Web services. These approaches include descriptions of Web
service functionalities as well as the meta-data information about their inputs and
outputs. Most of the suggested approaches rely on the data available in WSDL files.
For example, Lerman and colleagues [21] presented work on automatic labelling of
inputs and outputs of Web services using meta-data based classification relying on
terms extracted from the associated WSDL files. The underlying heuristic behind the
meta-data based classification is that similar data types tend to be named by similar
names and/or belong to operations or messages that are similarly named. Similarly,
Hess and Kushmerick [22] used machine learning to classify Web services using
information given in WSDL files of the services which include port types, operations
and parameters along with any documentation available about the Web service.
Information in a WSDL file is treated as “normal” text, and the problem of Web
service and its metadata classification is addressed as a text classification problem.
    Carman and Knoblock, on the other hand, reported on invoking new/unknown
services and comparing the data they produce with that of known services, and then
use the meta-data associated with the known services to add annotations to the
unknown resources [23]. Belhajjame and colleagues [24] used known annotations of
parameters belonging to components in a workflow to infer the unknown annotations
of other parameters (in other components). Here, semantic information of operation
parameters is inferred based on their connections to other (annotated) components
within existing tried-and-tested workflows. Apart from deriving new annotations, this
method can inspect the parameter compatibility in workflows and can also highlight
conflicting parameter annotations.
    There have been some efforts to improve the service discovery process. Dong et
al. [25], for example, used clustering-based approach in which parameters of service
operations are grouped into meaningful concepts, which are then used to find similar
service operations based on similar parameters. However, this method provides only a
limited solution and is unable to provide comprehensive service discovery based on
the underlying semantics provided by services. Employing Semantic Web approaches
such as ontological annotations could improve this approach [11].


5    Conclusions

In this paper we proposed and explored a literature-based methodology for building
clusters and semantic networks of functionally related e-resources in bioinformatics.
The main motivation is to facilitate the resource discovery approaches, which would
improve the availability and utility of these resources to the community. The
methodology revolves around terminological units (semantic descriptors) that are
frequently used by bioinformatics resource providers to semantically describe the
resources. The semantic descriptors have been automatically compiled and each e-
resource has been assigned a set of descriptors co-occurring with the given e-resource
in a full-text article corpus.
   In order to establish similarity between resources, their profiles are compared using
three levels of service representations: the lexical similarity between the resource
names (method 1); the similarity calculated on the basis of shared semantic
descriptors (method 2), and the same similarity smoothed by considering lexically
similar descriptors. As expected, the first method failed to capture any significant
links between resources as it relied solely on the surface level clues originating from
the names of resources. The second approach performed significantly better and was
able to identify interesting clustering patterns between the resources which did not
have any lexical resemblance. At the third level, in contrast to considering the exact
match between resource descriptors, we devised a descriptor-based kernel matrix,
which incorporated the approximate lexical similarities between the descriptors (using
their lexical profiles). The approximate similarities helped in linking the resources
that shared the descriptors which were not exactly the same, but were related. An
interesting pattern emerged whilst experimenting with this metric, whereby resources
would cluster together based on their class (i.e. the resources which belonged to the
same class (such as Algorithm, Data Resource, etc.) tended to appear closer in the
network). Method 2 revealed some interesting functional links (linking data types and
algorithms). It remains an open question as to which of these clustering patterns is
most useful for semantic resource discovery.
   The work presented here demonstrates the potential of simple kernel methods
(using lexical profiles) built to model relatedness between resource descriptors. We
anticipate that further work will be required to identify the most relevant weights for
semantic descriptors to counter-balance the impact of frequent (and less informative)
features. Other kernels (such as contextual and distributional similarities, WordNet-
based similarities, string kernels etc) need to be explored and could provide better
resolution of the complex interrelationships between resources.

Acknowledgments. This work was partially supported by the UK Biotechnology and
Biological Science Research Council (BBSRC) via the “BioCatalogue” project and by
the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2) awarded
to HA. JE is funded by the e-LICO project (EU Grant agreement number 231519).


References

1.   Nucleic Acids Research, DB Issue. Volume 37 (January, 2009)
2.   Nucleic Acids Research, Web Server Issue. Volume 37 (July, 2009)
3.   Goble, C A., Belhajjame, K., Tanoh, F., Bhagat, J., Wolstencroft, K., Stevens, R.,
     Nzuobontane, E., McWilliam, H., Laurent, T. & Lopez, R.: BioCatalogue: A Curated
     Web Service Registry For The Life Science Community. In 3rd International Biocuration
     Conference, Berlin Germany (2009)
4.   Wilkinson, MD., Links, M.: BioMOBY: an open source biological web services proposal.
     Briefings in Bioinformatics. 3:331–41 (2002)
5.   Oinn, T., Greenwood, M., Addis, M., Alpdemir, N., Wroe, C., et al.: Taverna: lessons in
     creating a workflow environment for the life sciences: Research Articles. Concurr.
     Comput. : Pract. Exper. 18(10): pp. 1067-1100, (2006)
6.   Stevens, R., Robinson, A., Goble, C.: myGrid: personalised bioinformatics on the
     information grid. In Bioinformatics (ISMB Supplement), pp. 302-304 (2003)
7.   Afzal, H., Stevens, R., Nenadic, G.: Mining Semantic Descriptions of Bioinformatics
     Web Resources from the Literature. In Proceedings of the 6th European Semantic Web
      Conference on The Semantic Web: Research and Applications. Heraklion, Crete, Greece,
      Springer-Verlag, pp. 535-549. (2009)
8.    Afzal, H., Stevens, R., Nenadic, G.: Towards Semantic Annotation of Bioinformatics
      Services: Building a Controlled Vocabulary. In Proc. of the Third International
      Symposium on Semantic Mining in Biomedicine, Turku, Finland, pp. 5–12 (2008)
9.    Eales, JM., Pinney, JW., Stevens, RD., Robertson, DL.: Methodology capture:
      discriminating between the "best" and the rest of community practice. BMC
      Bioinformatics 9: 359 (2008).
10.   Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A: A
      systematic strategy for large-scale analysis of genotype phenotype correlations:
      identification of candidate genes involved in African trypanosomiasis. Nucleic Acids Res.
      35: pp. 5625–5633 (2007)
11.   Wolstencroft, K., Alper, P., Hull, D., Wroe, C., Lord, P.W., Stevens, R.D., Goble, C.A.:
      The myGrid Ontology: Bioinformatics Service Discovery. International Journal of
      Bioinformatics Research and Applications 3, pp. 326–340 (2007)
12.   Teytaud, O., Jalam, R.: Kernel-based text categorization. In International Joint
      Conference on Neural Networks (IJCNN’2001), Washington DC (2001).
13.   Pahikkala, T., Pyysalo, S., Ginter, F., Boberg, J., Jarvinen, J., et al.: Kernels incorporating
      word positional information in natural language disambiguation tasks. In Proc. of the
      Eighteenth International Florida Artificial Intelligence Research Society Conference,
      Menlo Park, California. AAAI Press (2005).
14.   Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text
      classification using string kernels. J. Mach. Learn. Res. 2 pp. 419-444 (2002).
15.   Nenadic, G., Ananiadou, S.: Mining Semantically Related Terms from Biomedical
      Literature. ACM Transactions on Asian Language Information Processing. pp. 22-
      43(2006)
16.   R Development Core Team: R, A language and environment for statistical computing. R
      Foundation for Statistical Computing, Vienna, Austria (2009)
17.   Romesburg, H.C.: Cluster analysis for researchers. Lulu Press, North Carolina (2004).
18.   Belacel N, Wang Q, Cuperlovic-Culf M: Clustering methods for microarray gene
      expression data. Omics. 2006; 10:pp. 507–531 (2006)
19.   Liang, LJ., Weiss, RE., Redelings B, Suchard MA.: Improving phylogenetic analyses by
      incorporating additional information from genetic sequence databases. Bioinformatics;
      25(19):pp. 2530-6 (2009)
20.   Salton, G, Buckley C.: Term-weighting approaches in automatic text retrieval,
      Information Processing and Management, pp. 513—523, (1998)
21.    Lerman, K., Plangrasopchok, A., Knoblock, C.: Automatically Labelling the Inputs and
      Outputs of Web Services. In Proceedings of AAAI-2006, Boston, MA, USA: 149-181
      (2006)
22.   Hess, A., Kushmerick, N. : Learning to Attach Semantic Metadata to Web Services. In
      Proc. 2nd International Semantic Web Conference (ISWC2003). Sanibel Island, Florida,
      USA, Springer Berlin / Heidelberg. 2870/2003: 258-273 (2003).
23.   Carman, M. J., Knoblock, C. A.: Learning Semantic Descriptions of Web Information
      Sources. Twentieth International Joint conference on Artificial Intelligence, Hyderabad
      India: 1474-1480, (2007).
24.   Belhajjame, K., Embury, S. M., Paton, N. W., Stevens, R.: Automatic annotation of Web
      services based on workflow definitions. ACM Trans. Web 2(2): 1-34 (2007)
25.   Dong, X., Halevy, A., Madhavan, J., Nemes, E., Zhang, J.: Similarity search for Web
      services. In Proceedings of the Thirtieth international conference on Very large data bases
      - Volume 30. Toronto, Canada, VLDB Endowment: pp. 372-383. (2004)
26.   Vinayagam A. et al.: Applying Support Vector Machines for Gene ontology based gene
      function prediction, BMC Bioinformatics 2004, 5:116