<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontology-based gene set enrichment analysis using an efficient semantic similarity measure and functional clustering.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sidahmed Benabderrahmane</string-name>
          <email>sidahmed.benabderrahmane@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mekami Hayet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>(1): King Abdul Aziz University, Faculty of Computing and Information Technology.</institution>
          <addr-line>Jeddah 21589</addr-line>
          ,
          <country country="SA">Saudi Arabia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <fpage>151</fpage>
      <lpage>159</lpage>
      <abstract>
        <p>Gene set enrichment analysis allows to extract specific biological functions relative to a group of genes. To this aim, we propose here a novel approach for mining biological data, using the Gene Ontology (GO) as main source of genes annotation terms. Firstly, we will use our new semantic similarity measure (IntelliGO ) in a clustering process, for grouping genes sharing similar biological functions described in GO. Secondly, the clustering results are evaluated using the F-score method and public genes reference sets. After that, an overlap analysis is presented as a method for exploiting the matching between clusters and reference sets. This method is then applied to a list of genes found dysregulated in cancer samples. In this case, the reference sets are replaced by gene expression profiles. Consequently, overlap analysis between these profiles and functional clusters obtained with the IntelliGO -based clustering, leads to characterize subsets of enriched biological functions of genes displaying consistent functions and similar expression profiles.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In the last decade, DNA microarrays were used for measuring the expression
levels of thousands of genes under various biological conditions [
        <xref ref-type="bibr" rid="ref11 ref8">11, 8</xref>
        ]. Thus, gene
expression data analysis proceeds in two steps: Firstly, expression profiles are
produced by grouping genes displaying similar expression levels under a given
set of situations [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Secondly, a functional analysis, based on functional
annotations, is applied on genes sharing the same expression profile, in order to identify
their relevant biological functions [
        <xref ref-type="bibr" rid="ref16 ref18 ref19 ref6">6, 16, 18, 19</xref>
        ]. In fact, one important purpose
of this functional analysis is to identify and characterize genes that can serve as
diagnostic signatures or prognostic markers for different stages of a disease. One
of the most interesting ontology in the biological domain is the Gene Ontology
(GO), which is one of the most commonly used source of functional annotations
of genes [
        <xref ref-type="bibr" rid="ref1 ref2 ref5">5, 1, 2</xref>
        ].
      </p>
      <p>
        This ontology of about 30,000 terms is organized as a controlled vocabulary
describing the biological process (BP), molecular function (MF), and cellular
component (CC) aspects of gene annotation, also called GO aspects [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The
GO vocabulary is structured as a rooted Directed Acyclic Graph (rDAG) in
which GO terms (concepts) are the nodes connected by different hierarchical
relations (mostly is a and part of relations). The is-a relation describes the fact
that a given child term is a specialization of a parent term, while the part-of
relation denotes the fact that a child term is a component of a parent term. By
definition, each rDAG has a unique root node, relationships between nodes are
oriented, and there are no cycles, i.e. no path starts and ends at the same node.
The GO Consortium regularly updates a GO Annotation (GOA) Database [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in
which appropriate GO terms are assigned to genes or gene products from public
databases.
      </p>
      <p>
        GO is widely used in several complex biological data mining problems. Authors
in [
        <xref ref-type="bibr" rid="ref20 ref7">20, 7</xref>
        ] used GO for gene functional analysis in order to interpret DNA
microarrays experiments, by exploiting the commonly accepted assumption that
genes having similar expression profile should share similar biological functions.
In such analysis, an enrichment study based on statistical P-value calculation
are applied to genes sharing the same expression profile [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The results usually
consist in sets of GO terms characterizing the biological function predominantly
represented in a list of genes, thereby suggesting which function or process is
affected when the behavior of this group of genes varies. However, the main
limitation of these kinds of methods is that they consider the input list of genes
in the enrichment analysis, as functionally homogeneous. Nonetheless in
practice, genes present in the same expression profile could be involved in multiple
biological processes. Thus the statistical tests for extracting specific GO terms
could be biased. Moreover, the already proposed methods for gene functional
enrichment analysis do not consider exclusively the three aspects of GO, that is
not important for the biologists. To overcome this problem, we proposed here
a new approach for analyzing gene expression data, by refining and creating
subgroups of functionally homogeneous genes. The enrichment analysis could be
then applied on each subgroup of genes, thus assuring the extraction of specific
biological functions (GO terms) for those genes. The creation of subgroups of
genes is performed using a clustering method based on our recently described
semantic similarity measure called IntelliGO [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], that applies functional
comparison between genes annotated by GO terms.
      </p>
      <p>This paper is organized as follows. The next section, outlines the utilization of
IntelliGO in a functional clustering approach and presents the evaluation
results, using the F-score method and collections of reference sets. In second stage
(Section III), we present and overlap analysis method that exploits the matching
between functional clusters and reference sets. This method is then applied to a
list of genes found dysregulated in cancer samples by replacing the reference sets
by gene expression profiles. An enrichment analysis is then applied on
overlapping genes, and leads to characterize subsets of enriched biological functions of
genes displaying consistent functions and similar expression profiles. Finally in
the last section the relevance of the obtained results of the proposed algorithms
are are discussed.
2
2.1</p>
      <p>The IntelliGO-based gene functional classification</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the datasets</title>
      <p>
        Gene functional clustering aims to regroup genes sharing common biological
functions. We used four datasets for evaluating functional classification of genes,
already presented in our past study[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In each dataset we prepared a collection
of reference sets. Each reference set represents a group of genes grouped by an
expert due to their shared biological functions. We selected a total of 13 KEGG
pathways from the KEGG database [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] for the Biological Process aspect of GO
for human (total of 280 genes) and yeast (total of 185 genes) species. For the
Molecular Function aspect of GO we chose 10 Pfam clans from the Sanger Pfam
database [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for both species(100 genes for human and 118 for yeast species).
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Calculating similarity matrices and clustering</title>
      <p>For performing a gene functional clustering for a given list of genes, the first step
is to compute a matrix representing the semantic similarity values between all
genes in the input list. This similarity matrix will be then used as parameter of a
clustering algorithm. In our case we used both hierarchical and fuzzy clustering.
The first method allows to have a global overview of the distribution of genes on
different clusters, while the second method allows to a gene to belong to multiple
clusters at one time. In fact, one gene could be involved in multiple biological
process simultaneously. These two algorithms are available in R-Bioconductor
package1.</p>
      <p>
        Pairwise similarity matrices were calculated for all genes present in the four
datasets using our recently proposed similarity measure (IntelliGO ) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This
measure is represented in an innovative vector space model (VSM), and takes
into account both information content of annotation terms and their positions
in the ontology rDAG [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. With IntelliGO VSM, each gene is represented as a
vector g in a k-dimensional space where the basis vectors ei correspond to the
k annotation terms. To measure the semantic relationships between terms, we
defined a term similarity product as:
ei.ej =
      </p>
      <p>2 ∗ Depth(LCA)
M inSP L(ti, tj) + 2 ∗ Depth(LCA)
.</p>
      <p>Moreover, we included in the IntelliGO VSM a novel weighting scheme in which
a coefficient αi is assigned to each ei so that the gene representation becomes:
g = i αi.ei. The coefficients (αi) combine a weight w(g, ti) which depends</p>
      <sec id="sec-3-1">
        <title>1 www.bioconductor.org</title>
        <p>on the evidence code tracking the annotation of gene g with a GO term ti and
on the Inverse Annotation Frequency (IAF (ti)) which is an estimation of the
information content IC of the term ti. Thus, the similarity between g1 and g2
is given by the following generalized cosine formula:</p>
        <p>IntelliGO(g1, g2) = √g1.gg11.√g2g2.g2 ,
(2)
with: g1.g2 = i,j α1iα2j ei.ej .</p>
        <p>Remark that IntelliGO is a pair-wise measure involving both node-based and
edge-based similarities. The measure, the clustering algorithms and the used
datasets are available at http://intelligo.loria.fr.
2.3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation of the clustering using the F-score method</title>
      <p>
        When reference sets are available, the best method for optimizing the
number of classes produced by unsupervised classification approaches is the F-score
method [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. This method relies on pairing each reference set with the
bestmatched cluster and provides a quantitative estimation of the pairing efficiency
(precision and recall). We decided to extend the F-score method in order to
further investigate the pairing between reference sets and clusters in a so-called
overlap analysis. Our approach is outlined in the following algorithms. Algorithm
1 describes unsupervised clustering optimization with reference sets and global
F-score measure. We applied fuzzy C-means clustering on the gene-gene pairwise
Algorithm 1 Clustering optimization with reference sets and F-score measure.
Require: Σ={R1, R2, .., Rp}: a collection of reference sets, (n1,n2) such that
n1&lt;p&lt;n2. The pairwise similarity matrix of all elements of Σ.
      </p>
      <p>
        Ensure: The optimal number of generated clusters K, Global F − score(K).
1: for each K in [n1,n2] do
2: Generate K clusters Φ={C1, C2, ..., CK}, using all elements in Ri
3: for each reference set Ri ∈ Σ do
4: for each cluster Cj ∈ Φ do
5: Precision(Ri,Cj)=|Ri∩Cj|/|Cj |
6: Recall(Ri,Cj )= |Ri∩Cj |/|Ri|
7: F − score(Ri, Cj )= 2P∗Prerceicsiisoino(nR(Ri,iC,Cj)j+)∗RReeccaalll(l(RRii,C,Cjj))
8: end for
9: F − score(Ri) = M ax∀Cj∈Φ(F − score(Ri, Cj))
10: end for
11: Global F − score(K) = ip=1(|Ri|ip∗=F1 −|Rsci|ore(Ri))
12: end for
13: Global F − score(K) = M ax∀KGlobal F − score(K)
14: return K, Global F − score(K).
similarity matrices calculated with IntelliGO for the four datasets. We used this
clustering algorithm since some genes can be involved in multiple biological
processes or molecular functions. The same evaluation procedure was performed on
a tool representing the state of the art for gene classification methods (DAVID:
Database for Annotation, Visualization, and Integrated Discovery classification
tool) [
        <xref ref-type="bibr" rid="ref14 ref9">9, 14</xref>
        ]. Each clustering result together with the corresponding collection of
reference sets served as input to Algorithm 1 for determining global F-scores.
Concerning the IntelliGO -based fuzzy clustering of Datasets 1 and 2, we varied
the number of generated clusters, K, between 11 and 17 in steps of 1 since these
datasets are composed of 13 pathways for human and yeast species. For Datasets
3 and 4, the values of K were taken between 8 and 14 with a step of 1, since
these two datasets are composed of 10 Pfam clans for both species. For each K,
the global F-score(K) value was calculated. Concerning the DAVID functional
classification of the same datasets, we varied the Kappa similarity threshold
between 0.3 to 0.7 with a step of 0.1 in order to obtain different numbers of clusters,
since DAVID does not allow the number of clusters to be specified a priori. As
in the previous case, the K clusters were matched with the input reference sets,
and the Global F − score(K) value was calculated. The results are presented in
Table 1.
      </p>
      <p>Regarding the results obtained with Dataset 1 (13 human KEGG pathways)
using our similarity measure, it can be seen that all global F-Score values are
greater than 0.5, with a maximum value of 0.62 for K = 14. This means that the
genes of the 13 human pathways considered in Dataset 1 are best grouped with
our measure into 14 functional clusters. This result can reflect the fact that one
pathway of the KEGG database encompasses two biological processes and/or
that the clustering process has grouped together genes from various pathways
sharing common BP annotations.</p>
      <p>With DAVID (Table 1), the maximum global F-score (0.67) is reached when
Kappa = 0.3, giving 10 functional clusters. At higher threshold, the number of
genes excluded from the clustering increases, revealing one limit of the DAVID
tool. Similar results are obtained with Datasets 2, 3 and 4 and are detailed in
Table 1.</p>
      <p>In summary these results indicate that IntelliGO -based clustering appears as
a valuable alternative to DAVID classification tool. It is noteworthy that with
DAVID classification tool all maximum values of global F-score are obtained for
the minimal Kappa similarity threshold (0.3) which corresponds, according to
DAVID, to the poorest quality of clustering. Moreover, the calculation of the
global F-score is somewhat biased with DAVID as a certain number of genes are
excluded from the classification results.</p>
      <p>Overlap analysis between functional clusters and
reference sets
In order to refine our comparison, we decided to look at the matching between
the reference sets and the clusters obtained with the optimal K value. We used
Algorithm 2 to extract the top-ranked cluster from each list of clusters assigned
to each reference set. This algorithm explains how clusters (C) are assigned to
reference sets (R) according to the F-score values, allowing the identification of
best-matching pairs (R ∩ C).</p>
      <p>The intersection R ∩ C is expected to display a highly homogeneous content
composed of genes known as members of a reference set and found most similar
by clustering. Alternatively, the two set-theoretic differences C\R and R\C can
be considered in order to discover missing information. In our study, we are
interested by genes present in R ∩ C. Indeed, we apply an enrichment analysis
on genes present in such intersection, in order to extract specific functions.
Algorithm 2 Assignment of clusters to reference sets according to the F − score
values.</p>
      <p>
        Require: Σ={R1, R2, .., Rp}: a collection of reference sets, ΦK ={C1, C2, ..., CK }: a
collection of clusters, ∀(i, j) | 1≤i≤p, 1≤j≤K, F − score(Ri, Cj) (see Algorithm 1).
Ensure: A ranked list of clusters, ordered by decreasing F − score, assigned to each
reference set.
1: for each reference set Ri ∈ Σ do
2: Listi←−(Cj,F −score(Ri,Cj )) : A list of clusters Cj ordered by decreasing values
of F-score(Ri,Cj )
3: print Listi.
4: end for
colorectal cancers. The idea here is to confront the IntelliGO functional
clusters of the 128 genes, and to consider as reference sets the fuzzy Differential
Expression Profiles (fuzzy DEP) obtained from the same list of genes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Here,
each DEP represents a group of genes having similar expression profile. We
believe that overlap analysis may lead to discover hidden relationships between
gene expression and biological function. Fuzzy DEPs are considered here as a
collection of reference sets for overlap analysis. More precisely, 8 fuzzy DEPs
containing genes with GO annotation are retained from our previous study [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
The pair-wise similarity matrix was generated for the 128 genes, and the number
of clusters, k, was optimized with the Algorithm 1 using the 8 fuzzy DEPs as
reference sets (Σ={DEPi}, i = 1..8). The optimal number of cluster was
obtained for k = 3 with and F-score value equals to 0.4.
      </p>
      <p>
        After that, Algorithm 2 was used to extract lists of genes present in C∩DEP , i.e.
displaying both functional similarity (C) and present in one of the eight fuzzy
DEPs (R). The enrichment analysis could be then applied on these signature
genes, to discover among theme statistically significant GO terms displaying low
P Value. In our case, the P value is calculated for genes present in C∩DEP
versus a background list (here all human genes) displaying GO annotation in
the NCBI repository file2, using the hyper geometric test [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Preliminary results have shown that very specific biological functions with
inferior P Values (≤10E-04) were extracted for genes present in C∩DEP . For
example, genes in Cluster 1 ∩ P ED3 have ”regulation of transcription
DNAdependent” and “NADH oxidation”, as very specific functions (non exhaustive).
Genes of Cluster 2 ∩ P ED2 have the following functions: ”cell differentiation”,
”multicellular organismal development”, ”insulin secretion”. Genes of Cluster 3
∩ P ED14 have the ”Water transport” as specific function. The ”Transport”
processes are very important in the physiology of the digestive system. This
function was found for the AQP8 (Aquaporine 8) human gene, which is found
in the literature under expressed in the tumoral tissues. This gene belongs to
P ED14 which regroups genes under expressed in cancer versus normal tissue [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
This observation could be considered as a positive witness of our strategy. Other
similar results were obtained for the remaining PED, are not reported here.
      </p>
      <sec id="sec-4-1">
        <title>2 ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz</title>
        <p>
          Conclusion and perspectives
In this paper, we have presented a gene set enrichment analysis based on
functional clustering with the IntelliGO semantic similarity measure. In a first step,
we proposed an algorithm for evaluation the clustering approach using reference
sets and the F-score method. Very encourageous results were obtained with
IntelliGO when compared with a well known classification method (DAVID tool).
Beyond clustering per se, we have presented an overlap analysis method which
leads to a pairing of clusters and reference sets and may be used for set-difference
analysis. Applied to a list of genes from a transcriptomic cancer study, our
method leads to identify subsets of genes displaying consistent expression and
functional profiles. Promising results have been obtained using a simple GO term
enrichment procedure. More sophisticated tools such as GSEA [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] could be used
to improve the biological interpretation of these subsets of genes.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Michael</given-names>
            <surname>Ashburner</surname>
          </string-name>
          , Catherine Ball, Judith Blake, David Botstein,
          <string-name>
            <given-names>Heather</given-names>
            <surname>Butler</surname>
          </string-name>
          , Michael Cherry, Allan Davis, Kara Dolinski, Selina Dwight, Janan Eppig, Midori Harris, David Hill, Laurie Issel-Tarver,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Kasarskis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Suzanna</given-names>
            <surname>Lewis</surname>
          </string-name>
          , John C. Matese, Joel Richardson, Martin Ringwald, Gerald Rubin, and
          <string-name>
            <given-names>Gavin</given-names>
            <surname>Sherlock</surname>
          </string-name>
          .
          <article-title>Gene ontology: tool for the unification of biology</article-title>
          .
          <source>Nature Genetics</source>
          ,
          <volume>25</volume>
          :
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Barrell</surname>
          </string-name>
          , Emily Dimmer,
          <string-name>
            <given-names>Rachael P.</given-names>
            <surname>Huntley</surname>
          </string-name>
          , David Binns,
          <string-name>
            <surname>Claire O'Donovan</surname>
            ,
            <given-names>and Rolf</given-names>
          </string-name>
          <string-name>
            <surname>Apweiler</surname>
          </string-name>
          .
          <article-title>The GOA database in 2009-an integrated Gene Ontology Annotation resource</article-title>
          .
          <source>Nucl. Acids Res</source>
          .,
          <volume>37</volume>
          (suppl1):
          <fpage>D396</fpage>
          -
          <lpage>403</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Sidahmed</given-names>
            <surname>Benabderrahmane</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marie-Dominique</surname>
            <given-names>Devignes</given-names>
          </string-name>
          , Malika Sma¨
          <fpage>ıl</fpage>
          -Tabbone, Amedeo Napoli, Olivier Poch,
          <string-name>
            <surname>Ngoc-H. Nguyen</surname>
            , and
            <given-names>Wolfgang</given-names>
          </string-name>
          <string-name>
            <surname>Raffelsberger</surname>
          </string-name>
          . Analyse de donn´
          <article-title>ees transcriptomiques: Mod´elisation floue de profils dexpression diff´erentielle et analyse fonctionnelle</article-title>
          .
          <source>In INFORSID</source>
          , pages
          <fpage>413</fpage>
          -
          <lpage>428</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Sidahmed</given-names>
            <surname>Benabderrahmane</surname>
          </string-name>
          , Malika Smail-Tabbone, Olivier Poch, Amedeo Napoli, and
          <string-name>
            <surname>Marie-Dominique Devignes</surname>
          </string-name>
          .
          <article-title>Intelligo: a new vector-based semantic similarity measure including annotation origin</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>588</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          .
          <article-title>Special issue: Biomedical ontology in action</article-title>
          .
          <source>Applied Ontology</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>J-F Boulicaut</surname>
            and
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Gandrillon</surname>
          </string-name>
          .
          <article-title>Informatique pour l'analyse du transcriptome</article-title>
          .
          <source>Hermes Lavoisier</source>
          , Paris,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Markus</given-names>
            <surname>Brameier</surname>
          </string-name>
          and
          <string-name>
            <given-names>Carsten</given-names>
            <surname>Wiuf</surname>
          </string-name>
          .
          <article-title>Co-clustering and visualization of gene expression data and gene ontology terms for saccharomyces cerevisiae using selforganizing maps</article-title>
          .
          <source>J. of Biomedical Informatics</source>
          ,
          <volume>40</volume>
          (
          <issue>2</issue>
          ):
          <fpage>160</fpage>
          -
          <lpage>173</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Alvis</given-names>
            <surname>Brazma</surname>
          </string-name>
          , Jaak Vilo, and Edited Gianni Cesareni.
          <article-title>Gene expression data analysis</article-title>
          .
          <source>FEBS Letters</source>
          ,
          <volume>480</volume>
          :
          <fpage>17</fpage>
          -
          <lpage>24</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Glynn</given-names>
            <surname>Dennis</surname>
          </string-name>
          , Brad Sherman, Douglas Hosack, Jun Yang,
          <string-name>
            <given-names>Wei</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H</given-names>
            <surname>Lane</surname>
          </string-name>
          , and Richard Lempicki. David:
          <article-title>Database for annotation, visualization, and integrated discovery</article-title>
          .
          <source>Genome Biology</source>
          ,
          <volume>4</volume>
          (
          <issue>9</issue>
          ):
          <fpage>R60</fpage>
          ,
          <year>2003</year>
          .
          <article-title>A previous version of this manuscript was made available before peer review</article-title>
          at http://genomebiology.com/
          <year>2003</year>
          /4/5/P3.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Eran</surname>
            <given-names>Eden</given-names>
          </string-name>
          , Roy Navon, Israel Steinfeld, Doron Lipson, and
          <string-name>
            <given-names>Zohar</given-names>
            <surname>Yakhini</surname>
          </string-name>
          .
          <article-title>Gorilla: a tool for discovery and visualization of enriched go terms in ranked gene lists</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>10</volume>
          (
          <issue>1</issue>
          ):
          <fpage>48</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>M. B. Eisen</surname>
            ,
            <given-names>P. T.</given-names>
          </string-name>
          <string-name>
            <surname>Spellman</surname>
            ,
            <given-names>P. O.</given-names>
          </string-name>
          <string-name>
            <surname>Brown</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Botstein</surname>
          </string-name>
          .
          <article-title>Cluster analysis and display of genome-wide expression patterns</article-title>
          .
          <source>Proceedings of the National Academy of Sciences of the United States of America</source>
          ,
          <volume>95</volume>
          (
          <issue>25</issue>
          ):
          <fpage>14863</fpage>
          -
          <lpage>14868</lpage>
          ,
          <year>December 1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Robert D. Finn</surname>
          </string-name>
          , Jaina Mistry, Benjamin Schuster-Bockler, Sam Griffiths-Jones, Volker Hollich, Timo Lassmann, Simon Moxon, Mhairi Marshall, Ajay Khanna, Richard Durbin, Sean R. Eddy,
          <string-name>
            <surname>Erik L. L. Sonnhammer</surname>
            , and
            <given-names>Alex</given-names>
          </string-name>
          <string-name>
            <surname>Bateman</surname>
          </string-name>
          .
          <article-title>Pfam: clans, web tools and services</article-title>
          .
          <source>Nucl. Acids Res</source>
          .,
          <volume>34</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Audrey</given-names>
            <surname>Gasch</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Eisen</surname>
          </string-name>
          .
          <article-title>Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering</article-title>
          .
          <source>Genome Biology</source>
          ,
          <volume>3</volume>
          (
          <issue>11</issue>
          ):
          <year>research0059</year>
          .
          <fpage>1</fpage>
          -
          <lpage>research0059</lpage>
          .
          <fpage>22</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Da</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Brad Sherman, Qina Tan, Jack Collins,
          <string-name>
            <given-names>W Gregory</given-names>
            <surname>Alvord</surname>
          </string-name>
          , Jean Roayaei, Robert Stephens,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Baseler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H Clifford</given-names>
            <surname>Lane</surname>
          </string-name>
          , and Richard Lempicki.
          <article-title>The david gene functional classification tool: a novel biological modulecentric algorithm to functionally analyze large gene lists</article-title>
          .
          <source>Genome Biology</source>
          ,
          <volume>8</volume>
          (
          <issue>9</issue>
          ):
          <fpage>R183</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Minoru</surname>
            <given-names>Kanehisa</given-names>
          </string-name>
          , Susumu Goto, Miho Furumichi, Mao Tanabe, and
          <string-name>
            <given-names>Mika</given-names>
            <surname>Hirakawa</surname>
          </string-name>
          .
          <article-title>KEGG for representation and analysis of molecular networks involving diseases and drugs</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <volume>38</volume>
          (Database issue):
          <fpage>D355</fpage>
          -360,
          <year>January 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Purvesh</given-names>
            <surname>Khatri</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sorin</given-names>
            <surname>Draghici</surname>
          </string-name>
          .
          <article-title>Ontological analysis of gene expression data: current tools, limitations, and open problems</article-title>
          . Bioinformatics,
          <volume>21</volume>
          (
          <issue>18</issue>
          ):
          <fpage>3587</fpage>
          -
          <lpage>3595</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Lord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brass</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Goble</surname>
          </string-name>
          .
          <article-title>Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>19</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1275</fpage>
          -
          <lpage>1283</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. David Martin,
          <string-name>
            <surname>Christine Brun</surname>
          </string-name>
          , Elisabeth Remy, Pierre Mouren, Denis Thieffry, and Bernard Jacq.
          <article-title>GOToolBox: functional analysis of gene datasets based on Gene Ontology</article-title>
          .
          <source>Genome Biology</source>
          ,
          <volume>5</volume>
          (
          <issue>12</issue>
          ),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Brad</surname>
            <given-names>Sherman</given-names>
          </string-name>
          , Da Huang, Qina Tan, Yongjian Guo, Stephan Bour, David Liu,
          <string-name>
            <given-names>Robert</given-names>
            <surname>Stephens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Baseler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H Clifford</given-names>
            <surname>Lane</surname>
          </string-name>
          , and Richard Lempicki.
          <article-title>David knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>8</volume>
          (
          <issue>1</issue>
          ):
          <fpage>426</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Nora</surname>
            <given-names>Speer</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Spieth</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Zell</surname>
          </string-name>
          .
          <article-title>A memetic co-clustering algorithm for gene expression profiles and biological annotation</article-title>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Aravind</surname>
            <given-names>Subramanian</given-names>
          </string-name>
          , Pablo Tamayo,
          <string-name>
            <surname>Vamsi K. Mootha</surname>
          </string-name>
          , Sayan Mukherjee, Benjamin L.
          <string-name>
            <surname>Ebert</surname>
            ,
            <given-names>Michael A.</given-names>
          </string-name>
          <string-name>
            <surname>Gillette</surname>
            ,
            <given-names>Amanda</given-names>
          </string-name>
          <string-name>
            <surname>Paulovich</surname>
          </string-name>
          , Scott L. Pomeroy,
          <string-name>
            <surname>Todd R. Golub</surname>
            , Eric S. Lander, and
            <given-names>Jill P.</given-names>
          </string-name>
          <string-name>
            <surname>Mesirov</surname>
          </string-name>
          .
          <article-title>Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles</article-title>
          .
          <source>Proceedings of the National Academy of Sciences of the United States of America</source>
          ,
          <volume>102</volume>
          (
          <issue>43</issue>
          ):
          <fpage>15545</fpage>
          -
          <lpage>15550</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>C. J. van Rijsbergen. Information</given-names>
            <surname>Retrieval</surname>
          </string-name>
          . Butterworth,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>