=Paper=
{{Paper
|id=Vol-1747/IP31_ICBO2016
|storemode=property
|title=Planteome Gene Annotation Enrichment Analysis 
|pdfUrl=https://ceur-ws.org/Vol-1747/IP31_ICBO2016.pdf
|volume=Vol-1747
|authors=Botong Qu,Jaden Diefenbaugh,Eugene Zhang,Justin Elser,Pankaj Jaiswal,Seth Carbon,Christopher Mungall
|dblpUrl=https://dblp.org/rec/conf/icbo/QuDZEJCM16
}}
==Planteome Gene Annotation Enrichment Analysis ==
<pdf width="1500px">https://ceur-ws.org/Vol-1747/IP31_ICBO2016.pdf</pdf>
<pre>
   Planteome Gene Annotation Enrichment Analysis
                     Botong Qu                                  Justin Elser                             Seth Carbon
                 Jaden Diefenbaugh                            Pankaj Jaiswal                            Chris Mungall
                    Eugene Zhang                           Department of Botany                       Lawrence Berkeley
               School of Electrical and                     and Plant Pathology                       National Laboratory
                Computer Engineering                      Oregon State University                     Berkeley, CA 94720
               Oregon State University                    Corvallis, Oregon 97331
               Corvallis, Oregon 97331


   Abstract—Annotation enrichment analysis of a gene list helps        This 2 by 2 table is the contingency table used to calculate the
biologists to identify the potential biological functions associated   p-values for each term. If the p-value is bigger than the user
with it. With the extensions of plant ontology categories, the         chosen cut-off value (0.01 or 0.05), the term is not enriched
discovery of significant ontology terms associated with a gene list
becomes more and more informative. We introduce a tool to help         by the gene list.
biologists to find out these terms based on the expanding ontology
database of the Planteome project. In addition, we propose some                                   TABLE I
new visualization schemes to help users construct a meaningful                  T HE CONTINGENCY TABLE FOR A ONTOLOGY TERM
interpretation of the results guided by the ontology tree.
                                                                                          Input Genes     Not Input Genes     Sum (Ref)
   keywords—Ontology, Plants, Term enrichment analysis, Visu-
                                                                           Annotated           m                k-m              k
alization                                                                 Not annotated       n-m           (N-n)-(k-m)         N-k
                                                                              Sum              n                N-n              N
                       I. I NTRODUCTION
   Gene annotations are analyzed and explored by gene cura-               With the number of genes annotated to the term inside the
tors from all over the world. Finding and visualizing the useful       gene list (m), the total number of genes annotated to the term
information from the annotations has been a hot topic for              in the whole database (k), the number of input genes (n) and
decades. The Common Reference Ontologies and Applications              total number of genes in the database (N ), Fisher’s exact test
for Plant Biology benefits biologists to be able to discover           is defined as equations 1 and 2. The H(m, k, n, N ) represents
enriched biological ontology terms among all provided on-              the hypergeometric distribution.
tologies (Gene Ontology, Plant Ontology, Trait Ontology,                                                  k
                                                                                                                N −k
                                                                                                                      
Environment Ontology, etc.). To assist this analysis process,                                             m × n−m
                                                                                       H(m, k, n, N ) =       N
                                                                                                                                  (1)
we provide a gene annotation enrichment analysis tool which                                                        n
uses Fisher’s exact and chi-squared methods to statistically                                           k
                                                                                                       X
analyze all annotation data. Then, we visualize the results two                        p − value =           H(i, k, n, N )               (2)
ways: 1) Highlighting the enriched terms among all ontology                                            i=m
terms in the database to emphasize relative positions of the              Based on the contingency table, we calculate the expected
enriched terms. 2) Considering the cut-off p-value as a basis          value of the cell that represents the number of genes annotated
of an uncertainty factor when visualizing the tree structure in                                                    k
                                                                       to the term and inside the input list by n× N , then we construct
order to conveniently focus on the interesting terms.                  an expected contingency table by fixing the margin values k, n,
                                                                       and N and using the calculated expected value to calculate all
            II. A NALYSIS M ODEL AND M ETHODS                          other three cells. Then we calculate the χ2 value with equation
   In our tool, we provide two common analysis methods to              3, and then transfer it to p-value for 1 degree of freedom (a 2
find the enriched terms: the Fisher’s exact test and the chi-          by 2 table always has a freedom of 1).
squared test [1]. To apply these statistical analysis methods,                              X (expcted − observed)2
the formulation of a contingency table is necessary. In our                         χ2 =                                             (3)
                                                                                                           expcted
system, we create the contingency table (table I) similar to                              all cells
ones used in [2] and [3]. For one specific ontology term and              Each of these two methods has its own strengths and
n genes, all genes in the database (N ) are classified into four       weaknesses. The Fisher’s exact test can be applied when the
categories: the genes annotated to the term and in the input           input genes number is small and provides an exact calculation
gene list (m), the genes not annotated to the term and in the          of the significance of the null hypothesis. But when the sample
input gene list (n − m), the genes annotated to the term and           is large or the data is well balanced, the Fisher’s exact test
not in the input gene list (k − m), the genes not annotated            becomes computationally costly for the factorial calculation
to the term and not in the input gene list (N − n − k + m).            involved. On the other side, the chi-squared test can be applied
to large data samples but can only give an approximation of
the significance. Both methods could be used to reject the null
hypothesis that the data are independent, i.e. the input genes
don’t enrich the ontology term.
   After inputting an interesting gene list, the server will query
graphically among all the annotation data, i.e. transfer all
annotations of an ontology term to its parents to make sure
the indirect annotations are involved in the analysis. The final
analysis is as shows in Fig. 1.                                                  Fig. 2. A hair-ball style visualization of mega data


                                                                     example, in Fig. 3(a), the relative low significant terms (less
                                                                     red ones) could be distractive if the users only want to focus
                                                                     on the most significant terms. Also, the fixed cut off p-values
                                                                     make the visualization results not flexible enough. Therefore,
                                                                     it would be useful if we consider the cut-off p-value as a
                                                                     uncertainty factor and graph it. In this way, users are able to set
                                                                     an interesting significance value range, then the visualization
                                                                     results will re-arrange the focused terms to the center (as
                                                                     shown in Fig. 3b) to help biologists easily study them. The
                                                                     structure relationship between them and the significant levels
                                                                     calculated are always preserved to provide correct hierarchical
               Fig. 1. Analysis result of a set of genes             information.

               III. A NALYSIS V ISUALIZATION
   Besides the detail information of enriched ontology terms,
there are two other kinds of information to be explored. First,
the relationship among enriched terms and their corresponding
significant levels, the significant levels are not only limited to
the p-values, but also the number of input genes annotated to                                             (a)
a particular ontology term. Second, the relationship between
enriched terms and the whole reference data. We want to apply
two visualization methods to help users efficiently perceive
these information.
A. Enriched Ontology Branch Visualization
                                                                                                          (b)
   Biological ontology terms are always organized in a hierar-
chial structure, i.e. each ontology term inherits the properties     Fig. 3. a) visualization from AgriGO (b) uncertainty visualization with re-
of their parents and differs with its siblings in some functional-   arranging the layout
ities. Since each ontology term can have multiple parents and
siblings, the research of the enriched ontology branch of a set
                                                                                                   R EFERENCES
of genes facilitates biologists to explore the potential functions
associated to the genes and can be applied to find featuring         [1] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Bioinformatics
                                                                         enrichment tools: paths toward the comprehensive functional analysis of
genes in it. To visualize the enriched branch, we apply a hair-          large gene lists,” Nucleic acids research, vol. 37, no. 1, pp. 1–13, 2009.
ball style visualization (similar to Fig. 2) to all the ontology     [2] I. Rivals, L. Personnaz, L. Taing, and M.-C. Potier, “Enrichment or deple-
terms included in the database and highlight the ones that are           tion of a go category within a class of genes: which test?” Bioinformatics,
                                                                         vol. 23, no. 4, pp. 401–407, 2007.
significant to our input genes.                                      [3] G. Mi, Y. Di, S. Emerson, J. S. Cumbie, and J. H. Chang, “Length bias
                                                                         correction in gene ontology enrichment analysis using logistic regression,”
B. Uncertainty Visualization                                             PloS one, vol. 7, no. 10, p. e46128, 2012.
   The hierarchical visualization of the analysis results (e.g.      [4] Z. Du, X. Zhou, Y. Ling, Z. Zhang, and Z. Su, “agrigo: a go analysis
                                                                         toolkit for the agricultural community,” Nucleic acids research, p. gkq310,
Gene Ontology terms) is a common method to facilitate users              2010.
to explore the biological meanings behind the gene lists [4].        [5] E. Eden, R. Navon, I. Steinfeld, D. Lipson, and Z. Yakhini, “Gorilla: a
The tree structure graphs (as Fig. 3a shows) describe the                tool for discovery and visualization of enriched go terms in ranked gene
                                                                         lists,” BMC bioinformatics, vol. 10, no. 1, p. 48, 2009.
hierarchical structured ontology terms pretty well and have          [6] S. Maere, K. Heymans, and M. Kuiper, “Bingo: a cytoscape plugin
a common use in analysis tools ( [4], [5], [6]). However,                to assess overrepresentation of gene ontology categories in biological
there are some shortages in these visualization results. For             networks,” Bioinformatics, vol. 21, no. 16, pp. 3448–3449, 2005.

</pre>