=Paper=
{{Paper
|id=Vol-1747/IP31_ICBO2016
|storemode=property
|title=Planteome Gene Annotation Enrichment Analysis
|pdfUrl=https://ceur-ws.org/Vol-1747/IP31_ICBO2016.pdf
|volume=Vol-1747
|authors=Botong Qu,Jaden Diefenbaugh,Eugene Zhang,Justin Elser,Pankaj Jaiswal,Seth Carbon,Christopher Mungall
|dblpUrl=https://dblp.org/rec/conf/icbo/QuDZEJCM16
}}
==Planteome Gene Annotation Enrichment Analysis ==
Planteome Gene Annotation Enrichment Analysis Botong Qu Justin Elser Seth Carbon Jaden Diefenbaugh Pankaj Jaiswal Chris Mungall Eugene Zhang Department of Botany Lawrence Berkeley School of Electrical and and Plant Pathology National Laboratory Computer Engineering Oregon State University Berkeley, CA 94720 Oregon State University Corvallis, Oregon 97331 Corvallis, Oregon 97331 Abstract—Annotation enrichment analysis of a gene list helps This 2 by 2 table is the contingency table used to calculate the biologists to identify the potential biological functions associated p-values for each term. If the p-value is bigger than the user with it. With the extensions of plant ontology categories, the chosen cut-off value (0.01 or 0.05), the term is not enriched discovery of significant ontology terms associated with a gene list becomes more and more informative. We introduce a tool to help by the gene list. biologists to find out these terms based on the expanding ontology database of the Planteome project. In addition, we propose some TABLE I new visualization schemes to help users construct a meaningful T HE CONTINGENCY TABLE FOR A ONTOLOGY TERM interpretation of the results guided by the ontology tree. Input Genes Not Input Genes Sum (Ref) keywords—Ontology, Plants, Term enrichment analysis, Visu- Annotated m k-m k alization Not annotated n-m (N-n)-(k-m) N-k Sum n N-n N I. I NTRODUCTION Gene annotations are analyzed and explored by gene cura- With the number of genes annotated to the term inside the tors from all over the world. Finding and visualizing the useful gene list (m), the total number of genes annotated to the term information from the annotations has been a hot topic for in the whole database (k), the number of input genes (n) and decades. The Common Reference Ontologies and Applications total number of genes in the database (N ), Fisher’s exact test for Plant Biology benefits biologists to be able to discover is defined as equations 1 and 2. The H(m, k, n, N ) represents enriched biological ontology terms among all provided on- the hypergeometric distribution. tologies (Gene Ontology, Plant Ontology, Trait Ontology, k N −k Environment Ontology, etc.). To assist this analysis process, m × n−m H(m, k, n, N ) = N (1) we provide a gene annotation enrichment analysis tool which n uses Fisher’s exact and chi-squared methods to statistically k X analyze all annotation data. Then, we visualize the results two p − value = H(i, k, n, N ) (2) ways: 1) Highlighting the enriched terms among all ontology i=m terms in the database to emphasize relative positions of the Based on the contingency table, we calculate the expected enriched terms. 2) Considering the cut-off p-value as a basis value of the cell that represents the number of genes annotated of an uncertainty factor when visualizing the tree structure in k to the term and inside the input list by n× N , then we construct order to conveniently focus on the interesting terms. an expected contingency table by fixing the margin values k, n, and N and using the calculated expected value to calculate all II. A NALYSIS M ODEL AND M ETHODS other three cells. Then we calculate the χ2 value with equation In our tool, we provide two common analysis methods to 3, and then transfer it to p-value for 1 degree of freedom (a 2 find the enriched terms: the Fisher’s exact test and the chi- by 2 table always has a freedom of 1). squared test [1]. To apply these statistical analysis methods, X (expcted − observed)2 the formulation of a contingency table is necessary. In our χ2 = (3) expcted system, we create the contingency table (table I) similar to all cells ones used in [2] and [3]. For one specific ontology term and Each of these two methods has its own strengths and n genes, all genes in the database (N ) are classified into four weaknesses. The Fisher’s exact test can be applied when the categories: the genes annotated to the term and in the input input genes number is small and provides an exact calculation gene list (m), the genes not annotated to the term and in the of the significance of the null hypothesis. But when the sample input gene list (n − m), the genes annotated to the term and is large or the data is well balanced, the Fisher’s exact test not in the input gene list (k − m), the genes not annotated becomes computationally costly for the factorial calculation to the term and not in the input gene list (N − n − k + m). involved. On the other side, the chi-squared test can be applied to large data samples but can only give an approximation of the significance. Both methods could be used to reject the null hypothesis that the data are independent, i.e. the input genes don’t enrich the ontology term. After inputting an interesting gene list, the server will query graphically among all the annotation data, i.e. transfer all annotations of an ontology term to its parents to make sure the indirect annotations are involved in the analysis. The final analysis is as shows in Fig. 1. Fig. 2. A hair-ball style visualization of mega data example, in Fig. 3(a), the relative low significant terms (less red ones) could be distractive if the users only want to focus on the most significant terms. Also, the fixed cut off p-values make the visualization results not flexible enough. Therefore, it would be useful if we consider the cut-off p-value as a uncertainty factor and graph it. In this way, users are able to set an interesting significance value range, then the visualization results will re-arrange the focused terms to the center (as shown in Fig. 3b) to help biologists easily study them. The structure relationship between them and the significant levels calculated are always preserved to provide correct hierarchical Fig. 1. Analysis result of a set of genes information. III. A NALYSIS V ISUALIZATION Besides the detail information of enriched ontology terms, there are two other kinds of information to be explored. First, the relationship among enriched terms and their corresponding significant levels, the significant levels are not only limited to the p-values, but also the number of input genes annotated to (a) a particular ontology term. Second, the relationship between enriched terms and the whole reference data. We want to apply two visualization methods to help users efficiently perceive these information. A. Enriched Ontology Branch Visualization (b) Biological ontology terms are always organized in a hierar- chial structure, i.e. each ontology term inherits the properties Fig. 3. a) visualization from AgriGO (b) uncertainty visualization with re- of their parents and differs with its siblings in some functional- arranging the layout ities. Since each ontology term can have multiple parents and siblings, the research of the enriched ontology branch of a set R EFERENCES of genes facilitates biologists to explore the potential functions associated to the genes and can be applied to find featuring [1] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of genes in it. To visualize the enriched branch, we apply a hair- large gene lists,” Nucleic acids research, vol. 37, no. 1, pp. 1–13, 2009. ball style visualization (similar to Fig. 2) to all the ontology [2] I. Rivals, L. Personnaz, L. Taing, and M.-C. Potier, “Enrichment or deple- terms included in the database and highlight the ones that are tion of a go category within a class of genes: which test?” Bioinformatics, vol. 23, no. 4, pp. 401–407, 2007. significant to our input genes. [3] G. Mi, Y. Di, S. Emerson, J. S. Cumbie, and J. H. Chang, “Length bias correction in gene ontology enrichment analysis using logistic regression,” B. Uncertainty Visualization PloS one, vol. 7, no. 10, p. e46128, 2012. The hierarchical visualization of the analysis results (e.g. [4] Z. Du, X. Zhou, Y. Ling, Z. Zhang, and Z. Su, “agrigo: a go analysis toolkit for the agricultural community,” Nucleic acids research, p. gkq310, Gene Ontology terms) is a common method to facilitate users 2010. to explore the biological meanings behind the gene lists [4]. [5] E. Eden, R. Navon, I. Steinfeld, D. Lipson, and Z. Yakhini, “Gorilla: a The tree structure graphs (as Fig. 3a shows) describe the tool for discovery and visualization of enriched go terms in ranked gene lists,” BMC bioinformatics, vol. 10, no. 1, p. 48, 2009. hierarchical structured ontology terms pretty well and have [6] S. Maere, K. Heymans, and M. Kuiper, “Bingo: a cytoscape plugin a common use in analysis tools ( [4], [5], [6]). However, to assess overrepresentation of gene ontology categories in biological there are some shortages in these visualization results. For networks,” Bioinformatics, vol. 21, no. 16, pp. 3448–3449, 2005.