GOfox: Semantics-based simplified hierarchical classification and interactive visualization to support GO enrichment analysis Edison Ong, Yongqun He University of Michigan, Ann Arbor, Michigan, USA ABSTRACT is the default OBO ontology linked data server that facili- Gene Ontology (GO)-based statistical enrichment analysis is a tates ontology data sharing, visualization, query, integration, popular approach to identify statistically enriched biological process- es, molecular functions, and cellular components that are associated and analysis (Xiang et al., 2011). Ontobee also supports with a list of genes. However, such GO enrichment analysis often ontology visualization including the hierarchy, definition generates a large number of enriched GO terms that are difficult to and annotations. By integrating and extending the features interpret and analyze. To address this issue, we developed GOfox, a of OntoFox and Ontobee, GOfox is able to represent the web tool that utilizes OWL-based ontology semantics and RDF triple store SPARQL queries to generate full or simplified hierarchical GO enriched GO terms in an interactive hierarchical layout subsets to classify and display enriched GO terms. GOfox integrates along with term-related information, and it allows users to and extends features from OntoFox and Ontobee, two ontology tools manually modify the summarized enrichment result. Con- developed in the laboratory. GOFox also includes a newly devel- sidering the multiple inheritance strategy used in GO devel- oped algorithm for generating simplified hierarchical classification by considering the multiple inheritance of GO. Furthermore, GOfox opment, GOfox developed a new algorithm to trim down the provides an interactive visualization that supports GO subset tree size of the enriched subset tree of GO. In addition, GOfox exploration and term editing. GOfox is freely available at the web- retrieves and displays related information such as definition, site: http://gofox.hegroup.org/. database cross references and comments, etc. of the selected GO term from Ontobee. This report provides the first time 1 INTRODUCTION introduction of the GOfox to help researchers better visual- A biological/biomedical ontology is a set of computer and ize and analyze the results of GO gene enrichment studies. human-interpretable terms and relations that represents enti- ties in a biological/biomedical domain and how they relate 2 GOFOX SYSTEM OVERALL DESIGN to each other. Hundreds of biological ontologies have been The overall design and workflow is displayed in Fig. 1. Us- developed. The most widely used biological ontology is the ing a web form shown in Fig. 2, a user can input enriched or Gene Ontology (GO), which systematically and semantical- interested GO terms along with the p-values. Then the user ly represents three major attributed associated with gene can define a P-values cutoff (or another cutoff) and how products: Biological Processes (BP), Molecular Function intermediates are treated. After receiving the user’s request, (MF), and Cellular Components (CC) (Ashburner et al., the GOfox server will extract a subset of GO that contains 2000). One major GO application is GO-based statistical the input terms and related GO terms using PHP, Java and enrichment analyses. The rationale of such an enrichment SPARQL. Specifically, the server queries against He analysis is that given a group of genes, the co-functioning Group’s RDF triple store using SPARQL and retrieves a genes should have a higher or enriched potential to be iden- subset of GO. The query results will be in RDF/XML for- tified as a relevant group using high throughput technolo- mat and will be reformatted to the OWL format using OWL gies (e.g., microarrays and RNA-Seq). Since often hundreds API (http://owlapi.sourceforge.net/). Then, based on the (or even more) of enriched terms are detected, the linear user’s preference, GOfox will run simplification algorithm output of enriched terms can be very large and overwhelm- and generate results for downloading, visualization, and ing, resulting in diluted focus on the analysis of related editing (Fig. 1). The results will be temporarily stored in He terms. group RDF triple store and destroyed in a regular basis. To address the ever increasing number of enriched GO terms resulting from high throughput studies, we developed GOfox to support GO enrichment analysis through integrat- ing and extending the features of OntoFox (Xiang et al., 2010) and Ontobee (Xiang et al., 2011). OntoFox is able to fetch ontology terms and axioms. OntoFox includes several semantics algorithms for extracting different levels of in- termediate layer terms between user-selected terms and a top level term of the ontology (Xiang et al., 2010). Ontobee * To whom correspondence should be addressed: yongqunhe@umich.edu Fig. 1. GOfox program architecture and workflow design. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1 Ong et al. 3 GOFOX NEW ALGORITHM (SIM) FOR PROCESSING ENRICHED GO TERMS The new GOfox algorithm “Include Computed Simplified Intermediates” (SIM) is developed on the basis of OntoFox “Include Computed Intermediates” (COM). The COM basi- cally removes all intermediate GO terms that match the fol- lowing rules: 1) the intermediate GO term is not included in the user’s input; 2) the intermediate GO term has only one parent and one children GO term (Xiang et al., 2010). Alt- hough COM works well for most ontologies, it often does not generate ideal results for ontologies (e.g., GO) that have multiple inheritance. SIM is developed to resolve this issue. SIM first goes through the COM steps, and the COM results are further simplified by selectively removing some intermediate terms that have multiple parents (e.g., multiple inheritance) based on the following 3 steps. First of all, SIM reformats the OWL-formatted results by removing indirect subclass relationships. For example, the subclass axiom: (regulates) some (transcription, DNA-templated) will be removed because the parent-children relationship is not a direct ‘is a’ relationship. Second, SIM removes intermediate GO terms that match the following rules: 1) the intermediate GO term is not included in the user’s input; 2) if the inter- mediate GO term has less than two child GO terms within Fig. 2. GOfox website interface and output. (A) GOfox web- the user’s input list (Note: here we do not consider one par- interface input form. (B) Standard GOfox SIM algorithm output. ent condition as COM does). Third, SIM will further trim down the list by removing the subclass relationships be- 5 AVAILABILITY AND LICENSE tween the GO terms and three GO top level terms of BP, GOfox is freely available on: http://gofox.hegroup.org/. CC, and MF. The requirements of the removal are: 1) the With the license of Apache License 2.0, the source code is term is a direct subclass of BP, CC or MF; 2) there exists released on Github: https://github.com/ontoden/gofox. another direct subclass relationship between the GO terms and terms other than the three GO top level terms. 6 SUMMARY While GOfox still keeps the COM algorithm for users GOfox is a simplified hierarchical classification tool to help to choose, the SIM algorithm provides an extra way of user interpret the results of GO enrichment analysis. GOfox shortening the GO terms in display. addresses a critical issue. i.e., the difficulty to visualize, select and further analyze the increased number of enriched 4 GOFOX FEATURES AND WEB INTERFACE GO terms from the popular GO enrichment analysis studies. GO provides many features for generating hierarchical clas- sification given a list of user-provided enriched GO terms. REFERENCES Fig. 2 provides a demo on how GOfox works. Specifically, Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, a user can choose to type in GO terms or upload a text file J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, as input. The user can provide a standard P-value or other P- J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., and Sherlock, G. values such as false discovery rate adjusted P-value. A dif- (2000). Gene ontology: tool for the unification of biology. The Gene ferent value cutoff can also be used. The user can then select Ontology Consortium. Nat Genet 25, 25-29. an intermediates retrieval setting, including COM, SIM, or Xiang, Z., Courtot, M., Brinkman, R.R., Ruttenberg, A., and He, Y. (2010). all intermediates. GOfox will run after “Run GOfox” is OntoFox: web-based support for ontology reuse. BMC Res Notes clicked (Fig. 2A). 3:175, 1-12. After the results are generated, GOfox provides an On- Xiang, Z., Mungall, C., Ruttenberg, A., and He, Y. (Year). "Ontobee: A tobee-like term visualization interface (Fig. 2B). This fea- linked data server and browser for ontology terms", in: The 2nd ture is good for biologists who are not familiar with using International Conference on Biomedical Ontologies (ICBO): CEUR the Protégé OWL editor to display output files. The user can Workshop Proceedings), Pages 279-281 [http://ceur-ws.org/Vol- interactively explore the hierarchy of retrieved GO terms 833/paper248.pdf]. and also hide unwanted GO terms from the web page. 2 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes