ICBO 2014 Proceedings Ontobat: An Ontology-based Semantic Web Approach for Linked Data Processing and Analysis Zuoshuang Xiang, Yu Lin, Yongqun He* Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, and Comprehensive Cancer Center, University of Michigan Medical School, Ann Arbor, MI 48109, USA Abstract — The Linked (Open) Data (LD/LOD) strategy extends (i.e., "dereferenced") by people and user agents [4]. Ontobee the Web by publishing various open datasets as RDF links on the uniquely dereferences and presents ontology term URIs with a Web. To support linked data query and analysis, we developed user-friendly HTML web display while providing RDF source Ontobat, a Semantic Web strategy for automatic generation of linked data RDFs using ontology formats, data uploading to a RDF triple code for remote Semantic Web query by software applications store, SPARQL query, browsing, and statistical data analysis. This [3]. To support LOD data dereferencing and query, Lodbee report introduces the rationale, design, and preliminary adopts the Ontobee technology for representing instance data implementation of the Ontobat system (http://ontobat.hegroup.org). stored in LOD RDF triple stores. Keywords — Ontobat; ontology; Semantic Web; LOD Ontostat provides statistical analysis of RDF-based LOD data, using open source software programs such as R-Sparql I. INTRODUCTION (http://code.google.com/p/r-sparql/) which runs SPARQL Ontologies are one of the major components of the queries inside R and stores the results as an R data frame. Semantic Web and Linked Data movements. The Semantic Ontovert: convert Web enables machines to understand the meaning of Ontoload: upload data instance data to RDF/ to RDF triple store information on the Web. The Linked Open Data (LOD) XML format community aims to extend the Web by publishing various open datasets as Resource Description Framework (RDF) Lodquery: LOD data links on the Web. These RDF links between data items can RDF triple stores SPARQL query come from different data sources and be accessed anywhere run R Sparql online [1]. Existing LOD data are primarily instance data. Lodbee: LOD data Ontostat: LOD data Ontologies provide classifications and relations among these display and RDF statistical analysis instance data. source generation (e.g., meta-analysis) To support LOD data query and analysis, we have started to develop Ontobat (http://ontobat.hegroup.org), a web-based Fig. 1. Ontobat components and workflow design. The Ontobat will store instance RDF data formatted based on OWL ontologies. The RDF data comes biodata analysis tool that utilizes ontology-based Semantics from automatic data conversion and loading. The data can be visualized by Web methods. Ontobat is developed to support LOD data Lodbee and queried by Lodquery. Statistical tools will be developed under generation, upload, query, browsing, and statistical analysis. Ontostat. Statistical results can also be uploaded to a RDF triple store. In Ontobat, all RDF/OWL-based LOD data are generated based on reliable existing ontologies such as the OBO III. CURRENT ONTOBAT DEVELOPMENT Foundry ontologies [2]. This report provides the first time Since the Ontobat system contains many components, we introduction of the Ontobat system design and development. do not expect to develop all the programs simultaneously. Our development strategy is to implement one program at a time II. ONTOBAT SYSTEM DESIGN and later integrate all programs together. Ontobat is designed to be an integrative system including Currently, a prototype Ontobat program called Ontovert several components (Fig. 1): (http://ontobat.hegroup.org/ontovert/) has been developed Ontovert supports efficient conversion of instance data (Fig. 2). The basic idea of Ontovert is to use the first row (or from tab-delimited text or MS Excel format to an ontology header) to list ontology class term URIs, and use other rows to format using the Web Ontology Language (OWL). represent data as instances of the class terms listed in the first Ontoload loads instance data to RDF triple store. row. The Ontovert web page provides an example tab-limited The RDF triple stores can be developed using different data extracted from a vaccine protection meta-analysis study systems, such as the Open-Source Virtuoso platform as [5]. The first row of the tab-limited input data lists term IDs implemented in our Hegroup RDF triple store [3]. from the Vaccine Ontology (VO) [6]. After the VO is selected Lodquery provides RDF data query functions based on the and the data is provided, the Ontovert program generates an SPARQL Protocol and RDF Query Language. A user-friendly OWL output file that specifies the instance data as named web interface is usually required. individuals of the VO terms. The relations of the VO terms are Lodbee supports the browsing and dereferencing of LOD specified in VO and can be retrieved using the tool OntoFox data. The LOD movement requires the usage of URIs to [7]. The OntoFox feature is not yet implemented in Ontovert. denote things and these URIs to be referred to and looked up 93 ICBO 2014 Proceedings However, the Ontovert and OntoFox OWL output files can the ANOVA analysis feature can be implemented in the then be merged to show the output results seen in Fig. 2. Ontostat program in Ontobat. The Ontology of Biological and Clinical Statistics (OBCS) is a newly reported ontology that aligns with OBI and supports semantic biostatistics analysis [10]. Ontostat may use OBCS at the backend ontology for enhanced statistical analysis. While Ontobat is still under its early development stage, we would like to demonstrate the Ontobat design strategy and discuss the program design and implementation issues with researchers at the ICBO-2014 conference. ACKNOWLEDGMENT As OntoFox input This research was supported by NIH grant R01AI081062. REFERENCES OntoFox output Merge two OWL files [1] T. Berners-Lee. (2009). Design Issues: Linked Data. Available: http://www.w3.org/DesignIssues/LinkedData [2] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, et al., "The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration," Nat Biotechnol, vol. 25, pp. 1251-5, Nov 2007. [3] Z. Xiang, C. Mungall, A. Ruttenberg, and Y. He, "Ontobee: A linked data server and browser for ontology terms," in The 2nd VO_0001203 International Conference on Biomedical Ontologies (ICBO), Buffalo, NY, USA, 2011, pp. Pages 279-281 [http://ceur-ws.org/Vol- 833/paper48.pdf]. Fig. 2. Ontovert example. The  output  shows  “42”  days,  an  instance  data  of   [4] R. Lewis. (2007, Nov 13). Dereferencing HTTP URIs. the  VO  class  ‘vaccination-challenge  interval  in  days’  (VO_0001203).  See the text for more explanation. Available: http://www.w3.org/2001/tag/doc/httpRange-14/2007-05- 31/HttpRange-14 A prototype Lodquery has also been established [5] T. E. Todd, O. Tibi, Y. Lin, S. Sayers, D. N. Bronner, Z. (http://ontobat.hegroup.org/lodquery). The Lodquery uses the Xiang, et al., "Meta-analysis of variables affecting mouse protection Hegroup RDF triple store [3] as the default triple store. The efficacy of whole organism Brucella vaccines and vaccine other programs listed in Fig. 1 (e.g., Ontoquery and Ontostat) candidates," BMC Bioinformatics, vol. 14 Suppl 6, p. S3, 2013. are still under development. [6] Y. He, L. Cowell, A. D. Diehl, H. L. Mobley, B. Peters, A. To show the usage of Semantic Web in solving scientific Ruttenberg, et al., "VO: Vaccine Ontology," in The 1st International questions in a specific domain, we have developed an Ontobat Conference on Biomedical Ontology (ICBO-2009), Buffalo, NY,, program OntoCOG (http://ontobat.hegroup.org/ontocog) [8]. 2009, URL: http://precedings.nature.com/documents/3552/version/1. OntoCOG demonstrates how we uses the Semantic Web [7] Z. Xiang, M. Courtot, R. R. Brinkman, A. Ruttenberg, and approach to support statistical enrichment analysis of the Y. He, "OntoFox: web-based support for ontology reuse," BMC Res Clusters of Orthologous Groups of proteins (COGs) [8]. Notes, vol. 3, p. 175, 2010. [8] Y. Lin, Z. Xiang, and Y. He, "Towards a Semantic Web IV. DISCUSSION application: Ontology-driven ortholog clustering analysis," Ontobat is an ontology-based Semantic Web system Proceedings of the second International Conference on Biomedical primarily targeting for ontology-based instance data processing Ontologies (ICBO), University at Buffalo, NY, July 26-30, 2011, pp. and analysis. The reliance on ontology for instance RDF data Pages 33 - 40. , 2011. generation can be reflected in our Ontovert example (Fig. 2). [9] Y. He, Z. Xiang, T. Todd, M. Courtot, R. R. Brinkman, J. The usage of reliable ontologies for RDF/OWL data generation Zheng, et al., "Ontology representation and ANOVA analysis of provides a feasible way for data integration and sharing, and it vaccine protection investigation," in Bio-Ontologies 2010: Semantic supports consistent and integrative data analysis. Applications in Life Sciences, Boston, MA, USA, 2010, pp. Pages 1-8 The Fig. 2 example was originated from a previous study [http://ceur-ws.org/Vol-754/he_krmed2010.pdf]. that modeled an Analysis of Variance (ANOVA) statistical [10] J. Zheng, M. R. Harris, A. M. Masci, Y. Lin, A. Hero, B. analysis using the framework of the Ontology for Biomedical Smith, et al., "OBCS: The Ontology of Biological and Clinical Investigations (OBI) [9]. To make Ontovert function more Statistics," in The 2014 International Conference on Biomedical efficiently, the OntoFox feature as shown in the Fig. 2 use case Ontologies (ICBO 2014), Houston, TX, USA, 2014, pp. 1-6. can be incorporated into the Ontovert program. Furthermore, 94 ICBO 2014 Proceedings Ontobat: An Ontology-based Semantic Web Approach Tel: (734) 615 8231 for Linked Data Processing and Analysis yongqunh@umich.edu Zuoshuang “Allen”  Xiang,  Yu  “Asiyah”  Lin,  and  Yongqun  “Oliver”  He http://www.hegroup.org University of Michigan Medical School, Ann Arbor, MI 48109, USA Abstract Current Ontobat Development The Linked (Open) Data (LD/LOD) strategy extends the Web by publishing Since the Ontobat system contains many components, we do not expect to various open datasets as RDF links on the Web. To support linked data query develop all the programs simultaneously. Our development strategy is to and analysis, we developed Ontobat, a Semantic Web strategy for automatic implement one program at a time and later integrate all programs together. generation of linked data RDFs using ontology formats, data uploading to a RDF Currently, a prototype Ontobat program called Ontovert triple store, SPARQL query, browsing, and statistical data analysis. This report (http://ontobat.hegroup.org/ontovert/) has been developed (Fig. 2). The basic idea introduces the rationale, design, and preliminary implementation of the Ontobat of Ontovert is to use the first row (or header) to list ontology class term URIs, and system (http://ontobat.hegroup.org). use other rows to represent data as instances of the class terms listed in the first row. The Ontovert web page provides an example tab-limited data extracted from a vaccine protection meta-analysis study [3]. Introduction A prototype Lodquery has also been established (http://ontobat.hegroup.org/lodquery). The Lodquery uses the Hegroup RDF triple Ontologies are one of the major components of the Semantic Web and Linked store [2] as the default triple store. The other programs listed in Fig. 1 (e.g., Data movements. The Semantic Web enables machines to understand the Ontoquery and Ontostat) are still under development. meaning of information on the Web. The Linked Open Data (LOD) community To show the usage of Semantic Web in solving scientific questions in a specific aims to extend the Web by publishing various open datasets as Resource domain, we have developed an Ontobat program OntoCOG Description Framework (RDF) links on the Web. These RDF links between data (http://ontobat.hegroup.org/ontocog) [4]. OntoCOG demonstrates how we uses items can come from different data sources and be accessed anywhere online the Semantic Web approach to support statistical enrichment analysis of the [1]. Existing LOD data are primarily instance data. Ontologies provide Clusters of Orthologous Groups of proteins (COGs) [4]. classifications and relations among these instance data. Fig. 2. An Ontovert example. To support LOD data query and analysis, we have started to develop Ontobat The Ontovert output shows (http://ontobat.hegroup.org), a web-based biodata analysis tool that utilizes “42” days, an instance data of ontology-based Semantics Web methods. Ontobat is developed to support LOD the Vaccine Ontology (VO) class ‘vaccination-challenge data generation, upload, query, browsing, and statistical analysis. In Ontobat, all interval in days’ (VO_0001203). RDF/OWL-based LOD data are generated based on reliable existing ontologies The first row of the tab-limited such as the OBO Foundry ontologies. This report provides the first time input data lists term IDs from introduction of the Ontobat system design and development. the VO [5]. After the VO is As OntoFox input selected and the data is provided, Ontovert generates Ontobat System Design an OWL output file that OntoFox output specifies the instance data as Merge two OWL files named individuals of the VO Ontobat is designed to be an integrative system with many components (Fig. 1): terms. The relations of the VO • Ontovert supports efficient conversion of instance data from tab-delimited text terms are specified in VO and or MS Excel format to an ontology format using the Web Ontology Language can be retrieved using the tool OntoFox [6]. The OntoFox (OWL). feature is not yet implemented • Ontoload loads instance data to RDF triple store. in Ontovert. However, the • The RDF triple stores can be developed using different systems, e.g., Open- Ontovert and OntoFox OWL Source Virtuoso platform as implemented in our Hegroup RDF triple store [2]. VO_0001203 output files can then be merged • Lodquery provides RDF data query functions based on the SPARQL Protocol to show the output results. and RDF Query Language. A user-friendly web interface is usually required. • Lodbee supports the browsing and dereferencing of LOD data. The LOD Discussion movement requires the usage of URIs to denote things and these URIs to be referred to and looked up (i.e., "dereferenced") by people and user agents. Ontobat is an ontology-based Semantic Web system primarily targeting for Ontobee uniquely dereferences and presents ontology term URIs with a user- ontology-based instance data processing and analysis. The usage of reliable friendly HTML web display while providing RDF source code for remote ontologies for RDF/OWL data generation provides a feasible way for data Semantic Web query by software applications. To support LOD data integration and sharing, and it supports consistent and integrative data analysis. dereferencing and query, Lodbee adopts the Ontobee technology for The ANOVA analysis feature can be implemented in the Ontostat program in representing instance data stored in LOD RDF triple stores. Ontobat. The Ontology of Biological and Clinical Statistics (OBCS) is a newly • Ontostat provides statistical analysis of RDF-based LOD data, using open reported ontology that aligns with OBI and supports semantic biostatistics source software programs such as R-Sparql (http://code.google.com/p/r- analysis [7]. Ontostat may use OBCS at the backend ontology for enhanced sparql/) which runs SPARQL queries inside R and stores the results as an R statistical analysis. data frame. Acknowledgements Ontovert: convert This work is supported by NIH-NIAID Grant 1R01AI081062 to YH. Ontoload: upload data Fig. 1. Ontobat components and instance data to RDF/ to RDF triple store workflow design. The Ontobat will References XML format store instance RDF data formatted 1. T. Berners-Lee. (2009). Design Issues: Linked Data. Available: http://www.w3.org/DesignIssues/LinkedData based on OWL ontologies. The RDF 2. Z. Xiang, C. Mungall, A. Ruttenberg, and Y. He, "Ontobee: A linked data server and browser for ontology terms," in The 2nd International Lodquery: LOD data data comes from automatic data Conference on Biomedical Ontologies (ICBO), Buffalo, NY, USA, 2011, pp. Pages 279-281 [http://ceur-ws.org/Vol-833/paper48.pdf]. 3. He Y, Xiang Z, Todd T, Courtot M, Brinkman R, Zheng J, Stoeckert CJ, Malone J, Rocca-Serra P, Sansone S, Fostel J, Soldatova LN, RDF triple stores SPARQL query conversion and loading. The data can Peters B, Rutternberg A. Ontology representation and ANOVA analysis of vaccine protection investigation. Proceeding of Bio-Ontologies be visualized by Lodbee and queried 2010: Semantic Applications in Life Sciences, ISMB, July 9-10, 2010. Boston, MA, USA. run R Sparql 4. Y. Lin, Z. Xiang, and Y. He, "Towards a Semantic Web application: Ontology-driven ortholog clustering analysis," Proceedings of the by Lodquery. Statistical tools will be second International Conference on Biomedical Ontologies (ICBO), University at Buffalo, NY, July 26-30, 2011, pp. Pages 33 - 40. , 2011. Lodbee: LOD data Ontostat: LOD data developed under Ontostat. Statistical 5. Y. He, L. Cowell, A. D. Diehl, H. L. Mobley, B. Peters, A. Ruttenberg, et al., "VO: Vaccine Ontology," in The 1st International Conference on Biomedical Ontology (ICBO-2009), Buffalo, NY,, 2009, URL: http://precedings.nature.com/documents/3552/version/1. display and RDF statistical analysis results can also be uploaded to a RDF 6. Z. Xiang, M. Courtot, R. R. Brinkman, A. Ruttenberg, and Y. He, "OntoFox: web-based support for ontology reuse," BMC Res Notes, vol. 3, source generation (e.g., meta-analysis) triple store. 7. p. 175, 2010. J. Zheng, M. R. Harris, A. M. Masci, Y. Lin, A. Hero, B. Smith, et al., "OBCS: The Ontology of Biological and Clinical Statistics," in The 2014 International Conference on Biomedical Ontologies (ICBO 2014), Houston, TX, USA, 2014, pp. 1-6. 95