Semantic integration of the genome annotations Toshiaki Katayama1, Shinobu Okamoto1, Shuichi Kawashima1, Hiroshi Mori2, and Takatomo Fujisawa3 1 Database Center for Life Science, Research Organization of Information and Systems , Tokyo, Japan ktym@dbcls.jp, {so,kwsm}@dbcls.rois.ac.jp 2 Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, Tokyo, Japan hmori@bio.titech.ac.jp 3 National Institute of Genetics, Research Organization of Information and Systems, Mishima, Japan tf@nig.ac.jp Abstract. Integration of the genome annotations is gaining more importance to interpret the biological meanings of large scale sequence data produced by the new sequencing technologies. Because existing annotations from public databases and literatures varies in terms of both categories and species, the Semantic Web technology has a great advantage for accumulating those wide- ranging information without having difficulty in the integration process. During the BioHackathon 2012, a new ontology to define the location of the annotations was defined as the Feature Annotation Location Description Ontology (FALDO). This ontology will be used to integrate annotations from INSDC (DDBJ/EMBL/GenBank) and UniProt databases, GFF3 formatted files, and many other bioinformatics resources. In parallel, Database Center for Life Science (DBCLS) and DNA Data Bank of Japan (DDBJ) has been jointly developing a RDF-based genome database which consists of the five layers. 1) RDF-based annotation data store, 2) SPARQL-based query engine, 3) RESTful API to retrieve genomic information, 4) HTML/CSS-based reusable web components and 5) integrated Web user interface. Our proposal is to standardize those layers so that every researcher can jointly use and/or update distributed genome annotations. Keywords: genome annotation, database integration, semantic web, ontology Introduction Reliable genome annotations are essential for understanding the existing and newly sequenced organisms. A number of model organism databases have been already developed for from human to pathogens including prokaryotes. However, thanks to the rapid evolution of the next generation sequencers, the number and volume of the 2 Toshiaki Katayama1, Shinobu Okamoto1, Shuichi Kawashima1, Hiroshi Mori2, and Takatomo Fujisawa3 sequenced organisms are growing exponentially. To understand the meaning of this vast amount of data, the importance of the reference annotation database is increasing. Even with the existence of many public databases, accumulation of those annotations is a difficult task because 1) annotations are still hidden in the literatures for most organisms and annotations of genes often require experimental verifications, 2) data formats used for the annotations were not always standardized, 3) there are no standard system to annotate any regions on the genomic sequence with a flexible yet controlled manner. The first problem requires text mining technologies and collaboration with biologists, the latter two issues can be resolved if our community agreed on a common standard. One of those efforts is the Generic Feature Format (GFF) and the Distributed Annotation System (BioDAS) which were introduced by the Generic Model Organism Database (GMOD) projects initially developed for the WormBase, FlyBase and some other model organism databases. Although the GFF format and the BioDAS protocol have been considered as standards for sharing genomic annotations, there still are some limitations. One is the lack of semantics in the GFF format which brought local variations such as the Gene Transfer Format (GTF). Also, especially for curators, it is very difficult to use the GFF format for describing sequence features not directly linked with the Sequence Ontology (SO) terms, or the semantic relations among sequence features such as interactions or regulations. Results To describe the missing semantics in the existing genome annotations or to add new heterogeneous annotations, we introduced the Semantic Web technology. It utilizes ontologies consisting of controlled vocabularies and their semantic relations, for describing objects to be annotated. Initially, we gathered reference knowledge from existing public data sources such as RefSeq, GFF3, and UniProt. We also merged annotations for prokaryote genomes from GTPS and MBGD databases and annotations of animal genomes from the H-inv database. All of those information are converted into RDF format and stored in a dedicated triple store. Along with this development, we are also developing a common genome annotation ontology. During the NBDC/DBCLS BioHackathon 2012 held in Japan, developers of life science databases and applications gathered and agreed on to develop new ontology for describing locations of the objects. As a result, the Annotation Location Description Ontology (FALDO) was proposed and we converted the locations and positions of all annotations stored in our triple store to comply with this new standard. Our proposed new RDF-based genome database is consisted of the five layers. 1) RDF-based annotation data store, 2) SPARQL-based query engine, 3) RESTful API to retrieve genomic information, 4) HTML/CSS-based reusable web components and 5) integrated Web user interface (Figure 1). To make the database generic as much as possible for any organisms, we introduced the following design principles. First, to identify any object in the database, data provider must assign a unique URI for any object in the dataset. However, designing well-formatted URIs for every heterogeneous object without any confusion is a difficult task. Therefore, for this Semantic integration of the genome annotations 3 purpose, we recommend to use unique URIs based on the Universally Unique Identifiers (UUIDs). Because a UUID can be independently and locally generated which will not collapse with the other ID in theory. One might think blank nodes in the RDF can be also applicable for this purpose, however, the use of UUID-based URIs can assure that the annotation for any object can be globally identifiable. Second, we recommend to use existing ontologies such as Sequence Ontology (SO) and FALDO to identify the type and location of the annotated object on the sequence. Third, it is recommended to develop a new ontology for each dataset which describes data source specific annotations such as the Feature/Qualifier used in the INSDC databases, specific tags in the H-inv database or the GFF3O ontology for the GFF3 data format. Then, each ontology should be linked with the new common genome annotation ontology so that users can use existing reasoning tools to generate interoperable triples which are subjected to be consumed by the genome database interface. Figure 1. Layers of the RDF-based semantic genome database. Discussions We are still in a very early stage of the development. Therefore, we are inviting collaborators for standardizing each of the five layers within the life science community so that every researcher can jointly use and/or update distributed genome annotation databases.