=Paper=
{{Paper
|id=Vol-1320/paper_26
|storemode=property
|title=Prototype Implementation of SPARQL Builder for Life-science Databases by Intelligent Schema Analysis on RDF datasets
|pdfUrl=https://ceur-ws.org/Vol-1320/paper_26.pdf
|volume=Vol-1320
|dblpUrl=https://dblp.org/rec/conf/swat4ls/KobayashiLWKY14
}}
==Prototype Implementation of SPARQL Builder for Life-science Databases by Intelligent Schema Analysis on RDF datasets==
Prototype implementation of SPARQL Builder for Life-science Databases by intelligent schema analysis on RDF datasets Norio Kobayashi1 , Kai Lenz1 , Hongyan Wu2 , Kouji Kozaki3 , and Atsuko Yamaguchi2 1 Advanced Center for Computing and Communication (ACCC), RIKEN, 2-1 Hirosawa, Wako, Saitama, 351-0198 Japan {norio.kobayashi, kai.lenz}@riken.jp 2 Database Center for Life Science (DBCLS), Research Organization of Information and Systems, 178-4-4 Wakashiba, Kashiwa, Chiba, 277-0871 Japan {wu, atsuko}@dbcls.rois.ac.jp 3 The Institute of Scientific and Industrial Research (ISIR), Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047 Japan kozaki@ei.sanken.osaka-u.ac.jp Abstract. Metadata publication in accordance with the semantic web as a database is a trend for providing and integrating various life-science data. These metadata are published as SPARQL endpoints, a standard- ised API for RDF datasets. As life-science data are very widely diverse and described using various ontologies and data classes, writing an effi- cient SPARQL query for SPARQL endpoints is difficult for biologists. To address this problem, we propose an intelligent SPARQL query builder that enables users to build a query without knowledge of SPARQL or the data schema. We have developed a prototype version of the SPARQL builder accessible via users’ web browsers. The system crawls SPARQL endpoints in advance to analyse the data schema of large amounts of data, and the resultant crawled data are stored as RDF datasets. This paper focuses on the implementation including the system overview, and the data structure of the resultant crawled data. Keywords: SPARQL, RDF schema, metadata of RDF datasets, life- science databases 1 Introduction With the development of life-science research fields and measurement technolo- gies for biological phenomena, the diversity of research data has been increas- ing. For efficient circulation, intelligent analysis and integration of such het- erogeneous data, semantic web technologies including RDF and SPARQL have been adapted, and life-science metadata datasets have already been published as SPARQL endpoints such as the European Bioinformatics Institute (EBI) RDF platform [1], Bio2RDF [2] and BioPortal [3]. However, because such various metadata in RDF are described using specialised ontology terms or data classes in subdivided research fields, writing an efficient SPARQL query that requires complete understanding of the data schema of RDF metadata is a difficult task for biologists as well as bio-informaticians. Efforts to make building a SPARQL query easier have been accomplished; for instance, many SPARQL endpoints provide typical example queries and figures of data schemata. However, they are not enough to cover the wide-ranging interests of biological researchers. To address this problem, we propose an intelligent web tool named SPARQL Builder that enables users to build a SPARQL query without understanding RDF data schema or SPARQL. We have implemented a prototype version of SPARQL Builder that enables users to build a SPARQL query for existing life- science SPARQL endpoints, including EBI’s service. This paper reports the im- plementation issues of the prototype system. 2 System overview SPARQL Builder is an intelligent tool that assists a user with no knowledge of SPARQL to generate a query on the basis of a triple path. To be more precise, p1 p2 pn a triple path i1 −→ i2 −→ . . . −→ in+1 , (1 ≤ n, and n = 3 is our default) is a sequence of instances i1 , i2 , . . . in+1 of classes C1 , C2 , . . . , Cn+1 respectively, connected by properties p1 , p2 , . . . pn . A list C1 , C2 , . . . , Cn+1 of classes is called a class path if a triple path for the list exists. When a SPARQL endpoint, a start class C1 and an end class Cn+1 are specified by a user on the system, the system analyses the metadata of the SPARQL endpoint obtained by a crawler in advance (cf. Section 3) and displays possible class paths C1 , C2 , . . . , Cn+1 . A user further selects a class path. Then the system generates a SPARQL query that searches a triple path corresponding to the selected class path. Figure 1 shows part of a screen capture of the SPARQL Builder client per- forming on a web browser. Our prototype system is implemented as a Java servlet, and a user can access through its client written in JavaScript using a web browser. To obtain possible class paths between the user’s start and end classes in a practical time, we use a data schema for the SPARQL endpoints called endpoint metadata. To construct the endpoint metadata for a SPARQL endpoint, SPARQL Builder throws small but numerous SPARQL queries to the endpoint in advance. The endpoint metadata are written in the vocabulary called SPARQL Builder metadata, published at http://sparqlbuilder.org/doc/, and they are stored in the servlet server in RDF. As of September 2014, we have retrieved endpoint metadata from EBI’s five SPARQL endpoints of large-scale databases used by the most cutting-edge research, including Expression Atlas1 , BioModels2 , BioSamples3 , ChEMBL4 and Reactome5 . 1 http://www.ebi.ac.uk/rdf/services/atlas/sparql 2 http://www.ebi.ac.uk/rdf/services/biomodels/sparql 3 http://www.ebi.ac.uk/rdf/services/biosamples/sparql 4 http://www.ebi.ac.uk/rdf/services/chembl/sparql 5 http://www.ebi.ac.uk/rdf/services/reactome/sparql Fig. 1. Graphical user interface of SPARQL Builder. (1) A user first selects a SPARQL endpoint, and then class lists for selecting a start class and end class are displayed. (2) When the user selects start and end classes, all possible class paths are displayed as a tree. (3) The user then selects a path, and the system generates the corresponding SPARQL query. (4) Finally, when the user clicks the SPARQL button, the system sends the generated SPARQL query to the SPARQL endpoint, and the result is displayed in a new window of the web browser. Though our SPARQL Builder itself is an individual application, it is designed to work in conjunction with TogoTable [4], a web application that enables bio- logical researchers to upload a table from a user’s data and to add annotations obtained from SPARQL endpoints. SPARQL Builder assists users in obtaining annotations from SPARQL endpoints without knowledge of SPARQL. The To- goTable service built with SPARQL Builder will be released to the public as the next version and will be evaluated regarding practicality of the tool. 3 SPARQL Builder metadata SPARQL Builder metadata briefly and comprehensively describes an RDF graph schema of SPARQL endpoint datasets. Other specifications defined for a sim- ilar purpose include the vocabulary of interlinked datasets VoID 6 and the vo- cabulary for describing SPARQL services SPARQL 1.1 Service Description 7 . SPARQL Builder metadata is based on these existing specifications but is de- fined by adding our original vocabularies that describe metadata for constructing class paths and statistics to determine comprehensiveness of the data that can 6 http://www.w3.org/TR/void/ 7 http://www.w3.org/TR/sparql11-service-description/ be handled by our search method on the basis of class paths. For instance, our original class ClassRelation is used to describe a relationship of two classes corre- lated with property p that is essential to build a class path. In order to improve comprehensiveness of our triple path search, the class–class relationship as a ClassRelation is not only the domain and range classes of property p but also the classes of subject and object instances of triples having property p. As described above, SPARQL Builder metadata is our original specification, but it is defined for arbitrary SPARQL endpoints. Some life-science SPARQL endpoints provide metadata for their datasets. EBI publishes such metadata in their framework called Lodestar8 , and Bio2RDF publishes Bio2RDF Dataset Metrics9 . We hope these metadata specifications are integrated as a global stan- dard and promote distribution of metadata for advanced intelligent semantic web data processing. 4 Conclusions We discussed our prototype version of the SPARQL Builder tool, which enables users to discover a sequentially connected triple path for arbitrary SPARQL endpoint without knowledge of the data schema or SPARQL. In order to a build SPARQL query by interaction with a user in a practical time, the metadata of the datasets provided by SPARQL endpoints are retrieved in advance and the results are stored as RDF datasets followed by a SPARQL Builder metadata specification. Our future work includes support for SPARQL queries not only for triple paths of a sequence of instances but also general structures such as trees, verification and improvement of practicality of our prototype system. Acknowledgments. We thank Dr Yasunori Yamamoto for useful comments for improvement of the SPARQL Builder metadata specification. References 1. Jupp, S., Malone, J., Bolleman, J., Brandizi, M., Davies, M., Garcia, L., Gaulton A., Gehant, S., Laibe, C., Redaschi, N., Wimalaratne, S. M., Martin, M., Le Novére, N., Parkinson, H., Birney, E., Jenkinson, A. M.: The EBI RDF platform: linked open data for the life sciences. Bioinformatics 30(9), 1338–1339 (2014) 2. Belleau, F., Nolin, M. A., Tourigny, N., Rigault, P., Morissette J.: Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform. 41(5), 706– 716 (2008) 3. Whetzel, P. L., Noy, N. F., Shah, N. H., Alexander, P. R., Nyulas, C., Tudorache, T., Musen, M. A. BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucl. Acids Res. 39(Web Server issue), W541–545 (2011) 4. Kawano, S., Watanabe, T., Mizuguchi, S., Araki, N., Katayama, T., Yamaguchi, A.: TogoTable: cross-database annotation system using the Resource Description Framework (RDF) data model. Nucl. Acids Res. 42(W1), W442–W448 (2014) 8 http://www.ebi.ac.uk/fgpt/sw/lodestar/ 9 https://github.com/bio2rdf/bio2rdf-scripts/wiki/Bio2RDF-Dataset-Metrics