UmakaData extension: Toward Realization of a Practical SPARQL Endpoint Discovery Service for Life Sciences Norio Kobayashi?1 , Yasunori Yamamoto?2 , and Atsuko Yamaguchi2 1 Head Office for Information Systems and Cybersecurity (ISC), RIKEN, 2-1 Hirosawa, Wako, Saitama, 351-0198 Japan norio.kobayashi@riken.jp 2 Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems 178-4-4, Wakashiba, Kashiwa, Chiba 277-0871, Japan {yy,atsuko}@dbcls.rois.ac.jp Abstract. UmakaData shows a list of SPARQL endpoints that provide life science data with reliability scores, called Umaka scores, concerned with properties such as data freshness, accessibility, and performance. UmakaData monitors 72 SPARQL endpoints and scores these endpoints by executing SPARQL queries daily. Recently, in order to realize a class and property catalogue service for each endpoint that helps users write suitable SPARQL queries, an RDF data schema explorer called LOD Surfer crawler accessed SPARQL endpoints that were ranked in the top 50 for Umaka scores. This poster presents our current progress on the Umaka data service and its recent extension. Keywords: SPARQL endpoint discovery, endpoint federation, RDF data quality check 1 Introduction Discovering SPARQL endpoints publishing RDF data that is suitable for a user's data analysis is an essential function. In the life sciences, since a wide variety of RDF data is published having classes and properties defined by various ontolo- gies, SPARQL endpoint discovery is a difficult task. In particular, when writing a federated search query, a user may find that classes and properties have differ- ent URIs even though the URIs should be the same among SPARQL endpoints. However, checking whether there are differences in classes and properties for each instance is generally quite an expensive task. In order to solve these prob- lems comprehensively, we introduce upper level ontologies by extracting at most several hundred classes from a single ontology having various kinds of classes. Another problem is practical availability of SPARQL endpoints. In order to address this problem, we have already developed a service called ‘UmakaData’ [1] ? These two authors contributed equally to this work. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). that shows a list of life science SPARQL endpoints and their properties, including availability, performance and data freshness. Our current issue is the selection of the best properties and computational method for a ranking score that re- flects users' practical data analysis. This poster reports our trial extension of UmakaData to address the issues described above. 2 UmakaData extension with detail SPARQL endpoint metadata The UmakaData currently provides endpoint metadata to both RDF data con- sumers and providers for their mutual understanding. These metadata include their running history, update information, processing speed, support for the four principals of Linked Data, and usage of ontologies that are well known or more common in life science RDF data. In addition, since UmakaData also ob- tains inter-endpoint relationships, it can provide information on links between RDF data of any pair of SPARQL endpoints. Therefore, UmakaData could pro- vide relationships among classes and properties once it finds a triple which has owl:sameAs as its predicate or any classes whose instance's URI is identical over SPARQL endpoints. Furthermore, in order to achieve more powerful class-discovering function- ality when writting a SPARQL query, we have been working on an extension of Umaka metadata by introducing LOD Surfer1 metadata that describes the LOD graph structure of a SPARQL endpoint including class-class relationships with statistics including numbers of triples and instances. Since a single instance may relate to different concept classes among different SPARQL endpoints, we introduce upper-level conceptual classes using a part of public ontology that covers wide and deep concepts. For the SPARQL endpoints ranked in the top 50 Umaka scores, the LOD Surfer metadata crawler was executed to extract upper- level concepts. This resulted in the selection of the top 114 Medical Subject Headings and 42 semanticscience integrated ontology concepts associated with 2,724 and 1,133 concepts among 35,248 concepts extracted from the 50 SPARQL endpoints without crawler error. Our future work will include periodical execution of the LOD Surfer meta- data crawler, its tuning to reduce computational complexity, introduction of other upper-level ontologies, and evaluation of the effectiveness of our extended UmakaData metadata using practical applications such as the LOD Surfer. Acknowledgements This work has been supported by JSPS KAKENHI grant numbers 17K00434, 17K00424 and 18K19766. References 1. Yamamoto, Y., Yamaguchi, A., and Splendiani, A.: YummyData: providing high- quality open life science data. Database, Vol. 2018, bay022, 2018. 1 http://github.com/LODSurfer/lodsurfer-metadata