1. Introduction

Yasunori Yamamoto

Takatomo Fujisawa

Web of Data

Data curation

0 Database Center for Life Science , ROIS-DS, 178-4-4 Wakashiba, Kashiwa, Chiba 277-0871 , JAPAN 1 National Institute of Genetics , 1111 Yata, Mishima, Shizuoka 411-8540 , JAPAN

2024

RDF data show their values the most when built in a distributed manner and linked to each other from several aspects with URIs as their keys. However, we have seen several URI mismatches across RDF datasets that should be identical such as the cases of using diferent prefixes and code systems. In this situation, we need to develop an infrastructure in which these URIs are treated identically by using an URI rewriting dictionary constructed to be tailored to each RDF dataset. Here, we show some examples of these synonymous URIs and propose an architecture to rewrite some URIs when retrieving RDF data from multiple SPARQL endpoints. As a result, users can obtain properties as to a consolidated URI, which otherwise get ones explicitly asserted as triples only.

1. Introduction

Several works to represent huge and diverse life science data in the Resource Description Framework (RDF) have emerged since the late 2000s, and the number of newly built RDF data is increasing even now. Currently, 65 SPARQL endpoints are listed at the Umaka-Yummy Data1 where you can learn the status of each endpoint such as how stable it is, how fast it returns a result, and so on. RDF performs at its maximum potential when each URI denotes one concept and vice versa, since a URI is a global identifier. Multiple RDF datasets built in a distributed manner can be easily joined if this is true. However, there are several URI discrepancies among them. First of all, there are some typos and misprints within a dataset, such as the following: LGOBE https://researchmap.jp/yayamamo (Y. Yamamoto); https://researchmap.jp/takatomo (T. Fujisawa)

All of these URIs denote Homo sapiens. We consider this issue to be due to the nature of a distributed way of building RDF datasets. Multiple groups and institutions are involved in building. Therefore, in addition to calling community’s attention, we need to construct an infrastructure to minimize these mismatches as much as possible with the help of machines. Here, we propose an infrastructure where synonymous URIs are treated as identical. While there are already related works such as sameAs3, Identifiers.org 4, and TogoID5, there is no attempt to date that aims at providing consolidated results by rewriting URIs in the life science domain.

2. URI consolidation Acknowledgments

This work was supported under the Life Science Database Integration Project, NBDC of Japan Science and Technology Agency.