<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UmakaData extension: Toward Realization of a Practical SPARQL Endpoint Discovery Service for Life Sciences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Norio Kobayashi?</string-name>
          <email>norio.kobayashi@riken.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yasunori Yamamoto?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Atsuko Yamaguchi</string-name>
          <email>atsukog@dbcls.rois.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems</institution>
          <addr-line>178-4-4, Wakashiba, Kashiwa, Chiba 277-0871</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Head O ce for Information Systems and Cybersecurity (ISC), RIKEN</institution>
          ,
          <addr-line>2-1 Hirosawa, Wako, Saitama, 351-0198</addr-line>
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>UmakaData shows a list of SPARQL endpoints that provide life science data with reliability scores, called Umaka scores, concerned with properties such as data freshness, accessibility, and performance. UmakaData monitors 72 SPARQL endpoints and scores these endpoints by executing SPARQL queries daily. Recently, in order to realize a class and property catalogue service for each endpoint that helps users write suitable SPARQL queries, an RDF data schema explorer called LOD Surfer crawler accessed SPARQL endpoints that were ranked in the top 50 for Umaka scores. This poster presents our current progress on the Umaka data service and its recent extension.</p>
      </abstract>
      <kwd-group>
        <kwd>SPARQL endpoint discovery</kwd>
        <kwd>endpoint federation</kwd>
        <kwd>RDF data quality check</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction
Discovering SPARQL endpoints publishing RDF data that is suitable for a user's
data analysis is an essential function. In the life sciences, since a wide variety of
RDF data is published having classes and properties de ned by various
ontologies, SPARQL endpoint discovery is a di cult task. In particular, when writing
a federated search query, a user may nd that classes and properties have di
erent URIs even though the URIs should be the same among SPARQL endpoints.</p>
      <p>However, checking whether there are di erences in classes and properties for
each instance is generally quite an expensive task. In order to solve these
problems comprehensively, we introduce upper level ontologies by extracting at most
several hundred classes from a single ontology having various kinds of classes.</p>
      <p>
        Another problem is practical availability of SPARQL endpoints. In order to
address this problem, we have already developed a service called `UmakaData' [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
? These two authors contributed equally to this work.
      </p>
      <p>Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
that shows a list of life science SPARQL endpoints and their properties, including
availability, performance and data freshness. Our current issue is the selection
of the best properties and computational method for a ranking score that
re</p>
      <p>ects users' practical data analysis. This poster reports our trial extension of
UmakaData to address the issues described above.
2</p>
      <p>UmakaData extension with detail SPARQL endpoint
metadata
The UmakaData currently provides endpoint metadata to both RDF data
consumers and providers for their mutual understanding. These metadata include
their running history, update information, processing speed, support for the
four principals of Linked Data, and usage of ontologies that are well known or
more common in life science RDF data. In addition, since UmakaData also
obtains inter-endpoint relationships, it can provide information on links between
RDF data of any pair of SPARQL endpoints. Therefore, UmakaData could
provide relationships among classes and properties once it nds a triple which has
owl:sameAs as its predicate or any classes whose instance's URI is identical over
SPARQL endpoints.</p>
      <p>Furthermore, in order to achieve more powerful class-discovering
functionality when writting a SPARQL query, we have been working on an extension
of Umaka metadata by introducing LOD Surfer1 metadata that describes the
LOD graph structure of a SPARQL endpoint including class-class relationships
with statistics including numbers of triples and instances. Since a single instance
may relate to di erent concept classes among di erent SPARQL endpoints, we
introduce upper-level conceptual classes using a part of public ontology that
covers wide and deep concepts. For the SPARQL endpoints ranked in the top 50
Umaka scores, the LOD Surfer metadata crawler was executed to extract
upperlevel concepts. This resulted in the selection of the top 114 Medical Subject
Headings and 42 semanticscience integrated ontology concepts associated with
2,724 and 1,133 concepts among 35,248 concepts extracted from the 50 SPARQL
endpoints without crawler error.</p>
      <p>Our future work will include periodical execution of the LOD Surfer
metadata crawler, its tuning to reduce computational complexity, introduction of
other upper-level ontologies, and evaluation of the e ectiveness of our extended
UmakaData metadata using practical applications such as the LOD Surfer.</p>
      <p>Acknowledgements
This work has been supported by JSPS KAKENHI grant numbers 17K00434,
17K00424 and 18K19766.
1 http://github.com/LODSurfer/lodsurfer-metadata</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Yamamoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamaguchi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Splendiani</surname>
          </string-name>
          , A.:
          <article-title>YummyData: providing highquality open life science data</article-title>
          .
          <source>Database</source>
          , Vol.
          <year>2018</year>
          ,
          <year>bay022</year>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>