<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Acquisition by Traversing Class{Class Relationships over the Linked Open Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Atsuko Yamaguchi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kouji Kozaki</string-name>
          <email>kozaki@ei.sanken.osaka-u.ac.jp</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kai Lenz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yasunori Yamamoto</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hiroshi Masuya</string-name>
          <email>hmasuya@brc.riken.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norio Kobayashi</string-name>
          <email>norio.kobayashig@riken.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Advanced Center for Computing and Communication (ACCC), RIKEN</institution>
          ,
          <addr-line>2-1 Hirosawa, Wako, Saitama, 351-0198</addr-line>
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Database Center for Life Science (DBCLS), Research Organization of Information and Systems</institution>
          ,
          <addr-line>178-4-4 Wakashiba, Kashiwa, Chiba, 277-0871</addr-line>
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>RIKEN BioResource Center (BRC)</institution>
          ,
          <addr-line>3-1-1, Koyadai,Tsukuba, Ibaraki, 305-0074</addr-line>
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>RIKEN CLST-JEOL Collaboration Center</institution>
          ,
          <addr-line>6-7-3 Minatojima-minamimachi, Chuo-ku, Kobe 650-0047</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>The Institute of Scienti c and Industrial Research (ISIR), Osaka University</institution>
          ,
          <addr-line>8-1 Mihogaoka, Ibaraki, Osaka, 567-0047</addr-line>
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Linked Open Data (LOD) is a powerful mechanism for linking di erent datasets published on the Web, which is expected to create new value of data through mash-up over such various datasets. One of the important needs to extract data from LOD is to nd a path of resources connecting given two classes, each of which has an end resource of the path. Based on the concept, we have been developing data acquisition system named SPARQL Builder assisting users in semantic queries for LOD. Through the development, we introduced the two technologies for the approach: a labeled multigraph named class graph to compute classclass relationships and an RDF speci cation named SPARQL Builder Metadata to obtain and store required metadata for construction of a class graph.</p>
      </abstract>
      <kwd-group>
        <kwd>linked data</kwd>
        <kwd>class{class relationships</kwd>
        <kwd>data integration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In order to e ciently use databases published as Linked Open Data (LOD),
the users need to be allowed to obtain data in the exible way according to
their interests. An important case is to nd paths of links between instances
(resources) whose types are given two classes for integrative data analysis with
semantics. These paths can be obtained by retrieving chains of properties (links)
which connect instances of classes. In other words, these paths can be obtained
by traversing paths of class{class relationships over the LOD.</p>
      <p>Therefore, based on class{class relationships, we have been developping a
system named SPARQL Builder to obtain data from LOD exibly, by assisting
users in writing SPARQL queries to the SPARQL endpoints. To realize our
approach, we should develop the following two techniques: 1) a method to collect
pro les related to class{class relations through SPARQL endpoints of RDF
datasets: This is implemented as SPARQL Builder Metadata (SBM), which describes
comprehensive metadata including not only class de nitions but also statistics
such as the number of instances while it is not supported existing metadata. 2) a
method to obtain chains of properties and classes by computing paths on labeled
multigraph named class graph: This enables an e cient method to compute path
and a measure to remove paths of classes with no instance path are proposed.</p>
      <p>
        Related application includes Visor[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which enables users to browse RDF
datasets in the light of class{class relationships. However, Visor doesn't provide a
method to nd an end-to-end path through multiple resources. Although another
related work is RelFinder [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] which computes paths between resources in LOD, it
is not based on class{class relationships but on instance{instance relationships.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>SPARQL Builder</title>
      <p>
        We have been developing a
practical LOD search tool named
SPARQL Builder for the
lifescience data analysis (http://
www.sparqlbuilder.org/). This
tool provides an interactive GUI
that allows users who are not
familiar with SPARQL language to
generate SPARQL queries without
knowledge of SPARQL and RDF
data schema [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Overview of
system architecture is shown in Fig 1.
      </p>
      <p>SPARQL Builder manages SBM Fig. 1. Overview of the SPARQL Builder
sysgenerated by accessing SPARQL tem.
endpoints in advance (1). When a
user access to the SPARQL Builder system via a web browser as a GUI, SPARQL
Builder obtains a list of classes by analysing SBM (2) and displays the list on
the user's web browser (3). Then, when the user selects "input" and "output"
classes, SPARQL Builder constructs class paths by traversing the class graph
constructed using information described in SBM (4) and draw them on the web
browser. Using this GUI, users can explore datasets as their interest by
specifying classes. If a user interested in the interrelationships between molecular
pathways and proteins, he should do at rst is to select protein as input class
and pathway as output class. Then, SPARQL Builder shows all possible paths
involves pathways in which proteins that catalyses chemical reactions constitutes.
These paths has sequentially connected two relationships as the form of
"Protein -(left/right)- BiochemicalReaction -(pathwayComponnt)- Pathway". When
he select one of the class paths, SPARQL Builder create a SPARQL query which
can use to retrieve data his interest. SPARQL Builder is used for support service
to generate SPARQL queries for 38 SPARQL endpoints as of July 2016.
3</p>
    </sec>
    <sec id="sec-3">
      <title>SPARQL Builder Metadata</title>
      <p>SPARQL Builder Metadata (SBM), is a summary of RDF datasets provided via
a SPARQL endpoint. SBM is de ned as an extension of VoID (https://www.w3.
org/TR/void/) and SPARQL 1.1 service description (https://www.w3.org/TR/
sparql11-service-description/) with our original vocabulary whose name
space is sbm:. SBM contains statistic summary data called \graph summary"
for default graph and each named graph provided by the SPARQL endpoint.
Graph summary is an extension of VoID vocabulary related to void:Dataset
class with detailed statistical parameters as follows: A property partition is
a subset of RDF dataset associated with a property. In addition to original
VoID properties, three properties sbm:subjectClasses, sbm:objectClasses,
and sbm:objectDatatypes to describe numbers of classes and datatypes are
used. A class relation is a distinct pair of subject class and object class/datatype,
where subject class and object class/datatype are the class of subject instance
and class/datatype of object instance/literal in all triples associated with the
concerned property partition. sbm:classRelation property is introduced to
describe for each class relation having properties sbm:subjectClass, sbm:objectClass,
and sbm:objectDatatype as our original extension and properties VoID
vocabulary.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Class Graph</title>
      <p>
        To compute paths between two classes e ciently, we employed a specialized
graph whose nodes and edges correspond to classes and the class{class relations
with predicates, respectively. We call the graph class graph. A class graph can be
constructed from SBM e ciently because SBM includes a list of all the classes
and a list of all the class{class relationships. Given a class graph, an undirected
path on the graph is called as a class path. Note that a class path is not always
simple path because the same classes may appear twice or more in the path with
di erent properties. Class paths between two classes can be found in practically
short time using algorithm written in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] although a class graph is a labeled
multi-edge graph and a class path is not simple.
      </p>
      <p>
        Too many paths to select by a user may be found for relatively large
datasets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For example, for Reactome of EBI RDF Platform [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as of December
2014, the average number of paths between classes with maximum length four
was 609. In addition, we found that some class paths have no sequence of
instances obtained by traversing triples along the class paths by our preliminary
investigation. We call such a class path an empty path. Because a user can not
obtain any data using an empty path, it is important to present a method to
remove such class paths as many as possible. To do so, we employed a measure
to remove empty paths by using statistic values describing in SBM. The
probability that a path is not an empty path is estimated by using SBM and used to
remove empty paths and used to extract non-empty paths from all the paths. For
example, for Reactome, although the average nonempty path ratios for all the
class paths with maximum length four is 0.393, the average ratio of non-empty
paths for the highest 10 class paths of the probability is 0.893.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this study, we discussed novel LOD data exploring methodology and its
application SPARQL Builder which enables practical LOD data searching in a
SPARQL endpoint. Although the system originally was designed for biological
databases, the technologies used in the system including SBM and class graphs
are applicable to another domain. Therefore, our future work includes
expanding our application into multiple domains and evaluate the generalities of our
approach. In addition, we will consider to expand class paths into more general
types of subgraphs on class graph, to support more styles of SPARQL queries.
Acknowledgments This work was supported by JSPS KAKENHI Grant
Number 25280081, 24120002 and the National Bioscience Database Center (NBDC)
of the Japan Science and Technology Agency (JST).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Popov</surname>
          </string-name>
          , IO.,
          <string-name>
            <surname>Schraefel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shadbolt</surname>
          </string-name>
          , N.:
          <article-title>Connecting the dots: a multi-pivot approach to data exploration</article-title>
          .
          <source>In The Semantic WebISWC</source>
          <year>2011</year>
          ,
          <fpage>553</fpage>
          -
          <lpage>568</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Heim</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lohmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stegemann</surname>
          </string-name>
          , T.:
          <article-title>RelFinder: Revealing Relationships in RDF Knowledge Bases</article-title>
          .
          <source>4th International Conference on Semantic and Digital Media Technologies, SAMT</source>
          <year>2009</year>
          , LNCS
          <volume>5887</volume>
          ,
          <issue>182</issue>
          {
          <fpage>187</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Yamaguchi</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozaki</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenz</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobayashi</surname>
            <given-names>N.:</given-names>
          </string-name>
          <article-title>An Intelligent SPARQL Query Builder for Exploration of Various Life-science Databases</article-title>
          ,
          <source>CEUR Workshop Proceedings 1279, The 3rd International Workshop on Intelligent Exploration of Semantic Data (IESD</source>
          <year>2014</year>
          ),
          <source>Riva del Garda</source>
          , Italy.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Yamaguchi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozaki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenz</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamamoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobayashi</surname>
          </string-name>
          , N.:
          <article-title>Efciently nding paths between classes to build a SPARQL query for life-science databases</article-title>
          .
          <source>5th Joint International Conference (JIST</source>
          <year>2015</year>
          ), LNCS
          <volume>9544</volume>
          ,
          <issue>321</issue>
          {
          <fpage>330</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jupp</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malone</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolleman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandizi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davies</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaulton</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gehant</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laibe</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redaschi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wimalaratne</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Le</given-names>
            <surname>Novere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Parkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Birney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Jenkinson</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M.:</surname>
          </string-name>
          <article-title>The EBI RDF platform: linked open data for the life sciences</article-title>
          .
          <source>Bioinformatics</source>
          <volume>30</volume>
          (
          <issue>9</issue>
          ),
          <volume>1338</volume>
          {
          <fpage>1339</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>