<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Intelligent SPARQL Query Builder for Exploration of Various Life-science Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Atsuko Yamaguchi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kouji Kozaki</string-name>
          <email>kozaki@ei.sanken.osaka-u.ac.jp</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kai Lenz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongyan Wu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norio Kobayashi</string-name>
          <email>norio.kobayashig@riken.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Advanced Center for Computing and Communication (ACCC), RIKEN</institution>
          ,
          <addr-line>2-1 Hirosawa, Wako, Saitama, 351-0198</addr-line>
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Database Center for Life Science (DBCLS), Research Organization of Information and Systems</institution>
          ,
          <addr-line>178-4-4 Wakashiba, Kashiwa, Chiba, 277-0871</addr-line>
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The Institute of Scienti c and Industrial Research (ISIR), Osaka University</institution>
          ,
          <addr-line>8-1 Mihogaoka, Ibaraki, Osaka, 567-0047</addr-line>
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Database integration of a wide variety of life-science data is an important issue for comprehensive data analysis. Since Semantic Web technologies, such as Resource Description Framework (RDF), are expected to provide e cient data integration technologies, many lifescience databases are published in RDF with SPARQL Protocol and RDF Query Language (SPARQL) endpoints as search application programming interfaces on the web. However, although SPARQL supports very useful functions for exploring and integrating various datasets, many biologists nd SPARQL di cult to use. To overcome this problem, we propose an intelligent SPARQL query builder that aids users with no knowledge on SPARQL in building queries. This paper discusses the methods used by this tool, its system design and implementation. The tool assists users in generating queries for cross-database annotations based on RDF and enhances the value of life-science data by exploring and integrating these data.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic web</kwd>
        <kwd>SPARQL</kwd>
        <kwd>intelligent query generation</kwd>
        <kwd>database integration</kwd>
        <kwd>life-science databases</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The high-throughput measurement apparatuses recently developed for life
sciences have generated enormous and various data. Such data are stored and
published in databases across all around the world. To enhance the value of
such databases and acquire innovative knowledge from the extracted
information, these databases must be integrated. The interoperability of heterogeneous,
widely distributed biological databases has been recently improved by
Semantic Web technologies. Currently, many important life-science databases provide
their data in Resource Description Framework (RDF) model. RDF is a standard
Semantic Web data model that is used as one of the key technologies for Linked
Data. For example, UniProt [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the largest protein database, has employed an
RDF data model since 2008 to handle the many interlinks to various existing
databases. Around the same time, the Bio2RDF project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] published Linked
Data originally generated from numerous major biological databases in RDF. In
October 2013, an RDF platform [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] was made publicly available by the European
Bioinformatics Institute (EBI). Through this platform, users can access the RDF
data of six EBI databases. Data are retrieved by SPARQL Protocol and RDF
Query Language (SPARQL) endpoints, a type of web application programming
interface.
      </p>
      <p>Under these circumstances, if life-science researchers are to easily access RDF
data in Linked Data, their requirements should be accommodated by SPARQL.
Since a SPARQL query construct is intractable to biologists who are unfamiliar
with programming languages and RDF data schema, these researchers require
informatics supports for constructing SPARQL queries. Corresponding SPARQL
queries for standard requirements may be prepared in advance. Indeed, many
SPARQL endpoints provide typical example queries on RDF datasets. However,
because the interests of biological researchers are wide-ranging, they are not
easily covered by SPARQL queries constructed in advance.</p>
      <p>
        In this study, we present an intelligent tool named SPARQL Builder that
assists biological researchers in building SPARQL queries. This prototype version
assumes that target users have no knowledge of RDF and SPARQL, a situation
that is typical of biological researchers. Peculiar to life sciences, comprehensive
datasets are frequently viewed, edited and analyzed in table form, where a
column of the table is associated with an RDF class as a set of instances. Therefore,
our initial prototype system is designed to be compatible with TogoTable [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], an
RDF-based cross-database annotation system. TogoTable implements a function
that retrieves annotations from many RDF databases, which accesses up-to-date
user-speci ed SPARQL endpoints and their corresponding SPARQL queries. Our
SPARQL Buidler was designed for a SPARQL query construction tool for
TogoTable before being trialed on various cases.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Many SPARQL based tools and semantic search methods have been proposed
to date. The most popular method for instance searching is faceted search. For
example, Ferre et al. proposed query-based faceted search (QFS) as a
navigational support tool for faceted searching by logical information system query
language (LISQL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. QFS employs a SPARQL endpoint, enabling searching of
large datasets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Ferre et al. also developed a web based tool named Sparklis1, which
supports complex queries and exploratory searching for SPARQL endpoints. The
1 http://www.irisa.fr/LIS/ferre/sparklis/osparklis.html
tool presents users with lists of classes in its target endpoints and allows users to
make queries through faceted based graphical user interfaces (GUIs). Although
the tool presents queries in a logical language format, interactive GUIs for
building SPARQL queries are provided by other systems such as NITELIGHT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
iSPARQL2 and RDF-GL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These systems build queries through users'
interactive selections of the candidates of possible query speci cation options.
      </p>
      <p>
        Popov proposed an exploratory search called Multi-Pivot [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This search
method extracts concepts and relationships from the ontologies of interest to the
user. The extracts are visualized and used for semantic searches among instances
(data) associated with ontology terms. Kozaki et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] also proposed a
userguided divergent ontology exploration tool. Multi-Pivot and Kozaki et al.'s tool
are good examples of semantic searching for instances based on ontologies as
conceptual structures.
      </p>
      <p>
        Following Popov's approach, our proposed SPARQL builder is designed for
quick discovery of possible paths between instances of selected classes. Although
some systems are designed to accelerate RDF data retrieval [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], the proposed
system accelerates only the computation of possible paths among preprocessed
metadata.
      </p>
      <p>
        Currently, users can build SPARQL queries for life-science databases using
GUI-based support tools, such as facet based searching. For example, BioSPARQL
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] supports paths among instances in selected classes while limiting the target
to local data.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>SPARQL Builder</title>
      <p>SPARQL Builder is an intelligent tool by which users with no knowledge of
SPARQL can generate SPARQL queries and retrieve results satisfying their
requirement. In this section, we discuss a prototype version of SPARQL Builder
that collaborates with TogoTable and explain the core software components of
the tool, including the controller and crawler modules.
3.1</p>
      <sec id="sec-3-1">
        <title>System requirement</title>
        <p>Building a SPARQL query in TogoTable TogoTable is a web application
enabling biological researchers to upload their data in a table form and add
annotations obtained from SPARQL endpoints. More precisely, when a user selects
one column of an uploaded table, TogoTable displays the candidate databases
containing the SPARQL endpoint of the annotation search and the candidate
annotation types. If a user selects one database and one annotation type in a
system-prepared SPARQL query, TogoTable obtains annotations of that type
from the selected endpoint and adds these annotations as a column to the user's
original table. Presently, approximately 50 queries are inbuilt to TogoTable for
access to the major life-science databases; however, these queries are insu cient
2 http://oat.openlinksw.com/isparql/index.html
to satisfy the diverse interests of biological researchers. They also exclude newly
published dateabases.</p>
        <p>To cover the SPARQL endpoints of interest to diverse biologists, TogoTable
has a function that sets SPARQL queries for arbitrary SPARQL endpoints. Our
rst goal is to support users in constructing SPARQL queries using TogoTable.
Limitation of SPARQL queries As discussed above, SPARQL implemented
in TogoTable obtains an annotation for each element in the user's table. Because
SPARQL itself handles lists of instances, the query outputs annotations
corresponding to a list of all data elements in a user-speci ed input column. Since
each column of a TogoTable table is an extension of instances (data)
corresponding to a class, a user's speci cation of input and output columns corresponds to
specifying input and output classes de ned in a dataset.</p>
        <p>Therefore, we assume that SPARQL queries in a TogoTable are used to search
instances in an output class given some instances in an input class. Note that
the instances of input and output classes may be related in multiple ways. Since
classes may not be directly related, our system should display all possible
relationships between the instances of input and output classes. The relational
expressions between two classes are detailed in Subsection 3.3.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>System overview</title>
        <p>The GUI module The SPARQL endpoint and the input and output classes
are speci ed in a text box in the GUI module. A panel displays a rooted
tree generated from the possible relationships between the input and output
classes, and the system-generated SPARQL query is displayed in another
text box.</p>
        <p>The path nder module This module includes a function that computes the
possible relationships between input and output classes.</p>
        <p>The query constructor module This module includes the SPARQL query
generator, which generates SPARQL queries based on the user-speci ed
relationship between input and output classes.</p>
        <p>The crawler module This module accelerates the path nder module by
extracting metadata from the SPARQL endpoints in advance, as described in
Subsection 3.4.</p>
        <p>The controller module This module is the core module of the system that
manages and integrates the activities of the other modules.</p>
        <p>Our sytem proceeds through the following steps: 1) the user selects a SPARQL
endpoint from the endpoint list containing the URLs of SPARQL endpoints
preprocessed by the crawler module. If the desired SPARQL endpoint is not in
the list, the user can provide the URL of the SPARQL endpoint or update the</p>
        <p>SPARQL endpoint list at the starting process of the crawler module. 2) After
specifying the SPARQL endpoint, the user is presented with a list of classes
contained in the dataset of the SPARQL endpoint. From the input and output
classes speci ed by the user, the system then nds the relationships among the
start and end classes. 3) Using metadata extracted by the crawler module, the
path nder module dynamically constructs a tree whose root and leaves
correspond to the start and end classes, respectively. The resulting tree is displayed
to the user. 4) When the user selects one leaf of the tree, representing a single
relationship, the SPARQL query is constructed and displayed to the user. 5)
Finally, the generated query is executed.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Generating SPARQL by browsing a class graph</title>
        <p>To enumerate the relationships between input and output classes, we introduce
a class graph whose nodes and edges correspond to classes and the class{class
relations with predicates, respectively. Given an RDF dataset R, we denote by C
the set of all classes in R. A class graph GR = (V; E; c; p) of R is a directed labeled
multigraph de ned as follows: V is a jCj-sized set of nodes and c is a one-to-one
mapping from V to a set of URLs of C. E is a multiset of directed edges between
the nodes of V , and p maps E to a set of URLs of predicates in R. To construct
E and p from R, we add to E a directed edge epred from node nd to nr, where
c(nd) = classd and c(nr) = classr, and de ne p(epred ) = pred if pred satis es
either of the following two conditions: (1) both the triples \pred rdfs:domain
classd" and \pred rdfs:range classr" exist in R for some classes classd and classr;
(2) there exist three triples \sub pred ob", \sub rdf:type classd", \ob rdf:type
classr" in R, where sub and ob are resources and classd and classr are classes.</p>
        <p>Given a class graph GR, we de ne a class path p from a start class start to
an end class end by a sequence (n1; e1; n2; e2; : : : ; nm), where the nodes ni and
edges ei of GR satisfy the following conditions: (1) c(n1) = start , c(nm) = end ,
(2) c(ni) 6= end for any i 6= m, (3) ei is a directed edge from ni to ni+1 or from
ni+1 to ni, (4) if c(ni) = c(ni+2), ei 6= ei+1. An edge ei directed from ni to ni+1
or from ni+1 to ni is called forward or reverse directed, respectively. Note that
a class path corresponds to a SPARQL query to obtain the instances of an end
class from the instances of a start class by relating a sequence of predicates p(ei).
By searching the possible class paths from the start class to the end class, we
can obtain the candidates of a SPARQL query that match the user's purpose.</p>
        <p>We now explain how a SPARQL query is constructed from a class path
(n1; e1; n2; e2; : : : ; nm). Basically, because a class path indicates a relationship
between a start and an end classes, the WHERE clause of the SPARQL query
should include \?si p(ei) ?oi" or \?oi p(ei) ?si" if the direction of ei is forward
or reverse, respectively. In addition, because si and oi should be restricted to
instances of classes c(ni) and c(ni+1), the WHERE clause should also include two
triples \?si rdf:type c(ni)" and \?oi rdf:type c(nn+1)" for any i. Therefore, for a
class path (n1; e1; n2; e2; : : : ; nm), the following SPARQL query is constructed.</p>
        <p>SELECT ?rm WHERE f
?r1 p(e1) ?r2. (or ?r2 p(e1) ?r1.)
?r2 p(e2) ?r3. (or ?r3 p(e2) ?r2.)
. . .
?rm 1 p(em 1) ?rm. (or ?rm p(em 1) ?rm.)
?r1 rdf:type c(n1).
?r2 rdf:type c(n2).
. . .</p>
        <p>?rm rdf:type c(nm)
g
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Preprocessing method for metadata acquisition</title>
        <p>If the path nder module sends a SPARQL query to obtain the adjacency classes
in a class graph each time the user searches a path through the GUI module of
SPARQL Builder, the search time may become unacceptably long.</p>
        <p>Therefore, we implemented the crawler module, which extracts metadata
from the major SPARQL endpoints in advance. The crawler module gathers the
necessary metadata to construct a class graph, namely the classes, properties,
property domains and ranges, by executing low-load SPARQL queries even if
the SPARQL endpoint contains large amounts of data.</p>
        <p>Property schema generated by the crawler module To construct a class
graph from an RDF dataset, we should extract the relationship between classes
cs and co with property p appearing as triples \s p o" in the RDF dataset,
where s and o are instances of cs and co, as explained in Subsection 3.3. We now
de ne the property schema of an RDF dataset R by a quintuple (p; cs; co; Isp; Iop),
where p is the property, cs and co are classes, and sets Isp and Iop are instances
s in cs and instances o in co, respectively, existing as triples \s p o" in R. Note
that each property schema corresponds to one edge of a class graph. Ideally,
the property schema should be derived from explicit information written into
the RDF dataset using rdfs:domain and rdfs:range. In practice, however, the
property domains and ranges in RDF datasets may not be de ned. Moreover,
even when the domain d and range r are de ned for a property p, erroneous
triples \s p o" may exist for which the s and o classes written to the RDF
dataset using rdf:type do not specify d and r.</p>
        <p>Therefore, for a property p, there may exist multiple pairs of classes. For
each property p, the crawler module infers the domain and the range of p from
all triples in an RDF dataset R as follows:
1. Collect a quadruple (cs; co; s; o) of classes cs and co and resources s and o
such that three triples \s p o", \s rdf:type cs" and \o rdf:type co" reside in
R, and neither cs nor co are de ned as classes in RDF Schema 1.1 3, such as
rdfs:Resource and rdfs:Class. Then, for each pair (cs; co) of classes containing
(cs; co; s; o) for some s and o in the collected quadruple, a property schema
(p; cs; co; Isp; Iop) is generated by computing Isp = fs j (cs; co; s; o) for some og
and Iop = fo j (cs; co; s; o) for some sg for classes cs and co.
2. Collect the domain classes cd of p explicitly described as \p rdfs:domain cd"
triples and store them in a set Cd of classes. Similarly, collect the range
classes cr of p explicitly described as \p rdfs:range cr" triples and store
them in a set Cr of classes. Then, for every pair (cd; cr) of classes cdpd =2
Cd and cr 2 Cr generate a property schema (p; cd; cr; Idp; Irp), where I
fs j a triple \s p o" in R for some og and Irp = fo j a triple \s p o" in R for
some sg.</p>
        <p>If the domain or range of p of an \s p o" triple is unde ned in R and if
the classes of s or o are not described using rdf:type, that triple is disregarded
when generating the property schema, and data related to the triple cannot be
retrieved by our system-generated SPARQL queries. We refer to such a triple as
a junk triple. In addition, out system disregards classes c for which no property
schema (p; cs; co; Isp; Iop) exists with cs = c nor co = c. Such a class is called a
junk class.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Statistical indicators of SPARQL endpoints</title>
        <p>The coverage of the RDF data searchable by a SPARQL Builder query depends
on the comprehensiveness of the data schema decelerated in an RDF dataset.
Especially, in life sciences, the comprehensiveness and numbers of entities, such
as genes and proteins, can be an crucially important for data analysis.
Therefore, we introduce statistical indicators called data schema categories, comprising
property, class and endpoint categories, by which users can intuitively understand
the statistics of their data.</p>
        <p>Property category Let Tp be a set of triples with predicate p, where p is a
property outside of rdf:type, rdfs:subClassOf, or any other properties de ned in
RDF Schema 1.1. The property category of p can be de ned using Tp as follows:
property category 1 (complete): For all triples t 2 Tp, the classes of both
subject and object of t are explicitly declared using rdf:type. Moreover, both
the domain and range classes of p are explicitly declared using the properties
rdfs:domain and rdfs:range. Then, the property category of p is 1.
3 http://www.w3.org/TR/rdf-schema/
property category 2 (complete by inference): When the property
category of p is not 1 for all t 2 Tp, if each subject or object class of t is de ned
using rdf:type or can be inferred as show below, the property category of p
is 2.</p>
        <p>{ When the subject s of t is not de ned using rdf:type, a domain class cs
of the predicate of t can be de ned using rdfs:domain, cs is decided as
the class of s.
{ When the object o of t is not de ned using rdf:type, a range class co of
the predicate of t can be de ned using rdfs:range, co is decided as the
class of o.
property category 3 (partial): When the property category of p is not 1 or
2, if there exists a triple t 2 Tp such that each subject or object class of
t is de ned using rdf:type or can be inferred as shown in category 2, the
property category of p is 3.
property category 4 (none): When the property category of p is not 1, 2, or
3, it is set to 4.</p>
        <p>Note that if a property p is category 4, all the triples in Tp are junk triples.
Class category For an RDF dataset R, the class category of R is an index
specifying the coverage of classes in R that are not junk classes. The class category
is de ned as follows:
class category 1 (complete): If R contains no junk classes, it is category 1.
class category 2 (partial): When the category of R is not 1, but R contains
at least one non-junk class, it is category 2.
class category 3 (none): R contains only junk classes, it is category 3.
Endpoint category For an RDF dataset R, the endpoint category of R is
an index specifying the coverage of non-junk properties and classes. Endpoint
category is de ned as follows:
endpoint category 1 (complete): If R satis es the following two conditions,
its endpoint category is 1.</p>
        <p>{ Every property p in R not de ned in the RDF schema is either category
1 or 2.</p>
        <p>{ The class category of R is 1.
endpoint category 3 (none): If the class category of R is 3, the endpoint
category of R is 3.
endpoint category 2 (partial): If the endpoint category of R is neither 1 nor
3, it is set to 2.</p>
        <p>Resultant RDF le As discussed earlier, the crawler obtains data schema and
statistics by accessing a SPARQL endpoint. These data are stored as RDF les
in our SPARQL Builder server and allow the path nder module to rapidly nd
relationships among classes. The RDF les are written in standardized
vocabularies including SPARQL 1.1 Service Description 4 and VoID 5, with our original
vocabularies to include classes describing the relationships among subject{object
classes extracted from triples.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Statistical analysis of data preprocessed by the crawler module</title>
        <p>To evaluate the performance of the crawler module of SPARQL Builder, we
selected ve self-maintained large-scale databases accessed by the most
cuttingedge biological researches as SPARQL endpoints; namely, Expression Atlas 6,
BioModels 7, BioSamples 8, ChEMBL 9 and Reactome 10. As described in
Subsection 3.4, the property schemata of SPARQL endpoints are computed by the
crawler module. As shown in Table 1, the crawler module successfully obtained
the property schemata of the ve SPARQL endpoints valid as of July 2014.</p>
        <p>Because all ve of the SPARQL endpoints are classi ed as endpoint category
2, part of the RDF data provided in the endpoints can be retrieved by the inbuilt
SPARQL queries. For a SPARQL endpoint in endpoint category 2, the RDF data
coverage can be evaluated in more detail by the resultant property categories.</p>
        <p>The numbers of the properties in property category 2, in which complete
class{class relationships are inferred from subject{object classes of triples, are
worthy of special mention. The inferred class{class relationships cannot be
dynamically retrieved from the SPARQL Builder GUI within a practical length
of time, since all triples of the corresponding properties must be analyzed by
executing SPARQL queries. As shown in the \property categories" column of
Table 1, the crawler successfully inferred the class{class relationships for 55%
of the properties (378 out of 693 properties) obtained from the ve endpoints.
These inferred relationships account for 68% of all triples (488,539,717 out of
710,506,582 triples) in the ve RDF databases. Each property requires 48
seconds of runtime, which is unacceptably long for an interactive system.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Application example</title>
        <p>As mentioned above, the SPARQL Builder was originally designed to obtain new
annotations from SPARQL endpoints via TogoTable. Therefore, we connected
4 http://www.w3.org/TR/sparql11-service-description/
5 http://www.w3.org/TR/void/
6 http://www.ebi.ac.uk/rdf/services/atlas/sparql
7 http://www.ebi.ac.uk/rdf/services/biomodels/sparql
8 http://www.ebi.ac.uk/rdf/services/biosamples/sparql
9 http://www.ebi.ac.uk/rdf/services/chembl/sparql
10 http://www.ebi.ac.uk/rdf/services/reactome/sparql
SPARQL Builder and TogoTable, and evaluated whether users could extract
their desired annotations from arbitrary RDF databases.</p>
        <p>Figure 2 shows how the SPARQL Builder is connected to TogoTale. The
TogoTable contains the user-uploaded table data, a start class corresponding
to a key column of the table data, and the URL of the SPARQL endpoint of
a database containing the user's desired annotations. All this information is
sent to the SPARQL builder. Based on the start class and URL of the SPARQL
endpoint, the SPARQL Builder displays candidate end classes. If the user selects
an end class, SPARQL Builder shows the possible paths from the start class to
the end class. If the user selects a path, a SPARQL query is generated and
sent to TogoTable. TogoTable then adds a new column containing the desired
annotations returned by the query. SPARQL Builder will be supported in the
next release of TogoTable.
Using the SPARQL Builder a user can discover a sequentially connected triple
path to an arbitrary SPARQL endpoint. An intuitive GUI speci es the class{
class relationships as data schema of the path. To obtain these data schema,
users access the corresponding SPARQL endpoint directly via the GUI or by
a crawler that obtains comprehensive data schema in advance and stores the
results in an RDF le. Although the RDF le is designed to rapidly nd the
user's data schema, it also includes the essences of the data semantics as a
pro le of the corresponding SPARQL endpoint. Life-science data are especially
large-scale and comprehensive and are stored in diverse classes such as genes
and phenotypes. The crawler result les will assist biologists to understand data
semantics as a general data retrieval tool.</p>
        <p>The SPARQL endpoints of EBI and similar sources have begun to
replicate manually drawn data schema, enabling users to intutively program their
SPARQL queries. However, we observed that the output data schema
generated by the crawler module is comprehensive but cannot emphasize important
structures. The search result of the queries generated by the SPARQL builder
is also imperfect, as the query may yield no solutions. This occurs because our
tool investigates only the relationships between classes connected by a
property. Such class{class relationships cannot prove the existence of a sequentially
connected instance (triple path). To ensure that the generated queries always
discover solutions, the crawler needs to traverse all possible triple paths of two
or more steps, which is not practically feasible for arbitrary SPARQL endpoints.
Our method provides the best compromise between intelligent exploration of a
SPARQL endpoint and the convenience of generating SPARQL queries.</p>
        <p>In future, we hope that our tool will support queries for discovering class{
class relationships, such as ontology terms. We also hope to support additional
structures such as blank nodes, class inferences and the OWL vocabulary.
Acknowledgments. This work was supported by JSPS KAKENHI Grant
Number 25280081 and the National Bioscience Database Center (NBDC) of
the Japan Science and Technology Agency (JST).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. The UniProt Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt)</article-title>
          .
          <source>Nucl. Acids Res</source>
          .
          <volume>40</volume>
          (
          <issue>D1</issue>
          ),
          <source>D71{D75</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Belleau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nolin</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tourigny</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rigault</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morissette</surname>
            <given-names>J.</given-names>
          </string-name>
          <article-title>Bio2RDF: towards a mashup to build bioinformatics knowledge systems</article-title>
          .
          <source>J. Biomed. Inform</source>
          .
          <volume>41</volume>
          (
          <issue>5</issue>
          ),
          <volume>706</volume>
          {
          <fpage>716</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jupp</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malone</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolleman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandizi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davies</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaulton</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gehant</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laibe</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redaschi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wimalaratne</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Le</given-names>
            <surname>Novere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Parkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Birney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Jenkinson</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M.:</surname>
          </string-name>
          <article-title>The EBI RDF platform: linked open data for the life sciences</article-title>
          .
          <source>Bioinformatics</source>
          <volume>30</volume>
          (
          <issue>9</issue>
          ),
          <volume>1338</volume>
          {
          <fpage>1339</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kawano</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watanabe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mizuguchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Araki</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katayama</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamaguchi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>TogoTable: cross-database annotation system using the Resource Description Framework (RDF) data model</article-title>
          .
          <source>Nucl. Acids Res</source>
          .
          <volume>42</volume>
          (
          <issue>W1</issue>
          ),
          <source>W442{W448</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ferre</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Hermann, A.:
          <article-title>Reconciling faceted search and query languages for the semantic web</article-title>
          .
          <source>IJMSO</source>
          <volume>7</volume>
          (
          <issue>1</issue>
          ),
          <volume>37</volume>
          {
          <fpage>54</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Guyonvarch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferre</surname>
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Scalewelis: a scalable query-based faceted search elena work</article-title>
          .
          <source>Multilingual Question Answering over Linked Data (QALD-3)</source>
          , Valencia, Spain
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smart</surname>
            ,
            <given-names>P. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braines</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shadbolt</surname>
            ,
            <given-names>N. R.:</given-names>
          </string-name>
          <article-title>NITELIGHT: a graphical tool for semantic query construction</article-title>
          .
          <source>Semantic Web User Interaction Workshop (SWUI</source>
          <year>2008</year>
          ), Florence, Italy
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hogenboom</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milea</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frasincar</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaymak</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>RDF-GL: a SPARQL-based graphical query language for RDF</article-title>
          . In Chbeir, R.,
          <string-name>
            <surname>Badr</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abraham</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassanien</surname>
            ,
            <given-names>A-E</given-names>
          </string-name>
          . (eds.)
          <source>Emergent Web Intelligence: Advanced Information Retrieval</source>
          ,
          <volume>87</volume>
          {
          <fpage>116</fpage>
          , Springer, London (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Popov</surname>
            ,
            <given-names>I. O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schraefel</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shadbolt</surname>
          </string-name>
          , N.:
          <article-title>Connecting the dots: a multipivot approach to data exploration</article-title>
          .
          <source>International Semantic Web Conference (ISWC</source>
          <year>2011</year>
          ),
          <year>LNCS7031</year>
          ,
          <volume>553</volume>
          {
          <fpage>568</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kozaki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirota</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mizoguchi</surname>
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Understanding an ontology through divergent exploration</article-title>
          .
          <source>Extended Semantic Web Conference (ESWC2011)</source>
          ,
          <volume>305</volume>
          {
          <fpage>320</fpage>
          ,
          <string-name>
            <surname>Heraklion</surname>
          </string-name>
          , Greece
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kementsietsidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Scalable keyword search on large RDF data</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          , doi:10.1109/TKDE.
          <year>2014</year>
          .
          <volume>2302294</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ladwig</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rudolph</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Managing structured and semistructured RDF data using structure indexes</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>25</volume>
          (
          <issue>9</issue>
          ),
          <year>2076</year>
          {
          <year>2089</year>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kobayashi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toyoda</surname>
          </string-name>
          , T.:
          <article-title>BioSPARQL: ontology-based smart building of SPARQL queries for biological linked open data</article-title>
          .
          <source>SWAT4LS</source>
          ,
          <volume>47</volume>
          {
          <fpage>49</fpage>
          , London, UK
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>