<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Approach Towards Classifying and Navigating RDF data based on Pattern Structures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>LORIA (CNRS - Universit´e de Lorraine) BP</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vandoeuvre-les-Nancy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>mehwish.alam</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>amedeo.napoli@loria.fr}</string-name>
        </contrib>
      </contrib-group>
      <fpage>33</fpage>
      <lpage>48</lpage>
      <abstract>
        <p>With an increased interest in machine processable data, more and more data is now published in RDF (Resource Description Framework) format. This RDF data is present in independent and distributed resources which needs to be centralized, navigated and searched for domain specific applications. This paper proposes a new approach based on Formal Concept Analysis (FCA) to create a navigation space over semantic web data. This approach uses an extension of FCA and takes RDF triples and RDF Schema present on several independent sources and provide centralized access over the data resulting from several resources. Afterwards, SPARQL queries can be posed over this navigation space to access these distributed resources from one platform for information retrieval purposes.</p>
      </abstract>
      <kwd-group>
        <kwd>Formal Concept Analysis</kwd>
        <kwd>Navigation Space</kwd>
        <kwd>Linked Open Data</kwd>
        <kwd>RDF</kwd>
        <kwd>RDF Schema</kwd>
        <kwd>Semantic Pattern Structures</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the efforts of Semantic Web community many technologies have been
offered for publishing machine-readable data on web. It annotates textual data
with meta-data and makes it available in the form of ontologies and RDF graphs.
One of the emerging source of such data are published in the form of Linked Open
Data (LOD) cloud [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As a contrast to textual resources, LOD does not need
extensive preprocessing as it is already annotated in the form of entities and
relationships.
      </p>
      <p>
        This structured format leads to other kind of challenges. One of the
basic characteristics of Semantic Web is that it follows a decentralized
publication model, meaning that the RDF graphs are published in several distributed
resources, instead of creating one knowledge-base of statements any one can
contribute new statements and make it publicly available. These resources have
nothing in common except some shared terms. These decentralized graphs should
be integrated through machine/software agents to provide domain specific
applications. Moreover, external schemas in the form of ontologies or taxonomies
can be linked to these data to make sense based on real world conception. Some
of the resources may only consist of ontological information and may not
contain the instantial information such as SWRC ontology [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and some resources
may only contain instantial information such as DBLP. The problem of how to
provide one platform for accessing several and related semantic web resources
to build domain specific applications still persists. This paper focuses on how
to link several related heterogeneous RDF resources present on distributed
locations containing RDF data and RDF Schema into one space which can further
be navigated and searched by the end user. We call it a “navigation space”
because it provides centralization over RDF triples and also over RDF Schema
belonging to several resources. This space provides navigation and querying over
RDF triples as well as RDF Schema.
      </p>
      <p>
        The current paper introduces a new framework based on Formal Concept
Analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which focuses on how a navigation space is created over RDF data
by taking into account an existing RDF Schema to provide search capabilities
over distributed and heterogeneous Semantic Web resources. In the current study
we only consider a part of RDF Schema which keeps subclass-superclass
relation. Other constructs such as sub-property and relation between classes are not
considered here. FCA alone may not be able to handle the complex data such as
RDF data and RDF Schema as it considers only binary data. To deal with this
shortcoming, we employ an extension of FCA, namely Pattern Structures [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We
propose a new framework called Semantic Pattern Structures, that takes RDF
triples and the RDF Schema as an input and produces a “navigation space”.
This navigation space provides centralization over distributed RDF resources
and keeps a partially ordered organization of the classes of RDF triples with
respect to RDF Schema from data sources which may or may not be part of
LOD. The lattice structure allows for querying over RDF graph w.r.t. subject,
predicate and object and makes use of the partial order relation between the
concepts. This enables simultaneous search over schema and the facts contained
in an RDF graph. This navigation space allows querying using well-documented
and well-know query language, SPARQL. Several data sources are
simultaneously queried through one platform. To date this is the first attempt to deal
with RDF graph and RDF Schema using pattern structures.
      </p>
      <p>The paper is structured as follows: section 2 gives the motivating example,
section 3 introduces the basic notion of Pattern Structures with an example.
Section 4 details the framework of semantic pattern structures for creating schema
rich navigation space over RDF data. Section 5 defines the querying mechanism
over the obtained navigation space. Section 6 describes the experimental results
of the framework. Section 7 discusses the related work while Section 8 concludes
the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Motivating Example</title>
      <p>For instance, if a patient has been prescribed some Cardiovascular Agent (CVA)
and after a while he is looking for a replacement for his drug because it causes
some allergic condition as a side effect. For finding solution, the doctor should
be able to access this information for obtaining the suggestions for replacing
patient’s drug which is used for the same purpose but does not cause the side
effect reported by the patient. All the required information is present in different
data sources:
– Drugbank1 contains information about drugs with their categories and the
proteins it targets.
– SIDER2 contains drug with their specific side effects.
– The category of drugs CVA exists in MeSH 3.</p>
      <p>– Allergic conditions which is a class of side effects exists in MedDRA4.
The information required by the domain expert cannot be obtained by posing a
simple query on Google. SPARQL queries over each resource should be written
to obtain this answer, which further needs manual work to obtain the required
suggestions as the answers are in the form of list. In order to access such kind of
information, all the data sets must be accessed simultaneously and there should
be a single information space for accessing all the desired information through
navigation and simple SPARQL queries.</p>
      <p>In the current study, we propose a generic framework that provides one space
for accessing several data sources related to some domain. The framework
proposed in this study is generic because it can be applied to any domain having
related information scattered over several resources. Based on the requirements
of the user and the applications, the navigation space can be defined on a limited
number of factual RDF triples. For example, in case of a cardiologist, he will only
be interested in Cardiovascular Agents or some related categories of drugs. This
way the application will include the drugs which are Cardiovascular Agents and
its related categories. Only a subset of datasets should be considered according
to the requirements of the applications. These specification are described by the
domain expert as a starting point for narrowing down the huge cloud of Linked
Data present on-line.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Preliminaries</title>
      <sec id="sec-3-1">
        <title>Linked Open Data</title>
        <p>
          Recently, Linked Open Data (LOD) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] has become a standard for publishing
data on-line in the form of Resource Description Framework (RDF)5 which can
further be linked to other data sources published in the same format. The
idea underlying RDF is based on storing statements about a particular
resource, where each statement is represented as xsubject, predicate, objecty. A
1 http://www.drugbank.ca/
2 http://sideeffects.embl.de/
3 http://www.ncbi.nlm.nih.gov/mesh/
4 http://meddra.org/
5 http://www.w3.org/RDF/
set of linked statements is referred to as RDF graph which constitutes an RDF
triple store. For instance, Table 2 keeps RDF triples for drugs with their side
effects and drug categories. The prefixes and full forms of all the abbreviations
used in this paper are shown in Table 1. Consider t1 i.e., xs1, p1, o1y, here s1
is subject, p1 is predicate and o1 is the object. A URI U in an RDF graph
is a general form of URL. The actual representation of the drug s1 in the
form of URI is 1http://wifo5-04.informatik.uni-mannheim.de/drugbank/
resource/drugs/DB006691, which is a URI of the DrugBank provided by
University of Mannheim containing all the information about the drug in the form
of triples. DB00669 is the id provided by DrugBank to drug s1 and the name of
the drug is obtained by using the rdfs:label predicate defined in RDFS
vocabulary6. The subject denotes the resource and the predicate denotes properties of
the resource and defines relationship between the subject and the object. In the
rest of the paper we use dereferenced resources i.e., s1 instead of complete URI.
        </p>
        <p>Practically, we consider RDF data w.r.t. three levels according to the value
of predicate.</p>
        <p>– An RDF triple is at the schema level when the predicate has a reserved
value from RDFS vocabulary such as rdfs:subClassOf, rdfs:Class
(represented as sc and Class respectively in the rest of the paper). The keyword
rdfs:Class is used to declare that a set of resources is constituting a class
and rdfs:subClassOf states that all the instances of one class are the
instances of another class. For example, in Table 2, the schema level is given by
triples t6, t8. An example of the tree structure contained in an RDF schema
(in the rest of the paper when we use RDF Schema/schema we mean the
tree structure created by rdfs:subClassOf) related to the drug side effect
and their categories is shown in Figures 1 and 2 respectively.
– An RDF triple is at the type level if it keeps the mappings from an instance
to its class and states that a resource is an instance of a class. The RDF
triple is at this level when the predicate has a reserved value from RDF
vocabulary such as rdf:type (represented as type in the rest of the paper).</p>
        <p>For example, in Table 2, the type level is given by triples t4, t5, t7.
– An RDF triple is at factual level when the predicate is related to the domain
knowledge and is not a keyword from RDF vocabulary. For example, in
Table 2, the factual level is given by t1, t2, t3.</p>
        <p>
          Table 2 keeps RDF triples from several resources. The information about the
side effect of the drugs shown in triples t1, t2 is present on SIDER7 database
which keeps the side effects of the drugs contained on the packaging of the drug.
The information about the categories of the drugs shown in triples t3 is present
on DrugBank8, which keeps detailed information about each of the drugs and
the proteins it targets. The schema level information about the drug side effects
shown in t5, t6 are present in The Medical Dictionary for Regulatory Activities
6 http://www.w3.org/TR/rdf-schema/
7 http://sideeffects.embl.de/
8 http://www.drugbank.ca/
(MedDRA) Terminology9 which is the international medical terminology. Finally
the schema level information related to the drug category shown in triples t7, t8
is present in MeSH (Medical Subject Headings)10 is a controlled vocabulary
thesaurus which along with other tree structures keeps classification of drug
categories.
A standard query language for accessing RDF graphs is SPARQL11 which mainly
focuses on graph matching. A SPARQL query is composed of two parts the head
and the body. The body of the query contains the Basic Graph Patterns (it is
contained in the WHERE clause of the query). It is composed of complex graph
patterns that may include RDF triples with variables, conjunctions,
disjunctions and constraints over the values of the variables. These graph patterns are
matched against the RDF graph and the matched graph is retrieved and
manipulated according to the conditions given in the query. The head of the query is an
expression which indicates how the answers of the query should be constructed.
For instance, consider a simple SPARQL query SELECT ?x ?y WHERE { ?x p1
?y}, then the basic graph pattern ?x p1 ?y will be matched against the triples
containing p1 as predicate i.e., t1 and t2 (see Table 2). Thus the answer of the
query will be (s1, o1), (s1, C2).
9 http://www.meddra.org/
10 http://www.ncbi.nlm.nih.gov/mesh
11 http://www.w3.org/TR/rdf-sparql-query/
Formal Concept Analysis [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] can process only binary context, more complex
data such as graphs, RDF triples can not be directly processed by FCA. Some
of the complex data require scaling for these to be considered as binary data.
The concept lattice obtained by a binary context mixes between several types of
attributes. Pattern structures [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], an extension of FCA, allows direct processing
of such kind of complex data such as RDF triples (as we show in this paper).
The pattern structures were introduced in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>A pattern structure is a triple pG, pD, [q, δq, where G is the set of entities12,
pD, [q is a meet-semilattice of descriptions D and δ : G Ñ D maps an entity
to a description. More intuitively, a pattern structure is the set of objects with
their descriptions with a similarity operation [ on them which represents the
similarity of objects. This similarity measure is idempotent, commutative and
associative. If pG, pD, [q, δq is the pattern structure then the derivation operators
can be defined as:</p>
        <p>A :
¦ δpgq
gPA
d :
tg P G|d  δpgqu
f or A  G
f or d P D
Now the pattern concept can be defined as follows:
Definition 1 (Pattern Concept). A pattern concept of a pattern structure
“G, pD, [q, δ” is a pair pA, dq where A  G and d P D such that A d and
A d , where A is called the concept extent and d is called the concept intent.</p>
        <p>Finally, the obtained concept lattice is called as pattern concept lattice, which
keeps partially ordered relations between the pattern concepts.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Towards Semantic Pattern Structures</title>
      <p>The decentralization issue is raised when many entities are contributing
statements and making it publicly available. Either they are publishing same data
using different vocabularies or related data which is distributed over several
resources. To provide centralized access over distributed resources this section
discusses a framework called Semantic Pattern Structures. This new framework
is called Semantic Pattern Structures (SPS) because it is based on Pattern
Structures and it deals with data in the form of triples and it classifies these triples
with respect to RDF Schema present in the form of taxonomy. This way the final
navigation space also keeps the semantic information i.e., background knowledge.
The idea underlying the SPS is to create a concept lattice directly from an RDF
graph and use the associated RDF Schema present on Linked Open Data to
12 We rename an object in Pattern Structures as an entity to avoid confusion with the
object in an RDF triple.
be accessed within one concept lattice. This concept lattice provides search and
navigation capabilities over RDF data as well as over the RDF Schema to answer
certain queries of the domain expert. The obtained lattice does not only provides
access to several data sources but is also a materialization of the associated RDF
Schema.</p>
      <p>For creating one navigation space over RDF as well as RDF Schema, the
first task is to define RDF triples in terms of patterns structures i.e., specifying
the entities, their descriptions and the mapping from entities to description.
The second task is to select a suitable RDF Schema associated to these RDF
triples from distributed data sources based on domain knowledge. After these two
preprocessing steps, we define a similarity measure [ over two descriptions which
we generalize to the set of descriptions. After defining the similarity measure,
we explain how a semantic pattern concept lattice is built using this similarity
measure. Finally, we interpret, provide navigation and querying capabilities over
the obtained pattern concept lattice.
4.1</p>
      <sec id="sec-4-1">
        <title>From RDF Triples to Semantic Pattern Structures</title>
        <p>Firstly, we represent factual RDF triples about certain domain as entities and
their descriptions. According to section 3.3, pattern structures are given as
pG, pD, [q, δq. The descriptions D are termed as semantic descriptions and are
denoted as Ds. An RDF triple store containing factual level information consists
of statements of the form xsi, pj , oky, where i t1, . . . , nu, j t1, . . . , qu, k
t1, . . . , mu. Then, the set of subjects si are mapped to the entities in G. As the
set of entities G represent the subjects in triples, we represent it as S.</p>
        <p>Regarding the type of each object set which is the range of same predicate
a suitable RDF Schema is obtained by locating the terms in the closely
related RDF Schema. This selection is done based on domain knowledge. RDF
Schema contains many constructs such as property, sub-property etc. along with
sc and information but in this work we use the RDF Schema predicates such
as type and sc introduced in section 3.1. First, the mapping from object to
its type is obtained such as xok, type, C1y i.e., ok is an instance of class C1.
Each object is then replaced by its corresponding class i.e., ok is replace by
C1. Now, the considered taxonomy information used for defining the similarity
measured is present in the RDF Schema by following the predicates such as
rdf s : subClassOf, skos : broader e.g., xC1, sc, C6y, i.e., C1 is subclass of C6.
Note that we do not extract the trees, we follows this predicate while defining
the similarity in the next section.</p>
        <p>For instance, to obtain generic drug categories such as cardiovascular agents
or generic side effects such as allergic conditions we further add the class
information related to each of the objects in the triples. For example, (ok,type,C2)
meaning that ok is an instance of class C2. Afterwards, we select all the super
classes of the current class in the RDF Schema. For example, for C2, we have
C7, C11 and C13. An RDF Schema for drug side effect and drug category is shown
in Figure 1 and Figure 2 respectively.</p>
        <p>C13
C6
C1</p>
        <p>C11</p>
        <p>C7
C2</p>
        <p>C12</p>
        <p>C10
C8
C3</p>
        <p>C9
C4</p>
        <p>C5</p>
        <p>D6
D4
D1</p>
        <p>D9
D8</p>
        <p>D7</p>
        <p>D5
D2</p>
        <p>D3
Fig. 1: RDF Schema for Side Effects (MedDRA) Fig. 2: RDF Schema for Drug Category (MeSH)
pT1q. pT2q.</p>
        <p>The pair ppj : Ckq gives a description dij P Ds. The mapping of entities to
description δ : S Ñ Ds is given as follows: let si P S then δpsiq tdi1, . . . , diqu
where i P t1, . . . , nu and dij tpj : tC1, C4, . . . , Cmuu where j P t1, . . . , qu.
Finally, the semantic pattern structures are given as pS, pDs, [q, δq. Consider the
running scenario described before. As the drugs in the DrugBank and SIDER are
same and are the subjects then the triples t1, t2, t3 in Table 2 are represented as
pS, pDs, [q, δq where the entity S has the description δps1q = {(p1 : tC1, C2, C3u),
(p2 : tD2u) }. The descriptions are based on same subjects having different
descriptions belonging to different resources i.e., side effect and category. Finally,
with respect to the triples in Table 2, the subjects in the RDF triples are mapped
to the entities, S {s1, s2, . . . }, the descriptions are given as Ds = {(p1 :
tC1, C2, C3, . . . }) , (p2 : tD1, D2, . . . u)}.</p>
        <p>Entities S Descriptions Ds
{(p1 : tC1, C2, C3u), (p2 : tD2u)}
{(p1 : tC1, C4, C5u), (p2 : tD3u)}
{(p1 : tC2u), (p2 : tD3u)}
{(p1 : tC3, C4, C5u), (p2 : tD8u)}
{(p1 : tC1, C3u), (p2 : tD2u)}</p>
        <p>
          Similarity Operation ([) over Descriptions
For defining the similarity operation over the semantic description Ds, we use
the notion of least common subsumer (lcs). Here we re-write the definition of a
least common subsumer of two classes in a RDF Schema T given in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Definition 2 (Least Common Subsumer). Given a partially ordered set</title>
        <p>pS, ¤q, a least common subsumer E of two classes C and D (lcs(C,D) for short)
in a partially ordered set is a class such that C  E and D  E and E is least
i.e., if there is a class E1 such that C  E1 and D  E1 then E  E1.
Example 1. Let us consider C3 and C5. Then according to the RDF Schema in
Figure 1, the common subsumers of C3 and C5 are C10, C12, C13 and lcspC3, C5q
C10.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Similarity Operation over Set of Descriptions: For describing the similar</title>
        <p>ity over set of descriptions, LCS is computed pairwise for classes in two different
sets of description and then only the most specific classes are picked.</p>
        <p>Let s1 and s2 be two entities such that s1, s2 P S and the mapping δ is given
as follows δps1q d1 tp1 : tC1, C2, C3uu and δps2q d2 tp1 : tC1, C4, C5uu.
The similarity is always computed w.r.t the same predicate, here p1. A dual
similarity measure is defined to ensure classification of triples with respect to
objects as well as with respect to classes from RDF Schema. The similarity
measure is computed based on the taxonomy given in Figure 1. According to
the current approach a pairwise similarity measure is computed between the two
sets of classes connected through some predicate. For example, δps1q [ δps2q
xp1 : tC1, C2, C3uy [ xp1 : tC1, C4, C5uy. Meaning that δps1q [ δps2q p1 :
xlcspC1, C1q, lcspC1, C5q, . . . y. Then, δps1q [ δps2q p1 : tC1, C10, C11, C13u. As
we have C1 ¤ C11 ¤ C13 and C10 ¤ C13 then we pick only the specific classes
i.e., C1 and C10. Finally, δps1q [ δps2q p1 : tC1, C10u.</p>
        <p>Now let us consider the complete descriptions of the two entities i.e., δps1q
{(p1 : tC1, C2, C3u), (p2 : tD2u)} and δps2q {(p1 : tC1, C4, C5u), (p2 : tD3u)}.
Then, δps1q [ δps2q {(p1 : tC1, C2, C3u), (p2 : tD2u)} [ {(p1 : tC1, C4, C5u),
(p2 : tD3u)}. The set of least common subsumers for p1 will be C1, C10. The
least common subumer between p2 : tD3u and p2 : tD2u is p2 : tD5u. Finally,
the similarity between these two drugs will be {(p1 : tC1, C10u), (p2 : tD5u)}.</p>
        <p>
          The proposed similarity measure fulfills the property of a similarity operator
i.e., commutative, associative and idempotent [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The number of RDF Schema
considered are equal to number of predicates in the entity-description
representation of an RDF graph. In the running example, we use two schemas to show that
our method can also be easily applied to multiple RDF Schemas which are either
independent or parts of the same schema distributed over several resources.
4.3
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Building the Semantic Pattern Concept Lattice</title>
        <p>For building the semantic pattern concept lattice using the above similarity
measure over several RDF Schemas, according to the running example, δps1q [
δps2q tpp1 : tC1, C10uq, pp2 : tD5uqu. The semantic pattern concept lattice is
built according to definition 1. For instance, ts1, s2ul tpp1 : tC1, C10uq, pp2 :
tD5uqu but tpp1 : tC1, C10uq, pp2 : tD5uqu l ts1, s2, s5u. This way we obtain
K#2 (see Figure 3) with 3 drugs giving the concept a support of 3. Finally,
the pattern concept lattice is built for Table 3 with respect to T1(Figure 1) and
T2(Figure 2). This pattern concept lattice is shown in Figure 3 and the details
of each of the pattern concept is shown in Table 4.</p>
        <p>K#5</p>
        <p>K#4</p>
        <p>K#6
K#7</p>
        <p>K#3</p>
        <p>K#8
K#2
K#0</p>
        <p>K#9
K#10
K#13
K#1</p>
        <p>K#11</p>
        <p>K#12</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.4 Interpreting and Navigating the Pattern Concept Lattice</title>
        <p>This pattern concept lattice not only serves as a navigation space but also
provides clustering of RDF triples w.r.t. some Schema. Each of the pattern concepts
is a cluster of triples. For instance, consider the simplest example in Table 4 i.e.,
K#1, where there is only one entity in the extent, then this cluster contains the
triples tps1, p1, C1q, ps1, p2, D2q, . . . u. Here it can be seen that all the classes are
connected to single subject S through two predicates. This space can further be
navigated to provide the required information to the user without the
technical background of Semantic Web. Let us consider, if user is looking for drugs
causing side effect C10, the located pattern concept will be K#9 which contains
all the drugs causing some allergic conditions. Afterwards, if the user is looking
for some very specific allergic condition such as C3 then he can simply navigate
downwards to obtain more specific concept i.e., K#10 which contains all the
drugs which cause the side effect C3. Let us consider the scenario described in
section 2, here the user will be looking for cardiovascular agents (D5) which do
not cause any allergic conditions. In this case first the concept containing all
the cardiovascular agents are located i.e., K#4 afterwards the lattice is further
navigated downwards to obtain the drugs causing C10 (allergic conditions) i.e.,
K#2. Now the drugs which are in K#4 and not in K#2 are the CVAs which
do not cause allergic conditions.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>SPARQL Queries over Pattern Concept Lattice</title>
      <p>For the user with the basic knowledge of SPARQL queries, this navigation space
can be easily accessed by posing simpler queries. This sub-lattice gives additional
information related to the query (detailed below). The obtained pattern concept
lattice keeps the materialized RDF Schema meaning that it does not require the
query to perform reasoning. It also allows for accessing several resources from
one platform. This navigation space allows queries which are subject, predicate
or object bound. In this section we describe simple as well as complex SPARQL
queries that can be posed over the this navigation space. Simple queries refer to
the queries with simple Basic Graph Patterns such a single triple pattern in the
WHERE clause of the SPARQL query, on the other hand, complex queries allow
joins and negations.
5.1</p>
      <sec id="sec-5-1">
        <title>Simple Queries</title>
        <p>Simple queries include subject, predicate or object bound queries. Subject-bound
queries refer to the queries where the value of the subject in an RDF triple
xs, p, oy is known while the values of predicate, object and classes of objects
are to be retrieved. Let a SPARQL query be SELECT ?p ?o WHERE {si ?p ?o},
where si is a known subject and a pattern concept be pA, dq such that si P A.
The BGP in this query is (si ?p ?o), this BGP is matched against the RDF
graph present in the pattern concept lattice. The answer of the query will be
the object concept tsiul tdiu where di P d where d is the intent of the
concept pA, dq. Note that this object concept is already computed while building
a pattern concept lattice. In other words, all the predicate-object pairs tdiu
connected to the subject si in the RDF graph G are returned as a result. For
example, if user wants to know complete information about the drug s1 i.e.,
its side effect and the category, then SELECT ?p ?o WHERE {s1 ?p ?o}. It will
search for object concept in the pattern concept lattice given in Figure 3. For
instance, s1l pp1 : tC1, C2, C3u, p2 : tD2uq (K#1) and the answer would
be pp1 : tC1, C2, C3u, p2 : tD2uq. The first part of the answer pp1 : tC1, C2, C3u
belongs to SIDER while p2 : tD2u belongs to DrugBank. However, because of the
strong lattice-structure a sub-lattice is retrieved by navigating to more general
concepts (upward navigation) present in the pattern concept lattice i.e., all the
concepts connected to K#1 i.e., K#2, K#4, K#5, K#9, K#10, K#13, K#8.
Here, K#1 is the focus concept because it keeps the exact answer to the posed
SPARQL query and the rest of the concepts contain additional information.
This way the user not only gets the desired answer but also some additional
information about the drug i.e., the classification of side effects from MedDRA
and the features it shares with other drugs and with which drugs it shares certain
features. Object-bound queries are answered in the same way as described above
by retrieving the attribute concept and the linked concepts. However, in this case
the additional information will be retrieved by navigating downwards from the
focus concept because it is the object-bound query and objects are contained in
the intent of the pattern concepts.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Queries with Joins</title>
        <p>Now let us consider a slightly more complex query which includes a join and
a class resources to be retrieved. If it is specified that the drugs to be
retrieved should be Cardiovascular Agent causing some Allergic Condition. Then
the SPARQL query with subject-subject join will be as follows:</p>
        <sec id="sec-5-2-1">
          <title>SELECT ?drug WHERE {</title>
          <p>?drug p2 D5.
?drug p1 C10 }</p>
          <p>For obtaining the answer, the BGP in the above query is matched against
the concept which contains both the predicates p1 and p2, along with this
predicate it keeps the classes of objects CardiovascularAgents (D5) connected to
predicate p2 as well as AllergicConditions (C10) connected to the predicate
p1. The answer will be ts1, s2, s5u i.e., K#2, which will be the focus concept.
However, to obtain more specific information regarding these classes of drug
categories and the side effects will be returned by navigating downwards in
the concept lattice. Finally, a sub-lattice containing K#2, K#3, K#13, K#1 is
returned as an answer to the above query. In this case, the user will have an
additional information regarding more specific categories related to Cardiovascular
Agents causing Allergic Conditions such as according to K#13 drugs ts1, s5u are
D2 and K#3 ts2u is D3. Similarly, specific allergic conditions are also retrieved.
5.3</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>Queries with Filtering Criteria</title>
        <p>Now let us move towards the scenario where the doctor is looking for drug
replacement for the patient having heart conditions based on its side effects. For
replacing the drugs, the system can be queried by locating all the CVAs (this
information comes from DrugBank, MeSH in the current navigation space) and
then excluding all the CVAs causing Allergic Conditions. The related SPARQL
query can be given as follows:</p>
        <sec id="sec-5-3-1">
          <title>SELECT ?drug WHERE {</title>
          <p>?drug p2 D5.
?drug p1 ?se
FILTER (?se != C10) }</p>
          <p>The condition is given in the filter clause because SPARQL does not support
negation operation directly. In order to do so, all the drugs which are CVAs are
retrieved by SPARQL and then it filters the drugs which do not cause allergic
conditions. Same is the case with the lattice operation where concept K#4 is
selected because it contains all the cardiovascular agents and then a more
specific concept is obtained containing the CVAs causing allergic conditions i.e.,
K#2 and minus operator is applied to perform the desired filtering. The answer
obtained is s3.</p>
          <p>In a sense, the navigation space follows the query mechanism provided by
Google where a query is posed by the user and this query is mapped to a
preexisting cluster of documents. In the same way, when a query is posed by the
user, our query mechanism maps the query to a cluster of RDF triples, where
each element in an RDF triple keeps the URI of a web-page.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experimentation</title>
      <p>The current approach was applied to biomedical data. This section details the
experimental evaluation for the creation of the navigation space The proposed
algorithm was coded in C++ and the experiments was performed using 3GB
RAM on Ubuntu version 12.04.
6.1</p>
      <sec id="sec-6-1">
        <title>Drug Search</title>
        <p>The datasets containing information about drugs are DrugBank, SIDER,
MedDRA and MeSH as described in section 3.1. DrugBank keeps detailed
information about each of the drugs, their categories and the proteins it targets. The
other database is SIDER which keeps the side effects of the drugs contained
on the packaging of the drug. The access to these two data sets is provided by
University of Mannheim through two different endpoints1314.</p>
        <p>
          The schema associated to the side effects of the drug is available on
BioPortal [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], which is a web portal that provides access to a repository of biomedical
ontologies. BioPortal is developed by National Center for Biomedical Ontology
(NCBO) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The Medical Dictionary for Regulatory Activities (MedDRA)
Terminology is the international medical terminology. During this experiment we
will enrich side effects with schema level information using MedDRA
terminology. In case of the drug categories MeSH vocabulary thesaurus was taken into
account. MeSH (Medical Subject Headings) is a controlled vocabulary thesaurus.
The drug categories from DrugBank will be enriched with the tree numbers from
MeSH vocabulary. The tree numbers arrange the terms from MeSH in a
hierarchical manner known as MeSH Tree Structures. In the current experiment we
13 http://wifo5-04.informatik.uni-mannheim.de/drugbank/snorql/
14 http://wifo5-04.informatik.uni-mannheim.de/sider/snorql/
used the MeSH vocabulary already present in the form of RDF in Bio2RDF [
          <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
          ],
which makes the public databases related to Bioinformatics such Kegg, MeSH,
HGNC etc. available in the RDF format.
        </p>
        <p>During this experiment, two subsets of the dataset were considered. Both
belonging to two major classes of drugs i.e., Cardiovascular Agents (CVA) and
Central Nervous System (CNS).
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Evaluation</title>
        <p>In the following, we study the scalability of Semantic Pattern Structures over
large dataset. Table 5 precises the statistics of the data. Pattern concept lattices
over both the chosen data sources was built in 0-25 seconds for the maximum of
31098 triples. Figure 4b shows the variation in the size of the navigation space
for both data sets. The navigation space contains a maximum of around 500000
clusters of triples which were generated in 25 seconds. However, there are several
ways to reduce these number of concepts. The filtering based on the depth of
classes considered in the taxonomy which allows the reduction in the number of
clusters while generating the concept lattice and hence causes decrease in the
runtime of creating the navigation space. Most of these very general classes are
not very interesting for the domain expert. Moreover, there are several measures
such as support, stability and lift which allow post-filtering of the navigation
space.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Related work</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the author focuses on allowing conceptual navigation to RDF graphs,
where each concept is accessed through SPARQL-like queries. However, in our
case several RDF graphs are considered and we use the already existing,
wellestablished and well-documented query language, SPARQL. Moreover, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
introduces ontological pattern structures for enriching raw data with E L
ontologies. But both the approaches consider only one resource at a time, hence not
targeting the problem of decentralization. As a contrast, our approach provide
navigation space over RDF graphs as well as schema level information from
several resources allowing user to access information from one platform. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the
authors introduce a new clause Categorize By which clusters SPARQL query
answers w.r.t to background knowledge present as ontologies. Like our approach,
only taxonomy is used for enriching, however, unlike our approach if clusters the
answers after obtaining the query answers. As a contrast, our approach provides
classification of the RDF triples as well as RDF Schema before hand.
Afterwards, SPARQL queries are mapped to the existing clusters and the answer is
shown to the user. In such a case, the query response time is faster than the
Categorize By clause. Moreover, as it provides clustering over answers only, it
lacks the capability to provide user with additional information.
8
      </p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion and Future Work</title>
      <p>This paper proposes a new approach for navigating semantic web data and
targets the capabilities of FCA to deal with RDF data. It provides navigational
and search capabilities over RDF triples as well as RDF Schema distributed
over several resources. This paper proposes a new similarity measure for pattern
structures to deal with RDF data as well as RDF Schema simultaneously, termed
as Semantic Pattern Structures. The pattern concepts in the concept lattice are
considered as clusters of of RDF triples which can then be navigated and queried
by the user.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Franc¸ois Belleau,
          <string-name>
            <surname>Marc-Alexandre</surname>
            <given-names>Nolin</given-names>
          </string-name>
          , Nicole Tourigny, Philippe Rigault, and Jean Morissette. Bio2rdf:
          <article-title>Towards a mashup to build bioinformatics knowledge systems</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>41</volume>
          (
          <issue>5</issue>
          ):
          <fpage>706</fpage>
          -
          <lpage>716</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Tom Heath, and
          <string-name>
            <surname>Tim</surname>
          </string-name>
          Berners-Lee.
          <article-title>Linked data - the story so far</article-title>
          .
          <source>Int. J. Semantic Web Inf. Syst.</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Adrien</given-names>
            <surname>Coulet</surname>
          </string-name>
          , Florent Domenach, Mehdi Kaytoue, and
          <string-name>
            <given-names>Amedeo</given-names>
            <surname>Napoli</surname>
          </string-name>
          .
          <article-title>Using pattern structures for analyzing ontology-based annotations of biomedical data</article-title>
          .
          <source>In 11th International Conference on Formal Concept Analysis</source>
          ,,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Claudia d'Amato</surname>
            ,
            <given-names>Nicola</given-names>
          </string-name>
          <string-name>
            <surname>Fanizzi</surname>
            , and
            <given-names>Agnieszka</given-names>
          </string-name>
          <string-name>
            <surname>Lawrynowicz</surname>
          </string-name>
          .
          <article-title>Categorize by: Deductive aggregation of semantic web query results</article-title>
          . In Lora Aroyo, Grigoris Antoniou, Eero Hyv¨onen, Annette ten Teije, Heiner Stuckenschmidt, Liliana Cabral, and Tania Tudorache, editors,
          <source>ESWC (1)</source>
          , volume
          <volume>6088</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>91</fpage>
          -
          <lpage>105</lpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Michel</given-names>
            <surname>Dumontier</surname>
          </string-name>
          , Alison Callahan, Jose Cruz-Toledo,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Ansell</surname>
          </string-name>
          , Vincent Emonet, Franc¸ois Belleau, and Arnaud Droit.
          <article-title>Bio2rdf release 3: A larger, more connected network of linked data for the life sciences</article-title>
          .
          <source>In Posters &amp; Demonstrations Track ISWC.</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. S´ebastien Ferr´e.
          <article-title>Conceptual navigation in RDF graphs with sparql-like queries</article-title>
          .
          <source>In Formal Concept Analysis, 8th International Conference, ICFCA</source>
          <year>2010</year>
          , Agadir, Morocco, March
          <volume>15</volume>
          -18,
          <year>2010</year>
          . Proceedings, pages
          <fpage>193</fpage>
          -
          <lpage>208</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Ganter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sergei O.</given-names>
            <surname>Kuznetsov</surname>
          </string-name>
          .
          <article-title>Pattern structures and their projections</article-title>
          .
          <source>In Harry S. Delugach and Gerd Stumme</source>
          , editors,
          <source>ICCS</source>
          , volume
          <volume>2120</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>129</fpage>
          -
          <lpage>142</lpage>
          . Springer,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Ganter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rudolf</given-names>
            <surname>Wille</surname>
          </string-name>
          .
          <source>Formal Concept Analysis: Mathematical Foundations</source>
          . Springer, Berlin/Heidelberg,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Ralf</given-names>
            <surname>Ku</surname>
          </string-name>
          <article-title>¨sters and Ralf Molitor</article-title>
          .
          <article-title>Computing least common subsumers in ALEN</article-title>
          .
          <source>In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI</source>
          , pages
          <fpage>219</fpage>
          -
          <lpage>224</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mark</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Musen</surname>
          </string-name>
          , Natalya Fridman Noy,
          <string-name>
            <surname>Nigam H. Shah</surname>
          </string-name>
          , Patricia L. Whetzel, Christopher G. Chute,
          <string-name>
            <surname>Margaret-Anne D. Storey</surname>
            , and
            <given-names>Barry</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
          </string-name>
          .
          <article-title>The national center for biomedical ontology</article-title>
          .
          <source>JAMIA</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <fpage>190</fpage>
          -
          <lpage>195</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. York Sure, Stephan Bloehdorn,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Haase</surname>
          </string-name>
          , Jens Hartmann, and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Oberle</surname>
          </string-name>
          .
          <article-title>The swrc ontology - semantic web for research communities</article-title>
          .
          <source>In EPIA</source>
          , volume
          <volume>3808</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>218</fpage>
          -
          <lpage>231</lpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Patricia L. Whetzel</surname>
          </string-name>
          , Natalya Fridman Noy,
          <string-name>
            <surname>Nigam H. Shah</surname>
            , Paul R. Alexander, Csongor Nyulas, Tania Tudorache, and
            <given-names>Mark A.</given-names>
          </string-name>
          <string-name>
            <surname>Musen</surname>
          </string-name>
          .
          <article-title>Bioportal: enhanced functionality via new web services from the national center for biomedical ontology to access and use ontologies in software applications</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>39</volume>
          (
          <string-name>
            <surname>WebServer-Issue</surname>
          </string-name>
          ):
          <fpage>541</fpage>
          -
          <lpage>545</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>