<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Ranking-Based Approach to Discover Semantic Associations Between Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mar´ıa-Esther Vidal</string-name>
          <email>mvidal@ldc.usb.ve</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Louiqa Rashid</string-name>
          <email>louiqa@umiacs.umd.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Iba´n˜ ez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean Carlo Rivera</string-name>
          <email>jrivera@ldc.usb.ve</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>He´ctor Rodr´ıguez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edna Ruckhaus</string-name>
          <email>ruckhaus@ldc.usb.ve</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Simo ́n Bol ́ıvar Caracas</institution>
          ,
          <country country="VE">Venezuela</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Maryland</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Under the umbrella of the Semantic Web, Linked Data projects have the potential to discover links between datasets and make available a large number of semantically inter-connected data. Particularly, Health Care and Life Sciences have taken advantage of this research area, and publicly hyper-connected data about disorders and disease genes, drugs and clinical trials, are accessible on the Web. In addition, existing health care domain ontologies are usually comprised of large sets of facts, which have been used to annotate scientific data. For instance, annotations of controlled vocabularies such as MeSH or UMLS, describe the topics treated in PubMed publications, and these annotations have been successfully used to discover associations between drugs and diseases in the context of the Literature-Based Discovery area. However, given the size of the linked datasets, users have to spend uncountable hours or days, to traverse the links before identifying a new discovery. In this paper we provide an authority-flow based ranking technique that is able to assign high scores to terms that correspond to potential novel discoveries, and to efficiently identify these highly scored terms. We propose a graph-sampling method that models linked data as a Bayesian network and implements a Direct Sampling reasoning algorithm to approximate the ranking scores of the network. An initial experimental study reveals that our ranking techniques are able to reproduce state-of-the-art discoveries; additionally, the sampling-based approach is able to reduce the exact solution evaluation time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>During the last decade, emerging technologies such as the Semantic Web, the
Semantic Grid, Linked Data projects, and affordable computation and network access, have
made available a great number of publicly inter-connected data sources. Life science
is a good example of this phenomenon. This domain constantly evolves, and has
generated publicly available information resources and services whose number and size,
have dramatically increased during the last years. For example, the amount of gene
expression data has grown exponentially, and most of the biomedical sources that publish
this information have been gaining data at a rate of 300 % per year. The same trend
is observed in biomedical literature where the two largest interconnected bibliographic
databases in biomedicine, PubMed and BIOISIS, illustrate the extremely large size of
the scientific literature today. PubMed publishes at least 16 million references to journal
articles, while BIOSIS makes available more than 18 million abstracts.</p>
      <p>
        On the other hand, a great number of ontologies and controlled vocabularies have
become available under the umbrella of the Semantic Web. Ontologies are specified in
different standard languages, such as XML, OWL or RDF, and regular requirements are
expressed using query languages such as SPARQL. Ontologies play an important role
and provide the basis for the definition of concepts and relationships that make global
interoperability among available Web resources possible. In the Health Care and Life
Sciences domains, large ontologies have been defined; for example, we can mention
MesH [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Disease [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Galen [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], EHR RM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], RxNorm [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], and GO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Ontologies are commonly applied in these domains to annotate publications, documents,
and images; also ontologies can be used to distinguish similar concepts, to generalize
and specialize concepts, and to derive new properties. To fully take advantage from the
linked data sources and their ontology annotations, and to be able to recognize novel
discoveries, scientists have to navigate through the inter-connected sources, and
compare, correlate and mine some of these annotated data. Nevertheless, because the size
and number of available sources and the set of possible annotations are very large, users
may have to spend countless hours or days before recognizing relevant findings.
      </p>
      <p>In order to facilitate the specification of scientist’s semantic connection needs, we
present a ranking technique able to assign high scores to potential novel associations.
Furthermore, given the size of the search space and to reduce the effect of the
number of available linked data sources and ontology annotations on the performance, we
also propose an approximate solution named graph-sampling. This approximate
ranking technique samples events in a Bayesian network that models the topology of the
data connections; it also estimates ranking scores that measure how important and
relevant are the associations between two terms. In addition, the approximate technique
exploits information about the topology of the hyperlinks and their ontology
annotations, to guide the ranking process into the space of relevant and important terms.</p>
      <p>
        In this paper we describe our ranking techniques and show their effectiveness and
efficiency. The paper is composed of five additional sections. In Section 2, we compare
existing approaches. Section 3 illustrates techniques proposed in the area of Literature
Based Discovery (LBD) by showing the discovery reported in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] where curcumin
longa was associated with retinal diseases. Section 4 describes our proposed sampling
technique. Section 5 reports our experimental results. Finally, we give our conclusions
and future work in Section 6.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Under the umbrella of the Semantic Web, Linked Data projects have proposed
algorithms to discover links between datasets. Particularly, the Linking Open Drug Data
(LODD) task has connected a list of datasets that includes disorders and disease genes [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
clinical trials [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and drug banks [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Some of these link discovery or generation tools
apply similarity metrics to detect potential similar concepts and their relationships [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
However, none of the existing link discovery techniques make use of information about
the link structure to identify potential novel associations. Also, the ontology voID [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]
has been proposed to describe interlinked datasets and enable their discovery and usage,
and provides the basis for our proposed approach.
      </p>
      <p>
        The discovery of associations between data entries implies descriptive and
predictive inference tasks based on the link structure [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and on semantics suggested by
relevant ontologies. In general the idea is to perform random walks in the space of
possible associations and discover those that satisfy a particular pattern; correspondences
between the discovered patterns are measured in terms of similarity functions. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
heuristics are used to discover relevant subgraphs within RDF graphs; relationships
among the metadata describing nodes is used to discover relevant relationships among
entities. To decide if two objects are semantically similar, Jeh et. al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] propose a
measure that reflects when two objects are similar based on the relationships that they
hold with similar objects. Yan et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] propose strategies to efficiently search
subgraphs that are similar to a given query graph. Finally, Hu et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Kuramochi
and Karypis [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] describe efficient algorithms to discover subgraphs (patterns) that
occur in graphs and to aggregate them.
      </p>
      <p>
        Sampling techniques have been successfully applied to a variety of approximation
techniques. For example, in the context of query optimization, different sampling-based
algorithms have been proposed to estimate the cardinality of a query efficiently [
        <xref ref-type="bibr" rid="ref13 ref14 ref18">13,
14, 18</xref>
        ]. The challenge of these methods is to reach estimates that satisfy the required
confidence levels while the size of the sample remains small. A key decision involves
when to stop sampling the population and this is determined by the mean and variance of
the sample in comparison to the target population. In this paper we propose a technique
that samples paths in an acyclic directed graph that models a dataset of linked data.
Paths are sampled based on the joint probability which is computed as the multiplication
of the authority transfer flow value of the edges that comprise the path. Similarly, we
define the stop condition of the sampling, based on an estimate of the metric score
mean. Related to the problem of estimating authority flow metrics, Fogaras et. al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
implement a Monte-Carlo based method to approximate personalized PageRank scores.
They sample paths whose length is determined by a geometric distribution. Paths are
sampled from a Web graph based on a probability that represents whether objects in
the paths can be visited by a random surfer. This approach may provide a solution to
PageRank; however, it is not applicable to our proposed approach because the length
of the paths is determined by the number of layers in the results graph and cannot be
randomly chosen. In contrast, graph-sampling samples objects layer by layer, until the
last layer in the result graph is visited. Objects with higher probability to be visited
by a random surfer and links between these objects, will have greater chance to be
chosen during the sampling process. Thus, graph-sampling may be able to only traverse
relevant paths that correspond to relevant discoveries.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Motivating Example</title>
      <p>
        Consider the area of LiteratureB-ased Discovery (LBD) where by traversing scientific
literature annotated with the controlled vocabularies like MesH, drugs have been
associated with diseases [
        <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
        ]. LBD can perform Open or Closed discoveries, where
a scientific problem is represented by a set of articles that discuss an input problem
(Topic A), and the goal is to prove the significance of the associations between A and
some other C topics discussed in the set of publications reachable from the initial set
of publications relevant to A. Srinivasan et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] followed this idea and improved the
Open and Closed techniques by recognizing that articles in PubMed have been curated
and heavily annotated with controlled vocabulary terms from the MeSH (Medical
Subject Heading) ontology. Relationships between publications and terms are annotated
with weights or scores that represent the relevance of the term in the document. MeSH
term weights are a slight modification of the commonly used T F/IDF scores. Figure 1
illustrates a directed graph that represents the terms and publications visited during the
evaluation of an Open discovery. Topic A is used to search on the PubMed site and
retrieve relevant publications, named PubA. Then, MeSH term annotations are extracted
from publications in PubA, and filtered by using a given set of semantic types of the
ontology Unified Medical Language System (UMLS) 3; this new set of MeSH terms is
named B and is used to repeat the search on the PubMed site. Similarly, sets PubB, C
and PubC are built.
      </p>
      <p>Curcumin
Topic A
9856863
7602115
10363643
10557090
12807725</p>
      <p>Fibroblast
Growth Factor 2
Nerve Growth
Factors
Protein
Kinases
Membrane
Proteins
B Topics
9278996
9278995
11245094
3651539
6116594
11481236</p>
      <p>Retina
Spinal
Cord
Testis
Pituitary
Gland
Sciatic
Nerve</p>
      <p>C Topics</p>
      <p>The Srinivasan’s algorithm considerably reduces the space of intermediate results
while identifying novel relationships; however, it still requires human intervention to
create the intermediate datasets as well as to rank the terms that may not conduce to
potential novel discoveries. We propose a sampling-based ranking technique that is able
to estimate which are the nodes that will conduce to novel discoveries, and thus, reduce
the discovery evaluation time. We illustrate the usage of this technique in the context of
Literature-based Discovery. However, we hypothesize that this technique can be used
to efficiently discover associations between the data published in the Cloud of Linked
Data.</p>
      <sec id="sec-3-1">
        <title>3 http://www.nlm.nih.gov/research/umls/</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A Ranking-based Solution to Discover Semantic Associations</title>
      <p>
        We propose ranking-based solutions to the problem of the semantic association
discovery. The proposed techniques take advantage of existing links between data published
on the Cloud of Linked Data, or make use of annotations with controlled vocabularies
such as MeSH, GO, PO, etc. We present an exact solution, and an approximate
technique; both methods have been implemented in BioNav [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
4.1
      </p>
      <sec id="sec-4-1">
        <title>An Exact Ranking Technique</title>
        <p>
          The exact ranking technique extends existing authority-flow based metrics like
PageRank, ObjectRank or any of their extensions [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. This ranking approach assumes that
the linked data comprise a layered graph, named layered Discovery Graph, where nodes
represent published data and edges correspond to hyperlinks.
        </p>
        <p>Formally, a layered Discovery Graph, lgDG=(Vlg, Elg) is a layered directed acyclic
graph, comprised of k layers, L1, . . . , Lk. Layers are composed of data entries which
point to data entries in the next layer of the graph. Data entries in the k-th layer or last
layer of the graph, are called target objects. Authority-flow based metrics are used to
rank the target objects, and we use these scores to identify relevant associations between
objects in the first layer and target objects.</p>
        <p>Figure 2 illustrates an example of a layered Discovery Graph that models the Open
Discovery Graph in Figure 1. In this example, odd layers are composed of MeSH terms
while even layers are sets of publications. Also, an edge from a term b to a publication
p indicates that p is retrieved by the PubMed search engine when b is the search term.
Finally, an edge from a publication p to a term b represents that p is annotated with b.
Each edge e = (b, p)(resp., e = (p, b)) between the layers li and li+1 is annotated with
the T F/IDF score; this value either represents how relevant is a term b in the collection
of documents in li+1, or a document relevance regarding to a set of terms. The path of
thick edges connects Topic A with C3; the value 0.729 corresponds to the authority-flow
score and represents the relevance of the association between Topic A and C3.
1
1
1
Topic1A
p1
p2
p3
p4</p>
        <p>Pub_A
Documents
in PubMed
0.1
where, M is a transition matrix and Rini is a vector with the scores of the objects in
the first layer of the graph. An entry M[u, v] in the transition matrix M, where u and
v are two data objects in lgDG, corresponds to α(u, v) or is 0.0. The value α(u, v) is
calculated according to the metric used to compute the ranking score of the data.</p>
        <p>M[u, v] = ( 0α.(0u, v) iofth(ue,rwv)is∈e.Elg,</p>
        <p>
          For instance, the layered graph Weighted Path Count (lgWP) is an extension of
ObjectRank and Path Count and the value of α(u, v) corresponds to the T F/IDF score
that denotes how relevant is the object u with respect to the object v. Nodes with high
lgWP scores are linked by many nodes or linked by highly scored nodes; for example,
in Figure 2, C3 is pointed by relevant nodes. In the context of LBD, we use this metric
to discover novel associations between a topic A and MeSH terms in the last layer of
the lgDG, and we have been able to discover the associations identified by Srinivasan
et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>A Sampling-based Ranking Solution</title>
        <p>Although the ranking induced by an authority-flow based metric is able to distinguish
relevant associations, the computation of this ranking may be costly. Thus, to speed
up this task, we propose a sampling-based technique that traverses only nodes in the
layered graph that may conduce to highly ranked target objects.</p>
        <p>Given a layered Discovery Graph lgDG = (Vlg, Elg), the computation of highly
ranked target objects is reduced to estimating a subgraph lgDG0 of lgDG, so that with
high confidence (at least δ), the relative error of the distance between the approximate
highly ranked target objects in lgDG0 and the exact highly ranked target objects, is at
least .</p>
        <p>
          A set S S ={lgDG1, ..., lgDGm} of independent and identically distributed (i.i.d.)
subgraphs of lgDG is generated. Then, lgDG0 is computed as the union of the m subgraphs.
Each subgraph lgDGi is generated using a graph-sampling technique. This sampling
approach is based on a Direct Sampling method for a Bayesian network [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. This
network represents all the navigational information encoded in lgDG and in the transition
matrix M of the authority-flow metric. The Direct Sampling technique generates events
from a Bayesian network [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>A Bayesian network BN = (V B, EB) for a layered Discovery Graph lgDG, is built
as follows:
– BN and lgDG are homomorphically equivalent, i.e., there is a mapping f : V B →</p>
        <p>Vlg, such that, ( f (u), f (v)) ∈ Elg iff (u, v) ∈ EB.
– Nodes in V B correspond to discrete random variables that represent if a node is
visited or not during the discovery process, i.e., V B = {X | X takes the value 1
(true) if the node X is visited and 0 (false), otherwise}.
– Each node X in V B has a conditional probability distribution:
n
Pr(X | Parents(X)) = X α( f (Y j), f (X))
j=1
where, Y j is the value of the random variable that represents the j-th parent of the
node X in the previous layer of the Bayesian network and n corresponds to the number
of parents of X. The value α( f (Y j), f (X)) represents the weight or score of the edge
( f (Y j), f (X)) in the layered Discovery Graph and corresponds to an entry in the
transition matrix M; it is seen as the probability to move from Y j to X in the Bayesian
network. Furthermore, the conditional probability distribution of a node X represents
the collective probability that X is visited by a random surfer starting from the objects
in the first layer of the layered Discovery Graph. Finally, the probability of the nodes in
the first layer of the Bayesian network corresponds to a score that indicates the relevance
of these objects with respect to the discovery process; these values are represented in
the Rini vector of the ranking metric.</p>
        <p>Given a Bayesian network generated from the layered Discovery Graph lgDG, the
Direct Sampling generates each subgraph lgDGi. Direct Sampling selects nodes in
lgDGi by sampling the variables from the Bayesian network based on the conditional
probability of each random variable or node. Algorithm 1 describes the Direct
Sampling algorithm.</p>
        <sec id="sec-4-2-1">
          <title>Algorithm 1 The Direct Sampling Algorithm</title>
          <p>Input: BN = (V B, EB) A Bayesian network f or a layered discovery graph
Output: A subgraph lgDGi</p>
          <p>T P ⇐ topologicalOrder(BN);
for X ∈ T P do</p>
          <p>Pr(X | Parents(X)) ⇐ Pnj=1 α( f (Y j), f (X));
if (randomNumber &gt;= Pr(X | Parents(X))) then</p>
          <p>Xi ⇐ 1;
else</p>
          <p>Xi ⇐ 0;
end if
end for</p>
          <p>Variables are sampled in turn following a topological order starting from the
variables in the first layer of the Bayesian network; this process is repeated until variables
in the last layer are reached. The values assigned to the parents of a variable define the
probability distribution from which the variable is sampled. The conditional probability
of each node in the last layer of lgDGi corresponds to the approximate value of the
implemented metric.</p>
          <p>Once an iteration i of the Direct Sampling is finalized, the sampled layered
Discovery Graph lgDGi = (Vi, Ei) is created. Nodes in Vi correspond to the variables sampled
during the Direct Sampling process that are connected to a visited variable in the last
layer of the Bayesian network. Additionally, for each edge (u, v) in the Bayesian
network that connects nodes f (u) and f (v) in Vi, an edge ( f (u), f (v)) is added to Ei. The
conditional probabilities of the target objects of each subgraph lgDGi correspond to the
approximate values of the ranking metric. After all the subgraphs lgDG1, . . . , lgDGm are
computed, an estimate lgDG0 is obtained as the union of these m subgraphs. The
approximation of the ranking metric in the graph lgDG0 is computed as the average of the
approximate ranking metric values of target objects in the subgraphs lgDG1, . . . , lgDGm.
A bound of the number of iterations or sampled subgraphs is defined in terms of the
Chernoff-Hoe ffdings bound.</p>
          <p>Theorem: Let lgDG be an exact layered Discovery Graph and lgDGi be one of
the m sampled subgraphs. Let T be a list of the target objects in lgDG ranked with
respect to exact values of the ranking metric RM. Let Ti be a list of the target objects
in lgDGi ranked with respect to the approximation of RM. Let J(lgDG1, lgDG, β),. . . ,
J(lgDGm, lgDG, β) be independent identically distributed (i.i.d.) random variables with
values in the set {0,1}. Each random variable J(lgDGi, lgDG, β) has value=1 if a
distance metric value between the ranking list Ti and the list T is at least β; otherwise,
value=0. Let S denote the average of these variables, i.e., X = m1 Pm
i=1 J(lgDGi, lgDG, β)
and E(S ) the expectation of S . Then, the size m of the sample has to satisfy the
following formula to ensure that the relative error of E(S ) is greater than
probability:
with some</p>
          <p>P(|S − E(S )| ≥ ) ≤ 2 exp(−2m 2) .
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>
        In this section we show the quality of our proposed discovery techniques. First, we
compare the results obtained by our ranking technique with respect to the results
obtained by the Manjal system [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Then, we show the behavior of this technique in
the DBLP dataset. Experiments were executed on a Sun Fire V440 equipped with two
UltraSPARC IIIi processors running at 1.593 GHZ with 16 GB RAM. The ranking and
sampling techniques were implemented in Java 1.6.1.
      </p>
      <p>To conduct the first experiment, we have created a catalog populated with the PubMed
publications from the NCBI source4, all the MeSH terms, and all the links between
Mesh terms and PubMed publications. We stored the downloaded data in two tables,
Pub-MeSH and MeSH-Pub . Table Pub-MeSH relates a publication p with all the MeSH
terms that correspond to annotations of p in PubMed; these annotations are manually
done by experts at the National Library of Medicine site. Table MeSH-Pub relates a
MeSH term m with all publications that are retrieved when the term m is used to search
on PubMed. Both tables have an attribute score that represents the relevance of the
relationships represented in the table. Suppose there is a tuple (p, m, s) in table Pub-MeSH ,
then the score s = A × T × C, where:
– A: is the augmented document frequency of the publication p, i.e., A = 0.5 + 0.5 +
t f , where, t f is the frequency of p in table Pub-MeSH , and t fmax is the maximum
t fmax
document frequency of any publication in Pub-MeSH .
– T: inverse term frequency log2( NNp ), where N is the number of collected MeSH
terms, i.e., 20,652, and Np corresponds to the number of MeSH terms associated
with the publication p in the table Pub-MeSH .
– C: is a cosine normalization factor.</p>
      <p>
        Similarly, scores in table MeSH-Pub were computed. To reproduce the results
reported by Srinivasan et al. in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], we ran the metric lgWP on a layered Discovery
Graph lgDG comprised of 5 layers, 3,107,901 nodes and 10,261,791 edges. Sets PubA,
B, PubB and C and were built following the criteria proposed by Srinivasan et al., and
by selecting data from tables Pub-MeSH and MeSH-Pub . We ranked the target objects
in the graph, and we could observe that our ranking technique was able to produce 4 of
the top-5 semantic associations identified by Srinivasan et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Table 1 compares
the top-5 target objects discovered by [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and the ones discovered by our ranking
technique, i.e., our ranking technique exhibits a precision and recall of 80%.
      </p>
      <p>
        We have also studied the benefits of performing the graph-sampling technique, and
we ran the sampling process for 5 iterations, i.e., 5 sampled subgraphs were computed.
Table 2 reports on the top-10 MeSH terms identified by graph-sampling. We can
observe that 4 of the top-5 MeSH terms identified by the Srinivasan’s algorithm [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ],
4 http://www.ncbi.nlm.nih.gov/
      </p>
      <p>lgWP
1
2
3
4
5</p>
      <p>Retina
Spinal Cord</p>
      <p>Testis
Pituitary Gland
Sciatic Nerve</p>
      <p>Testis</p>
      <p>Retina
Spinal Cord</p>
      <p>Obesity</p>
      <p>Pituitary Gland
are also identified. We note that iterations do not improve the quality of the discovery
process.</p>
      <p>k
i=1
i=2
i=3
i=4
i=5
1 Spinal Cord Spinal Cord Spinal Cord Spinal Cord Spinal Cord
2 Pituitary Gland Pituitary Gland Pituitary Gland Pituitary Gland Pituitary Gland
3 Celiac Disease Celiac Disease Celiac Disease Celiac Disease Disease
4 Hepatic Encepha. Hepatic Encepha. Hepatic Encepha. Hepatic Encepha. Hepatic Encepha.
5 Uremia Uremia Uremia Uremia Uremia
6 Retina Anemia Anemia Anemia Anemia
7 Obesity Retina Retina Retina Retina
8 Testis Obesity Phenylketonurias Phenylketonurias Phenylketonurias
9 Hypothalamus Testis Obesity Obesity Obesity
10 Osteoporosis Hypothalamus Testis Testis Testis</p>
      <p>
        Finally, we report on the number of target MeSH terms produced by the Srinivasan’s
algorithm and the ones produced during each iteration of graph-sampling (Table 3). We
can observe that graph-sampling is able to discover 80% of the top novel MeSH terms,
while the number of target terms is reduced by up to one order of magnitude.
# Srinivasan’s target MeSH Terms [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] i=1 i=2 i=3 i=4 i=5
570
      </p>
      <p>24 38 49 61 71</p>
      <p>In the second experiment, we downloaded the DBLP file in a relational database.
We ran the graph-sampling technique to discover associations between a given author
and the most relevant conferences where this author has published at least one paper.
We ran 3 sets of 30 queries and compared the ranking produced by the exact solution
and the one produced by graph-sampling; layered Discovery Graphs were comprised
of 5 layers and at most 876,110 nodes and 4,166,626 edges. Author’s names with high,
medium and low selectivity were considered, where high selectivity means that the
author has few publications while low selectivity represents that the author is very
productive. The top-5 conferences associated with each author were computed by using the
exact ranking and the approximation produced by graph-sampling during 6 iterations.
Table 4 reports the average precision of the approximate top-5 conferences with respect
to the exact top-5. We can observe that graph-sampling is able to identify almost 65% of
the top-5 conferences after iteration 3. The time required to execute the graph-sampling
technique was reduced at least by half. These results suggest that the proposed
discovery techniques provide an effective and efficient solution to the problem of identifying
associations between terms.
In this paper we have presented a sampling-based technique that supports the discovery
of semantic associations between linked data. We have reported the results of an
empirical study where we have observed that our proposed techniques are able to efficiently
reproduce the behavior of existing LBD techniques. This observed property of our
discovery technique may be particularly important in the context of large datasets as the
ones published in the Cloud of Linked Data. In the future we plan to extend this study
to identify potential associations between other sources of the Cloud of Linked Data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Disease</given-names>
            <surname>Ontology</surname>
          </string-name>
          . http://diseaseontology.sourceforge.net.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>EHR</given-names>
            <surname>Ontology</surname>
          </string-name>
          . http://trajano.us.es./ isabel/EHR/EHRRM.owl.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Fogaras</surname>
          </string-name>
          , B. Ra´cz, K. Csaloga´ny, and
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>Sarlo´s. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments</article-title>
          .
          <source>Internet Mathematics</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>L.</given-names>
            <surname>Getoor</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Diehl</surname>
          </string-name>
          .
          <article-title>Introduction to the special issue on link mining</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>7</volume>
          (
          <issue>2</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>5. The Gene Ontology. http://www.geneontology.org/.</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>K.-I. Goh</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          <string-name>
            <surname>Cusick</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Valle</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Childs</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vidal</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Barabasi</surname>
          </string-name>
          .
          <article-title>The human disease network</article-title>
          .
          <source>Proceedings of the National Academy of Sciences of the United States of America</source>
          ,
          <volume>104</volume>
          :
          <fpage>8685</fpage>
          -
          <lpage>8690</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>C.</given-names>
            <surname>Halaschek-Wiener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Aleman-Meza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. B.</given-names>
            <surname>Arpinar</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Sheth</surname>
          </string-name>
          .
          <article-title>Discovering and ranking semantic associations over a large rdf metabase</article-title>
          .
          <source>In VLDB</source>
          , pages
          <fpage>1317</fpage>
          -
          <lpage>1320</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          . Mining, indexing, and
          <article-title>similarity search in graphs and complex structures</article-title>
          .
          <source>In ICDE, page 106</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kementsietsidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Linkedct: A linked data space for clinical trials</article-title>
          .
          <source>In Proceedings of the WWW2009 workshop on Linked Data on the Web (LDOW2009)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. H.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>Y. H.</given-names>
          </string-name>
          0003, J. Han, and
          <string-name>
            <given-names>X. J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Mining coherent dense subgraphs across massive biological networks for functional discovery</article-title>
          .
          <source>In ISMB (Supplement of Bioinformatics)</source>
          , pages
          <fpage>213</fpage>
          -
          <lpage>221</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. G. Jeh and
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          .
          <article-title>Simrank: a measure of structural-context similarity</article-title>
          .
          <source>In KDD</source>
          , pages
          <fpage>538</fpage>
          -
          <lpage>543</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuramochi</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Karypis</surname>
          </string-name>
          .
          <article-title>Finding frequent patterns in a large sparse graph*</article-title>
          .
          <source>Data Min. Knowl. Discov.</source>
          ,
          <volume>11</volume>
          (
          <issue>3</issue>
          ):
          <fpage>243</fpage>
          -
          <lpage>271</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ling</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>A supplement to sampling-based methods for query size estimation in a database system</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>21</volume>
          (
          <issue>4</issue>
          ):
          <fpage>12</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>R.</given-names>
            <surname>Lipton</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Naughton</surname>
          </string-name>
          .
          <article-title>Query size estimation by adaptive sampling (extended abstract)</article-title>
          .
          <source>In PODS '90: Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems</source>
          , pages
          <fpage>40</fpage>
          -
          <lpage>46</lpage>
          . New York, NY, USA: ACM Press,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Medical Subject</surname>
          </string-name>
          <article-title>Heading (MeSH)</article-title>
          . http://www.nlm.nih/gov/mesh.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>O. C.</surname>
          </string-name>
          <article-title>Organization. GALEN common reference model</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. L.
          <string-name>
            <surname>Raschid</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vidal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Tsaparas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Srinivasan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Sehgal</surname>
          </string-name>
          .
          <article-title>Ranking target objects of navigational queries</article-title>
          .
          <source>In WIDM</source>
          , pages
          <fpage>27</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. E.
          <string-name>
            <surname>Ruckhaus</surname>
            , E. Ruiz, and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vidal</surname>
          </string-name>
          .
          <article-title>Query optimization in the semantic web</article-title>
          .
          <source>In Theory and Practice of Logic Programming</source>
          .
          <source>Special issue on Logic Programming and the Web</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>S.</given-names>
            <surname>Russell</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Norvig. Artificial Intelligence</surname>
          </string-name>
          :
          <string-name>
            <given-names>A Modern</given-names>
            <surname>Approach. Second Edition</surname>
          </string-name>
          . Princeton Hall,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <article-title>An Overview to RxNorm</article-title>
          . http://www.nlm.nih.gov/research/umls/rxnorm/.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. P. Srinivasan, b. Libbus,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Mining medline: Postulating a beneficial role for curcumin longa in retinal diseases</article-title>
          . In L. Hirschman and J. Pustejovsky, editors,
          <source>LT-NAACL 2004 Workshop: BioLINK</source>
          <year>2004</year>
          , Linking Biological Literature,
          <source>Ontologies and Databases</source>
          , pages
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>D.</given-names>
            <surname>Swanson</surname>
          </string-name>
          .
          <article-title>Migraine and magnesium: Eleven neglected connections</article-title>
          .
          <source>In Perspective in Biology and Medicine</source>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>M.-E. Vidal</surname>
            , E. Ruckhaus, and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Marquez</surname>
          </string-name>
          .
          <article-title>BioNav: A System to Discover Semantic Web Associations in the Life Sciences</article-title>
          .
          <source>In ESWC 09-Poster Session</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24. voiD Guide -
          <article-title>Using the Vocabulary of Interlinked Datasets</article-title>
          . http: //rdfs.org/ns/void-guide.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>J. Volz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gaedke</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kobilarov</surname>
          </string-name>
          .
          <article-title>Discovering and maintaining links on the web of data</article-title>
          .
          <source>In International Semantic Web Conference (ISWC)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Wishart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Knox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hassanali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stothard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chang</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Woolsey.</surname>
          </string-name>
          <article-title>DrugBank: a comprehensive resource for in silico drug discovery and exploration</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>34</volume>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>