<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analyzing Semantic Interoperability in ? Bioinformatic Database Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Philippe Cudre-Mauroux</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julien Gaugaz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adriana Budura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karl Aberer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer and Communication Sciences EPFL</institution>
        </aff>
      </contrib-group>
      <fpage>82</fpage>
      <lpage>91</lpage>
      <abstract>
        <p>We consider the problem of analytically evaluating semantic interoperability in large-scale networks of schemas interconnected through pairwise schema mappings. Our heuristics are based on a graphtheoretic framework capturing important statistical properties of the graphs. We validate our heuristics on a real collection of interconnected bioinformatic databases registered with the Sequence Retrieval System (SRS). Furthermore, we derive and provide experimental evaluations of query propagation on weighted semantic networks, where weights model the quality of the various schema mappings in the network.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction</p>
      <p>The Sequence Retrieval System (SRS)
SRS, short for \Sequence Retrieval System", is a commercial information
indexing and retrieval system designed for bioinformatic libraries such as the EMBL
nucleotide sequence databank, the SwissProt protein sequence databank or the
Prosite library of protein subsequence consensus patterns. It is a distributed,
interoperable data management system which was initially developed at the
European Molecular Biology Laboratory in Heidelberg, and which allows the
querying of one or several databases simultaneously, regardless of their format
or schemas.</p>
      <p>Administrators wishing to register new databases with SRS rst have to
de ne the schema they have adopted to store data, using a custom language
called Icarus. Once their schemas have been de ned, administrators can import
schema instances (i.e., text les) whose data will be correctly parsed, indexed and
processed thanks to the corresponding schema de nitions. Additionally,
administrators can manually de ne relationships between their database schema and
similar schemas. In SRS, these relationships are represented as links relating one
entry of a database schema to one entry of another schema. Thanks to this
structure of links between databases, users can propagate queries they pose locally
against one speci c schema to other schemas available in the system (for
technical details, we refer the interested reader to http://www.lionbioscience.com/ )
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Graph analysis of an SRS repository</title>
      <p>Conceptually, the model described above is very close to what one could expect
from a subgraph of the semantic web itself, i.e., a collection of related schemas
(or ontologies) linked one to another through pairwise mappings. The graph
which can be extracted from a SRS repository has two main advantages over
those which can be built from current RDFS / OWL repositories: i) it is based
on a real-world collection of schemas which are being used on a daily basis by
numerous independent parties and ii) it is of a reasonable size (several hundreds
of semantically related complex schemas). Thus, after having been rather
unsuccessful at nding reasonable semantic networks from the Semantic Web itself,
we decided to build a specialized crawler to analyze the semantic graph of an
SRS repository and to test our heuristics on the resulting network.</p>
      <p>We chose to analyze the semantic network from the European Bioinformatics
Institute SRS repository, publicly available at http://srs.ebi.ac.uk/. We built a
custom crawler which traverses the entire network of databases and extracts
schema mapping links stored in the schema de nition les. The discussion below
is based on the state of the SRS repository as of May 2005.</p>
      <p>The graph resulting from our crawling process is an undirected graph of
388 nodes (database schemas) and 518 edges (pairwise schema mappings). We
chose to represent links as undirected edges since they are used in both
directions by SRS (they basically represent cross-references between two entries of
two database schemas). We identi ed all connected components in the graph
(two nodes are in the same connected component if there is a path from one to
the other following edges). The analysis revealed one giant connected
component (i.e., a relatively large set of interconnected schemas) of 187 nodes, which
represent roughly half of the nodes and 498 edges. Besides the giant connected
component, the graph also has two smaller components, each consisting of two
vertices. The rest of the nodes are isolated, representing mostly result databases
or databases for which no link to other databases was de ned.</p>
      <p>The average degree of the nodes is 2.2 for the whole graph and 4.6 for the
giant component. Real networks di er from random graphs in that often their
degree distribution follows a power law, or has a power law tail, while random
graphs have a Poisson distribution of degrees [2]. Unsurprisingly, our semantic
network is no exception as can be seen in Figure 1 below, which depicts an
accurate approximation of the degree distribution of our network by a
powerlaw distribution y(x) = x with = 0:21 and = 1:51.</p>
      <p>70
60
50
se 40
d
o
N
f#o 30
20
10
0 0</p>
      <p>Measured degree distribution</p>
      <p>Approximated degree distribution
10
20</p>
      <p>30
Node Degree
40
50</p>
      <p>Another interesting property which we explored was the tendency of the
schemas to form clusters, quanti ed by the average clustering coe cient.
Intuitively, the clustering coe cient of a vertex measures the degree to which its
neighbors are neighbors of each other. More precisely, the clustering coe cient
indicates the ratio of existing edges connecting a node's neighbors to each other
to the maximum possible number of such edges. The network we considered has
a high average clustering coe cient of 0.32 for the whole graph and of 0.54 for
the giant component. The diameter (maximum shortest paths between any two
vertices) of the giant connected component is 9. These data indicate that our
network can be characterized as scale-free (power-law distribution of degrees) or
small-world (small diameter, high clustering coe cient).</p>
      <p>Analyzing semantic interoperability in the large
In [5], we introduced a graph-theoretic framework for analyzing semantic
interoperability in large-scale networks. As described above, we model database
schemas (or ontologies) as nodes, interconnected by edges (schema mappings).
Schema mappings are used to iteratively propagate queries posed against a local
schema to other related schemas (see [1] for how this can be implemented in
practice). Note that links can be directed or undirected, weighted or non-weighted
depending on the schema mappings being used.</p>
      <p>In such a network, the density of mappings is important in order to propagate
a local query from one database (schema) to the other databases. A query can
only be propagated to all databases if the semantic network is connected, that is if
there exists a path from one schema to any other schema following schema
mapping links. If some schemas are isolated, queries cannot be propagated to/from
the rest of the graph, thus making it impossible to have a semantically
interoperable network of databases. This observation motivated us to take advantage of
percolation theory to determine when a semantic network could be connected or
not. Our framework for analyzing semantic interoperability takes advantage of
generatingfunctionologic [7] functions for the degree distribution of the semantic
graph:
where pk represents the probability that a randomly chosen vertex has degree k.
We showed (by extending results from [6]) that a network cannot be semantically
interoperable in the large unless the connectivity indicator ci = Pk k(k 2 cc)pk
is greater than zero, with cc representing the clustering coe cient. Also, we
provided heuristics for estimating the relative size S of the biggest semantically
interoperable cluster of schemas by solving
where u is the smallest non-negative real solution of</p>
      <p>S = 1</p>
      <p>G0(u);
u = G1(u)
and G1(u), the distribution of outgoing edges from
bors, is
rst to second-order
neighWe applied our heuristics to the SRS graph we obtained from the crawling
process. The results are as follows: we get a connectivity indicator ci of 25.4,
indicating that the semantic network is clearly in a super-critical state and that
a giant component interlinking most of the databases has appeared. The size
of this giant component as estimated by our heuristics (see above) is of 0:47,
meaning that 47% of the schemas are part of the giant connected component.
This is surprisingly close to the real value of 0:48 as observed in the graph.
3.2</p>
    </sec>
    <sec id="sec-3">
      <title>Generating a Graph with a given Power-Law Degree</title>
    </sec>
    <sec id="sec-4">
      <title>Distribution</title>
      <p>Going slightly further, we want to analyze the dynamics of semantic graphs
with varying numbers of edges. Our aim is to generate graphs with the same
statistical properties as the SRS graphs, that is, graphs following a power-law
degree distribution:</p>
      <p>
        P (k) =
k
but with a varying number of edges. We take from [3] a graph-building algorithm
yielding a power-law degree distribution with a given exponent . It goes as
following: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) create a (large) number N of vertices. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) to each vertex i, assign
an \importance" xi, which is a real number taken from a distribution (x). (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
for each pair of vertices, draw a link with probability f (xi; xj ), which is function
of the importance of both vertices.
      </p>
      <p>Now if f (xi; xj ) = (xixj =x2M ) (where xM is the largest value of x in the
graph), we know from [3] that the degree distribution of a graph will be
P (k) =</p>
      <p>2
xM
N hxi</p>
      <p>2
xM k</p>
      <p>N hxi
(x) =
(m1
de ned over the interval [m; Q]. However, we still have to nd values for m and Q
such that the scale of the resulting degree distribution equals . Using equation
7, we nd the expected importance value as
hxi =
(
(</p>
      <p>2
1)(m Q
2)(mQ
m Q2)
m Q)
:
Replacing (x) in equation 6, the degree distribution of the resulting graph
becomes</p>
      <p>P (k) =</p>
      <p>
        2
xM
N hxi (m1
such that, equating with equation 5, we get
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(8)
(9)
= Nxh2Mxi (m1 1Q1 ) Nxh2Mxi : (10)
      </p>
      <p>We can then arbitrarily choose m &gt; 0 and nd Q by numerical
approximation, since the right-hand side of equation 10 is de ned and continue for values
of Q &gt; m.</p>
      <p>Figures 2 and 3 show the results of our heuristics on networks of respectively
388 (i.e., mimicking the original SRS graph) and 3880 edges (i.e., 10 times bigger
than the original SRS graph but with the same statistical properties) constructed
in the way presented above with a varying number of edges. The curves are
averaged over 50 consecutive runs. As for the original SRS network, we see that
we can accurately predict the size of the giant semantic component, even for
very dense graphs.</p>
      <p>Estimated Maximum Relative Component Size
Measured Maximum Relative Component Size
g (Approximated Power−Law Exponent)
a (Approximated Power−Law Scale)</p>
      <p>Clustering Coefficient
2084.3 563.7 828.1 1084.7 1338.9 1583 1823.4 2058.1 2289.92512.4</p>
      <p>Mean Number of Edge</p>
      <p>Connectivity Indicator and Giant Component Size in
Weighted Graphs
So far, we only analyzed the presence and size of a giant connected component
in order to determine which portion of a semantic network could potentially be
semantically interoperable. In large-scale decentralized networks, however, one
should not only look into the giant semantic component itself, but also analyze
the quality of the mappings used to propagate queries from one schema to the
other (see [1] for a discussion on that topic). Indeed, in any large, decentralized
network, it is very unlikely that all schema mappings could correctly map queries
from one schema to the other, because of the lack of expressivity of the mapping
languages, and of the fact that some (most?) of the mappings might be generated
automatically.</p>
      <p>Thus, as considered by more and more semantic query routing algorithms,
we introduce weights for the schema mappings to capture the quality of a given
mapping. Weights range from zero (indicating a really poor mapping unable to
semantically keep any information while translating the query) to one (for
perfect mappings, keeping the semantics of the query intact from one schema to the
other). We then iteratively forward a query posed against a speci c schema to
other schemas through schema mappings if and only if a given mapping has a
weight (i.e., quality) greater than a prede ned threshold . = 0 corresponds
to sending the query through any schema mapping, irrespective of its quality.
On the contrary, when we set to one, the query gets only propagated to
semantically perfect mappings, while even slightly faulty mappings are ignored.
Previous works in statistical physics and graph theory have looked into
percolation for weighted graphs. We present hereafter an extension of our heuristics for
weighted semantic networks inspired by [4].
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Connectivity Indicator</title>
      <p>As before, we consider a generating function for the degree distribution
where pk is the probability that a randomly chosen vertex has degree k in the
network. We then de ne tjk as the probability that an edge has a weight above
given that it binds vertices of degree j and k. Thus, wk = Pj1=0 tjk is the
probability that an edge transmits, given that it is attached to a vertex of degree
k. The generating function for the probability that a vertex we arrive at by
following a randomly chosen edge is of degree k and transmits is
where cc is the clustering coe cient. Now, from [4], we know that the generating
function for the probability that one end of a randomly chosen edge on the graph</p>
      <p>Similarly, the generating function for the probability that a randomly chosen
vertex belongs to a percolation cluster of a given number of vertices is
such that the mean component size corresponding to a randomly chosen vertex
is
leads to a percolation cluster of a given number of vertices is</p>
      <p>
        1. However,
such that a giant connected component appears if
hsi = H 00(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) = G0(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) +
      </p>
      <p>
        G00(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )G1(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
1 G01(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
G01(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) =
      </p>
      <p>1
Pk=0 wkpkk(k</p>
      <p>1
Pk=0 kpk
1</p>
      <p>cc)
1
ci = X kpk [wk(k
k=0
1
cc)
0
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Giant Component Size</title>
      <p>
        As seen above, H0(x) represents the distribution for the cluster size which a
randomly chosen vertex belongs to, excluding the giant component. Thus,
according to [4], H0(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) is equal to the fraction of the nodes which are not in the
giant component. The fraction of the nodes which are in the giant component is
hence S = 1 H0(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ). Using equation 14 we can write
      </p>
      <p>S = 1</p>
      <p>
        H0(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) = G0(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
      </p>
      <p>
        G0 [H1(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )] :
with H1(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) = 1 G1(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) + xG1 [H1(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )]. Thus H1(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) = u where u is the smallest
non-negative solution of
      </p>
      <p>The relative size of the giant component reached by the query in a weighted
semantic graph follows as</p>
      <p>Figures 4 and 5 show the results of our heuristics on weighted networks
of respectively 388 and 3880 nodes, for a varying number of edges and various
values of . The curves are averaged over 50 consecutive runs, and the weights of
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
individual schema mappings are randomly generated using a uniform distribution
ranging from zero to one. We see that our heuristics can quite adequately predict
the relative size of the graph to which a given query will be forwarded. Also,
as for the unweighted analysis, we observe similar behaviors for the two graphs;
This is rather unsurprising as we are dealing with scale-free networks whose
properties are basically independent of their size.
estimated t = 0.0
measured t = 0.0
estimated t = 0.4
measured t = 0.4
estimated t = 0.8
measured t = 0.8
0278.2 547.2 806.1 1056.2 1300.7 1539.6 1771.9 2000.4 2224.4 2442.7</p>
      <p>Mean Number of Edge
In this paper, we tested graph-theoretic heuristics to evaluate semantic
interoperability on a real semantic network. The results con rm the validity of our
heuristics beyond our initial hopes as we could predict quite accurately the size
of the giant semantic component in the network. Also, we extended our analysis
to apply our heuristics on larger networks enjoying similar statistical properties
and on weighted semantic networks. It was for us quite important to test our
heuristics using real-world data, as semantic network analyses mostly consider
arti cial networks today, due to the current lack of semantically enriched
websites or deployed semantic infrastructures. In the future, we plan to extend our
analyses to take into account the dynamicity (churn) of the network of schema
mappings, and to consider more accurate query forwarding schemes based on
transitive closures of mapping operations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cudre-Mauroux</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hauswirth</surname>
          </string-name>
          .
          <article-title>The Chatty Web: Emergent Semantics Through Gossiping</article-title>
          .
          <source>In International World Wide Web Conference (WWW)</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Albert</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Barabasi</surname>
          </string-name>
          .
          <article-title>Statistical mechanics of complex networks</article-title>
          .
          <source>Reviews of Modern Physics</source>
          ,
          <volume>74</volume>
          (
          <issue>47</issue>
          ),
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>G.</given-names>
            <surname>Caldarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Capocci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D. L.</given-names>
            <surname>Rios</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Muoz</surname>
          </string-name>
          .
          <article-title>Scale-free networks from varying vertex intrinsic tness</article-title>
          .
          <source>Phys. Rev. Lett.</source>
          ,
          <volume>89</volume>
          ,
          <fpage>258702</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Callaway</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Newmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Strogatz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Watts</surname>
          </string-name>
          .
          <article-title>Network robustness and fragility: Percolation on random graphs</article-title>
          .
          <source>Phys. Rev. Lett.</source>
          ,
          <volume>85</volume>
          ,
          <fpage>54685471</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>P.</given-names>
            <surname>Cudre-Mauroux</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          .
          <article-title>A Necessary Condition For Semantic Interoperability In The Large</article-title>
          . In International Conference Ontologies, DataBases, and Applications of Semantics (ODBASE),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M. E. J.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Strogatz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Watts</surname>
          </string-name>
          .
          <article-title>Random graphs with arbitrary degree distributions and their applications</article-title>
          .
          <source>Phys. Rev</source>
          .,
          <source>E64(026118)</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Wilf</surname>
          </string-name>
          .
          <source>Generatingfunctionology. 2nd Edition</source>
          , Academic Press, London,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>