<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graph Summarization in Annotated Data Using Probabilistic Soft Logic</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alex Memory</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelika Kimmig</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephen H. Bach</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Louiqa Raschid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lise Getoor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Maryland</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>view on the data, can help alleviate this problem, but new methods for graph summarization are needed that handle uncertainty present within and across these sources. Here, we propose the use of probabilistic soft logic (PSL) [1] as a general framework for reasoning about annotation graphs, similarities, and the possibly confounding evidence arising from these. We show preliminary results using two simple graph summarization heuristics in PSL for a plant biology domain.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The Linked Data initiative and Semantic Web technologies have been very
successful in providing access to a diversity of data collections. Of particular interest
are annotation graphs, where scienti c concepts are tagged with controlled
vocabulary terms from ontologies or thesauri. As these collections grow, tools and
techniques to analyze, explore and inspect such data become ever more
important. In this paper we consider the problem of mining the richly curated
annotation graphs, in conjunction with the wealth of semantic knowledge encoded
within ontologies, to create graph summaries. Graph summaries group entities
and relations based on similarity as well as local graph structure, thus creating
a graph at a higher level of abstraction that can be easier to analyze. This can
help the scientist to understand the underlying evidence, to nd patterns, and
to make predictions.</p>
      <p>Linked Data can provide multiple rich and possibly confounding sources of
evidence about concepts. As a motivating example, we consider an annotation
graph from the domain of plant biology. The nodes in this graph are genes from
the model organism Arabidopsis thaliana (these are the concepts) as well as
terms from both the Gene Ontology (GO) and the Plant Ontology (PO) (these
are the annotations). Edges represent annotations of genes with such terms.
Other sources of information of interest include sequence-based similarity
between pairs of genes, co-occurrence frequencies of pairs of GO terms, taxonomic
distances between pairs of PO or pairs of GO terms, etc. This evidence may
be confounding; for example, genes can have high sequence based similarity to
other genes in their same family. However, more useful evidence may be that
they share high GO functional similarity with genes in unrelated families (with
or without high sequence similarity).</p>
      <p>We propose the use of probabilistic soft logic (PSL) [1] as a general framework
for reasoning about annotation graphs, similarities, and the possibly
confounding evidence arising from these. PSL is a framework for collective, probabilistic
reasoning in relational domains that directly exploits available similarities. It
uses rules to capture the dependency structure of the domain, based on which it
builds a joint probabilistic model over the data. This allows us to easily encode
the annotation graph, similarity information for nodes, and a number of graph
summarization heuristics, and to explore the e ect of these heuristics on the
resulting graph summaries. In this work, we show preliminary results from two
simple heuristics.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Motivating Example and Problem Setting</title>
      <p>We use an example annotation graph from the plant biology domain to present
our goals for graph summarization. We also use this domain for our experimental
evaluation in Section 6. The graph represents gene annotation data for the model
organism Arabidopsis thaliana, which originates in The Arabidopsis Information
Resource (TAIR).3 Each gene in TAIR is annotated with terms from the Plant
Ontology (PO) and from the Gene Ontology (GO). A fragment of the resulting
annotation graph is illustrated in Figure 1, with PO terms on the left, genes in
the center, and GO terms on the right.</p>
      <p>For a scientist exploring a set of genes of interest within a biological context,
e.g., genes related to light-mediated development, nding regularities in such a
graph can provide useful information. Our goal is to facilitate this process by
providing summaries of the graph, that is, by grouping together nodes (and
edges). The grouping can exploit multiple sources of evidence including explicit
similarity between pairs of nodes, or shared annotations. For ease of illustration,
we drastically simplify the graph to the topmost part of Figure 1, shown on
the left in Figure 2. On the right, Figure 2 shows a possible graph summary,
where three pairs of nodes have been grouped into three supernodes or clusters,
and sets of edges between all pairs of nodes in adjacent clusters are represented
by single edges between clusters. However, for real-world graphs, many
clusterings are possible, and so di erent heuristics and combinations of heuristics
may be appropriate for di erent graphs. In this work, we show how two such
heuristics can be easily incorporated into a probabilistic framework, but others
are certainly possible. Future work can extend this approach by incorporating
additional heuristics and adapting heuristics to di erent graph-summarization
tasks.
3 http://www.arabidopsis.org
cauline leaf
shoot apex
root
petal
petiole
flower
hypocotyl
shoot system</p>
      <p>PHOT1
CRY2
CRY1
SHB1
PHOT2
COP1</p>
      <p>vacuole
stomatal movement
response to water
deprivation
phototropism
photomorphogenesis
shade avoidance
red or far-red light
signaling pathway</p>
      <p>The rst is an annotation link heuristic: We would like to cluster nodes that
share a large fraction of their neighbors in the annotation graph. For instance,
the two PO terms \cauline leaf" and \shoot apex" both annotate genes PHOT1
and CRY2 in our example, and there are no genes that are annotated with only
one of these terms. The terms are thus similar in terms of the link structure
they participate in, which supports clustering them. On the GO side, the same
argument holds for \vacuole" and \stomatal movement", but not for \response to
water deprivation", which only annotates CRY2. Clearly, the direct link structure
alone does not provide su cient evidence to decide whether the latter term
should be added to the GO cluster or not. Choosing to include the term in the
cluster would correspond to implicitly assuming that the term should actually
annotate PHOT1 as well, and thus allow one to predict a new link, whereas the
latter would tend more towards accepting the absence of such a link. Finally, we
observe that the two gene nodes share four out of their ve neighbors, which can
still be viewed as a relatively strong indication to cluster them.</p>
      <p>Next, we consider explicit similarities between pairs of nodes. Such additional
information could help deciding whether the third GO term should be included in
the cluster. For this we use the sequence based similarity between pairs of genes
and information retrieval based metrics between pairs of annotation terms. For
instance, amongst the extensive statistics published by the GO Consortium, the
annotation co-occurrence of pairs of GO terms has signi cant biological
meaning and is a good predictor of new function. For the GO term response to
water deprivation, stomatal movement is the 5th highest co-occurring term;
for the reverse case the rank is 11. Incorporating a similarity measure between
GO terms into the graph summarization process might thus provide additional
evidence in favor of clustering all three terms in the example and further
predicting the \response to water deprivation" annotation on PHOT1. This would
help the biologist understand the new functional annotation of PHOT1 and also
understand that there is a posssible interaction between CRY2 and PHOT1.</p>
      <p>To be able to exploit this similarity information, we introduce a similarity
heuristic: we prefer to cluster nodes that are similar according to some available
similarity measure. Recall however that this may also introduce con icting
evidence. For instance, the two genes in our example belong to di erent groups of
blue light receptors and are therefore dissimilar in terms of sequence similarity,
but similar in terms of their annotations in the graph.</p>
      <p>We integrate the multiple types of evidence from the annotation links, the
various similarity metrics, and the two graph summarization heuristics within a
probabilistic model using PSL. We discuss this model in more detail in Section 5,
after an introduction to PSL in Section 4.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>Graph summarization as broadly considered in this paper is a form of
multirelational clustering that exploits attributes of the nodes or objects to be
clustered, as well as additional relational features or properties in which these nodes
participate [2{4]. Multi-relational clustering aims at grouping nodes in
heterogeneous, multi-relational networks, i.e., networks with both multiple types of
nodes and multiple types of relationships between nodes. The clusters group
nodes based on their similarities, where the value is either given explicitly, or
derived from node attributes or relations between nodes of the same or di
erent type(s). There is a large body of work on multi-relational clustering, and
methods include matrix factorization approaches, generative models, and other
optimization methods. Other work on graph summarization explores
summarization techniques that can be tailored to user needs and which scale to large
graphs with minimal loss [5{8]. Our proposed approach makes use of the
notion of examplars, used in methods such as a nity propagation [9], to denote
the elements which are chosen as the canonical representation for nodes in each
cluster. Besides multi-relational clustering and graph summarization, there is a
broad range of other mining and analysis techniques for heterogeneous
information networks, cf. for instance [10].</p>
      <p>Probabilistic soft logic (PSL) [1] combines ideas from fuzzy logic [11] and
graphical models. Similar to Markov Logic [12], it uses rst order logic as a
template language for a graphical model. However, its use of soft truth values
turns inference from a discrete into a continuous optimization task, which can
be solved e ciently.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Probabilistic Soft Logic</title>
      <p>Probabilistic soft logic (PSL) [1] is a framework for collective, probabilistic
reasoning in relational domains. PSL uses rules to capture the dependency structure
of the domain, based on which it builds a joint probabilistic model over all atoms.
Each rule has an associated non-negative weight that captures the rule's relative
importance. Furthermore, PSL uses soft truth values in the interval [0; 1], which
allows one to directly incorporate similarity functions into the logical model. We
refer to Broecheler et al. [1] for full technical details and instead illustrate the
key concepts in the context of the following example program:</p>
      <p>w1 : exemplar(A; B) ! similar(A; B)
w2 : link(A; B) ^ exemplar(A; C) ^ exemplar(D; C) ! link(D; B)
Here, for simplicity of presentation, we assume w1 = w2 = 1. Consider any
concrete nodes a, b, c, and d instantiating logical variables A, B, C, and D
respectively. The rst rule states that if a is in the cluster exempli ed by b, they
should be similar (similarity heuristic), whereas the second states that if a and d
are both in the cluster exempli ed by c, and a has a link to b, then d should also
have a link to b (link heuristic). While PSL shares the syntax of its rules with
rst order logic, PSL uses soft truth values from the interval [0; 1] instead of its
extremes 0 (false) and 1 (true) only. Given a set of atoms ` = f`1; : : : ; `ng, we call
the mapping I : ` ! [0; 1]n from atoms to soft truth values an interpretation.
PSL de nes a probability distribution over interpretations that makes those
satisfying more ground rule instances more probable.</p>
      <p>
        To determine the degree to which a ground rule is satis ed, PSL uses the
Lukasiewicz t-norm and its corresponding co-norm as the relaxation of the
logical AND and OR, respectively. These relaxations are exact at the extremes,
but provide a consistent mapping for values in-between. Given an interpretation
I, the formulas for the relaxation of the logical conjunction (^), disjunction (_),
and negation (:) are as follows:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
`1 ^~ `2 = maxf0; I(`1) + I(`2)
`1 _~ `2 = minfI(`1) + I(`2); 1g;
      </p>
      <p>1g;
:~ l1 = 1</p>
      <p>I(`1);
where we use ~ to indicate the relaxation from the Boolean domain. For a ground
rule r rbody ! rhead :~ rbody _~ rhead, where rbody and rhead are logical
formulas composed of atoms and the logical operators de ned above, an interpretation
I over the atoms in r determines whether r is satis ed, and, if not, its distance
to satisfaction. Abusing notation, we can expand the usage of I to also denote
the truth assignments to logical formulas induced by assignments to atoms and
applying the de nitions of the logical operators in the formula, i.e., I(r) is the
truth value that results from applying the logical operators in r to the truth
values of atoms in r given by I. Then, given I, r is satis ed, i.e., I(r) = 1, if and
only if I(rbody) I(rhead), that is, the head has at least the same truth value
as the body. Again, this coincides with the usual de nition of satisfaction of a
rule when truth values are restricted to 0 and 1. The rule's distance to
satisfaction under interpretation I then measures the degree to which this condition is
violated:
dr(I) = maxf0; I(rbody)</p>
      <p>
        I(rhead)g
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
For instance, consider the interpretation I = flink(a; b) 7! 1; exemplar(a; c) 7!
0:9; exemplar(d; c) 7! 0:8; link(d; b) 7! 0g and let r be the corresponding ground
instance of Rule (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) above. We get I(rbody) = maxf0; 1 + 0:8 + 0:9 2g = 0:7
and thus dr(I) = maxf0; 0:7 0g = 0:7, whereas the distance would be 0 if the
head had truth value 0:7 or greater.
      </p>
      <p>Given a set of atoms ` of interest, a PSL program induces a distribution over
possible interpretations I. ` rst induces a set of ground rules R, which contains
every possible ground rule r such that r can be obtained by performing variable
substitution on one of the rules in the program and each atom mentioned in r
is in `. The probability density function f over I is:
f (I) =</p>
      <p>exp[
where r is the weight of the rule r, Z is the continuous version of the
normalization constant used in discrete Markov random elds, and p 2 f1; 2g determines
the loss function for minimizing the distance from satisfaction. If p = 2 the loss
function is quadratic and the distance from satisfaction for each ground rule is
squared. Constraints can be imposed on interpretations and the domain updated
accordingly, for instance, requiring a predicate to be functional. Also, the
density function can be conditioned on a partial interpretation and the domain and
de nitions of distances to satisfaction updated accordingly.</p>
      <p>Finding the most probable interpretation in PSL is an instance of MPE
inference. Maximizing the density function f (I) is equivalent to minimizing the
summation in the exponent. This optimization problem, if subject only to linear
equality and inequality constraints on the interpretation, can be solved e ciently
by casting it as a second-order cone program [1].
5</p>
    </sec>
    <sec id="sec-5">
      <title>A PSL Model for Graph Summarization</title>
      <p>Figure 3 lists the set of PSL rules used in this work for graph summarization
in annotation data. Di erent subsets of these rules are experimentally evaluated
and compared in Section 6. We model similarity of pairs of nodes of the same type
exemplar(A; B) ! similar(A; B)
exemplar(A; B) ! exemplar(B; B)
with predicate similar=2 and relations between pairs of nodes of di erent types
with predicate link=2. Both predicates are symmetric. Note that while these
predicates allow us to easily write general rules for all types of links and nodes
appearing in the data, the inference engine takes into account the node types
during grounding and thus ensures that clustering respects the types. Given
truth values for all relevant atoms of these two predicates, the task of inference
is to infer truth values of the remaining predicate exemplar=2, which encodes
clusters. More speci cally, the truth value of an atom exemplar(a; b) indicates
whether node a is a member of the cluster that has node b as its exemplar. We
constrain exemplar=2 to be a functional predicate, that is, the truth values of
all its groundings using a given node a as rst argument have to sum to one. We
also set a small prior on exemplar=2, further limiting its groundings.</p>
      <p>In the following, we discuss the individual rules in more detail, showing how
they encode the clustering heuristics introduced in Section 2 as probabilistic
dependencies.
We start with the similarity heuristic, which indicates that pairs of similar nodes
of the same type should probably be clustered. It is modeled by the rst PSL
rule:</p>
      <p>exemplar(A; B) ! similar(A; B)
This rule connects truth values of similar=2, which are given, to those of
exemplar=2, which are inferred. For a pair of nodes (a; b) with low similarity,
the rule is only satis ed for low truth values of exemplar(a; b). In other words,
it encourages node a to choose a di erent, more similar exemplar. If a and b are
highly similar, on the other hand, a wider range of truth values for exemplar(a; b)
will satisfy the rule, making it possible for a to choose b or another node as its
exemplar without penalty.</p>
      <p>
        We further encourage clusters with a single exemplar, which is modeled by
the second PSL rule:
exemplar(A; B) ! exemplar(B; B)
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
This rule breaks chains of exemplar choices by penalizing situations where a node
that is chosen as exemplar by some node in the cluster does not choose itself as
exemplar. As truth values of exemplar=2 atoms are inferred during clustering,
this rule can propagate information in both directions. If the truth value of
exemplar(b; b) is low for a given node b, it will encourage low truth values for
all atoms exemplar(a; b) with other nodes a as rst argument. Conversely, each
atom exemplar(a; b) with high truth value encourages a high truth value for
exemplar(b; b).
5.2
      </p>
      <sec id="sec-5-1">
        <title>Annotation Link Heuristic</title>
        <p>
          The following two PSL rules model the annotation link heuristic:
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
        </p>
        <p>
          Rule (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ) states that a shared neighbor is an indication that two nodes should
be clustered. Consider a pair of candidate nodes a and c for clustering, and keep
the exemplar d and the node b on the other side xed. Due to symmetry, we
get two groundings of the rule, one replacing A with a and C with c, the other
replacing A with c and C with a:
link(a; b) ^ link(c; b) ^ exemplar(a; d) ! exemplar(c; d)
link(c; b) ^ link(a; b) ^ exemplar(c; d) ! exemplar(a; d)
During clustering, the truth values of link=2 atoms are xed to either 1 (link
exists) or 0 (link does not exist). If one of the two links does not exist, both
groundings are trivially satis ed, as their bodies will have the minimal truth
value 0. If they both exist, the rules simplify to
exemplar(a; d) ! exemplar(c; d)
exemplar(c; d) ! exemplar(a; d)
and thus encourage the truth value of exemplar(a; d) to be at most and at
least that of exemplar(c; d), respectively. In other words, the two nodes should
agree in the degree to which they choose that speci c exemplar and its
corresponding cluster. Note that the in uence of this rule grows with the number of
joint neighbors the two candidates for clustering share, as those will produce
individual groundings.
        </p>
        <p>
          While Rule (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) again involves a pair of nodes that are candidates for
clustering (a neighboring node and an exemplar), due to its di erent form, it encodes
a di erent dependency. Consider the grounding
link(a; b) ^ exemplar(a; c) ^ exemplar(d; c) ! link(d; b)
(13)
As truth values of link=2 are xed to either 0 or 1, this grounding is trivially
satis ed if there is no link between a and b (in which case the truth value of the
body is minimal) or if there is a link between d and b (in which case the truth
value of the head is maximal). The interesting case is thus the one where there
is a link between a and b, but no link between d and b.4 In this case, the rule
increases the probability that
exemplar(a; c) ^~ exemplar(d; c)
0
(14)
In words, the rule will be satis ed in this case if and only if the truth values of
the two exemplar=2 atoms sum to at most 1, thus encouraging the two nodes
not to strongly agree on a joint exemplar. The in uence of this rule grows with
the number of neighbors on which a and d disagree. Together, the two rules
thus allow one to take into account both shared and unshared neighbors during
clustering.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation</title>
      <p>The goals of graph summarization include identifying patterns and making
predictions in the annotation graph. We use prediction, speci cally the task of
predicting missing links, to explore the utility of the simple heuristics from
Section 5. In our experimental setting, the missing links are links between genes and
GO terms in the annotation graph for the model organism Arabidopsis thaliana,
as described in Section 2. We begin by generating graph summaries using the
rules described ealier. Next, we use this model to predict gene-GO annotations.</p>
      <p>In addition to the GO, gene and PO annotation graph, we consider gene-gene
sequence-based similarity,5 as well as PO-PO path based distances from the PO
ontology and GO-GO path based distances from the GO ontology. These are
represented as similar=2 atoms between nodes of each of the three types within the
graph: PO terms, genes and GO terms. In our model, all instances of similar=2
atoms are treated uniformly; however, they are computed using di erent
similarity metrics. All other relations from the data are link=2 atoms between nodes
of di erent types. Although the type of each node in the graph could be
represented explicitly and used in the rules to control the graph summaries, e.g., to
ensure that no cluster contains both PO and GO terms, in our implementation
we only consider relations between nodes where a relation of that type might
exist between nodes of those types. Further, in the grounding of the atoms in
the data, we make each similar=2 atom symmetric by asserting its inverse, and
we do the same for link=2 atoms.</p>
      <p>Using each graph summarization program, we infer the exemplar=2 atoms
forming clusters with soft membership, which is the input to our link
prediction program. To then evaluate our link prediction using graph summaries, we
perform leave-one-out evaluation. Speci cally, for each link in the original
annotation graph we rst remove the link and compute the graph summary. We then
4 Note that we get a symmetric grounding that a ects the opposite case as well.
5 We compute pair-wise sequence similarity between pairs of genes using the</p>
      <p>Nucleotide-Nucleotide BLAST 2.2.26+ package.
predict missing links. We sort the predicted links based on their truth values,
and we interpret the truth values as the con dence in the prediction. We
calculate the link prediction precision with recall of one, and we report on the mean
average precision computed over all leave-one-out links.</p>
      <p>
        Many combinations of heuristics could be explored for the summarization and
prediction programs, so we chose the following four con gurations to evaluate.
The rst con guration (LINK1) uses Rule (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) for graph summarization and
link prediction. The second con guration (LINK2) is the same as LINK1 but
it uses Rule (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) in place of Rule (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ). The third con guration (SIM1) uses only the
similarity rules for graph summarization and adds Rule (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) for link prediction.
The fourth con guration (SIM2) is the same as SIM1 but uses Rule (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) in
place of Rule (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ). Each con guration also uses Rule (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) and Rule (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ). All rules
have weight one except Rule (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) which has a high weight, 1000, to encourage
distinct clusters by breaking chains of exemplar=2 atoms. Learning individual
weights for these rules may be bene cial, but we do not consider it in this work.
      </p>
      <p>
        Finally, we choose the loss function for minimizing the distance from
satisfaction, which is set by p in (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ). In each con guration, we use the linear loss
function for the graph summarization program and the quadratic loss function
for the link prediction program. We use the quadratic loss function for the link
prediction programs for two reasons. First, inference with quadratic loss is more
expensive than with linear loss, so for the interest of time we only use it on link
prediction, which is less expensive than graph summarization. Our link
prediction programs are less expensive, in part, because inferring link=2 atoms involves
a smaller number of rules than inferring exemplar=2 atoms. Second, quadratic
loss tends to assign link=2 atom truth values between rather than at the
extremes, 0 or 1, more often than linear loss, and this is helpful when ranking
predicted links to calculate precision.
6.1
      </p>
      <sec id="sec-6-1">
        <title>Results</title>
        <p>
          Table 1 describes the TAIR annotation graph data sets we used for evaluation.
The rst two data sets (DS1 and DS2) have fewer genes but more terms and
annotations over all than the last data set (DS3). Table 2 reports mean average
precision (MAP) on the evaluation data sets for each PSL model con guration.
Considering the LINK1 and LINK2 con gurations across all data sets, we see
that Rule (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ), used in the latter, consistently has higher precision than Rule (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ),
used in the former. Rule (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) is also used in SIM2 where it similarly has higher
precision than the other link rule in SIM1 across all data sets.
        </p>
        <p>Now considering precision across data sets, on DS3 both LINK2 and SIM2
perform well compared to previous link prediction results on similar annotation
data [5]. However, none of the con gurations perform well on DS1 and DS2. To
interpret this result we convert the clusters to hard membership, calculate the
average number of clusters of size greater than one produced by each con
guration and normalize by the number of nodes in the data set. This is shown in the
right side of Table 2.</p>
        <p>
          Using this information, we see a small number of clusters formed in DS1
and DS2 and a larger number formed in DS36 except where Rule (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ) is used.
This suggests that Rule (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) is helpful for link prediction on this data and may
be helpful for clustering; on the other hand, Rule (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ) is not helpful for clustering
or link prediction on this data, and may interfere with the use of similarity
attributes in clustering. Finally, this also suggests that neither annotation link
heuristic rule works well for prediction on graphs where we nd few clusters of
size greater than one. Since, for example in Rule (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) the truth value of inferred
link=2 atoms is bounded only when there are multiple nodes A and D in the
same cluster, this result is not surprising.
In this work, we demonstrated an exploratory use of graph summarization
heuristics in probabilistic soft logic (PSL) on annotation graph data, combining
relational and similarity evidence from multiple, heterogeneous sources. The power
of the approach is the ease in which a variety of clustering criteria can be
declaratively expressed. Our work, which is ongoing, will continue to explore the space
of graph summarization rules for combining data from rich sources, such as the
6 A similar pattern of cluster sizes emerges when a version of these graphs is clustered
using a separate method similar to [5].
gene sequences, annotations and term ontologies used in this work and other
sources now made available through the Linked Data initiative and Semantic
Web.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Acknowledgements</title>
        <p>Angelika Kimmig is a postdoctoral fellow of the Research Foundation
Flanders (FWO Vlaanderen). This work is partially supported by NSF under grants
IIS0746930 and CCF0937094.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Broecheler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihalkova</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Getoor</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Probabilistic similarity logic</article-title>
          .
          <source>In: Conference on Uncertainty in Arti cial Intelligence (UAI)</source>
          .
          <article-title>(</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Taskar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Segal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Probabilistic classi cation and clustering in relational data</article-title>
          .
          <source>In: International Joint Conference on Arti cial Intelligence (IJCAI)</source>
          .
          <article-title>(</article-title>
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kemp</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tenenbaum</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          , Gri ths, T.L.,
          <string-name>
            <surname>Yamada</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ueda</surname>
          </string-name>
          , N.:
          <article-title>Learning systems of concepts with an in nite relational model</article-title>
          .
          <source>In: National Conference on Arti cial Intelligence (AAAI)</source>
          .
          <article-title>(</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P.S.:</given-names>
          </string-name>
          <article-title>A probabilistic framework for relational clustering</article-title>
          .
          <source>In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)</source>
          .
          <article-title>(</article-title>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Thor</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raschid</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navlakha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khuller</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.N.</given-names>
          </string-name>
          :
          <article-title>Link prediction for annotation graphs using graph summarization</article-title>
          .
          <source>In: International Semantic Web Conference (ISWC)</source>
          .
          <article-title>(</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          :
          <article-title>Discovery-driven graph summarization</article-title>
          .
          <source>In: International Conference on Data Engineering (ICDE)</source>
          .
          <article-title>(</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hankins</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>E cient aggregation for graph summarization</article-title>
          .
          <source>In: ACM SIGMOD International Conference on Management of Data (SIGMOD)</source>
          .
          <article-title>(</article-title>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Navlakha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rastogi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shrivastava</surname>
          </string-name>
          , N.:
          <article-title>Graph summarization with bounded error</article-title>
          .
          <source>In: ACM SIGMOD International Conference on Management of Data (SIGMOD)</source>
          .
          <article-title>(</article-title>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Frey</surname>
            ,
            <given-names>B.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dueck</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Clustering by passing messages between data points</article-title>
          .
          <source>Science</source>
          <volume>315</volume>
          (
          <year>2007</year>
          )
          <volume>972</volume>
          {
          <fpage>976</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Berthold</surname>
          </string-name>
          , M., ed.:
          <source>Bisociative Knowledge Discovery. Volume 7250 of Lecture Notes in Computer Science</source>
          . Springer Berlin / Heidelberg (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , R.C.T.:
          <article-title>Fuzzy logic and the resolution principle</article-title>
          .
          <source>J. ACM</source>
          <volume>19</volume>
          (
          <issue>1</issue>
          ) (
          <year>1972</year>
          )
          <volume>109</volume>
          {
          <fpage>119</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Markov logic networks</article-title>
          .
          <source>Machine Learning</source>
          <volume>62</volume>
          (
          <issue>1-2</issue>
          ) (
          <year>2006</year>
          )
          <volume>107</volume>
          {
          <fpage>136</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>