=Paper=
{{Paper
|id=None
|storemode=property
|title=Finding and Explaining Similarities in Linked Data
|pdfUrl=https://ceur-ws.org/Vol-808/STIDS2011_CR_T7_OlssonEtAl.pdf
|volume=Vol-808
|dblpUrl=https://dblp.org/rec/conf/stids/OlssonPSP11
}}
==Finding and Explaining Similarities in Linked Data==
Finding and Explaining Similarities in Linked Data
Catherine Olsson Plamen Petrov, Jeff Sherman, Andrew Perez-Lopez
Raytheon BBN Technologies and Raytheon BBN Technologies
Massachusetts Institute of Technology Arlington, VA
catherio@mit.edu {ppetrov,jsherman,aperezlo}@bbn.com
Abstract—Today’s computer users and system designers face Raw numerical similarity scores provide very little insight
increasingly vast amounts of data, yet lack good tools to find to users about what those scores mean, so users often want
pertinent information within those datasets. Linked data tech- an explanation of how a score should be understood and
nologies add invaluable structure to data, but challenges remain
in helping users understand and exploit that structure. One interpreted. In this paper, we formulate a definition of an
important question users might ask about their data is “What “explanation” of an SSDM score that ensures that the explana-
entities are similar to this one, and why?” or “How similar tion is both human-understandable and well-grounded in how
are these two entities to one another, and why?”. Our work SSDM scores are calculated. We also describe a system that
focuses on using the semantic content of linked data not only can produce such explanations efficiently.
to facilitate the process of finding similar entities, but also
to produce automatically-generated and human-understandable Additionally, not all users will desire the same level of detail
explanations of what makes those entities similar. In this paper, in their explanations. Therefore, we present a methodology
we formulate a definition of an “explanation” of similarity, we for allowing the user to tailor how “obvious” or “obscure”
describe a system that can produce such explanations efficiently, the provided explanations are. We expect that users who
and we present a methodology to allow the user to tailor how are investigating an unfamiliar domain will prefer “obvious”
“obvious” or “obscure” the provided explanations are.
explanations that refer to common and well-known properties,
I. I NTRODUCTION while expert users will prefer “obscure” explanations that shed
light on less well-known relationships and details.
Today’s world is a world of data. As technology advances,
it becomes easier and easier to collect and store vast amounts A. Motivation
of data. Much of this data can be viewed in terms of nodes Our work is motivated by a number of problems in the
with properties and relationships, or edges, among those nodes intelligence and military research communities. Many of those
— that is to say, it can be represented as a graph. Once a problems are ubiquitous and have direct translation to business
dataset has been represented in a graph format, such as with intelligence, logistics, and planning. Take, for example, a
Semantic Web [1] or other linked data technologies [2][3], it model of a large organization C with its associates, their
can easily be combined with data from different sources. In interactions, locations they visit, resources they use or produce,
this way, linked data allows already-vast datasets to be readily and events in which they participate. Given this information,
combined and connected, giving users and programs access to one could explore the stated relationships among the con-
more data than ever before. The challenge, then, is in making stituents of C, such as “show all transactions that involve
sense of this data. person X.” Beyond these simple information-retrieval tasks,
Some of the data analysis questions that are emerging analysts might want to examine more complex (or less crisply
include the following: How does one entity in the linked defined) interactions. For example, “show all associates similar
data relate to another entity, possibly derived from a different to Y ” could be a very useful query when trying to learn more
source? How does a given entity relate to the rest of the about person Y . Finally, given a subset S = s1 , s2 , ...sn of
data? What are the similarities between two entities, and why? members in the organization C, which might represent a group
There are also related data search and retrieval questions to be that is suspect of participating in nefarious activities, a query
tackled, such as “Find all entities similar to this entity” and like “show all subsets of C similar to S” might be an excellent
“Find groups of entities that are similar to each other.” way to discover other suspicious clusters in the organization.
To solve these problems, work has been done at Raytheon Note that in the example above we did not have any a priori
BBN Technologies (BBN) to devise a similarity measure knowledge of the organization other than its structure (which
called the Structural Semantic Distance Measure (SSDM), in general is a directed graph) and the elements for which we
which leverages both the structural and semantic content were searching. In particular, we did not assume any hierarchy,
of linked data to find similar entities. SSDM is based on types of relationships present, or any statistical properties
SimRank [4], a highly domain-general similarity measure of the graph. It was also important that the queries were
with an efficient approximate calculation. SSDM improves on phrased in a general way using the word “similar” to indicate a
SimRank by incorporating the semantic content of edge labels, degree of likeness, but not (necessarily) an exact match. Such
and by achieving greater independence of ontological choices. problems occur every day both in the military, intelligence, and
defense communities as well as in the business and civilian A. SimRank
worlds. SimRank is based on the intuition that “Two entities are
We have structured our algorithms and methodologies to similar if they are related to similar entities.” While this
be applicable to any data expressing entities and relationships statement may seem trivial at first, it leads directly to a
between entities. Notably, much of the data encountered in the simple mathematical definition of similarity: the similarity
military and intelligence domains deals with entities and the score between two entities is the average pairwise similarity
relationships between them, and can therefore benefit from our of their neighbors, scaled by a decay factor.
contributions. Consider the example in Fig. 1. Intuitively, one would
Throughout this paper, we include examples from the movie imagine that Movie 1 and Movie 2 should be similar, because
industry, drawn from a popular and widely accessible dataset they have two actors in common and they are both in the
about movies, actors, directors, film genres, and so on [5]. One same genre. Additionally, Director 1 and Director 2 should
can easily find direct analogies between this data and the types be similar even though they have no immediate connections
of data encountered in the intelligence and defense domains. in common, because they directed similar movies. SimRank
captures and formalizes this intuition.
II. BACKGROUND
The general problem of similarity is twofold: first, to
Actor 1
construct a measure of pairwise similarity so that a meaningful
similarity score can be calculated for any pair of entities Movie 1 Director 1
in a dataset; and second, to devise a method for efficiently Actor 2
retrieving the entities that are most similar to a given entity.
In this section, we discuss a similarity measure developed Movie 2 Director 2
at BBN called the Structural Semantic Distance Measure Genre 1
(SSDM). SSDM is an extension of existing work on calculat- .
ing similarity over unlabeled, directed graphs. The contribution
Fig. 1: This figure depicts relationships that exist between
of SSDM is to incorporate the semantic content contained
entities in a movie dataset. Director 1 and Director 2 have no
in edge labels, and to achieve a greater independence of
immediate neighbors in common, but they are similar because
ontological choices for edge labeling.
they are related to similar movies
Our work on SSDM builds off the SimRank algorithm by
Jeh and Widom [4]. We chose to base our work on SimRank
Each pair’s similarity is dependent on many other pairs,
for the following reasons:
which may seem to be a barrier to computing their scores.
• SimRank is domain-independent in that it can be applied Fortunately, this barrier is readily surmountable. On small
to any data representing relationships between entities. datasets the system can be solved with an iterative algorithm,
This is in contrast with domain-specific similarity algo- and on large datasets it can be solved using an efficient
rithms, such as those that can only be used to compare approximate method outlined by Fogaras and Rácz in [9].
documents [6], ontological categories [7], or some other Our implementation of the SSDM calculation is based on this
domain-specific data type. efficient approximate method.
• SimRank can be computed efficiently in approximation, The algorithm outlined by Fogaras and Rácz relies on the
even over very large datasets, in contrast with measures mathematical notion of a random walk through a graph, in
that rely on Singular Value Decomposition or other which an abstract walker steps from node to node through
computations that scale poorly [8]. the graph by following random edges [10]. In the original
• The approximate computation of SimRank can not only SimRank paper, Jeh and Widom observed that the SimRank
determine the similarity between two entities efficiently, score of two nodes can be approximated from the expected
but can also generate a list of entities that are most similar meeting time of two random walkers starting at those two
to a given entity. nodes; a higher expected meeting time corresponds with a
• The computation behind SimRank can be understood on lower SimRank score. Fogaras and Rácz used this observation
a conceptual level, which makes it possible to explain the to develop an efficient and scalable algorithm for calculating
similarity score by referring directly to the computation similarity scores.
performed. This would not be possible using a similarity In the algorithm proposed in [9], one random walker is
measure that relied on more abstract calculations. initialized per node in the graph, and each walker moves
• SimRank looks beyond an entity’s immediate neighbor- along one edge per time step. To reduce the amount of
hood and features when determining similarity, which computation required, walkers are allowed to converge at their
enables it to incorporate a broader scope of information meeting point, and are thenceforth treated as a single walker
about the structural context of entities. without loss of correctness in the approximate calculation of
All these positive attributes are retained in SSDM, along expected meeting times. Fig. 2 demonstrates how walkers
with several additional improvements. converge. Once the maximum number of steps has elapsed,
the run is halted. Repeated runs are performed and the data as a convergence if they arrive at the same node on the same
are aggregated. step and they have traversed identical sequence of predicates
to that node. The reasoning behind this modification is that the
a semantic meaning of edge labels is critical to the similarity
a calculation. For example, two entities A and B may both be
b a, b a, b, c, d related to a third entity C, but they are certainly not similar if
b the relations in question are A isA C and B isNever C.
c, d
c Additionally, SimRank only allows similarity to propagate
c, d along in-edges, which means that the original computation of
d SimRank only allows walkers to step backwards (that is, from
. objects to subjects). This makes SimRank highly dependent
Fig. 2: Walkers a, b, c, and d begin as independent walkers. on ontological choices, because it is an arbitrary choice in
As they walk (shown progressing from left to right), they meet a directed graph whether each label should be phrased in
one another. c and d meet at the first time step and converge. the forward or reverse direction. For example, the relation
a and b meet next. Finally, on the far right of the diagram, all A isComponentOf B could be equally expressed as B
walkers have converged. hasComponent A; the choice of which direction is used
is arbitrary and can vary from dataset to dataset. Choosing
Additionally, to repair some deficiencies with the original and enforcing consistent edge directionality is a difficult issue
formulation of SimRank, walkers in Fogaras and Rácz’s algo- in ontologies in general, so we did not want SSDM to be
rithm are incentivized to converge if they are near one another. heavily dependent on arbitrary edge direction choices. As a
This is accomplished by randomly permuting all vertices in the result, SSDM allows walkers to walk both directions.
graph at the start of each time step, with each walker stepping
to the neighbor with the smallest index in the permutation. Note that allowing walks in both directions requires us to
The end result of one run of Fogaras and Rácz’s scalable distinguish a walker traversing A isComponentOf B from
SimRank algorithm is a fingerprint graph which encodes the a walker traversing B isComponentOf A, as these two
first meeting times for each pair of walkers. Several runs steps have very different semantic meanings. Therefore, in
are conducted, and the fingerprint graphs are compiled into SSDM, it is not enough for walkers to simply have traversed
a larger fingerprint database. The fingerprint database is pre- identical predicates in order to converge; they must have
computed and can be efficiently queried thereafter, either to traversed those predicates in the same direction (in or out).
retrieve a similarity score between any two nodes, or to retrieve
To illustrate the conditions on convergence required by
the set of nodes with a similarity score greater than some
SSDM, consider the example of calculating the similarity
threshold with any given node.
between two movies, War of The Worlds and Gladiator, as
The implementation we devised for calculating the SSDM
shown in Fig. 3. Suppose Walker 1, starting from War of
retains the basic structure of the computation described above,
The Worlds, traverses the predicates directedin , directedout ,
including converging random walkers and permutation-based
hasActorout to reach Harrison Ford. If Walker 2 follows the
convergence incentivization.
same predicates in the same order to reach Harrison Ford, as
B. SSDM — The Structural Semantic Distance Measure is shown in Path A, then the two walkers will converge with
a meeting time of three steps. If Walker 2 instead follows
The Structural Semantic Distance Measure developed at a different sequence of predicates, such as hasActorout ,
BBN is closely related to SimRank as described above, with hasActorin , hasActorout as shown in Path B, it will not
two key enhancements: SSDM incorporates the semantic con- converge with Walker 1 because the two walkers did not follow
tent of edge labels, and SSDM is independent of ontological identical predicates to get there.
choices — namely, which edge directionality each proposition
should have. SSDM was designed as a domain-independent similarity
Whereas SimRank is a measure over unlabeled directed measure that would easily account for the semantics of labeled
graphs, SSDM incorporates edge labels. This makes it well- graph data without being dependent on ontological choices.
suited to any data in subject-predicate-object format, such as This ability to incorporate semantic information is especially
RDF or other linked data; subjects and objects are equivalent important in domains with rich semantic context, and allows
to nodes in the graph, and predicates are equivalent to labeled SSDM to capture semantic nuances that are missed by less
edges. The most important use of edge labels is in the way sophisticated similarity measures. The significance of this
expected meeting time is calculated. The original SimRank extra information in the measure is as-yet unassessed, but we
computation defines “meeting time” as the time step when expect that SSDM should perform better than SimRank for
two walkers step to the same node, at which point they semantic graphs. In short, SSDM is an efficient, semantically-
converge. The SSDM computation has a stricter condition on grounded, and ontology-independent algorithm for discovering
convergence: namely, when two walkers meet, it only counts similar entities in a linked dataset.
an explanation that is generated by a computation which is
Walker 1 Start Walker 2 Start
entirely unlike the original calculation of the similarity score
War of the can hardly be considered an explanation of why that score was
Gladiator produced.
Worlds
Recall that the SSDM computation calculates similarity
directed scores based on repeated runs of converging random walkers.
A directed B hasActor The faster two random walkers tend to converge, the higher
Steven Richard the similarity score of their starting nodes will be.
Ridley Scott
Spielberg Harris It follows that for an explanation of a score to be well-
grounded in how SSDM scores are calculated, it must some-
directed directed has Actor how elucidate where and how walkers from the nodes in
question tend to converge, and whether these convergences
Raiders of Patriot tend to happen rapidly or whether the walks are long. In order
Blade Runner
the Lost Ark Games for this information to also be human-understandable, it must
has Actor hasActor be relatively concise.
hasActor For these reasons, we decided that an explanation should
consist of a brief list of common convergence points, along
Harrison Ford with a handful of concise chains of statements per convergence
point describing the “best” relationships linking each starting
End node to that point. The “best” relationships may be the shortest
. chains of statements, or the ones that tend to be traversed
Fig. 3: This diagram depicts two possible ways — path A most frequently, or (as explained later) relationships that are
and path B — by which Walker 2 could meet Walker 1 at appropriately obvious or obscure. Figure 4 shows an example
the Harrison Ford node. If Walker 2 takes Path A, they will of an explanation of this form.
converge. If Walker 2 takes Path B, they will not.
John Williams
{A New Hope, hasMusicContributor, John Williams}
{The Empire Strikes Back, hasMusicContributor,
III. E XPLAINING S IMILARITIES John Williams}
A similarity measure is useful for ranking items or pairs Harrison Ford
of items, but a numerical score alone gives little insight into {A New Hope, hasActor, Harrison Ford}
why two entities are similar. The user can easily retrieve the {The Empire Strikes Back, hasActor, Harrison Ford}
similarity score for two entities but is then left wondering: Star Wars (Film Collection)
what about those entities and their relations caused them to {A New Hope, inCollection, Star Wars}
receive a high or low similarity score? What is the nature of {The Empire Strikes Back, inCollection, Star Wars}
their similarity? In addition, users may want more or less depth {Revenge of the Sith, hasSequel, A New Hope}
in the explanations provided. {Revenge of the Sith, inCollection, Star Wars}
In order to enable users to answer this question, we sought {A New Hope, hasSequel, The Empire Strikes Back}
{A New Hope, inCollection, Star Wars}
to build a system that could provide explanations for similarity ...
scores. Our three main contributions to the area of similarity
explanations are as follows: Fig. 4: An excerpt from an explanation for the similarity
1) We formulated a definition of an “explanation” for a between Star Wars Episode IV: A New Hope and Star Wars
similarity score that is human-understandable, as well Episode V: The Empire Strikes Back, showing both one-
as appropriately grounded in the way that the similarity relationship chains and multi-relationship chains
score was originally calculated.
2) We wrote a program to efficiently produce such expla- By defining an “explanation” this way, we ensure that
nations. users are presented with a coherent explanation of where and
3) We further developed a methodology for biasing ex- how walkers from the nodes in question tend to converge.
planations towards either more “obvious” or more “ob- Additionally, such explanations are also readily understandable
scure” facts. as explanations of what commonalities the nodes have, and
how they are related to each commonality.
IV. D EFINITION OF AN E XPLANATION
A good explanation of a similarity score must be both V. O UR A PPROACH
human-understandable and grounded in the original calcula- The most obvious way to extract an explanation for a com-
tion of the similarity measure. An explanation that is not putation seems to be to inspect the path that the computation
human-understandable is hardly an explanation at all, and followed to obtain its result. Unfortunately, in the case of our
SSDM calculation, such a strategy is inadequate. We have scores. As part of our work on similarity explanations, we
established that we would like to present chains of relations added a “why” button for each score, which users can click
to the user. However, the fingerprint graphs produced by the to produce an on-demand explanation of that score. The
SSDM computation record only when and where the walkers software behind the “why” button interacts with the underlying
converged, discarding all information about the path taken Jena1 model of the data in Parliament to walk the graph and
by the walkers; furthermore, discarding this information is produce an explanation using the methodology described in
essential to the calculation as a whole to maintain acceptable this section. Even for datasets with millions of triples, such
space and performance characteristics. While simply listing as the movie dataset, preliminary findings show that accurate
convergence points does provide the user with some intelligible explanations could be produced and displayed to the user
information, it does not provide as rich an explanation as we within seconds.
would like. In summary, explanations for SSDM scores can be effi-
Therefore, our approach to explanation generation is to re- ciently computed on-demand at query time by re-running a
run the SSDM calculation on a smaller scale at query-time, and modified version of the SSDM random walker calculation.
explicitly store the relations traversed rather than condensing
the results into a terse fingerprint graph. Two alterations were VI. U SING S ALIENCE TO B IAS E XPLANATIONS
required to make the modified SSDM calculation efficient The final contribution we describe in this paper is a
enough to provide an acceptable user experience at query time. method for incorporating salience into explanation generation.
Both performance improvements were achievable because the Salience is a measure of how rare a fact is in a dataset. Our
modified calculation uses just two walkers rather than starting goal was to produce the most “useful” explanations, which we
one walker at every node in the graph. Recall that the similarity believed would be the explanations with the most salient facts,
of two entities is derived from the expected meeting times of because salient facts are rare and therefore highly descriptive.
walkers starting at those two entities. If we know in advance We instead discovered that high-salience explanations often
which two entities we will be comparing, there is no need to come across as obscure because they can contain extremely
start walkers at any other entities. Because it is so efficient to rare facts. Similarly, low-salience explanations often come
generate a pair of walks compared to a whole graph’s worth of across as obvious because they contain extremely common
walks, we can afford to run the computation many times per facts. Nonetheless, just as high-salience explanations often
explanation request, and then choose from among the possible have the upside of being very descriptive, low-salience ex-
explanations to display relevant results to users. planations often have the upside of revealing the broadest
The first performance improvement strategy relates to the and most general similarities rather than obscure trivia. Which
permutation-based convergence strategy described in Sec- flavor of explanation is more “useful” likely depends on the
tion II-A. Each step of the ordinary SSDM calculation begins goals of the user and requires more research in an application
by randomly shuffling all edges in the graph. In the modified domain.
calculation, only the two walkers’ immediate out-edges need In this section we present the definition of salience as
to be shuffled, which is almost always a vastly smaller number applied to facts and explain how we constructed a salience-
of edges, and takes a negligible amount of time. weighted edge generator so that the explanations generated
The second performance improvement strategy relates to the would contain more or fewer high-salience facts. Note that
strict edge-label requirements for convergence. In the ordinary while the following descriptions will focus on biasing expla-
SSDM calculation, walkers frequently meet at the same node nations by salience, it can be used to favor facts based on any
but do not actually converge because they did not follow numerical property of those facts.
the same predicates. In the modified calculation, we instead
require the two walkers to follow the same predicates as one A. Fact Salience
another. So, in the example given in Fig. 3, path B would never Salience is a measure of how rare a fact is in a dataset.
be generated; instead, either Walker 1 and Walker 2 would A fact in this case refers not to a whole statement, but to a
both follow a directed in-edge, or both would follow a statement missing its subject or object; that is to say, struc-
hasActor out-edge, or both would follow some other shared tures of the form subject predicate blank or blank
edge not shown. If the two walkers were ever unable to follow predicate object (such as Spielberg directed
the same predicate in the same direction, then that run of blank or blank directed Gladiator). Facts with a
the computation would end. Coupling the edge options of the subject and predicate are called left facts because all the
two walkers greatly reduces the number of trials that fail to information they retain is on the left side of the statement.
converge, leading to a much more efficient calculation. Similarly, facts with a predicate and an object are called right
As a proof-of-concept for this methodology, we imple- facts [12].
mented an explanation-generating component of Parliament, We now describe how the salience of a fact is calculated.
an open-source triple store developed and maintained by Consider o(f ) to be the number of times a fact is expressed
BBN [11]. In Parliament, users are able to browse and view in a set of unique subject predicate object triples.
entities in the knowledge base, explore other triples containing
an entity, and view a list of similar entities and their SSDM 1 http://openjena.org
Consider also subj to be the number of unique subjects present The weighted-permutation algorithm we developed works
in those triples, and obj to be the number of unique objects. according to the common permutation strategy of assigning
For left facts, salience is calculated as follows: a random number to each element to be permuted and then
sorting by those numbers. We devised a method to incorporate
1 − log(o(f act)) salience into the generation of those random numbers.
salience(f act) = (1)
log(obj) We made two attempts at designing the weighting algorithm.
And for right facts, the calculation is as follows: Our first, unsuccessful attempt at a biased algorithm contained
an oversight which we corrected in the second version. The
1 − log(o(f act)) first, naive approach worked as follows:
salience(f act) = (2)
log(subj) 1) Each fact fi is associated with a nonnegative weight wi
2) Each fact fi is assigned a random number ri between 0
The conceptual significance of subj and obj in these and wi
equations is to count the number of times each fact could 3) Elements are ranked in ascending order according to ri
potentially occur, since each left fact could potentially appear
Using this algorithm, elements with a low weight are more
with every object in the data set, and each right fact could
likely to get a low number relative to other elements, and are
potentially appear with every subject. Facts that actually do
therefore more likely to be ranked first.2 To use this algorithm
appear with almost every available subject or object are
to favor salient facts, the weight function used to assign the
extremely common, and are thus not very salient. Facts that are
wi values could be set to wi =√ 1 − salience(fi ), or some
expressed about very few of the available subjects or objects
other function such as wi = 1 − salience(fi ).
are very rare and therefore highly salient. This intuition,
When testing this algorithm using wi = 1 − salience(fi ), a
and the resultant calculation, is grounded in the information
failure mode arose: namely, the very same facts would appear
theoretic concept of relative entropy, discussed in [13].
at the top time and time again. We determined that the problem
B. Weighting occurred with unique facts because their salience is precisely
1, and their weight was therefore 0. When unique facts were
The objective of salience weighting is to favor high- or low- assigned a random number between 0 and their weight they
salience facts in the generated explanations. This bias can be were always assigned 0. This meant that all the unique facts
incorporated into the random permutation that is calculated at were always first in the list, and so only unique facts were
the beginning of each time step. In the unweighted explanation ever generated.
calculation, the random walkers choose which edge to take To remedy this problem, we added a parameter to increase
by stepping to the neighbor with the smallest index in this the randomness. The modified, successful algorithm works as
permutation. The original reason for the permutation was to follows:
encourage walkers that are near one another to converge, but
1) Each fact fi is associated with a nonnegative weight wi
it can also be modified to add other weights and biases into
2) Given a parameter b, each fact is assigned a random
the explanation-generation process.
number ri between 0 and b + (1 − b) ∗ wi
In order to encourage walkers’ edge choices to favor more
3) Elements are ranked in ascending order according to ri
salient edges, we would like high-salience edges to be more
likely to occur at low indices in the permutation. However, In this way, the parameter b can be tuned to increase or
we do not want the distribution of salience in the permutation decrease the randomness of the permutation.
to be too consistent from trial to trial, otherwise low-salience The properties of the weight function do not constrain the
edges will never reach the top of the rankings, restricting the calculation, so most any weight function could be used. To
edge choice of the walkers and severely limiting the breadth favor low-salience facts, for example, salience or the square
of explanations produced. root of salience can be used directly. To favor medium-salience
The permutation as originally described in Fogaras and Rácz facts, wi = salience(fi ) ∗ (1 − salience(fi )) could be used.
ranks nodes randomly; however, the salience of a node is not Any number of other functions are possible depending on the
a well-defined concept, and so to enable a salience-weighted desired salience characteristics of the resulting explanation.
permutation algorithm it was necessary to switch to ranking C. Relevance of Salience-Weighting
facts. This modification was justified for the following reason.
As for whether salient facts are actually more useful, it
Under Fogaras and Rácz’s formulation, convergence required
seems to depend on the intended use case. In some cases, ob-
only that two walkers meet at the same point, and so to
vious paths are more useful, and in other cases, obscure paths
encourage convergence it was enough to encourage walkers to
are more useful. If the user knows very little about the area of
choose the same nodes to step to — hence nodes were what
inquiry he or she is likely to prefer explanations that refer to
was ranked. However, under our formulation, convergence
common and well-understood facts and properties. Conversely,
requires that the two walkers also follow the same edge label
if the user is an expert in the domain, he or she is likely to
in the same direction. An edge label plus one endpoint makes
up a fact, so to encourage convergence, we are justified in 2 “Weight” might be a misnomer here as it implies elements with high
encouraging walkers to choose the same facts to walk along. weights are favored, but the opposite is the case with this algorithm.
prefer more obscure data. A user with average expertise will we could derive similarity scores for each pair of profiles, and
probably want only middle-salience explanations. merge those profiles as appropriate.
For example, consider a movie-watcher who has never seen In another setting, suppose we are monitoring the network
any Star Wars movies. He or she may be interested to know activities of a group of employees of company X. Each em-
about low-salience (i.e. common) facts like these: ployee belongs to one of four departments: Engineering, Sales,
Finance, or Corporate. We can monitor email traffic, access
{ofGenre, Science Fiction}
to corporate applications, printers, file repositories and other
{hasMusicContributor, John Williams}
network activities. Let’s focus on George, who is in Sales.
{hasActor, James Earl Jones}
We receive alerts that George has been accessing financial
These facts reference well-known people and broad genres, software and financial projection data files. Is this activity
which could help give a novice a grasp of what relates the unusual? Our algorithm could be applied to compare George
Star Wars movies to one another. High-salience facts such as and the profile of his activities with those of employees at his
the following: and other departments. Does George seem to be behaving more
like employees in Finance than he was before, or less similarly
{hasProducer, Gary Kurtz}
to his fellow employees at Sales than we might expect? If so,
{hasDirector, Irvin Kershner}
it certainly indicates that George’s behavior has changed for
{hasEditor, T. M. Christopher}
some reason, and may warrant investigation.
would be too obscure; an unfamiliar viewer is unlikely to In essence, our similarity work can be applied to any data
know who, say, Irvin Kershner is, as he is not well-known expressing relationships between entities. Our algorithm is
for directing any other blockbusters. However, a Star Wars highly scalable, so it can be applied on very large datasets,
afficionado who already knows that the Star Wars movies and our work on explanations allows the similarity results to
are Science Fiction films will find the low-salience facts too be clearly communicated to end users of the data analysis. For
obvious. He or she may be interested in knowing about the these reasons, we believe our work to be highly applicable to
more unusual details that relate Star Wars Episode IV: A New a wide variety of military and intelligence tasks.
Hope and Star Wars Episode V: The Empire Strikes Back
to one another, and would be pleased to discover that the VIII. F UTURE W ORK
relatively-unknown editor T. M. Christopher was involved in The future of this work lies in two directions. The first is
the production. to perform experiments to assess the improvement of using
Fortunately, our method enables the salience to be weighted SSDM as compared to SimRank and other approaches. We
by a custom weight function, so that the user can tune the would also like to perform user tests to determine the perceived
salience of the resulting explanations to his or her needs. utility of different explanations in a real-world environment.
Additionally, because explanations are generated on-demand, The second direction lies in further research on extensions to
the desired obscurity could be specified at query time, which the work and on new approaches to enhance it.
would enable the user to ask for more or less obscure expla- Experimentation with SSDM will likely take place in a
nations in real time as they explore their data. relevant application domain such as social network analysis (in
the intelligence domain) or computer network activity analysis
VII. A PPLICATIONS
(the cyber domain). Metrics for the meaning of similarity will
Our work on semantic similarity has been developed with an have to be developed for each domain before the algorithm
eye towards a variety of applications, primarily in the military can be evaluated. This is especially true with user testing,
and intelligence domains. In general terms, similarity measures where the experiments need to account for subjectivity and
and explanations are very useful tools for analysis of large prior domain knowledge. Selecting and vetting the appropriate
graph-based datasets. data sets for automated evaluation is another challenge —
Consider a simple example. Suppose we are collating in- graph-based data, which is manually annotated for similarity,
formation on Libya and we encounter the profiles for the does not appear to be very common. One approach we may
following individuals: take is to compare structural similarity generated by SSDM to
• Muammar al-Gaddafi similarity derived from entity attribute comparison (presence
• Muammar El-Gadhafi and value of certain attributes). Many well-established algo-
• Moammar Kadaffi rithms exist in this area to provide a baseline for attribute-
Despite considerable variation in their spellings, these based similarity.
names all refer to the same Libyan former head of state. In fact, The most promising direction of further research we en-
some sources report over one hundred ways to spell this per- vision lies in the domain of calculating predicate similarity.
son’s name [14] due to ambiguities in the transliteration from With our current algorithm, walkers must traverse identical
Arabic. It would be ideal if we can use additional information predicates — a strategy designed to prevent relations such as A
to disambiguate the names in order to ascertain that these is C and B isNever C from contributing positively to A’s
profiles represent the same person. Using information about similarity with B. However, it intuitively seems that A is C
the relationships (and actions) of the person from each profile, and B isOften C should certainly contribute to A and B’s
similarity. Doing so would require calculating the similarity explanations efficiently; and we enabled the obscurity of our
between predicates (in this case, is and isOften) before explanations to be tuned to meet user’s needs. We believe
calculating the similarities between entities. One possible that these contributions will help users in the intelligence and
way to do this would be to run SSDM on the ontology to defense communities to make sense of their data, by enabling
calculate predicate similarity before moving on to calculate them to not only find relevant similarities more efficiently, but
object similarity. Another possible option would be to use a also to understand those similarities on an intuitive level.
language-based metric as a source of predicate similarity, such R EFERENCES
as by using WordNet [15] similarity.
[1] T. Berners-Lee, J. Hendler, and O. Lassila, “The semantic web,”
A third option considers the insight that “similar predicates Scientific American, May 2008. [Online]. Available: http://www.ds3web.
are those that connect similar entities to other similar entities,” it/miscellanea/the semantic web.pdf
for example, the predicates teaches and hasInstructor [2] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data — the story so
far,” International Journal on Semantic Web and Information Systems
are considered similar because they appear in relationships (IJSWIS), vol. 5, 2009. [Online]. Available: http://eprints.ecs.soton.ac.
such as Dr. Smith teaches Chemistry and CH1301 uk/21285/1/bizer-heath-berners-lee-ijswis-linked-data.pdf
hasInstructor Dr. Jones where Chemistry is [3] T. Heath. (2011) Linked data — connect distributed data across the
web. [Online]. Available: http://linkeddata.org/
known to be similar to CH1301 and Dr. Smith is similar [4] G. Jeh and J. Widom, “SimRank: a measure of structural-context
to Dr. Jones. This formulation of predicate similarity is similarity,” in Proceedings of the eighth ACM SIGKDD international
obviously recursive with respect to entity similarity: in order conference on Knowledge discovery and data mining, ser. KDD
’02. New York, NY: ACM, 2002, pp. 538–543. [Online]. Available:
to calculate predicate similarity, we must first calculate entity http://doi.acm.org/10.1145/775047.775126
similarity, and vice versa. However, SimRank, SSDM, and [5] M. P. Consens, O. Hassanzadeh, and A. M. Teisanu. (2008, September)
many other algorithms are also based on recursive definitions Linked movie database. [Online]. Available: http://www.linkedmdb.org/
[6] A. Islam and D. Inkpen, “Semantic text similarity using corpus-based
and derive iterative approximations to the optimal result set. word similarity and string similarity,” ACM Transactions on Knowledge
Logically then, we could apply this recursive definition of Discovery from Data, vol. 2, no. 2, pp. 1–25, July 2008. [Online].
predicate similarity in order to simultaneously derive both Available: http://doi.acm.org/10.1145/1376815.1376819
[7] F. M. Couto, M. J. Silva, and P. M. Coutinho, “Measuring semantic
entity (subject and object) and predicate similarity. This is similarity between Gene Ontology terms,” Data and Knowledge
reminiscent of similar iterative and approximate approaches Engineering, vol. 61, no. 1, pp. 137–152, 2007. [Online]. Available:
in robotic navigation to solve simultaneous localization and http://www.sciencedirect.com/science/article/pii/S0169023X06000875
[8] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
mapping (SLAM) [16][17] problems. It is reasonable to expect Vision. Cambridge University Press, 2003. [Online]. Available:
that predicate similarity may be derived analogously. http://books.google.com/books?id=si3R3Pfa98QC
Additional directions also include determining what other [9] D. Fogaras and B. Rácz, “Scaling link-based similarity search,” in
Proceedings of the 14th international conference on World Wide Web,
weighting schemes besides salience-weighting might be use- ser. WWW ’05. New York, NY: ACM, 2005, pp. 641–650. [Online].
ful. For example, if a dataset includes metadata about prove- Available: http://doi.acm.org/10.1145/1060745.1060839
nance or other information about the trustworthiness of each [10] D. Aldous and J. Fill, “Reversible markov chains and random walks
on graphs,” 2002, in preparation. [Online]. Available: http://stat-
fact (node in the graph), then the SSDM calculation could www.berkeley.edu/users/aldous/RWG/book.html
be weighted to favor more trusted facts over facts from less [11] D. Kolas, I. Emmons, and M. Dean, “Efficient Linked-List RDF
reliable sources. Other possibilities surely exist. Indexing in Parliament,” in Proceedings of the Fifth International
Workshop on Scalable Semantic Web Knowledge Base Systems
And finally, while the current explanation format is cer- (SSWS2009), ser. Lecture Notes in Computer Science, vol. 5823.
tainly human-understandable, it does not yet read as easily Washington, DC: Springer, October 2009, pp. 17–32. [Online].
as text. Fortunately, the data in question is already in subject- Available: http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-
WS/Vol-517/ssws09-paper2.pdf
predicate-object form; that is to say, it is already in a sentence- [12] Commonsense Computing Group. (2011, June) Tutorial: Making
like structure, and therefore natural language report generation matrices from your own data. [Online]. Available: http://csc.media.mit.
is a goal well within reach. The additional work required edu/docs/divisi1/tutorial make.html
[13] T. M. Cover and J. A. Thomas, Elements of Information Theory.
would include determining what sentence structures each New York, NY: Wiley, 1991. [Online]. Available: http://www.
predicate fit with, and determining how to ensure that the elementsofinformationtheory.com/
subjects and objects were tagged with human-readable labels [14] S. Bass, “How many different ways can you
spell ‘Gaddafi’?” ABC News, September 2009. [Online].
that could be included in a generated report (and not just URIs Available: http://abcnews.go.com/blogs/headlines/2009/09/how-many-
or other machine-readable descriptors). different-ways-can-you-spell-gaddafi/
[15] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller,
IX. C ONCLUSION “Introduction to wordnet: An on-line lexical database,” International
Journal of Lexicography, vol. 3, no. 4, pp. 235–244, 1990. [Online].
Questions of similarity crop up any time users want to make Available: http://wordnet.princeton.edu/
sense of data describing relationships between entities, and [16] R. C. Smith and P. Cheeseman, “On the representation and
estimation of spatial uncertainty,” The International Journal of Robotics
data of this form (i.e. graph data or linked data) is ubiquitous. Research, vol. 5, pp. 56–68, December 1986. [Online]. Available:
The contributions we describe in this paper help users find sim- http://dl.acm.org/citation.cfm?id=33838.33842
ilarities in graph data efficiently using SSDM, and understand [17] J. J. Leonard and H. F. Durrant-Whyte, “Simultaneous map
building and localization for an autonomous mobile robot,” in
those similarities using similarity explanations. We defined IEEE/RSJ International Workshop on Intelligent Robot Systems
what an explanation of a similarity score should convey; (IROS), November 1991, pp. 1442–1447. [Online]. Available: http:
we implemented a system that can produce such scores and //ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=174711&tag=1