=Paper= {{Paper |id=None |storemode=property |title=Finding and Explaining Similarities in Linked Data |pdfUrl=https://ceur-ws.org/Vol-808/STIDS2011_CR_T7_OlssonEtAl.pdf |volume=Vol-808 |dblpUrl=https://dblp.org/rec/conf/stids/OlssonPSP11 }} ==Finding and Explaining Similarities in Linked Data== https://ceur-ws.org/Vol-808/STIDS2011_CR_T7_OlssonEtAl.pdf
 Finding and Explaining Similarities in Linked Data
                     Catherine Olsson                         Plamen Petrov, Jeff Sherman, Andrew Perez-Lopez
           Raytheon BBN Technologies and                                    Raytheon BBN Technologies
          Massachusetts Institute of Technology                                   Arlington, VA
                catherio@mit.edu                                    {ppetrov,jsherman,aperezlo}@bbn.com




   Abstract—Today’s computer users and system designers face            Raw numerical similarity scores provide very little insight
increasingly vast amounts of data, yet lack good tools to find       to users about what those scores mean, so users often want
pertinent information within those datasets. Linked data tech-       an explanation of how a score should be understood and
nologies add invaluable structure to data, but challenges remain
in helping users understand and exploit that structure. One          interpreted. In this paper, we formulate a definition of an
important question users might ask about their data is “What         “explanation” of an SSDM score that ensures that the explana-
entities are similar to this one, and why?” or “How similar          tion is both human-understandable and well-grounded in how
are these two entities to one another, and why?”. Our work           SSDM scores are calculated. We also describe a system that
focuses on using the semantic content of linked data not only        can produce such explanations efficiently.
to facilitate the process of finding similar entities, but also
to produce automatically-generated and human-understandable             Additionally, not all users will desire the same level of detail
explanations of what makes those entities similar. In this paper,    in their explanations. Therefore, we present a methodology
we formulate a definition of an “explanation” of similarity, we      for allowing the user to tailor how “obvious” or “obscure”
describe a system that can produce such explanations efficiently,    the provided explanations are. We expect that users who
and we present a methodology to allow the user to tailor how         are investigating an unfamiliar domain will prefer “obvious”
“obvious” or “obscure” the provided explanations are.
                                                                     explanations that refer to common and well-known properties,
                      I. I NTRODUCTION                               while expert users will prefer “obscure” explanations that shed
                                                                     light on less well-known relationships and details.
   Today’s world is a world of data. As technology advances,
it becomes easier and easier to collect and store vast amounts       A. Motivation
of data. Much of this data can be viewed in terms of nodes              Our work is motivated by a number of problems in the
with properties and relationships, or edges, among those nodes       intelligence and military research communities. Many of those
— that is to say, it can be represented as a graph. Once a           problems are ubiquitous and have direct translation to business
dataset has been represented in a graph format, such as with         intelligence, logistics, and planning. Take, for example, a
Semantic Web [1] or other linked data technologies [2][3], it        model of a large organization C with its associates, their
can easily be combined with data from different sources. In          interactions, locations they visit, resources they use or produce,
this way, linked data allows already-vast datasets to be readily     and events in which they participate. Given this information,
combined and connected, giving users and programs access to          one could explore the stated relationships among the con-
more data than ever before. The challenge, then, is in making        stituents of C, such as “show all transactions that involve
sense of this data.                                                  person X.” Beyond these simple information-retrieval tasks,
   Some of the data analysis questions that are emerging             analysts might want to examine more complex (or less crisply
include the following: How does one entity in the linked             defined) interactions. For example, “show all associates similar
data relate to another entity, possibly derived from a different     to Y ” could be a very useful query when trying to learn more
source? How does a given entity relate to the rest of the            about person Y . Finally, given a subset S = s1 , s2 , ...sn of
data? What are the similarities between two entities, and why?       members in the organization C, which might represent a group
There are also related data search and retrieval questions to be     that is suspect of participating in nefarious activities, a query
tackled, such as “Find all entities similar to this entity” and      like “show all subsets of C similar to S” might be an excellent
“Find groups of entities that are similar to each other.”            way to discover other suspicious clusters in the organization.
   To solve these problems, work has been done at Raytheon              Note that in the example above we did not have any a priori
BBN Technologies (BBN) to devise a similarity measure                knowledge of the organization other than its structure (which
called the Structural Semantic Distance Measure (SSDM),              in general is a directed graph) and the elements for which we
which leverages both the structural and semantic content             were searching. In particular, we did not assume any hierarchy,
of linked data to find similar entities. SSDM is based on            types of relationships present, or any statistical properties
SimRank [4], a highly domain-general similarity measure              of the graph. It was also important that the queries were
with an efficient approximate calculation. SSDM improves on          phrased in a general way using the word “similar” to indicate a
SimRank by incorporating the semantic content of edge labels,        degree of likeness, but not (necessarily) an exact match. Such
and by achieving greater independence of ontological choices.        problems occur every day both in the military, intelligence, and
defense communities as well as in the business and civilian            A. SimRank
worlds.                                                                   SimRank is based on the intuition that “Two entities are
   We have structured our algorithms and methodologies to              similar if they are related to similar entities.” While this
be applicable to any data expressing entities and relationships        statement may seem trivial at first, it leads directly to a
between entities. Notably, much of the data encountered in the         simple mathematical definition of similarity: the similarity
military and intelligence domains deals with entities and the          score between two entities is the average pairwise similarity
relationships between them, and can therefore benefit from our         of their neighbors, scaled by a decay factor.
contributions.                                                            Consider the example in Fig. 1. Intuitively, one would
   Throughout this paper, we include examples from the movie           imagine that Movie 1 and Movie 2 should be similar, because
industry, drawn from a popular and widely accessible dataset           they have two actors in common and they are both in the
about movies, actors, directors, film genres, and so on [5]. One       same genre. Additionally, Director 1 and Director 2 should
can easily find direct analogies between this data and the types       be similar even though they have no immediate connections
of data encountered in the intelligence and defense domains.           in common, because they directed similar movies. SimRank
                                                                       captures and formalizes this intuition.
                       II. BACKGROUND
   The general problem of similarity is twofold: first, to
                                                                             Actor 1
construct a measure of pairwise similarity so that a meaningful
similarity score can be calculated for any pair of entities                                        Movie 1           Director 1
in a dataset; and second, to devise a method for efficiently                 Actor 2
retrieving the entities that are most similar to a given entity.
   In this section, we discuss a similarity measure developed                                      Movie 2           Director 2
at BBN called the Structural Semantic Distance Measure                        Genre 1
(SSDM). SSDM is an extension of existing work on calculat-              .
ing similarity over unlabeled, directed graphs. The contribution
                                                                       Fig. 1: This figure depicts relationships that exist between
of SSDM is to incorporate the semantic content contained
                                                                       entities in a movie dataset. Director 1 and Director 2 have no
in edge labels, and to achieve a greater independence of
                                                                       immediate neighbors in common, but they are similar because
ontological choices for edge labeling.
                                                                       they are related to similar movies
   Our work on SSDM builds off the SimRank algorithm by
Jeh and Widom [4]. We chose to base our work on SimRank
                                                                          Each pair’s similarity is dependent on many other pairs,
for the following reasons:
                                                                       which may seem to be a barrier to computing their scores.
   • SimRank is domain-independent in that it can be applied           Fortunately, this barrier is readily surmountable. On small
      to any data representing relationships between entities.         datasets the system can be solved with an iterative algorithm,
      This is in contrast with domain-specific similarity algo-        and on large datasets it can be solved using an efficient
      rithms, such as those that can only be used to compare           approximate method outlined by Fogaras and Rácz in [9].
      documents [6], ontological categories [7], or some other         Our implementation of the SSDM calculation is based on this
      domain-specific data type.                                       efficient approximate method.
   • SimRank can be computed efficiently in approximation,                The algorithm outlined by Fogaras and Rácz relies on the
      even over very large datasets, in contrast with measures         mathematical notion of a random walk through a graph, in
      that rely on Singular Value Decomposition or other               which an abstract walker steps from node to node through
      computations that scale poorly [8].                              the graph by following random edges [10]. In the original
   • The approximate computation of SimRank can not only               SimRank paper, Jeh and Widom observed that the SimRank
      determine the similarity between two entities efficiently,       score of two nodes can be approximated from the expected
      but can also generate a list of entities that are most similar   meeting time of two random walkers starting at those two
      to a given entity.                                               nodes; a higher expected meeting time corresponds with a
   • The computation behind SimRank can be understood on               lower SimRank score. Fogaras and Rácz used this observation
      a conceptual level, which makes it possible to explain the       to develop an efficient and scalable algorithm for calculating
      similarity score by referring directly to the computation        similarity scores.
      performed. This would not be possible using a similarity            In the algorithm proposed in [9], one random walker is
      measure that relied on more abstract calculations.               initialized per node in the graph, and each walker moves
   • SimRank looks beyond an entity’s immediate neighbor-              along one edge per time step. To reduce the amount of
      hood and features when determining similarity, which             computation required, walkers are allowed to converge at their
      enables it to incorporate a broader scope of information         meeting point, and are thenceforth treated as a single walker
      about the structural context of entities.                        without loss of correctness in the approximate calculation of
   All these positive attributes are retained in SSDM, along           expected meeting times. Fig. 2 demonstrates how walkers
with several additional improvements.                                  converge. Once the maximum number of steps has elapsed,
the run is halted. Repeated runs are performed and the data         as a convergence if they arrive at the same node on the same
are aggregated.                                                     step and they have traversed identical sequence of predicates
                                                                    to that node. The reasoning behind this modification is that the
                       a                                            semantic meaning of edge labels is critical to the similarity
    a                                                               calculation. For example, two entities A and B may both be
                       b            a, b         a, b, c, d         related to a third entity C, but they are certainly not similar if
     b                                                              the relations in question are A isA C and B isNever C.
                       c, d
     c                                                                 Additionally, SimRank only allows similarity to propagate
                                    c, d                            along in-edges, which means that the original computation of
     d                                                              SimRank only allows walkers to step backwards (that is, from
 .                                                                  objects to subjects). This makes SimRank highly dependent
Fig. 2: Walkers a, b, c, and d begin as independent walkers.        on ontological choices, because it is an arbitrary choice in
As they walk (shown progressing from left to right), they meet      a directed graph whether each label should be phrased in
one another. c and d meet at the first time step and converge.      the forward or reverse direction. For example, the relation
a and b meet next. Finally, on the far right of the diagram, all    A isComponentOf B could be equally expressed as B
walkers have converged.                                             hasComponent A; the choice of which direction is used
                                                                    is arbitrary and can vary from dataset to dataset. Choosing
   Additionally, to repair some deficiencies with the original      and enforcing consistent edge directionality is a difficult issue
formulation of SimRank, walkers in Fogaras and Rácz’s algo-        in ontologies in general, so we did not want SSDM to be
rithm are incentivized to converge if they are near one another.    heavily dependent on arbitrary edge direction choices. As a
This is accomplished by randomly permuting all vertices in the      result, SSDM allows walkers to walk both directions.
graph at the start of each time step, with each walker stepping
to the neighbor with the smallest index in the permutation.            Note that allowing walks in both directions requires us to
   The end result of one run of Fogaras and Rácz’s scalable        distinguish a walker traversing A isComponentOf B from
SimRank algorithm is a fingerprint graph which encodes the          a walker traversing B isComponentOf A, as these two
first meeting times for each pair of walkers. Several runs          steps have very different semantic meanings. Therefore, in
are conducted, and the fingerprint graphs are compiled into         SSDM, it is not enough for walkers to simply have traversed
a larger fingerprint database. The fingerprint database is pre-     identical predicates in order to converge; they must have
computed and can be efficiently queried thereafter, either to       traversed those predicates in the same direction (in or out).
retrieve a similarity score between any two nodes, or to retrieve
                                                                       To illustrate the conditions on convergence required by
the set of nodes with a similarity score greater than some
                                                                    SSDM, consider the example of calculating the similarity
threshold with any given node.
                                                                    between two movies, War of The Worlds and Gladiator, as
   The implementation we devised for calculating the SSDM
                                                                    shown in Fig. 3. Suppose Walker 1, starting from War of
retains the basic structure of the computation described above,
                                                                    The Worlds, traverses the predicates directedin , directedout ,
including converging random walkers and permutation-based
                                                                    hasActorout to reach Harrison Ford. If Walker 2 follows the
convergence incentivization.
                                                                    same predicates in the same order to reach Harrison Ford, as
B. SSDM — The Structural Semantic Distance Measure                  is shown in Path A, then the two walkers will converge with
                                                                    a meeting time of three steps. If Walker 2 instead follows
   The Structural Semantic Distance Measure developed at            a different sequence of predicates, such as hasActorout ,
BBN is closely related to SimRank as described above, with          hasActorin , hasActorout as shown in Path B, it will not
two key enhancements: SSDM incorporates the semantic con-           converge with Walker 1 because the two walkers did not follow
tent of edge labels, and SSDM is independent of ontological         identical predicates to get there.
choices — namely, which edge directionality each proposition
should have.                                                          SSDM was designed as a domain-independent similarity
   Whereas SimRank is a measure over unlabeled directed             measure that would easily account for the semantics of labeled
graphs, SSDM incorporates edge labels. This makes it well-          graph data without being dependent on ontological choices.
suited to any data in subject-predicate-object format, such as      This ability to incorporate semantic information is especially
RDF or other linked data; subjects and objects are equivalent       important in domains with rich semantic context, and allows
to nodes in the graph, and predicates are equivalent to labeled     SSDM to capture semantic nuances that are missed by less
edges. The most important use of edge labels is in the way          sophisticated similarity measures. The significance of this
expected meeting time is calculated. The original SimRank           extra information in the measure is as-yet unassessed, but we
computation defines “meeting time” as the time step when            expect that SSDM should perform better than SimRank for
two walkers step to the same node, at which point they              semantic graphs. In short, SSDM is an efficient, semantically-
converge. The SSDM computation has a stricter condition on          grounded, and ontology-independent algorithm for discovering
convergence: namely, when two walkers meet, it only counts          similar entities in a linked dataset.
                                                                   an explanation that is generated by a computation which is
   Walker 1 Start                Walker 2 Start
                                                                   entirely unlike the original calculation of the similarity score
     War of the                                                    can hardly be considered an explanation of why that score was
                                    Gladiator                      produced.
      Worlds
                                                                      Recall that the SSDM computation calculates similarity
            directed                                               scores based on repeated runs of converging random walkers.
                       A directed           B hasActor             The faster two random walkers tend to converge, the higher
       Steven                                   Richard            the similarity score of their starting nodes will be.
                         Ridley Scott
      Spielberg                                 Harris                It follows that for an explanation of a score to be well-
                                                                   grounded in how SSDM scores are calculated, it must some-
            directed            directed        has Actor          how elucidate where and how walkers from the nodes in
                                                                   question tend to converge, and whether these convergences
     Raiders of                                 Patriot            tend to happen rapidly or whether the walks are long. In order
                        Blade Runner
    the Lost Ark                                Games              for this information to also be human-understandable, it must
                           has Actor         hasActor              be relatively concise.
           hasActor                                                   For these reasons, we decided that an explanation should
                                                                   consist of a brief list of common convergence points, along
                        Harrison Ford                              with a handful of concise chains of statements per convergence
                                                                   point describing the “best” relationships linking each starting
                            End                                    node to that point. The “best” relationships may be the shortest
 .                                                                 chains of statements, or the ones that tend to be traversed
Fig. 3: This diagram depicts two possible ways — path A            most frequently, or (as explained later) relationships that are
and path B — by which Walker 2 could meet Walker 1 at              appropriately obvious or obscure. Figure 4 shows an example
the Harrison Ford node. If Walker 2 takes Path A, they will        of an explanation of this form.
converge. If Walker 2 takes Path B, they will not.
                                                                   John Williams
                                                                     {A New Hope, hasMusicContributor, John Williams}
                                                                     {The Empire Strikes Back, hasMusicContributor,
              III. E XPLAINING S IMILARITIES                           John Williams}
   A similarity measure is useful for ranking items or pairs       Harrison Ford
of items, but a numerical score alone gives little insight into      {A New Hope, hasActor, Harrison Ford}
why two entities are similar. The user can easily retrieve the       {The Empire Strikes Back, hasActor, Harrison Ford}
similarity score for two entities but is then left wondering:      Star Wars (Film Collection)
what about those entities and their relations caused them to         {A New Hope, inCollection, Star Wars}
receive a high or low similarity score? What is the nature of        {The Empire Strikes Back, inCollection, Star Wars}
their similarity? In addition, users may want more or less depth     {Revenge of the Sith, hasSequel, A New Hope}
in the explanations provided.                                          {Revenge of the Sith, inCollection, Star Wars}
   In order to enable users to answer this question, we sought       {A New Hope, hasSequel, The Empire Strikes Back}
                                                                       {A New Hope, inCollection, Star Wars}
to build a system that could provide explanations for similarity   ...
scores. Our three main contributions to the area of similarity
explanations are as follows:                                       Fig. 4: An excerpt from an explanation for the similarity
   1) We formulated a definition of an “explanation” for a         between Star Wars Episode IV: A New Hope and Star Wars
       similarity score that is human-understandable, as well      Episode V: The Empire Strikes Back, showing both one-
       as appropriately grounded in the way that the similarity    relationship chains and multi-relationship chains
       score was originally calculated.
   2) We wrote a program to efficiently produce such expla-          By defining an “explanation” this way, we ensure that
       nations.                                                    users are presented with a coherent explanation of where and
   3) We further developed a methodology for biasing ex-           how walkers from the nodes in question tend to converge.
       planations towards either more “obvious” or more “ob-       Additionally, such explanations are also readily understandable
       scure” facts.                                               as explanations of what commonalities the nodes have, and
                                                                   how they are related to each commonality.
           IV. D EFINITION OF AN E XPLANATION
   A good explanation of a similarity score must be both                                V. O UR A PPROACH
human-understandable and grounded in the original calcula-            The most obvious way to extract an explanation for a com-
tion of the similarity measure. An explanation that is not         putation seems to be to inspect the path that the computation
human-understandable is hardly an explanation at all, and          followed to obtain its result. Unfortunately, in the case of our
SSDM calculation, such a strategy is inadequate. We have             scores. As part of our work on similarity explanations, we
established that we would like to present chains of relations        added a “why” button for each score, which users can click
to the user. However, the fingerprint graphs produced by the         to produce an on-demand explanation of that score. The
SSDM computation record only when and where the walkers              software behind the “why” button interacts with the underlying
converged, discarding all information about the path taken           Jena1 model of the data in Parliament to walk the graph and
by the walkers; furthermore, discarding this information is          produce an explanation using the methodology described in
essential to the calculation as a whole to maintain acceptable       this section. Even for datasets with millions of triples, such
space and performance characteristics. While simply listing          as the movie dataset, preliminary findings show that accurate
convergence points does provide the user with some intelligible      explanations could be produced and displayed to the user
information, it does not provide as rich an explanation as we        within seconds.
would like.                                                             In summary, explanations for SSDM scores can be effi-
   Therefore, our approach to explanation generation is to re-       ciently computed on-demand at query time by re-running a
run the SSDM calculation on a smaller scale at query-time, and       modified version of the SSDM random walker calculation.
explicitly store the relations traversed rather than condensing
the results into a terse fingerprint graph. Two alterations were            VI. U SING S ALIENCE TO B IAS E XPLANATIONS
required to make the modified SSDM calculation efficient                The final contribution we describe in this paper is a
enough to provide an acceptable user experience at query time.       method for incorporating salience into explanation generation.
Both performance improvements were achievable because the            Salience is a measure of how rare a fact is in a dataset. Our
modified calculation uses just two walkers rather than starting      goal was to produce the most “useful” explanations, which we
one walker at every node in the graph. Recall that the similarity    believed would be the explanations with the most salient facts,
of two entities is derived from the expected meeting times of        because salient facts are rare and therefore highly descriptive.
walkers starting at those two entities. If we know in advance           We instead discovered that high-salience explanations often
which two entities we will be comparing, there is no need to         come across as obscure because they can contain extremely
start walkers at any other entities. Because it is so efficient to   rare facts. Similarly, low-salience explanations often come
generate a pair of walks compared to a whole graph’s worth of        across as obvious because they contain extremely common
walks, we can afford to run the computation many times per           facts. Nonetheless, just as high-salience explanations often
explanation request, and then choose from among the possible         have the upside of being very descriptive, low-salience ex-
explanations to display relevant results to users.                   planations often have the upside of revealing the broadest
   The first performance improvement strategy relates to the         and most general similarities rather than obscure trivia. Which
permutation-based convergence strategy described in Sec-             flavor of explanation is more “useful” likely depends on the
tion II-A. Each step of the ordinary SSDM calculation begins         goals of the user and requires more research in an application
by randomly shuffling all edges in the graph. In the modified        domain.
calculation, only the two walkers’ immediate out-edges need             In this section we present the definition of salience as
to be shuffled, which is almost always a vastly smaller number       applied to facts and explain how we constructed a salience-
of edges, and takes a negligible amount of time.                     weighted edge generator so that the explanations generated
   The second performance improvement strategy relates to the        would contain more or fewer high-salience facts. Note that
strict edge-label requirements for convergence. In the ordinary      while the following descriptions will focus on biasing expla-
SSDM calculation, walkers frequently meet at the same node           nations by salience, it can be used to favor facts based on any
but do not actually converge because they did not follow             numerical property of those facts.
the same predicates. In the modified calculation, we instead
require the two walkers to follow the same predicates as one         A. Fact Salience
another. So, in the example given in Fig. 3, path B would never         Salience is a measure of how rare a fact is in a dataset.
be generated; instead, either Walker 1 and Walker 2 would            A fact in this case refers not to a whole statement, but to a
both follow a directed in-edge, or both would follow a               statement missing its subject or object; that is to say, struc-
hasActor out-edge, or both would follow some other shared            tures of the form subject predicate blank or blank
edge not shown. If the two walkers were ever unable to follow        predicate object (such as Spielberg directed
the same predicate in the same direction, then that run of           blank or blank directed Gladiator). Facts with a
the computation would end. Coupling the edge options of the          subject and predicate are called left facts because all the
two walkers greatly reduces the number of trials that fail to        information they retain is on the left side of the statement.
converge, leading to a much more efficient calculation.              Similarly, facts with a predicate and an object are called right
   As a proof-of-concept for this methodology, we imple-             facts [12].
mented an explanation-generating component of Parliament,               We now describe how the salience of a fact is calculated.
an open-source triple store developed and maintained by              Consider o(f ) to be the number of times a fact is expressed
BBN [11]. In Parliament, users are able to browse and view           in a set of unique subject predicate object triples.
entities in the knowledge base, explore other triples containing
an entity, and view a list of similar entities and their SSDM          1 http://openjena.org
Consider also subj to be the number of unique subjects present        The weighted-permutation algorithm we developed works
in those triples, and obj to be the number of unique objects.      according to the common permutation strategy of assigning
For left facts, salience is calculated as follows:                 a random number to each element to be permuted and then
                                                                   sorting by those numbers. We devised a method to incorporate
                                 1 − log(o(f act))                 salience into the generation of those random numbers.
             salience(f act) =                               (1)
                                     log(obj)                         We made two attempts at designing the weighting algorithm.
  And for right facts, the calculation is as follows:              Our first, unsuccessful attempt at a biased algorithm contained
                                                                   an oversight which we corrected in the second version. The
                                 1 − log(o(f act))                 first, naive approach worked as follows:
             salience(f act) =                               (2)
                                    log(subj)                         1) Each fact fi is associated with a nonnegative weight wi
                                                                      2) Each fact fi is assigned a random number ri between 0
   The conceptual significance of subj and obj in these                    and wi
equations is to count the number of times each fact could             3) Elements are ranked in ascending order according to ri
potentially occur, since each left fact could potentially appear
                                                                      Using this algorithm, elements with a low weight are more
with every object in the data set, and each right fact could
                                                                   likely to get a low number relative to other elements, and are
potentially appear with every subject. Facts that actually do
                                                                   therefore more likely to be ranked first.2 To use this algorithm
appear with almost every available subject or object are
                                                                   to favor salient facts, the weight function used to assign the
extremely common, and are thus not very salient. Facts that are
                                                                   wi values could be set to wi =√     1 − salience(fi ), or some
expressed about very few of the available subjects or objects
                                                                   other function such as wi = 1 − salience(fi ).
are very rare and therefore highly salient. This intuition,
                                                                      When testing this algorithm using wi = 1 − salience(fi ), a
and the resultant calculation, is grounded in the information
                                                                   failure mode arose: namely, the very same facts would appear
theoretic concept of relative entropy, discussed in [13].
                                                                   at the top time and time again. We determined that the problem
B. Weighting                                                       occurred with unique facts because their salience is precisely
                                                                   1, and their weight was therefore 0. When unique facts were
   The objective of salience weighting is to favor high- or low-   assigned a random number between 0 and their weight they
salience facts in the generated explanations. This bias can be     were always assigned 0. This meant that all the unique facts
incorporated into the random permutation that is calculated at     were always first in the list, and so only unique facts were
the beginning of each time step. In the unweighted explanation     ever generated.
calculation, the random walkers choose which edge to take             To remedy this problem, we added a parameter to increase
by stepping to the neighbor with the smallest index in this        the randomness. The modified, successful algorithm works as
permutation. The original reason for the permutation was to        follows:
encourage walkers that are near one another to converge, but
                                                                      1) Each fact fi is associated with a nonnegative weight wi
it can also be modified to add other weights and biases into
                                                                      2) Given a parameter b, each fact is assigned a random
the explanation-generation process.
                                                                           number ri between 0 and b + (1 − b) ∗ wi
   In order to encourage walkers’ edge choices to favor more
                                                                      3) Elements are ranked in ascending order according to ri
salient edges, we would like high-salience edges to be more
likely to occur at low indices in the permutation. However,           In this way, the parameter b can be tuned to increase or
we do not want the distribution of salience in the permutation     decrease the randomness of the permutation.
to be too consistent from trial to trial, otherwise low-salience      The properties of the weight function do not constrain the
edges will never reach the top of the rankings, restricting the    calculation, so most any weight function could be used. To
edge choice of the walkers and severely limiting the breadth       favor low-salience facts, for example, salience or the square
of explanations produced.                                          root of salience can be used directly. To favor medium-salience
   The permutation as originally described in Fogaras and Rácz    facts, wi = salience(fi ) ∗ (1 − salience(fi )) could be used.
ranks nodes randomly; however, the salience of a node is not       Any number of other functions are possible depending on the
a well-defined concept, and so to enable a salience-weighted       desired salience characteristics of the resulting explanation.
permutation algorithm it was necessary to switch to ranking        C. Relevance of Salience-Weighting
facts. This modification was justified for the following reason.
                                                                      As for whether salient facts are actually more useful, it
Under Fogaras and Rácz’s formulation, convergence required
                                                                   seems to depend on the intended use case. In some cases, ob-
only that two walkers meet at the same point, and so to
                                                                   vious paths are more useful, and in other cases, obscure paths
encourage convergence it was enough to encourage walkers to
                                                                   are more useful. If the user knows very little about the area of
choose the same nodes to step to — hence nodes were what
                                                                   inquiry he or she is likely to prefer explanations that refer to
was ranked. However, under our formulation, convergence
                                                                   common and well-understood facts and properties. Conversely,
requires that the two walkers also follow the same edge label
                                                                   if the user is an expert in the domain, he or she is likely to
in the same direction. An edge label plus one endpoint makes
up a fact, so to encourage convergence, we are justified in          2 “Weight” might be a misnomer here as it implies elements with high
encouraging walkers to choose the same facts to walk along.        weights are favored, but the opposite is the case with this algorithm.
prefer more obscure data. A user with average expertise will        we could derive similarity scores for each pair of profiles, and
probably want only middle-salience explanations.                    merge those profiles as appropriate.
  For example, consider a movie-watcher who has never seen             In another setting, suppose we are monitoring the network
any Star Wars movies. He or she may be interested to know           activities of a group of employees of company X. Each em-
about low-salience (i.e. common) facts like these:                  ployee belongs to one of four departments: Engineering, Sales,
                                                                    Finance, or Corporate. We can monitor email traffic, access
   {ofGenre, Science Fiction}
                                                                    to corporate applications, printers, file repositories and other
   {hasMusicContributor, John Williams}
                                                                    network activities. Let’s focus on George, who is in Sales.
   {hasActor, James Earl Jones}
                                                                    We receive alerts that George has been accessing financial
These facts reference well-known people and broad genres,           software and financial projection data files. Is this activity
which could help give a novice a grasp of what relates the          unusual? Our algorithm could be applied to compare George
Star Wars movies to one another. High-salience facts such as        and the profile of his activities with those of employees at his
the following:                                                      and other departments. Does George seem to be behaving more
                                                                    like employees in Finance than he was before, or less similarly
   {hasProducer, Gary Kurtz}
                                                                    to his fellow employees at Sales than we might expect? If so,
   {hasDirector, Irvin Kershner}
                                                                    it certainly indicates that George’s behavior has changed for
   {hasEditor, T. M. Christopher}
                                                                    some reason, and may warrant investigation.
would be too obscure; an unfamiliar viewer is unlikely to              In essence, our similarity work can be applied to any data
know who, say, Irvin Kershner is, as he is not well-known           expressing relationships between entities. Our algorithm is
for directing any other blockbusters. However, a Star Wars          highly scalable, so it can be applied on very large datasets,
afficionado who already knows that the Star Wars movies             and our work on explanations allows the similarity results to
are Science Fiction films will find the low-salience facts too      be clearly communicated to end users of the data analysis. For
obvious. He or she may be interested in knowing about the           these reasons, we believe our work to be highly applicable to
more unusual details that relate Star Wars Episode IV: A New        a wide variety of military and intelligence tasks.
Hope and Star Wars Episode V: The Empire Strikes Back
to one another, and would be pleased to discover that the                               VIII. F UTURE W ORK
relatively-unknown editor T. M. Christopher was involved in            The future of this work lies in two directions. The first is
the production.                                                     to perform experiments to assess the improvement of using
   Fortunately, our method enables the salience to be weighted      SSDM as compared to SimRank and other approaches. We
by a custom weight function, so that the user can tune the          would also like to perform user tests to determine the perceived
salience of the resulting explanations to his or her needs.         utility of different explanations in a real-world environment.
Additionally, because explanations are generated on-demand,         The second direction lies in further research on extensions to
the desired obscurity could be specified at query time, which       the work and on new approaches to enhance it.
would enable the user to ask for more or less obscure expla-           Experimentation with SSDM will likely take place in a
nations in real time as they explore their data.                    relevant application domain such as social network analysis (in
                                                                    the intelligence domain) or computer network activity analysis
                     VII. A PPLICATIONS
                                                                    (the cyber domain). Metrics for the meaning of similarity will
   Our work on semantic similarity has been developed with an       have to be developed for each domain before the algorithm
eye towards a variety of applications, primarily in the military    can be evaluated. This is especially true with user testing,
and intelligence domains. In general terms, similarity measures     where the experiments need to account for subjectivity and
and explanations are very useful tools for analysis of large        prior domain knowledge. Selecting and vetting the appropriate
graph-based datasets.                                               data sets for automated evaluation is another challenge —
   Consider a simple example. Suppose we are collating in-          graph-based data, which is manually annotated for similarity,
formation on Libya and we encounter the profiles for the            does not appear to be very common. One approach we may
following individuals:                                              take is to compare structural similarity generated by SSDM to
   • Muammar al-Gaddafi                                             similarity derived from entity attribute comparison (presence
   • Muammar El-Gadhafi                                             and value of certain attributes). Many well-established algo-
   • Moammar Kadaffi                                                rithms exist in this area to provide a baseline for attribute-
   Despite considerable variation in their spellings, these         based similarity.
names all refer to the same Libyan former head of state. In fact,      The most promising direction of further research we en-
some sources report over one hundred ways to spell this per-        vision lies in the domain of calculating predicate similarity.
son’s name [14] due to ambiguities in the transliteration from      With our current algorithm, walkers must traverse identical
Arabic. It would be ideal if we can use additional information      predicates — a strategy designed to prevent relations such as A
to disambiguate the names in order to ascertain that these          is C and B isNever C from contributing positively to A’s
profiles represent the same person. Using information about         similarity with B. However, it intuitively seems that A is C
the relationships (and actions) of the person from each profile,    and B isOften C should certainly contribute to A and B’s
similarity. Doing so would require calculating the similarity         explanations efficiently; and we enabled the obscurity of our
between predicates (in this case, is and isOften) before              explanations to be tuned to meet user’s needs. We believe
calculating the similarities between entities. One possible           that these contributions will help users in the intelligence and
way to do this would be to run SSDM on the ontology to                defense communities to make sense of their data, by enabling
calculate predicate similarity before moving on to calculate          them to not only find relevant similarities more efficiently, but
object similarity. Another possible option would be to use a          also to understand those similarities on an intuitive level.
language-based metric as a source of predicate similarity, such                                     R EFERENCES
as by using WordNet [15] similarity.
                                                                       [1] T. Berners-Lee, J. Hendler, and O. Lassila, “The semantic web,”
   A third option considers the insight that “similar predicates           Scientific American, May 2008. [Online]. Available: http://www.ds3web.
are those that connect similar entities to other similar entities,”        it/miscellanea/the semantic web.pdf
for example, the predicates teaches and hasInstructor                  [2] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data — the story so
                                                                           far,” International Journal on Semantic Web and Information Systems
are considered similar because they appear in relationships                (IJSWIS), vol. 5, 2009. [Online]. Available: http://eprints.ecs.soton.ac.
such as Dr. Smith teaches Chemistry and CH1301                             uk/21285/1/bizer-heath-berners-lee-ijswis-linked-data.pdf
hasInstructor Dr. Jones where Chemistry is                             [3] T. Heath. (2011) Linked data — connect distributed data across the
                                                                           web. [Online]. Available: http://linkeddata.org/
known to be similar to CH1301 and Dr. Smith is similar                 [4] G. Jeh and J. Widom, “SimRank: a measure of structural-context
to Dr. Jones. This formulation of predicate similarity is                  similarity,” in Proceedings of the eighth ACM SIGKDD international
obviously recursive with respect to entity similarity: in order            conference on Knowledge discovery and data mining, ser. KDD
                                                                           ’02. New York, NY: ACM, 2002, pp. 538–543. [Online]. Available:
to calculate predicate similarity, we must first calculate entity          http://doi.acm.org/10.1145/775047.775126
similarity, and vice versa. However, SimRank, SSDM, and                [5] M. P. Consens, O. Hassanzadeh, and A. M. Teisanu. (2008, September)
many other algorithms are also based on recursive definitions              Linked movie database. [Online]. Available: http://www.linkedmdb.org/
                                                                       [6] A. Islam and D. Inkpen, “Semantic text similarity using corpus-based
and derive iterative approximations to the optimal result set.             word similarity and string similarity,” ACM Transactions on Knowledge
Logically then, we could apply this recursive definition of                Discovery from Data, vol. 2, no. 2, pp. 1–25, July 2008. [Online].
predicate similarity in order to simultaneously derive both                Available: http://doi.acm.org/10.1145/1376815.1376819
                                                                       [7] F. M. Couto, M. J. Silva, and P. M. Coutinho, “Measuring semantic
entity (subject and object) and predicate similarity. This is              similarity between Gene Ontology terms,” Data and Knowledge
reminiscent of similar iterative and approximate approaches                Engineering, vol. 61, no. 1, pp. 137–152, 2007. [Online]. Available:
in robotic navigation to solve simultaneous localization and               http://www.sciencedirect.com/science/article/pii/S0169023X06000875
                                                                       [8] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
mapping (SLAM) [16][17] problems. It is reasonable to expect               Vision. Cambridge University Press, 2003. [Online]. Available:
that predicate similarity may be derived analogously.                      http://books.google.com/books?id=si3R3Pfa98QC
   Additional directions also include determining what other           [9] D. Fogaras and B. Rácz, “Scaling link-based similarity search,” in
                                                                           Proceedings of the 14th international conference on World Wide Web,
weighting schemes besides salience-weighting might be use-                 ser. WWW ’05. New York, NY: ACM, 2005, pp. 641–650. [Online].
ful. For example, if a dataset includes metadata about prove-              Available: http://doi.acm.org/10.1145/1060745.1060839
nance or other information about the trustworthiness of each          [10] D. Aldous and J. Fill, “Reversible markov chains and random walks
                                                                           on graphs,” 2002, in preparation. [Online]. Available: http://stat-
fact (node in the graph), then the SSDM calculation could                  www.berkeley.edu/users/aldous/RWG/book.html
be weighted to favor more trusted facts over facts from less          [11] D. Kolas, I. Emmons, and M. Dean, “Efficient Linked-List RDF
reliable sources. Other possibilities surely exist.                        Indexing in Parliament,” in Proceedings of the Fifth International
                                                                           Workshop on Scalable Semantic Web Knowledge Base Systems
   And finally, while the current explanation format is cer-               (SSWS2009), ser. Lecture Notes in Computer Science, vol. 5823.
tainly human-understandable, it does not yet read as easily                Washington, DC: Springer, October 2009, pp. 17–32. [Online].
as text. Fortunately, the data in question is already in subject-          Available: http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-
                                                                           WS/Vol-517/ssws09-paper2.pdf
predicate-object form; that is to say, it is already in a sentence-   [12] Commonsense Computing Group. (2011, June) Tutorial: Making
like structure, and therefore natural language report generation           matrices from your own data. [Online]. Available: http://csc.media.mit.
is a goal well within reach. The additional work required                  edu/docs/divisi1/tutorial make.html
                                                                      [13] T. M. Cover and J. A. Thomas, Elements of Information Theory.
would include determining what sentence structures each                    New York, NY: Wiley, 1991. [Online]. Available: http://www.
predicate fit with, and determining how to ensure that the                 elementsofinformationtheory.com/
subjects and objects were tagged with human-readable labels           [14] S.      Bass,     “How       many      different    ways     can     you
                                                                           spell ‘Gaddafi’?” ABC News, September 2009. [Online].
that could be included in a generated report (and not just URIs            Available: http://abcnews.go.com/blogs/headlines/2009/09/how-many-
or other machine-readable descriptors).                                    different-ways-can-you-spell-gaddafi/
                                                                      [15] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller,
                       IX. C ONCLUSION                                     “Introduction to wordnet: An on-line lexical database,” International
                                                                           Journal of Lexicography, vol. 3, no. 4, pp. 235–244, 1990. [Online].
   Questions of similarity crop up any time users want to make             Available: http://wordnet.princeton.edu/
sense of data describing relationships between entities, and          [16] R. C. Smith and P. Cheeseman, “On the representation and
                                                                           estimation of spatial uncertainty,” The International Journal of Robotics
data of this form (i.e. graph data or linked data) is ubiquitous.          Research, vol. 5, pp. 56–68, December 1986. [Online]. Available:
The contributions we describe in this paper help users find sim-           http://dl.acm.org/citation.cfm?id=33838.33842
ilarities in graph data efficiently using SSDM, and understand        [17] J. J. Leonard and H. F. Durrant-Whyte, “Simultaneous map
                                                                           building and localization for an autonomous mobile robot,” in
those similarities using similarity explanations. We defined               IEEE/RSJ International Workshop on Intelligent Robot Systems
what an explanation of a similarity score should convey;                   (IROS), November 1991, pp. 1442–1447. [Online]. Available: http:
we implemented a system that can produce such scores and                   //ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=174711&tag=1