On Principles of Egocentric Person Search in
                                  Social Networks

                         Sara Cohen                                           Benny Kimelfeld Georgia Koutrika                   Jan Vondrák
     Dept. of Computer Science and Engineering                                                   IBM Research–Almaden
           Hebrew University of Jerusalem                                                       San Jose, CA 95120, USA
                    sara@cs.huji.ac.il                                               {kimelfeld,gkoutri,jvondrak}@us.ibm.com


ABSTRACT                                                                              topics such as social search [3, 9, 10] and link prediction [8].
Person search is the problem of finding, by means of key-                                The focus of this paper is on egocentric person search.
word search, relevant people in a social network. In egocen-                          Person search is the problem of finding, by means of keyword
tric person search, the search query is issued by a person s                          search, relevant people in a social network. Person search
participating in the social network, and the goal is to find                          is an important type of query over a social network, as it
people that possess two qualities: relevancy to the query,                            is an aid in finding people of interest. In egocentric person
and relevancy to s herself. This position paper considers                             search, the search query is issued by a person s participating
the latter quality, and specifically, scoring functions that                          in the social network, and the goal is to find people that
rank persons by their relevancy to s. In particular, the pa-                          possess two qualities: relevancy to the query, and relevancy
per proposes general principles (i.e., properties) that should                        to s herself. This position paper considers the latter quality,
be held by such scoring functions. Several functions, which                           and specifically, scoring functions that rank persons by their
were proposed in the past for measuring node connectivity,                            relevancy to a given node s. Results that are highly ranked
are analyzed with respect to the proposed principles. It is                           by relevancy to the query poser s are people (transitively)
shown that none of these functions sufficiently satisfy the                           trusted by s. Hence, s can be less wary of entering into a
principles. In contrast, the paper presents two additional                            real-life relationship (social or otherwise) with these people.
functions that satisfy the principles in a strong sense.                                 Suppose, for example, that the searcher s is node Sally in
                                                                                      the small fragment of a social network in Figure 1, and she
                                                                                      poses the query “oral surgeon” (or “car mechanic”, “immi-
1.     INTRODUCTION                                                                   gration lawyer”, “really nice guy”). Obviously, her goal is to
   Online social networks have grown in popularity at an                              find a person satisfying the query, who is also trusted or rec-
extraordinary pace over the last few years. In fact, social                           ommended by people who she trusts. Assuming that nodes
networks, such as Facebook, MySpace and Twitter, have                                 Tim, Ted and Tony are relevant to the keywords, our goal is
become so widespread that they currently boast hundreds                               to measure their relevancy to s by taking into consideration
of millions of users. The graph structure defined by a social                         the graph structure.
network encodes interesting and useful information about                                 Egocentric person search highly differs from social (web)
the social relations between users. Leveraging this data to                           search [3,9,10]. The latter generally refers to the problem of
effectively answer different types of queries is an interesting                       ranking web pages while taking into consideration social rela-
and challenging problem.                                                              tions. In contrast, the former problem is that of ranking so-
   Abstractly, a social network is simply a graph of people.                          cial network nodes. The problem studied in this paper bears
Directed edges indicate that one person (node) likes/trust-                           similarity to expert search [4]. However, the latter has not
s/recommends another. (We use a directed model, as in                                 taken into consideration the egocentric aspect of this prob-
Twitter, to model possibly asymmetric relations.) In ad-                              lem. Another different yet related problem is efficient search
dition, each node is associated with textual data, such as                            within a social network [1, 2, 13]; there, the focus is usually
personal information, posts, etc.                                                     on efficiently finding nodes with given properties (and not
   Social networks have been the focus of extensive research,                         on sophisticated scoring functions). Link prediction, which
studying metrics like centrality and cohesion, as well as phe-                        is the problem of predicting which social relations are likely
nomena like the small world property. See [5] for a history of                        to be added to a social network [8], is also highly related.
the development of social network analysis. More recently,                            (This relationship is discussed further in Section 4). Intu-
online social networks have been studied in the context of                            itively, egocentric person search differs from link prediction
                                                                                      in that we must rank nodes t who are relevant to the search
                                                                                      keywords, even if it would a priori seem unlikely for s and t
                                                                                      to form a social relation.
Permission to make digital or hard copies of all or part of this work for                As mentioned previously, when ranking results of an ego-
personal or classroom use is granted without fee provided that copies are             centric person search, one must measure relevancy to the
not made or distributed for profit or commercial advantage and that copies            query, and relevancy to s herself. The former can be quanti-
bear this notice and the full citation on the first page. To copy otherwise, to       fied using standard information retrieval ranking functions.
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. This article was presented at the workshop Very
                                                                                      Thus, the focus of this work is on the latter. As an aid to
Large Data Search (VLDS) 2011.                                                        developing and studying node scoring functions, we present
Copyright 2011.
                                             Tim                    vance of t for s in graph G. Relevance of t for s is important
                                                                    in a social setting; in particular, it measures how much s
             Sally                           Ted                    (her friends, friends of friends, and so forth) like/trust/rec-
                                                                    ommend t. In the following section we present three simple
                                                                    properties that any score(s, t, G) should satisfy.
                                             Tony
                                                                    3.     SCORING FUNCTION PROPERTIES
     Figure 1: Small fragment of a social network                      In this section, we fix a graph G and three distinct nodes
                                                                                                           v
                                                                    s, t, and v of G. We denote by πs,t       (G) the graph that
                                                                    consists of all the simple paths from s to t through v. We
several properties that seem intuitively appealing, which we        are interested in three special types of nodes that are on
expect to hold in any natural node scoring function. These          paths from s to t: connectors, mergers, and splitters. Each
properties are given in the form of graph manipulations, and        of these plays a special part in G, and hence, manipulating
how they effect scoring of nodes. Intuitively, the underlying       such nodes, will yield a new graph, for which we will expect
assumptions of our properties are, all else being equal, (1)        that s and t will be even more closely related.
a node closer to s should score higher than one farther from           We say that v is a connector if v lies on a path from s to
s, and (2) a node with multiple independent paths from s            t, v has a single incoming edge (u, v), a single outgoing edge
should score higher than one with fewer paths. For example,         (v, w), and there is no edge from u to w. Intuitively, v is a
in Figure 1, it would seem that Tim is more relevant to Sally       connecting link on a path from s to t, and has no additional
than Tony, who in turn is more relevant than Ted. All three         interplay with the graph. We say that v is a merger if v
nodes being at distance 2 from Sally, there are more inde-          has multiple incoming edges that are on simple paths from
pendent paths from Sally to Tim than there are to Tony              s to t, v has a single outgoing edge, and v separates s from
(and more to Tony than to Ted). Note that independent                                 v
                                                                    t in the graph πs,t (G). Intuitively, merger nodes serve as a
paths translate, in the real world, to independent opinions,        merging point of multiple paths from s to t. Finally, v is
and hence, are quite valuable.                                      a splitter if v has a single incoming edge, multiple outgoing
   After presenting our general principles (properties), we         edges that are on simple paths from s to t, and v separates
consider several node scoring functions that have been mainly       s from t in the graph πs,t v
                                                                                                  (G). Thus, v can be thought of
used in the context of link prediction. We analyze the degree       as splitting an incoming path from s into multiple diverging
to which these functions satisfy our properties, and show           paths to t.
that this degree is insufficient: each of these functions satis-       Our properties use three types of graph transformations,
fies one or more properties in a trivial manner, or even vio-       as depicted pictorially in Figure 2 and formally defined next.
lates them. We then introduce two additional node scoring           Let v be a node in G.
functions, expected distance and reliability, and show that
these functions satisfy all properties in a strong sense.                1. If v is a connector, with incoming edge (u, v) and out-
                                                                            going edge (v, w), then shortenv (G) denotes the graph
                                                                            that is obtained from G by removing v, and adding an
2.    EGOCENTRIC PERSON SEARCH                                              the edge (u, w). (See Figure 2(a).)1
   A social network is a directed graph G(V, E), where V is
a set of nodes, called people, and E is a set of edges. We               2. If v is a merger with outgoing edge (v, w) and incoming
use a directed model to take into consideration asymmetric                  edges (u1 , v), . . . , (un , v), then unmergev (G) denotes
social relations, as in Twitter. Thus, an edge (u, v) indicates             the graph that is obtained from G by removing v, and
that u views v in a positive light, that is, u likes/trusts/rec-            for all i ∈ {1, . . . , n} adding a new node vi and the
ommends v. In addition, each person v is associated with                    edges (ui , vi ) and (vi , w). (See Figure 2(b).)
textual content, such as personal information, posts and so
forth. A small fragment of a social network appears in Fig-              3. If v is a splitter with incoming edge (u, v) and outgoing
ure 1. Note that most textual content has been omitted in                   edges (v, w1 ), . . . , (v, wn ), then unsplitv (G) denotes the
this figure for simplicity in presentation (we only show the                graph that is obtained from G by removing v, and for
first names of the nodes).                                                  all i ∈ {1, . . . , n} adding a new node vi and the edges
   In egocentric person search, a keyword search query is                   (u, vi ) and (vi , wi ). (See Figure 2(c).)
issued by a person s in the network, and the result is a ranked
list t1 , . . . , tk of nodes. Abstractly, we denote a query as a      We now list the properties that we expect a scoring func-
pair (s, c), where s is the node initiating the query, and c is a   tion score(s, t, G) to satisfy.
string of keywords. For example, the query “tax consultant”,
                                                                         • Shortening property: If node v is a connector, then
issued by Sally in Figure 1, will result in a ranked list of
                                                                           it holds that score(s, t, G) ≤ score(s, t, shortenv (G)).
nodes, relevant to the keywords, from the graph.
   The main challenge in egocentric keyword search is to for-
                                                                         • Unmerge property: If node v is a merger, then it
mulate an effective scoring function for results of a query
                                                                           holds that score(s, t, G) ≤ score(s, t, unmergev (G)).
(s, c). Clearly, the score of a node t should take into consid-
eration two aspects. First, how relevant are the keywords c              • Unsplit property: If node v is a splitter, then it
to t? Second, how relevant is t for s? For the former aspect,              holds that score(s, t, G) ≤ score(s, t, unsplitv (G)).
standard information retrieval scoring mechanisms can be
used. Therefore, we focus on the latter aspect. Formally, we        1
                                                                      Note that when a node v is removed, every edge that is
will consider functions score(s, t, G) that measure the rele-       incident to v is removed as well.
                                                                     of nodes s and t. Due to space limitations, this is not dis-
           u       v     w              u     w                      cussed further.
                   (a) Shortening: shortenv (G)
                                                                        In the upcoming section we explore several scoring func-
                                                                     tions, and determine which of the properties they satisfy,
           u1                          u1     v1                     and thus, whether they are appropriate for use in ranking
                                                                     results of egocentric person search.
           u2      v    w              u2     v2    w
                                                                     4.     SCORING FUNCTIONS
            ...


                                        ...
                                                                        The goal of a scoring function is to quantify the relation-
           un                          un     vn                     ship between s and t in a graph G, for the purpose of ego-
                  (b) Unmerging: unmergev (G)                        centric person search. Attempts to quantify the relationship
                                                                     between nodes have been made in the past, for different
                                                                     goals. For example, the link prediction problem is defined
                        w1                    v1    w1               as follows: Given a snapshot of a social network at time T ,
                                                                     link prediction is to accurately predict the edges that will be
           u       v    w2              u     v2    w2               added to the network during the interval from time T to a
                                                                     given future time T 0 . Intuitively, a link is more likely to be
                         ...


                                              ...

                                                    ...
                                        ...


                                                                     added from s to t during this interval, if they are already
                        wn                    vn    wn               well-related at time T . Hence, scoring functions for link
                                                                     prediction measure the relatedness of s and t.
                   (c) Unsplitting: unsplitv (G)                        Several link prediction measures have been studied in the
                                                                     past. We consider some of the more prominent functions:
                  Figure 2: Transformations
                                                                          • rdist(s, t, G) is the reciprocal of the distance from s to
                                                                            t in G, i.e., (dist(s, t, G))−1 , where dist(s, t, G) is the
  Intuitively, the shortening property should be satisfied, as              length of the shortest path from s to t.
shorter paths from s to t obviously reflect a closer relation-                                              P∞ l           l
ship between the two. The unmerge and unsplit properties                  • allPaths(s, t, G) is the sum       l=1 β |pathss,t,G | where
                                                                                  l
should be satisfied, as they intuitively state that introducing             pathss,t,G is the set of all length-l paths from s to t in
more independent opinions (i.e., disjoint edges), should only               G [7, 8]. Thus, allPaths(s, t, G) directly sums over all
improve the ranking of t.                                                   paths from s to t, exponentially dampening by length
                                                                            to count short paths more heavily.
   Example 3.1. To demonstrate these properties, let G be
the graph of Figure 1, and let s be the node Sally. The node              • rootPR(s, t, G) (i.e., rooted PageRank [8]) is the sta-
on the path from Sally to Ted is a connector. Hence, remov-                 tionary probability of t in a random walk that returns
ing this node, and directly connecting Sally to Ted, should                 to s with probability α at each step, moving to a ran-
raise Ted’s score (shortening property). Observe the merger                 dom neighbor with probability 1 − α.
node on the paths from Sally to Tony. Applying an unmerge
transformation to this node would result in the graph struc-           Table 1 shows which properties are satisfied (and to which
ture currently existing between Sally and Tim, and hence,            degree) by the three functions.
should raise Tony’s score (unmerge property).
                                                                        Example 4.1. Consider again the graph in Figure 1. Sup-
   Note that all properties use the rather weak “≤” to in-           pose that s is the node Sally, and we are interested in ranking
dicate that the change in G should not lower score(s, t, G).         nodes Tim, Ted and Tony. The function rdist gives the same
Given a specific scoring function, the shortening property is        score to all three nodes, as all are at distance 2 from Sally.
satisfied in the strong sense if for all s, t, G and connector       The function allPaths will score Tim and Tony equally (and
nodes v, score(s, t, G) < score(s, t, shortenv (G)). (Note the       above Ted), as they both have precisely one path of length 2,
strict inequality.) Similarly, this property is satisfied in the     and two paths of length 3. Similarly, rootPR will score Tim
trivial sense if score(s, t, G) = score(s, t, shortenv (G)) always   and Tony equally (and above Tim). Rooted PageRank can-
holds. Finally, this property is satisfied in the weak sense         not differentiate between the graph structure relating Sally
if the inequality score(s, t, G) ≤ score(s, t, shortenv (G)) is      to Tim, and that to Tony. On the other hand, it would seem
sometimes strict. We define satisfaction in the strong, triv-        that Tim should be the highest scoring, as its paths are in-
ial and weak sense similarly for the other two properties.           dependent (while those to Tony are not). None of the three
                                                                     scoring functions achieve such a scoring.
   Remark 3.2. Due to the small-world property, often ob-
served in social networks [12], there is a high likelihood that
any two given nodes will be connected by a short path.               4.1     Expected Distance and Reliability
Hence, it sometimes may make sense to only take a lim-                 It is easy to see that with respect to our properties, none of
ited neighborhood of a node into account when computing              the functions considered thus far is a good fit for node scor-
the scoring function. We do not directly formulate this re-          ing. Therefore, we introduce two new functions (expected
quirement as one of our properties. However, it can usually          distance and reliability), which can be appropriate for node
be taken into consideration by taking any scoring function,          scoring. (To the best of our knowledge, these functions have
and applying it only to a neighborhood-bounded projection            not been considered in the past for related social network
                          Shortening    Unmerge       Unsplit                                               ...
                                                                                     t1      s              ...         t2
        rdist(s, t, G)          weak      trivial     trivial


                                                                                                    ...

                                                                                                           ...

                                                                                                                  ...
    allPaths(s, t, G)          strong     trivial     trivial                                               ...
     rootPR(s, t, G)           strong     trivial      fails                                              k−1

                                                                          Figure 3: Expected distance vs. reliability
       Table 1: Table of property satisfaction.

                                                                    rel (s, t2 , G) > p. Since rel (s, t1 , G) = p, we get that rel ranks
scoring problems.)                                                  t1 lower than t2 . However, the expected distance from s to
                                                                    t1 is p + (1 − p)m, and that from s to t2 is at least k; hence,
Expected Distance. We first consider the expected dis-              choosing the parameters to be such that k > p + (1 − p)m
tance function. Intuitively, this function measures the ex-         would make expd rank t1 higher than t2 .
pected distance from s to t, when each edge is removed with
probability p. Note that for this value to be well defined,
we will chose a number m, that is returned by the function,         5.    CONCLUSION
when no path from s to t exists.                                       This paper presents a first attempt at defining principles
   We fix a parameter m ∈ R, and a probability p ∈ (0, 1).          that should guide node relevance ranking in egocentric per-
We will implicitly assume that m is larger than the number          son search. Three properties, determining how graph manip-
of nodes in the graph G. The m-bounded distance from s to           ulations should affect node scoring, were presented. Tradi-
t, denoted δ̂G (s, t), is defined by                                tional node scoring functions were analyzed with respect to
                         def                                        these properties, as well as two additional functions, which
               δ̂G (s, t) = min{dist(s, t, G), m} .                 are shown to strongly satisfy the properties.
Thus, if G has no path from s to t, then δ̂G (s, t) = m. Note          For future work we intend to experimentally test the ef-
                                                                    fectiveness of various node scoring schemes and validate the
that if s 6= t, then δ̂G (s, t) is always in the interval [1, m].
                                                                    given properties. We also intend to study additional models
   We denote by Gr a random subgraph of G that is obtained
                                                                    of social networks, such as the undirected model (e.g., to
by removing each edge of G, independently, with probability
                                                                    represent social relationships in Facebook), as well as social
1−p. The expected m-bounded distance, denoted by δG (s, t),
                                                                    networks with weak and strong (or more generally, weighted)
is defined as follows.
                                    h        i                      edges [6].
                               def
                    δG (s, t) = E δ̂Gr (s, t) .
                                                                    Acknowledgements
That is, δG (s, t) is the expected m-bounded distance from
                                                                    The research of Sara Cohen was partially supported by the
s to t in a random subgraph Gr of G. Finally, our scoring
                                                                    ISF (Grant 143/09) and the Israeli Ministry of Science and
function is the reciprocal of δG (s, t), namely
                                                                    Technology (Grant 3-6472).
                 expd (s, t, G) = (δG (s, t))−1 .
                                  def

                                                                    6.[1] L.REFERENCES
Reliability. Reliability is another function that strongly                   Adamic and E. Adar. How to search a social network. Social
                                                                         Networks, 27(3):187–203, 2005.
satisfies all properties. Intuitively, reliability measures the
                                                                     [2] L. Adamic, R. Lukose, A. Puniyani, and B. Huberman. Search
likelihood that a random subgraph Gr of G contains a path                in power-law networks. Phys. Rev., E 64(046135):187–203,
from s to t. Intuitively, reliability satisfies the shortening           2001.
property, since longer paths are more likely to be discon-           [3] P. Briggs and B. Smyth. Harnessing trust in social search. In
                                                                         ECIR, pages 525–532, Berlin, Heidelberg, 2007.
nected when a random subgraph is chosen. Similarly, re-                  Springer-Verlag.
liability satisfies the unmerge and unsplit properties, since        [4] R. D’Amore. Expertise community detection. In SIGIR, 2004.
they give preference to graphs with disjoint paths, which in         [5] L. C. Freeman. The Development of Social Network Analysis:
turn, increase the likelihood of s and t being connected in              A Study in the Sociology of Science. Empirical Press, 2004.
Gr . Fixing a probability p ∈ (0, 1), we define                      [6] S. Granovetter. The strength of weak ties. American Journal
                                                                         of Sociology, 78:1360–80, 1973.
        rel (s, t, G) = Pr [Gr has a path from s to t] .             [7] L. Katz. A new status index derived from sociometric analysis.
                                                                         Psychometrika, 18(1):39–43, 1953.
  The following theorem shows that both functions strongly           [8] D. Liben-Nowell and J. M. Kleinberg. The link-prediction
                                                                         problem for social networks. JASIST, 58(7):1019–1031, 2007.
satisfy all three properties. The proof is nontrivial, and is
                                                                     [9] K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle, and
omitted, due to space restrictions. In the case of expected              P. Briggs. Towards a reputation-based model of social web
distance, the proof is based on the notion of stochastic or-             search. In IUI, pages 179–188, New York, NY, USA, 2010.
dering [11].                                                             ACM.
                                                                    [10] R. Schenkel, T. Crecelius, M. Kacimi, S. Michel, T. Neumann,
                                                                         J. X. Parreira, and G. Weikum. Efficient top-k querying over
   Theorem 4.2. The functions expd (s, t, G) and rel (s, t, G)           social-tagging networks. In SIGIR, pages 523–530, New York,
strongly satisfy all three properties.                                   NY, USA, 2008. ACM.
                                                                    [11] M. Shaked and J. G. Shanthikumar. Stochastic orders and
  Remark 4.3. Although expd and rel strongly satisfy all                 their applications. Probability and Mathematical Statistics.
                                                                         Academic Press Inc., Boston, MA, 1994.
properties, they do not necessarily imply the same relative
                                                                    [12] J. Travers and S. Milgram. An experimental study of the small
ranking for all nodes. To demonstrate that, we consider                  world problem. Sociometry, 32(4):425–443, 1969.
the ranking of the nodes t1 and t2 in Figure 3. Assume that         [13] S. B. Yang and H. Garcia-Molina. Improving search in
there are sufficiently many disjoint paths from s to t2 so that          peer-to-peer networks. In ICDCS, pages 5–14, 2002.