On Principles of Egocentric Person Search in Social Networks Sara Cohen Benny Kimelfeld Georgia Koutrika Jan Vondrák Dept. of Computer Science and Engineering IBM Research–Almaden Hebrew University of Jerusalem San Jose, CA 95120, USA sara@cs.huji.ac.il {kimelfeld,gkoutri,jvondrak}@us.ibm.com ABSTRACT topics such as social search [3, 9, 10] and link prediction [8]. Person search is the problem of finding, by means of key- The focus of this paper is on egocentric person search. word search, relevant people in a social network. In egocen- Person search is the problem of finding, by means of keyword tric person search, the search query is issued by a person s search, relevant people in a social network. Person search participating in the social network, and the goal is to find is an important type of query over a social network, as it people that possess two qualities: relevancy to the query, is an aid in finding people of interest. In egocentric person and relevancy to s herself. This position paper considers search, the search query is issued by a person s participating the latter quality, and specifically, scoring functions that in the social network, and the goal is to find people that rank persons by their relevancy to s. In particular, the pa- possess two qualities: relevancy to the query, and relevancy per proposes general principles (i.e., properties) that should to s herself. This position paper considers the latter quality, be held by such scoring functions. Several functions, which and specifically, scoring functions that rank persons by their were proposed in the past for measuring node connectivity, relevancy to a given node s. Results that are highly ranked are analyzed with respect to the proposed principles. It is by relevancy to the query poser s are people (transitively) shown that none of these functions sufficiently satisfy the trusted by s. Hence, s can be less wary of entering into a principles. In contrast, the paper presents two additional real-life relationship (social or otherwise) with these people. functions that satisfy the principles in a strong sense. Suppose, for example, that the searcher s is node Sally in the small fragment of a social network in Figure 1, and she poses the query “oral surgeon” (or “car mechanic”, “immi- 1. INTRODUCTION gration lawyer”, “really nice guy”). Obviously, her goal is to Online social networks have grown in popularity at an find a person satisfying the query, who is also trusted or rec- extraordinary pace over the last few years. In fact, social ommended by people who she trusts. Assuming that nodes networks, such as Facebook, MySpace and Twitter, have Tim, Ted and Tony are relevant to the keywords, our goal is become so widespread that they currently boast hundreds to measure their relevancy to s by taking into consideration of millions of users. The graph structure defined by a social the graph structure. network encodes interesting and useful information about Egocentric person search highly differs from social (web) the social relations between users. Leveraging this data to search [3,9,10]. The latter generally refers to the problem of effectively answer different types of queries is an interesting ranking web pages while taking into consideration social rela- and challenging problem. tions. In contrast, the former problem is that of ranking so- Abstractly, a social network is simply a graph of people. cial network nodes. The problem studied in this paper bears Directed edges indicate that one person (node) likes/trust- similarity to expert search [4]. However, the latter has not s/recommends another. (We use a directed model, as in taken into consideration the egocentric aspect of this prob- Twitter, to model possibly asymmetric relations.) In ad- lem. Another different yet related problem is efficient search dition, each node is associated with textual data, such as within a social network [1, 2, 13]; there, the focus is usually personal information, posts, etc. on efficiently finding nodes with given properties (and not Social networks have been the focus of extensive research, on sophisticated scoring functions). Link prediction, which studying metrics like centrality and cohesion, as well as phe- is the problem of predicting which social relations are likely nomena like the small world property. See [5] for a history of to be added to a social network [8], is also highly related. the development of social network analysis. More recently, (This relationship is discussed further in Section 4). Intu- online social networks have been studied in the context of itively, egocentric person search differs from link prediction in that we must rank nodes t who are relevant to the search keywords, even if it would a priori seem unlikely for s and t to form a social relation. Permission to make digital or hard copies of all or part of this work for As mentioned previously, when ranking results of an ego- personal or classroom use is granted without fee provided that copies are centric person search, one must measure relevancy to the not made or distributed for profit or commercial advantage and that copies query, and relevancy to s herself. The former can be quanti- bear this notice and the full citation on the first page. To copy otherwise, to fied using standard information retrieval ranking functions. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. This article was presented at the workshop Very Thus, the focus of this work is on the latter. As an aid to Large Data Search (VLDS) 2011. developing and studying node scoring functions, we present Copyright 2011. Tim vance of t for s in graph G. Relevance of t for s is important in a social setting; in particular, it measures how much s Sally Ted (her friends, friends of friends, and so forth) like/trust/rec- ommend t. In the following section we present three simple properties that any score(s, t, G) should satisfy. Tony 3. SCORING FUNCTION PROPERTIES Figure 1: Small fragment of a social network In this section, we fix a graph G and three distinct nodes v s, t, and v of G. We denote by πs,t (G) the graph that consists of all the simple paths from s to t through v. We several properties that seem intuitively appealing, which we are interested in three special types of nodes that are on expect to hold in any natural node scoring function. These paths from s to t: connectors, mergers, and splitters. Each properties are given in the form of graph manipulations, and of these plays a special part in G, and hence, manipulating how they effect scoring of nodes. Intuitively, the underlying such nodes, will yield a new graph, for which we will expect assumptions of our properties are, all else being equal, (1) that s and t will be even more closely related. a node closer to s should score higher than one farther from We say that v is a connector if v lies on a path from s to s, and (2) a node with multiple independent paths from s t, v has a single incoming edge (u, v), a single outgoing edge should score higher than one with fewer paths. For example, (v, w), and there is no edge from u to w. Intuitively, v is a in Figure 1, it would seem that Tim is more relevant to Sally connecting link on a path from s to t, and has no additional than Tony, who in turn is more relevant than Ted. All three interplay with the graph. We say that v is a merger if v nodes being at distance 2 from Sally, there are more inde- has multiple incoming edges that are on simple paths from pendent paths from Sally to Tim than there are to Tony s to t, v has a single outgoing edge, and v separates s from (and more to Tony than to Ted). Note that independent v t in the graph πs,t (G). Intuitively, merger nodes serve as a paths translate, in the real world, to independent opinions, merging point of multiple paths from s to t. Finally, v is and hence, are quite valuable. a splitter if v has a single incoming edge, multiple outgoing After presenting our general principles (properties), we edges that are on simple paths from s to t, and v separates consider several node scoring functions that have been mainly s from t in the graph πs,t v (G). Thus, v can be thought of used in the context of link prediction. We analyze the degree as splitting an incoming path from s into multiple diverging to which these functions satisfy our properties, and show paths to t. that this degree is insufficient: each of these functions satis- Our properties use three types of graph transformations, fies one or more properties in a trivial manner, or even vio- as depicted pictorially in Figure 2 and formally defined next. lates them. We then introduce two additional node scoring Let v be a node in G. functions, expected distance and reliability, and show that these functions satisfy all properties in a strong sense. 1. If v is a connector, with incoming edge (u, v) and out- going edge (v, w), then shortenv (G) denotes the graph that is obtained from G by removing v, and adding an 2. EGOCENTRIC PERSON SEARCH the edge (u, w). (See Figure 2(a).)1 A social network is a directed graph G(V, E), where V is a set of nodes, called people, and E is a set of edges. We 2. If v is a merger with outgoing edge (v, w) and incoming use a directed model to take into consideration asymmetric edges (u1 , v), . . . , (un , v), then unmergev (G) denotes social relations, as in Twitter. Thus, an edge (u, v) indicates the graph that is obtained from G by removing v, and that u views v in a positive light, that is, u likes/trusts/rec- for all i ∈ {1, . . . , n} adding a new node vi and the ommends v. In addition, each person v is associated with edges (ui , vi ) and (vi , w). (See Figure 2(b).) textual content, such as personal information, posts and so forth. A small fragment of a social network appears in Fig- 3. If v is a splitter with incoming edge (u, v) and outgoing ure 1. Note that most textual content has been omitted in edges (v, w1 ), . . . , (v, wn ), then unsplitv (G) denotes the this figure for simplicity in presentation (we only show the graph that is obtained from G by removing v, and for first names of the nodes). all i ∈ {1, . . . , n} adding a new node vi and the edges In egocentric person search, a keyword search query is (u, vi ) and (vi , wi ). (See Figure 2(c).) issued by a person s in the network, and the result is a ranked list t1 , . . . , tk of nodes. Abstractly, we denote a query as a We now list the properties that we expect a scoring func- pair (s, c), where s is the node initiating the query, and c is a tion score(s, t, G) to satisfy. string of keywords. For example, the query “tax consultant”, • Shortening property: If node v is a connector, then issued by Sally in Figure 1, will result in a ranked list of it holds that score(s, t, G) ≤ score(s, t, shortenv (G)). nodes, relevant to the keywords, from the graph. The main challenge in egocentric keyword search is to for- • Unmerge property: If node v is a merger, then it mulate an effective scoring function for results of a query holds that score(s, t, G) ≤ score(s, t, unmergev (G)). (s, c). Clearly, the score of a node t should take into consid- eration two aspects. First, how relevant are the keywords c • Unsplit property: If node v is a splitter, then it to t? Second, how relevant is t for s? For the former aspect, holds that score(s, t, G) ≤ score(s, t, unsplitv (G)). standard information retrieval scoring mechanisms can be used. Therefore, we focus on the latter aspect. Formally, we 1 Note that when a node v is removed, every edge that is will consider functions score(s, t, G) that measure the rele- incident to v is removed as well. of nodes s and t. Due to space limitations, this is not dis- u v w u w cussed further. (a) Shortening: shortenv (G) In the upcoming section we explore several scoring func- tions, and determine which of the properties they satisfy, u1 u1 v1 and thus, whether they are appropriate for use in ranking results of egocentric person search. u2 v w u2 v2 w 4. SCORING FUNCTIONS ... ... The goal of a scoring function is to quantify the relation- un un vn ship between s and t in a graph G, for the purpose of ego- (b) Unmerging: unmergev (G) centric person search. Attempts to quantify the relationship between nodes have been made in the past, for different goals. For example, the link prediction problem is defined w1 v1 w1 as follows: Given a snapshot of a social network at time T , link prediction is to accurately predict the edges that will be u v w2 u v2 w2 added to the network during the interval from time T to a given future time T 0 . Intuitively, a link is more likely to be ... ... ... ... added from s to t during this interval, if they are already wn vn wn well-related at time T . Hence, scoring functions for link prediction measure the relatedness of s and t. (c) Unsplitting: unsplitv (G) Several link prediction measures have been studied in the past. We consider some of the more prominent functions: Figure 2: Transformations • rdist(s, t, G) is the reciprocal of the distance from s to t in G, i.e., (dist(s, t, G))−1 , where dist(s, t, G) is the Intuitively, the shortening property should be satisfied, as length of the shortest path from s to t. shorter paths from s to t obviously reflect a closer relation- P∞ l l ship between the two. The unmerge and unsplit properties • allPaths(s, t, G) is the sum l=1 β |pathss,t,G | where l should be satisfied, as they intuitively state that introducing pathss,t,G is the set of all length-l paths from s to t in more independent opinions (i.e., disjoint edges), should only G [7, 8]. Thus, allPaths(s, t, G) directly sums over all improve the ranking of t. paths from s to t, exponentially dampening by length to count short paths more heavily. Example 3.1. To demonstrate these properties, let G be the graph of Figure 1, and let s be the node Sally. The node • rootPR(s, t, G) (i.e., rooted PageRank [8]) is the sta- on the path from Sally to Ted is a connector. Hence, remov- tionary probability of t in a random walk that returns ing this node, and directly connecting Sally to Ted, should to s with probability α at each step, moving to a ran- raise Ted’s score (shortening property). Observe the merger dom neighbor with probability 1 − α. node on the paths from Sally to Tony. Applying an unmerge transformation to this node would result in the graph struc- Table 1 shows which properties are satisfied (and to which ture currently existing between Sally and Tim, and hence, degree) by the three functions. should raise Tony’s score (unmerge property). Example 4.1. Consider again the graph in Figure 1. Sup- Note that all properties use the rather weak “≤” to in- pose that s is the node Sally, and we are interested in ranking dicate that the change in G should not lower score(s, t, G). nodes Tim, Ted and Tony. The function rdist gives the same Given a specific scoring function, the shortening property is score to all three nodes, as all are at distance 2 from Sally. satisfied in the strong sense if for all s, t, G and connector The function allPaths will score Tim and Tony equally (and nodes v, score(s, t, G) < score(s, t, shortenv (G)). (Note the above Ted), as they both have precisely one path of length 2, strict inequality.) Similarly, this property is satisfied in the and two paths of length 3. Similarly, rootPR will score Tim trivial sense if score(s, t, G) = score(s, t, shortenv (G)) always and Tony equally (and above Tim). Rooted PageRank can- holds. Finally, this property is satisfied in the weak sense not differentiate between the graph structure relating Sally if the inequality score(s, t, G) ≤ score(s, t, shortenv (G)) is to Tim, and that to Tony. On the other hand, it would seem sometimes strict. We define satisfaction in the strong, triv- that Tim should be the highest scoring, as its paths are in- ial and weak sense similarly for the other two properties. dependent (while those to Tony are not). None of the three scoring functions achieve such a scoring. Remark 3.2. Due to the small-world property, often ob- served in social networks [12], there is a high likelihood that any two given nodes will be connected by a short path. 4.1 Expected Distance and Reliability Hence, it sometimes may make sense to only take a lim- It is easy to see that with respect to our properties, none of ited neighborhood of a node into account when computing the functions considered thus far is a good fit for node scor- the scoring function. We do not directly formulate this re- ing. Therefore, we introduce two new functions (expected quirement as one of our properties. However, it can usually distance and reliability), which can be appropriate for node be taken into consideration by taking any scoring function, scoring. (To the best of our knowledge, these functions have and applying it only to a neighborhood-bounded projection not been considered in the past for related social network Shortening Unmerge Unsplit ... t1 s ... t2 rdist(s, t, G) weak trivial trivial ... ... ... allPaths(s, t, G) strong trivial trivial ... rootPR(s, t, G) strong trivial fails k−1 Figure 3: Expected distance vs. reliability Table 1: Table of property satisfaction. rel (s, t2 , G) > p. Since rel (s, t1 , G) = p, we get that rel ranks scoring problems.) t1 lower than t2 . However, the expected distance from s to t1 is p + (1 − p)m, and that from s to t2 is at least k; hence, Expected Distance. We first consider the expected dis- choosing the parameters to be such that k > p + (1 − p)m tance function. Intuitively, this function measures the ex- would make expd rank t1 higher than t2 . pected distance from s to t, when each edge is removed with probability p. Note that for this value to be well defined, we will chose a number m, that is returned by the function, 5. CONCLUSION when no path from s to t exists. This paper presents a first attempt at defining principles We fix a parameter m ∈ R, and a probability p ∈ (0, 1). that should guide node relevance ranking in egocentric per- We will implicitly assume that m is larger than the number son search. Three properties, determining how graph manip- of nodes in the graph G. The m-bounded distance from s to ulations should affect node scoring, were presented. Tradi- t, denoted δ̂G (s, t), is defined by tional node scoring functions were analyzed with respect to def these properties, as well as two additional functions, which δ̂G (s, t) = min{dist(s, t, G), m} . are shown to strongly satisfy the properties. Thus, if G has no path from s to t, then δ̂G (s, t) = m. Note For future work we intend to experimentally test the ef- fectiveness of various node scoring schemes and validate the that if s 6= t, then δ̂G (s, t) is always in the interval [1, m]. given properties. We also intend to study additional models We denote by Gr a random subgraph of G that is obtained of social networks, such as the undirected model (e.g., to by removing each edge of G, independently, with probability represent social relationships in Facebook), as well as social 1−p. The expected m-bounded distance, denoted by δG (s, t), networks with weak and strong (or more generally, weighted) is defined as follows. h i edges [6]. def δG (s, t) = E δ̂Gr (s, t) . Acknowledgements That is, δG (s, t) is the expected m-bounded distance from The research of Sara Cohen was partially supported by the s to t in a random subgraph Gr of G. Finally, our scoring ISF (Grant 143/09) and the Israeli Ministry of Science and function is the reciprocal of δG (s, t), namely Technology (Grant 3-6472). expd (s, t, G) = (δG (s, t))−1 . def 6.[1] L.REFERENCES Reliability. Reliability is another function that strongly Adamic and E. Adar. How to search a social network. Social Networks, 27(3):187–203, 2005. satisfies all properties. Intuitively, reliability measures the [2] L. Adamic, R. Lukose, A. Puniyani, and B. Huberman. Search likelihood that a random subgraph Gr of G contains a path in power-law networks. Phys. Rev., E 64(046135):187–203, from s to t. Intuitively, reliability satisfies the shortening 2001. property, since longer paths are more likely to be discon- [3] P. Briggs and B. Smyth. Harnessing trust in social search. In ECIR, pages 525–532, Berlin, Heidelberg, 2007. nected when a random subgraph is chosen. Similarly, re- Springer-Verlag. liability satisfies the unmerge and unsplit properties, since [4] R. D’Amore. Expertise community detection. In SIGIR, 2004. they give preference to graphs with disjoint paths, which in [5] L. C. Freeman. The Development of Social Network Analysis: turn, increase the likelihood of s and t being connected in A Study in the Sociology of Science. Empirical Press, 2004. Gr . Fixing a probability p ∈ (0, 1), we define [6] S. Granovetter. The strength of weak ties. American Journal of Sociology, 78:1360–80, 1973. rel (s, t, G) = Pr [Gr has a path from s to t] . [7] L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, 1953. The following theorem shows that both functions strongly [8] D. Liben-Nowell and J. M. Kleinberg. The link-prediction problem for social networks. JASIST, 58(7):1019–1031, 2007. satisfy all three properties. The proof is nontrivial, and is [9] K. McNally, M. P. O’Mahony, B. Smyth, M. Coyle, and omitted, due to space restrictions. In the case of expected P. Briggs. Towards a reputation-based model of social web distance, the proof is based on the notion of stochastic or- search. In IUI, pages 179–188, New York, NY, USA, 2010. dering [11]. ACM. [10] R. Schenkel, T. Crecelius, M. Kacimi, S. Michel, T. Neumann, J. X. Parreira, and G. Weikum. Efficient top-k querying over Theorem 4.2. The functions expd (s, t, G) and rel (s, t, G) social-tagging networks. In SIGIR, pages 523–530, New York, strongly satisfy all three properties. NY, USA, 2008. ACM. [11] M. Shaked and J. G. Shanthikumar. Stochastic orders and Remark 4.3. Although expd and rel strongly satisfy all their applications. Probability and Mathematical Statistics. Academic Press Inc., Boston, MA, 1994. properties, they do not necessarily imply the same relative [12] J. Travers and S. Milgram. An experimental study of the small ranking for all nodes. To demonstrate that, we consider world problem. Sociometry, 32(4):425–443, 1969. the ranking of the nodes t1 and t2 in Figure 3. Assume that [13] S. B. Yang and H. Garcia-Molina. Improving search in there are sufficiently many disjoint paths from s to t2 so that peer-to-peer networks. In ICDCS, pages 5–14, 2002.