Benchmarking the Privacy-Preserving People Search
                                     Shuguang Han, Daqing He and Zhen Yue
                                School of Information Sciences, University of Pittsburgh
                                  135 N Bellefield Ave., Pittsburgh, PA, United States
                                 shh69@pitt.edu, dah44@pitt.edu, zhy18@pitt.edu

ABSTRACT                                                               many social network services [7, 8] - users often either opt out
People search is an important topic in information retrieval. Many     from certain social networks or provide incomplete or even fake
previous studies on this topic employed social networks to boost       social network information. Early research work has shown that
search performance by incorporating either local network features      many data mining algorithms may not work or even harm user
(e.g. the common connections between the querying user and             experience when equipped with such incomplete and noisy social
candidates in social networks), or global network features (e.g. the   information [9]. Recently, researchers start to incorporating social
PageRank), or both. However, the available social network              information into people search systems, and the coauthor
information can be restricted because of the privacy settings of       networks generated from scholarly publications were often
involved users, which in turn would affect the performance of          utilized [4, 5, 10]. However, probably because the coauthor
people search. Therefore, in this paper, we focus on the privacy       networks often have less privacy concerns, little attention has
issues in people search. We propose simulating different privacy       been paid to the privacy related issues in people search.
settings with a public social network due to the unavailability of     Furthermore, there is no study on how incomplete social networks
privacy-concerned networks. Our study examines the influences          would affect the performance of people search systems.
of privacy concerns on the local and global network features, and      In this paper, we are particularly interested in the privacy issues in
their impacts on the performance of people search. Our results         people search and the impacts of these issues on people search
show that: 1) the privacy concerns of different people in the          performance. The TREC experience demonstrates that it would be
networks have different influences. People with higher association     a critical drawback for studying the search problems if there are
(i.e. higher degree in a network) have much greater impacts on the     no appropriate test beds. Considering the difficulty of obtaining an
performance of people search; 2) local network features are more       open privacy-concerned social network and the expense of
sensitive to the privacy concerns, especially when such concerns       constructing such a network from scratch for research purpose, we
come from high association peoples in the network who are also         propose in this paper to simulate the privacy-concerned social
related to the querying user. As the first study on this topic, we     network using the public available coauthor networks. Note that,
hope to generate further discussions on these issues.                  users in many social network services are able to keep both their
                                                                       profiles and social connections as private. In this paper, we focus
Categories and Subject Descriptors                                     on the privacy issues of sharing social connections.
H.2.8 [Database Applications]: Data Mining; H.3.3 [Information         The key assumption of our simulation is that a coauthor network
Storage and Retrieval]: Information Search and Retrieval -             would have the same or similar network characteristics with a
Search process                                                         privacy-concerned social network. The foundation of our
                                                                       simulation approach is based on some existing studies [11-13],
Keywords                                                               which state that many real-world social networks (including
People Search; Privacy-preserving networks; Privacy-preserving         coauthor networks and many other privacy-concerned networks
people search                                                          such as Facebook social networks) share the same patterns: they
                                                                       are small-world networks and their degree distributions are highly
1. INTRODUCTION                                                        skewed. Newman [14] studied the assortative patterns (the
Modern search engines often assume that their search algorithms        preferences of connecting people who share the similar features)
should return the most relevant documents to a query. However,         of social networks. He found that the social networks showed
in many occasions, users actually want to look for relevant people     assortatively mixed patterns, whereas technological and biological
rather than documents. For example, company recruiters may             seems to be disassortative. Therefore, it is reasonable to assume
need to find appropriate job candidates for a job opening [1]; or      that coauthor networks and many privacy-preserving networks
conference chairs may need to invite the right experts to form a       (because they are both social networks) share some important
program committee [2]. These topics have been studied as the           common characteristics. Therefore, coauthor networks, which are
expert finding problems in the information retrieval community         publically available, can be used as the surrogate for studying
[3], and the expert is often defined as the people who have domain     privacy-preserving social networks. In the remaining part of this
knowledge for a given topic. However, expert finding is only one       paper, all the privacy related discussions are based on coauthor
type of people search tasks. Many other scenarios such as finding      network and coauthor network-based people search.
appropriate collaborators [4] or thesis committee members [5],
                                                                       In order to study the impact of privacy concerns to the people
require not only the topical expertise matching but also the social
                                                                       search performance, we need to examine how the social network
matching [6] because a higher social similarity make it easier for
                                                                       information is used in existing people search systems. We refer
people to connect.
                                                                       the global network features as the features that are propagated
In order to perform social matching, the retrieval systems need to     through the whole networks while the local network features are
access users’ social networks and return the potential candidates      those that are directly related to the ego-network of the querying
who have either direct or indirect connections with the given          user [15]. Some people search systems adopted only the local
users. However, privacy has been identified as a major concern in      network features [4], whereas some others used both the local and
global network features [5, 10]. For example, Han et al. [5] took      further author disambiguation step was performed. In total, the
into consideration of both the local social similarity between the     collection contains 253,390 unique authors and 953,685 coauthor
querying user and each returned candidate (measured by the             connection instances. Therefore, that collection contains both
proportion of common social connections) and the global                content information about papers (title and abstract) and social
authority of each returned candidate (measured by the PageRank         network of authors (i.e., coauthor networks).
value running on the whole social networks). They found that           The goal of the user study presented in Han et al. [5] was to
combing both global and local network features with the topic          evaluate a people search system. The study involved four different
relevance would provide better support of modeling diverse             people search tasks, each of which aimed to search for 5
people search contexts and further augment the search                  candidates satisfying a querying user’s search need. Two systems
experiences. Since both the global and local network features          were used in the study: a baseline plain content-based people
played important roles in people search systems [5], the study of      search system and an experimental system that enhances people
privacy needs to consider both.                                        search with three interactive facets: content relevance, social
Both the local and global network features could be influenced by      similarity between the user and a candidate (the local network
the completeness of social network information. Therefore, a           feature) and the authority of a candidate (the global network
privacy-preserving network with many private (unrevealed) social       feature). The experiment system allowed the querying users to
connections would affect the calculation of the global and local       tune the importance associated with each facet in order to generate
network features, which may in turn affect the people search           a better candidate search result. 24 participants were recruited for
performance. The incomplete social contexts of the querying user       the user study. At the beginning of the user study, each participant
and the network candidates affect the calculation of the proportion    was asked to provide their publications and their social
of common coauthors between them. This is the reason why we            connections (such as advisors). In the post-task questionnaire, the
examine the impacts of privacy concerns on the local network           participants were asked to rate the relevance of each marked
features for both network candidates and querying user. When           candidate in a Five-point Likert scale (1 as non-relevant and 5 as
analyze the local network features, we study the privacy settings      the highly relevant).
of querying users and candidates separately. The global network        We reuse the data from [5] in the following ways. First, we use
features rely on the information propagation through the whole         the same academic publication collection which contains both the
network which is only related to network candidates. We study          papers and the coauthor networks. Secondly, we use the marked
global network features for network candidates only.                   highly relevant candidates (i.e., those with ratings higher than 3)
In summary, we identify that privacy-preserving people search is       from the user study as our ground-truth, which are further used to
still an almost untouched research topic. In this paper, we make       measure the effectiveness of the search algorithms under different
the first attempt to provide some benchmarks by simulating             privacy-preserving network scenarios.
privacy-preserving networks and examining how these networks
affect the performance of people search. To achieve the goal of        2.2 Configuring Privacy-Preserving Networks
this study, we need to properly simulate different types of            We identify two different types of users in our study, the people
privacy-preserving networks. A privacy-reserving network is            who initiates the people search requests (i.e. the participants in the
essentially a subset of the full network, so we model different        user study. Therefore, they are called querying users) and the
privacy concerns as different sampling strategies (the purpose is      candidates in the publication collection and the coauthor networks
to sample a subset of privacy-concerned people). We discuss            (therefore, called the candidates). We treat them differently
sampling strategies in section 2. To be specific, our research         because: 1) although many querying users would be on the
questions are:                                                         coauthor network, some others may not be; 2) more importantly,
                                                                       we believe that the calculation of local network features can be
 RQ1: How to properly simulate different types of privacy-
                                                                       influenced by the privacy settings of the querying users as well as
  preserving social networks?                                          the candidates, and the impacts of privacy setting from different
 RQ2: How does each type of privacy-preserving network affect         users would be different.
  the global and local network features?
 RQ3: How does the global and local network features derived          2.2.1 Modeling Privacy for the Candidates
  from privacy-preserving networks further affect the people           Although the privacy settings are related to various factors, those
                                                                       factors would result in a common outcome – a user either has
  search performance?                                                  privacy concern or not. We assume that there is a probability (i.e.
                                                                       pi) for each candidate being privacy-concerned. Based on different
2. DATASET AND METHODOLOGY                                             roles that people can play in a network, we think that modeling
                                                                       privacy concern as being associated with the candidate’s degree of
2.1 Experiment Dataset                                                 associations (i.e. the coauthor relationships) on the network would
Our experiments in this paper reuse the user study data and the
                                                                       be a reasonable approach to study the impacts of privacy settings
publication collection presented in Han et al. [5]. The dataset used
                                                                       for people with different roles. We could see that there are two
in that study was an academic publication collection containing
                                                                       extremes for different candidates to have privacy concerns: 1) the
219,677 conference papers from the ACM Digital Library. These
                                                                       top degree of association candidates have privacy-concerns; or 2)
papers were published in academic conferences (the full list of
                                                                       the bottom degree of association candidates have privacy concern.
conferences is available at ACM Digital Library 1) between 1990
and 2013. Only public available information of a paper (the title,     Suppose that for each candidate i, his/her degree of association on
abstract and authors) was collected. The unique identifier assigned    the network is di and the maximized degree on the network is dmax,
by ACM Digital Library was used to identify each author, and no        we have Eq. 1 to provide one formula with a parameter λ for
                                                                       modeling candidates with different degree of associations on the
                                                                       network to have privacy concerns. When λ is set as negatives or
1
    http://dl.acm.org/proceedings.cfm                                  positives, we can obtain different simulations for indicating either
top-degree or bottom-degree candidates to have more privacy                    information for the querying user. When we set pc = 1.0, it means
concerns. The absolute value of λ corresponds to the power of                  that the complete social connections for the querying user is
emphasizing on top-degree or bottom-degree candidates. When λ                  available. When set pc to the other values, we can only use partial
is set to 0, it is uniform and each user has equivalent probability.           social connections. To remove the sampling bias, we randomly
                                                                               sample the incomplete social connections 10 runs and the reported
                                                                               results are based on the average over 10 runs.
             (      )                                          Eq. 1
                                                                               2.3 Experiment Setup
Besides λ, we need another parameter to control the proportion of              Our study involves two sets of experiments. The first set examines
candidates on the networks who have the privacy concerns (noted                the impacts of various privacy settings on the computing of global
as pb). In this paper, we will test nine different pb (from 0.1 to 0.9,        and local network feature. The second set tests their further
with 0.1 for each step) and under each pb. Besides, we also test               influences on people search.
different values of λ. For each pair of <λ, pb>, we sample 10
different runs to remove the bias. Our reported results are based              2.3.1 Testing the Impacts on Global Network Feature
on the average over those 10 runs. To be specific, suppose that we              Since the local network feature is directly related to the querying
have N candidates and we think that N × pb of them have privacy                users, it is difficult to study it independently. In contrast, the
concern. The goal of sampling, therefore, is to return N × pb                  global network feature is computed through the propagation on
sampled privacy-concerned candidates. Our sampling algorithm is                the whole network and it is independent of the querying users. So,
a “sampling without replacement” (see Figure 1).                               we only examine the influences of different privacy settings on
                                                                               the global network feature in this section.
Algorithm: Sampling privacy-concerned candidates                               The global network feature of a candidate is represented as his/her
                                                                               authority value, which is measured by the PageRank value on the
Input: N, pb and λ; Output: N × pb privacy-concerned candidates U
                                                                               coauthor networks. We first compute the authority value (pra) for
                                                                               each candidate a using the whole network information. This is
Procedure:                                                                     treated as the ground-truth values. To test the impact of a privacy
1 : compute pi using Eq. 1, put it in array P[] and compute the sum S of P[]   setting, we re-compute the authority value (prap) for the candidate
2 : for run = 1 : 10                                                           a with different portion of people on the network do not share
3: M = N                                                                       their social connections because of the privacy concerns. We use
3 : for i = 1 : N × pb //sampling N × pb candidates                            the Mean Absolute Error (MAE) between the new authority
4:      randomly generate a number r in [0,S)5:                                values and ground-truth authority values over all of the authors as
5:      for a = 1 : M                                                          the indication of the impact from privacy concerns (see Eq. 2).
6:        if Σ P[a] ≥ r
7:             put the corresponding candidate into U
8:             S = S – P[a];                                                                            ∑|             |                Eq. 2
9:             break;
10: M = M -1;

Figure 1: Algorithm for generating the privacy-concerned candidates            2.3.2 Testing the Impacts on People Search
                                                                               When examining the impacts of different privacy-preserving
2.2.2 Modeling Privacy for the Querying Users                                  networks on the people search performance, we adopted the user
The local network feature in this paper refers to the proportion of            study data from Han et al. [5]. In that experiment setting, the
common social connections between the querying users and the                   effectiveness of a people search was affected by three facets:
existing candidates. Therefore, the privacy settings of both people            content relevance, local network feature and global network
will influence the calculation of the local network feature.                   feature. The three facets are displayed to the querying users so
Modeling privacy concerns for the candidates has been discussed                that the users could directly configure the importance of each
above; here we present our modeling of the privacy concerns on                 facet. To test the influences of using privacy-preserving networks,
the querying users. The social connections of the querying users               we can directly test its impacts on a live system by comparing
were obtained through the users themselves in the user study                   system performance in two scenarios: one with complete network
(more details see Han et al. [5]). In that study, each participant             and the other one with privacy-preserving networks. However, it
was asked to provide his/her personal information as well as                   will be very time-consuming and may be unable to detect the
his/her close social connections.                                              subtle differences. Therefore, we decide to conduct a simulation
                                                                               study based on the queries and marked candidates from [5].
The privacy-conscious users may either do not provide any or
only provide incomplete personal social information. In our study,             We assume that a querying user u issued several queries in order
therefore, we introduce the completeness of the provided                       to finish a task and under K queries, u has marked at least one
information (pc) as the indicator of the querying user’s privacy               candidate. We name those K queries as the effective queries. We
concerns. It is measured by the percentage of social connections               assume that the purpose of each effective query is to retrieve the
that a querying user provided over the complete “oracle” social                best-matched candidates (i.e. the ground-truth). Although the
connections of that user. The “oracle” social connections are                  ordering of those K queries may reveal their importance in the
simulated by the user provided information from the user study in              whole search process, we do not consider such information in this
[5] because the users were explicitly asked to provide complete                paper for simplicity. Therefore, for each effective query in [5], we
social connections during the user experiment.                                 compute three scores: the query-candidate content match SC, the
                                                                               local network feature SL and the global network feature SG. Those
In this paper, we test elven different values of pc (from 0.0 to 1.0,          scores were transformed into logarithmic values and combined
with 0.1 for each step). Note that, when set pc = 0.0, it                      linearly. In a live system, the querying users can tune the
corresponds to the scenarios that we do not have any social                    importance of each facet: wc (for SC), wg (for SG) and wl (for SL).
The computation of each score and their integration are the same        candidates are usually well-connected in networks, we anticipate a
as Han et al. [5]. The Integration score S is computed using Eq. 3.     higher impact from their privacy concerns. We see in Figure 2 that
The candidates are ranked based on this score.                          the MAE curves for both two positive λ values are above the
                                                                        baseline. When sampled more privacy-concerned candidates from
                                                              Eq. 3     high-degree candidates (i.e. compare λ = + 0.5 with λ = + 1.0), we
                                                                        see an increase of the MAE errors.
For each effective query, different configurations of wc, wg and wl         0.50
yield different search performance. Lacking of the real user                               λ = - 1.0
interactions, we cannot obtain how users would set those weights.           0.40           λ= - 0.5
In the simulation study, we assume that users are able to tune the
                                                                                           λ = 0.0
best configurations to achieve the best search performance. The
search performance of each effective query qi is measured by the
                                                                            0.30           λ = + 0.5
Average Precision (AP) under the best configuration of wc, wg and                          λ = + 1.0
wl, as shown in Eq. 4. The AP is computed using the ground-truth            0.20
data (the marked candidates for a task with ratings bigger than
3.0) from the user study in Han et al. [5]. The ground-truth is built       0.10
for user-task pair so that any of the K effective queries within one
user-task pair would share the same ground-truth. The search                0.00
performance of each user-task pair is then measured by the Mean                    0.1   0.2   0.3     0.4   0.5   0.6   0.7   0.8   0.9
Average Precision (MAP) over all of the K effective queries, as
shown in Eq. 5. Then, the search performance depends on the SC,         Figure 2: The impacts of different privacy-preserving networks on the
SG and SL (as shown in Eq. 3.), which are determined by the             calculation of global network feature. We measure the impacts using
                                                                        MAE. X axis: pb in Figure 1; Y axis: the MAE. Each value is
available information in the privacy-preserving networks. The
                                                                        aggregated over 10 runs. (MAE, the smaller the better)
comparison of privacy-preserving networks can be transformed to
compare the MAP.
                                                                        3.2 Impacts on the People Search
                                                                        We further study how different privacy-preserving networks
                                                             Eq. 4      affect the people search performance. We took λ = -1.0 (+1.0) as
                                                                        the upper (lower) bound based on the result in Figure 2 and still
                                                                        used λ = 0.0 as the baseline.
             ∑                                               Eq. 5      To simulate and measure the people search performance, we need
                                                                        to set appropriate parameters (wc, wg and wl) in Eq. 3. Since we
                                                                        only focus on the impact of the global network feature in this
3. IMPACTS ON GLOBAL FEATURE IN                                         section, we set the weight for local network feature wl = 0. We
                                                                        estimate the parameters based on the full network information,
PRIVACY-PRESERVING NETWORKS                                             and assume that parameters are also applied to privacy-preserving
In this section, we study how different privacy-preserving              networks. We acknowledge the limitation of not tuning
networks influence the computation of the global network feature        parameters for each network. We think the parameters reveal
and how it further affects the performance of people search.            users’ objective view of the importance of each facet and it
                                                                        remains the same under different networks. The parameters we
3.1 Impacts on Global Network Feature                                   used in this section are wc = 1.0 and wg = 0.1.
We simulate different privacy-preserving networks by setting
different λ in Eq. 1. We compare five λ in this paper: -1.0, -0.5,      The MAP evaluations under different privacy-preserving
0.0, 0.5 and 1.0. Under each λ, we then adopt the sampling              networks (different values of λ and pb) are shown in Figure 3. We
procedure described in the section 2.2.1 to choose a certain            also plot the MAP performance using the full network information
percentage (pb in the Figure 1) of privacy-concerned candidates.        (the red solid line) as an upper bound baseline. We find that the
To measure its impacts on the computing of global network               results of λ = -1.0 have very similar performance to the upper
feature, we measure the MAE between its values on the full              bound baseline even when pb is as large as 0.9. This is because
networks and the sampled privacy-preserving networks.                   here only those low-degree candidates have privacy concerns
                                                                        while the core candidates with medium or high degree remains in
The MAE results are shown in Figure 3. As stated, when λ is set
                                                                        the network. In contrast, the results of λ = +1.0 (high-degree
to 0.0, the candidates on the network have uniformed probability
                                                                        people has more privacy concerns) have clearly impacts on the
(pi in Eq. 1) of being concerned on sharing social connections. We
                                                                        people search performance even when pb is as small as 0.1 and
treat it as one of the baselines. We also set λ as negative values to
                                                                        0.2. This is because many core candidates with top degree of
simulate the scenario that candidates with low association degrees
                                                                        associations are removed from the networks.
have more privacy-concern. Since those low association degree
candidates only affect a small proportion of the connections on the     Although the maximal change of MAP is a 3.87% drop (relative
network, we suspect that they have less impact. The results from        percentage when λ=+1.0 and pb=0.8, comparing to the “Full
Figure 2 confirm our expectation. In addition, λ with smaller           Networks”), the changes for all pb are still significant under the
negative values (i.e., bigger absolute values) results in slightly      Wilcoxon Sign Test (e.g. p-value=0.040 for pb=0.1, p-value
better MAE, which is not surprising based on our suspect.               =0.016 for pb = 0.2 and p-value= 0.000 for pb=0.3 and etc). Again,
                                                                        the results of λ = 0.0 lie between that of λ = + 1.0 and that of λ = -
When set λ into a positive value, it corresponds to the scenario
                                                                        1.0 because the high- or low-degree candidates have the same
that the high association degree candidates have higher chance to
                                                                        probability of being sampled as the privacy-concerned candidates.
have privacy concerns. Since those high association degree
   0.28
                                                                        well-connected candidates are removed. The MAP of randomly
                                                                        selecting candidates (λ=0.0) to have privacy concerns lies between
                       Full Networks               λ= - 1.0
                                                                        that of λ=-1.0 and that of λ =+1.0.
   0.26
                       λ= 0.0                      λ = + 1.0
                                                                           0.28
   0.24
                                                                           0.26
   0.22
                                                                           0.24
   0.20
                                                                                                 Full Networks                λ= - 1.0
           0.1   0.2   0.3      0.4   0.5   0.6   0.7   0.8    0.9         0.22
Figure 3: The impacts of new global network feature under different                              λ= 0.0                       λ = + 1.0
privacy-preserving networks to the performance of people search. The       0.20
impact is measured by MAP. X axis: pb in Figure 1; Y axis: the MAP.                0.1   0.2   0.3    0.4      0.5   0.6    0.7    0.8    0.9
Each value for different λ (except the “Full Networks”) is aggregated
over 10 runs. (MAP, the bigger the better)                                 Figure 4: The impacts of new local network feature under different
                                                                        privacy-preserving networks to the performance of people search. The
4. IMPACTS ON LOCAL FEATURE IN                                          impact is measured by MAP. X axis: pb in Figure 1; Y axis: the MAP.
                                                                        Each value for different λ (except the “Full Networks”) is aggregated
PRIVACY-PRESERVING NETWORKS                                             over 10 runs. (MAP, the bigger the better)
In this section, we try to understand the impacts of privacy-
preserving networks on local network feature. Since it is related to    4.2 Impacts of Querying Users’ Privacy
both the candidates and the querying users, we study the privacy
settings for both two types of users.                                   Setting on Local Network Feature
                                                                        The last privacy setting we examined is related to the
4.1 Impact of Candidates’ Privacy Setting on                            completeness of social information provided by the querying users
Local Network Feature                                                   that is to test the influence of different settings of pc (see the
                                                                        section 2.2.2 for its definition) on people search performance. The
Since we are focusing on the local network feature in this section,
                                                                        MAP evaluations over different pc are shown in Figure 5. The
we set wg = 0.0. To find appropriate weights for SC and SL (i.e. the
                                                                        “No Social Info.” means that we do not use the local network
optimal wc and wl), we re-examine users’ people search process
                                                                        feature. The “Full Social Info.” corresponds to the scenario that
based on the user study data and find the corresponding optimal
                                                                        we can obtain the complete user social connections and use them
parameters that maximize the people search performance over all
                                                                        to compute the local network feature. The “No Social Info.”
effective queries. Same as the Section 3.2, in this process, we use
                                                                        performs as the lower bound of the MAP whereas the “Full Social
the full network information and assume that the same parameter
                                                                        Info.” acts as the upper bound.
setting also applies in the privacy-preserving networks. The best
parameters we chose is the wc = 1.0 and wg = 0.082. We also use             0.28
the same parameters in the section 4.2.
                                                                            0.26
The MAP evaluations on different privacy-preserving networks
are shown in Figure 4, where we examine the results of three
                                                                            0.24
different λ values: -1.0, 0.0 and +1.0. Besides, we consider the
“Full Networks” as an upper bound baseline. It is the same as                              No Social Info.                 Partial Social Info.
what we did in Section 3.2. We find that local network feature              0.22
                                                                                           Full Social Info.
produces more improvements on the performance of people
search than global network feature -- the MAP equals to 0.2352              0.20
for global network feature (combing with the content relevance)                    0.1   0.2    0.3    0.4     0.5   0.6    0.7    0.8    0.9
while it equals to 0.2752 for local network feature (combing with       Figure 5: The impacts of new local network feature under different
content relevance) when using the full network information. The         privacy settings of querying users to the people search performance. X
difference is significant under the Wilcoxon Sign test, p=0.003.        axis: pc, i.e. the completeness of user provided social information; Y
However, we observe that local network feature is more sensitive        axis: the MAP. Each point for the “Partial Social Info.” is averaged
to the privacy setting than global network feature – the maximized      over 10 runs. (MAP, the bigger the better)
MAP change for the λ = 0.0 is less than 0.01 for global network
feature (as shown in Figure 3) while it changes more than 0.035         We observe that the upper bound is significantly better than the
for local network feature (as shown in Figure 4).                       lower bound (+15.58%, with p-value= 0.001 under Wilcoxon Sign
                                                                        Test), which indicates the usefulness of involving local network
We further find that removing those high-degree candidates (i.e.,       feature of the querying users in the people search process. We also
λ=+1.0) has a great impact -- the performance has a substantial         find that the search performance will keep steadily increasing
drop even when only a small portion of candidates have privacy          when having more social information about the query user (the
concerns (pb =0.1 or 0.2). This indicates the import roles that the     dotted red line with “Partial Social Info.”).
high-degree candidates played in the computing of local network
feature. We think it may be because of that most of the desired         5. CONCLUSION AND DISCUSSIONS
candidates (i.e. candidates in the ground-truth) for our user study     People search has been extensively studied in recent years. Many
are actually directly or indirectly connected to the top degree         of the researchers identified that social network information is an
candidates. However, this is not the case when λ=-1.0 where less        important resource for improving the people search performance
[4, 5, 10, 15]. The social networks can be used to infer the local      Finally, we tested the impacts of local and global network features
network feature between the querying users and candidates, as           separately; whereas we know that privacy concerns affect people
well as the global network feature regarding the candidates.            search system such as in PeopleExplorer 2 on both features. In
However, both the local and global network features can be highly       addition, we studied the privacy settings for the querying users
affected by the privacy settings of querying users and candidates.      and candidates separately. In the real settings, all these factors
Although the privacy issues are increasingly important in recent        should be studied together.
years, its impacts on people search haven’t been studied yet.
It may be due to the difficulty of obtaining a privacy-preserving
                                                                        6. REFERENCES
                                                                        [1] Rodriguez, M., Posse, C. and Zhang, E. Multiple objective
social network and make it openly available for research purpose.
                                                                            optimization in recommender systems. ACM, City, 2012.
Therefore, in this paper, we focus on simulating the privacy-
preserving social networks using a publicly available network –         [2] Han, S., Jiang, J., Yue, Z. and He, D. Recommending
the academic coauthor network. The privacy could come from                  program committee candidates for academic conferences. In
either the querying users or the candidates in the networks. We             Proceedings of the Proceedings of the 2013 workshop on
studied their impacts separately. For the querying users, we                Computational scientometrics: theory & applications (San
treated the completeness of social information as a parameter to            Francisco, California, USA, 2013). ACM.
simulate the scenario that users do not provide full social             [3] Balog, K., Azzopardi, L. and Rijke, M. d. Formal models for
information. For the candidates, we introduced the proportion of            expert finding in enterprise corpora. In SIGIR 2006, Seattle,
candidates that has privacy concerns and the strength of                    Washington, USA, 2006. ACM.
association (i.e. his/her degree in the networks) as two parameters.
We assume that candidates’ privacy concerns are correlated with         [4] Chen, H.-H., Gou, L., Zhang, X. and Giles, C. L. Collabseer:
their degree of association in networks.                                    a search engine for collaboration discovery. In JCDL '11.
                                                                            ACM, New York, NY, USA, 231-240.
When using the full network information, we find that both the
local and global network features provide significant boosts on the     [5] Han, S., He, D., Jiang, J. and Yue, Z. Supporting exploratory
performance of people search (compare to not using social                   people search: a study of factor transparency and user
network). However, comparing to the global network feature, the             control. In CIKM 2013, San Francisco, California, USA,
local network feature can provide greater improvements. Using               2013. ACM.
the simulated networks, we also find that privacy-preserving            [6] Terveen, L. and McDonald, D. W. Social matching: A
networks have significant influences on the performance of people           framework and research agenda. ACM transactions on
search with both the local and global network features (comparing           computer-human interaction (TOCHI), 12, 3 2005), 401-434.
to the use of complete network information).
                                                                        [7] Acquisti, A. and Gross, R. Imagined communities:
In additional, we observe that different roles of candidates can            Awareness, information sharing, and privacy on the
exert different impacts on the computing of global network                  Facebook. Springer, 2006.
feature and they further impose different influences on the people
                                                                        [8] Dwyer, C., Hiltz, S. R. and Passerini, K. Trust and Privacy
search process. The privacy concerns from the high-degree
                                                                            Concern Within Social Networking Sites: A Comparison of
candidates in the network have more impacts. Since the local
                                                                            Facebook and MySpace. 2007.
network feature is related to both the querying users and the
candidates in the networks, we find that the privacy concerns from      [9] Agrawal, R. and Srikant, R. Privacy-preserving data mining.
both of them have significant impacts on the search performance.            ACM Sigmod Record, 29, 2 2000), 439-450.
The privacy concerns from high-degree candidates have bigger            [10] Zhang, J., Tang, J. and Li, J. Expert finding in a social
influences on the people search than that of the lower-degree                network. Springer, 2007.
candidates, especially when those high-degree candidates are
related to the querying user. We also find that if the querying         [11] Ugander, J., Karrer, B., Backstrom, L. and Marlow, C. The
users provide more social connections, the search performance                anatomy of the facebook social graph. arXiv preprint
would increase steadily.                                                     arXiv:1111.45032011).

We do acknowledge that there are still several limitations in this      [12] Barabási, A.-L. and Albert, R. Emergence of scaling in
paper. First of all, our simulation study assumed that the purpose           random networks. science, 286, 5439 1999), 509-512.
of each query is to find the best-matching candidates so we didn’t      [13] Watts, D. J. and Strogatz, S. H. Collective dynamics of
differentiate the deeper intentions of different queries. However, it        ‘small-world’networks. nature, 393, 6684 1998), 440-442.
is observed that users may develop different strategies in their        [14] Newman, M. E. Assortative mixing in networks. Physical
search processes so that some queries may be only used to filter             review letters, 89, 20 2002), 208701.
out certain non-relevant ones. Identifying the search intentions
behind each query would give us better understanding of the             [15] Han, S., He, D., Brusilovsky, P. and Yue, Z. Coauthor
impacts of privacy concerns. This is one future direction.                   prediction for junior researchers. In Proceedings of the
                                                                             Proceedings of the 6th international conference on Social
Secondly, we also assumed that each querying user is able to tune            Computing, Behavioral-Cultural Modeling and Prediction
the optimized configurations of the weights for each feature;                (Washington, DC, 2013). Springer-Verlag.
while it may not be the case in a live search system. Users may
exhibit different behaviors as we expected -- they may not
necessary to tune for the optimal parameters and find the best
matched candidates. Our next step is to conduct a live user
experiment to study how users interact with the search system
under different privacy-preserving networks.
                                                                        2
                                                                            http://crystal.exp.sis.pitt.edu:8080/PeopleExplorer/