=Paper=
{{Paper
|id=Vol-2083/paper-05
|storemode=property
|title=Sensitive Attribute Prediction for Social Networks Users
|pdfUrl=https://ceur-ws.org/Vol-2083/paper-05.pdf
|volume=Vol-2083
|authors=Younes Abid,Abdessamad Imine,Michaël Rusinowitch
|dblpUrl=https://dblp.org/rec/conf/edbt/AbidIR18
}}
==Sensitive Attribute Prediction for Social Networks Users==
Sensitive attribute prediction for social networks users
Younes Abid Abdessamad Imine Michaël Rusinowitch
Lorraine University Lorraine University, Cnrs, Inria Lorraine University, Cnrs, Inria
Nancy, France Nancy, France Nancy, France
younes.abid@loria.fr abdessamad.imine@loria.fr michael.rusinowitch@loria.fr
ABSTRACT attributes like music and movie. Consequently, the graphs
Social networks are popular means of data sharing but they that model these attributes have higher connectivity than
are vulnerable to privacy breaches. For instance, relating the other graphs. For instance, the density of the graph
users with similar profiles an entity can predict personal that models gender (as most users publish their gender) is
data with high probability. We present SONSAI a tool to close to 0.5. In order to infer hidden links in such graphs
help Facebook users to protect their private information we need to learn from highly connected graphs. However
from these inferences. The system samples a subnetwork most of the available learning graphs are sparse. To cope
centered on the user, cleanses the collected public data with this last problem, we derive new graphs by merging
and predicts user sensitive attribute values by leveraging several learning graphs.
machine learning techniques. Since SONSAI displays the All the proposed methods have been implemented for
most relevant attributes exploited by each inference, the Facebook, however they can be applied to many other
user can modify them to prevent undesirable inferences. social networks. The system has been tested by several
The tool is designed to perform reasonably with the lim- volunteer users for auditing their Facebook profiles. In each
ited resources of a personal computer, by collecting and case a dataset was built from real profiles collected in the
processing a relatively small relevant part of network data. user neighborood network.
Related works. In [15] the authors propose algorithms to
1 INTRODUCTION detect whether a sensitive attribute value can be infered
Data published on social networks profiles can be mined from the neighborhood of a target user in a social net-
for inferring sensitive information about users. For instance work. Heatherly et al. [12] infer attribute values in social
it was shown in [10] how musical tastes allow one to pre- network with bayesian classification techniques. For the
dict educational level. To increase user awareness about same purpose Estivill-Castro et al. [7] employ decision-tree
these privacy threats we have designed a tool, SONSAI, for learning algorithms. In these works learning is performed
Facebook users to audit their profiles.The system crawls off-line on large datasets. In order to perform attribute
the network around the user and predicts its sensitive at- inference from sparser datasets collected in short time by
tributes values using a machine learning engine. The results our tool user in his ego-network, we rather use randow
provided by SONSAI, also shows the public attributes of walk-based learning. The random walk technique has been
the user that have oriented the learning algorithm towards applied to social representations in [14] and [11] where the
a particular sensitive attribute value. The user can therefore authors analysed friendships and used a skip-gram model.
modify these public attributes to prevent inference. In these works, user profiles that have similar friends will
For the approach to be feasible several problems have be mapped to similar representations. This model helps
to be solved: First, data collection by crawling is limited to detect communities and can predict a set of potential
both by the social network and by country regulations. friends for a given user profile. Skip-gram model has also
Hence the crawler exploration strategy has to focus only on been applied recently to infer social relationship from mo-
meaningful representative network nodes. Since attributes bility data [5]. Our work uses a more adapted Continuous
are numerous, for the learning program to scale one has Bag of Words (CBOW) model to predict the most likely
to select only the most relevant ones for inferring sensitive values for a user sensitive attribute. In our setting friends
attribute values. Hence the second problem is to find an are considered as an attribute among others. Moreover
attribute relevance measure that is both accurate and easy we first determine automatically a relevant subset of the
to compute. Note that we cannot rely on semantic proximity social network for optimizing random walks. In [3] the rel-
since we notice that a user that hides a sensitive attribute evance of attributes is computed by Bayesian optimisation
probably will hide semantically related ones, too. Moreover which is much less efficient than the graph comparison
for fully anonymised datasets the attributes semantics is approach adopted here. In particular [3] does not succeed
hard to recover. Therefore we follow an alternative approach in reasonable time (on standard PC) with attributes like
by modelling attributes as bipartite graphs and measuring gender and relationship status. Let us note that (unlike
relevance of attributes by comparing their bipartite graph [1]) our approach does not need any ontology to perform
structures. semantic correlation between attributes. Finally we notice
For specific attributes such as gender and relationship that the proposed system SONSAI is close under some
status, the sets of values are much smaller than for other aspects to a recommendation system: an item suggestion
can be viewed as an attribute value prediction [4]. How-
© 2018 Copyright held by the owner/author(s). Published in the
Workshop Proceedings of the 21st International Conference on Ex-
ever unlike recommendation systems SONSAI also provides
tending Database Technology (EDBT) (March 26-29, 2018, Vienna, explanations for the predicted values, namely an ordered
Austria) on CEUR-WS.org (ISSN 1613-0073).Distribution of this list of attributes that have played a significant role in the
paper is permitted under the terms of the Creative Commons license
CC-by-nc-nd 4.0. computation. For enforcing privacy, our final goal will be
28
to reduce the “recommendation accuracy” by acting on 1 Procedure crawl_nodes(𝑢𝑡 , 𝑑, 𝑛𝑐 , 𝜋)
this list of attributes. 2 crawl_node(𝑢𝑡 );
3 while |𝑐𝑟𝑎𝑤𝑙𝑒𝑑_𝑛𝑜𝑑𝑒𝑠| < 𝑛𝑐 do
2 ARCHITECTURE OF SONSAI 4 𝑐 ← 𝑢𝑡 ; 𝑟 ← {};
5 for 𝑖 ← 1 to 𝑑 by 1 do
6 𝑟.addAll(sinks(𝑑 − 𝑖)));
7 𝑐 ← random_select_in({𝑐 ∪ 𝑐.𝑛} ∖ 𝑟);
8 end
9 crawl_node(𝑐);
10 end
Algorithm 1: Crawling nodes around a target user.
Figure 1: Architecture of SONSAI. 4 SOCIAL NETWORK MODEL
Modelling friendship relations. Since friendship on Face-
The architecture of SONSAI is overviewed in Figure 1. book is symmetric, we model friendship between user pro-
The Facebook crawler (about 5k lines of Java 8) drives a files by an undirected graph (𝑈, 𝐹 ) where 𝑈 is a set of
Firefox 58.0b4 navigator through a Selenium 3.5.3 server 1 . users’ profiles and 𝐹 is a set of friendship links between
Collected information from each profile, group and page are them.
stored in separated XML files. The Anonymizer, Cleanser, Modelling page like-ship. We model like-ship between
Random walker and Ranker components are written in user profiles and pages by several bipartite graphs (𝑈, 𝑃, 𝐿)
Python 2.7 (about 2.5k lines of code). The Anonymizer where 𝑈 is a set of users profiles, 𝑃 is a set of pages (a
component parses all the XML files, generates anonymized type) and 𝐿 is a set of like-ship links between them. Figure
graphs and stores them as TSV files. Then, the Cleanser 2 shows an example of page like-ship modelled by two
selects the most relevant graphs and stores them as adja- graphs3 . Graph (𝑎) models liked pages of music type and
cency lists. The Random walker component browses the Graph (𝑏) models liked pages of book type. We note that
adjacency lists and stores the resulting walks in a text user profiles can like several pages of the same type.
document. We use the Python gensim 2 implementation of
Word2Vec to parse the text document and compute a vec-
torial representation of social network nodes encountered
in the walk. Finally, the Ranker component classifies the
sensitive nodes according to their similarity to the target
user profile.
3 SAMPLING FACEBOOK
We have designed a crawler that explores the social network
around a user (to some distance given as a parameter) and
collects public information from the visited web pages (i.e.
friends, liked pages . . . ) in order to build a representative
subnetwork. We distinguish two types of Facebook nodes:
user profiles (u) and pages (p) and two types of links:
like-ships between user profiles and pages, and friendships
between user profiles. Given a node 𝑐, 𝑐.𝑛 denotes the
set of nodes that are linked to 𝑐. A discovered node is a
node whose URL is known by the crawler. For instance, if
the crawler retrieves a user profile and collects its public
friends list, then all the friends of that particular user
profile are discovered. Algorithm 1 crawls at most 𝑛𝑐 nodes
at distance ≤ 𝑑 from the target node 𝑢𝑡 . Each iteration of
the outer loop samples a node, crawls it and update the sets
of discovered and crawled nodes. The sampling is done by Figure 2: Example of pages like-ship graphs.
random walks of length ≤ 𝑑 with a transition probability 𝜋
designed to crawl with higher priority closer nodes and to
favour neighbour nodes according to their type. Function Anonymizing the social network graphs. For ethical and
sinks(𝑗) returns the set of sinks, i.e. crawled nodes such regulation reasons Facebook identifiers are replaced by fresh
that all discovered nodes at distance ≤ 𝑗 are also crawled. identifiers. Each node in the network is then identified
Sinks are avoided by the random walks to guarantee that by a unique integer ID replacing its Facebook ID. The
the final node has not been crawled yet. anonymized IDs are sorted according to the node types.
Anonymized graphs are saved under the tab-separated
1
http://www.seleniumhq.org/
2 3
https://radimrehurek.com/gensim/index.html Icon made by Smashicons from www.flaticon.com
2
29
value (TSV) format, one of the most general delimiter-
separated values format (DSV). TSV is widely used in
graph exchange. In contrast to the dataset released by
Netflix 4 where only user IDs are anonymised (but not
movies title, rating, date of rating), all attribute values are
anonymised in our datasets.
In the following a sensitive graph is a graph that models
an attribute that is considered sensitive by the user (i.e.
the user does not want its value to be predictable). The
learning graphs are attribute graphs available for the learn-
ing module in order to predict hidden links in the sensitive
graph.
5 MODEL CLEANSING
The task of predicting a sensitive attribute from the other
ones is made difficult by the size of datasets. Therefore, we
suggest to apply the inference process only on a subset of
most relevant attributes for the task. Our relevance notion Figure 3: Example of cutting graphs for structure
does not rely on semantic proximity since we noticed that comparison.
i) a user that hides a sensitive attribute will probably hide
other semantically-related attributes, and moreover ii) for
fully anonymised datasets the attributes semantics is hard 𝑢1 and 𝑢2 in a given graph 𝐴 is computed as follows, where
to recover. On the contrary we will rely on comparing the function 𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢𝑗 ) returns the set of nodes to which
attributes graph structures. user node 𝑢𝑗 is connected in the graph 𝐴.
In the following we assume a fixed sensitive graph 𝑠. 𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢1 ) ∩ 𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢2 )
𝐽𝐴 (𝑢1 , 𝑢2 ) = (1)
Step 1: Computing the learning and the confidence rates 𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢1 ) ∪ 𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢2 )
of learning graphs. In order to compare the structure of The Hamming distance 𝐻 between graphs 𝑙 and 𝑠 is
a given learning graph to the structure of the sensitive defined by:
graph, we first split each graph in two parts. The first part ∑︁
𝐻(𝑙) = |𝐽𝑙 (𝑢𝑘 , 𝑢𝑗 ) − 𝐽𝑠 (𝑢𝑘 , 𝑢𝑗 )| (2)
contains user profiles that hide their links in the sensitive
graph. The ratio of user profiles that publish their links in
𝑢𝑘 ,𝑢𝑗 ∈𝑈 𝑙∩𝑈 𝑠
𝑘̸=𝑗
the first part of the learning graph represents the learning
In order to compare learning graphs with different sets
rate 𝑙𝑟. The second part contains user profiles that publish
of common connected profiles 𝑈 𝑙 ∩ 𝑈 𝑠, we normalize this
their links in the sensitive graph. The ratio of user profiles
distance by the maximal Hamming distance that can be
that publish their links in the second part of the learning
obtained on such a set. Hence we define the Hamming rate:
graph represents the confidence rate 𝑐𝑟.
ℎ𝑟(𝑙) = 𝐻(𝑙)/𝑀 (𝑙) where 𝑀 (𝑙) is
The function connected_profiles_in() returns the set ∑︁
of user profiles that publish their links in the graph given |𝑀 𝑎𝑥(𝐽𝑠 (𝑢𝑘 , 𝑢𝑗 ), 1 − 𝐽𝑠 (𝑢𝑘 , 𝑢𝑗 ))| (3)
as argument. Given 𝑙, 𝑠 respectively a learning graph and a 𝑢𝑘 ,𝑢𝑗 ∈𝑈 𝑙∩𝑈 𝑠
sensitive graph and 𝑈 the set of user profiles in all graphs, 𝑘̸=𝑗
we define: Step 3: Selecting most relevant graphs for learning sen-
sitive attribute values. We first discard the learning graphs
𝑈 𝑙 = connected_profiles_in(𝑙)
that have a learning rate 𝑙𝑟 lower than threshold 𝜃𝑙𝑟 since
𝑈 𝑠 = connected_profiles_in(𝑠) they do not convey enough information. We then discard
𝑙𝑟(𝑙) = size(𝑈 𝑙 ∩ (𝑈 ∖ 𝑈 𝑠)) / size(𝑈 ∖ 𝑈 𝑠) the graphs that have a confidence rate 𝑐𝑟 lower than 𝜃𝑐𝑟
𝑐𝑟(𝑙) = size(𝑈 𝑙 ∩ 𝑈 𝑠) / size(𝑈 𝑠) since they are considered as unreliable. Finally, from the re-
maining graphs we only select graphs that have a Hamming
rate ℎ𝑟 higher than 𝜃ℎ𝑟 since they are the most similar to
Figure 3 depicts an example of splitting two graphs for the sensitive graph.
comparison. The graph that models the link-ship between
user profiles and pages of politicians is the sensitive graph. Densifying graphs
And the graph that models the link-ship between user For some sensitive attributes such as gender, age and rela-
profiles and pages of musics is the learning graph. The tionship status, user profiles are linked to at most one value.
learning rate 𝑙𝑟 for this example is equal to 50%. And the Moreover the sets of values for these particular attributes
confidence rate 𝑐𝑟 is equal to 75%. are much smaller than for other attributes. Consequently,
Step 2: Computing the distance between a learning graph the graphs that model these attributes are denser than the
and the sensitive graph. In this step, we discard user profiles other ones. In this case, for improving the random-walk
that have a null degree in the learning graph or in the based learning process (see Section 6) we need to merge
sensitive graph. The Jaccard index between two user nodes several learning graphs in order to obtain a denser one.
We explain the method with a simple example of gender
4
https://www.kaggle.com/netflix-inc/netflix-prize-data prediction: we select attribute graphs with high 𝑙𝑟, 𝑐𝑟 rates
3
30
but with also a good rate of discrimination between gen-
ders, that is the gender of connected users in the graph is 𝑢 → 𝑣 : 𝑝𝑢,𝑦 /𝑑𝑒𝑔𝑦 (𝑢) if 𝑦 ̸= 1 and (𝑢, 𝑣) ∈ 𝐿𝑦
unbalanced between male and female. For instance we can 𝑣 → 𝑢 : 1/𝑑𝑒𝑔𝑦 (𝑣) if 𝑦 ̸= 1 and (𝑢, 𝑣) ∈ 𝐿𝑦 (5)
select jewelry and fast-food graphs. We merge these graphs 𝑢 → 𝑢′ : 𝑝𝑢,1 /𝑑𝑒𝑔1 (𝑢) if (𝑢, 𝑢′ ) ∈ 𝐹
by grouping all fast-foods in a unique node and similarly
for jewelries as shown in Figure 4 to obtain a new learning
graph.
Figure 4: Merging graphs.
6 RANDOM WALK-BASED
ATTRIBUTE LEARNING
Representation generation. We plan to translate latent
information from the selected graphs (in previous section)
to a document. The resulting document holds information
about paths in the graphs and their frequencies that can Figure 5: Example of multi graph random walk.
be exploited for infering proximity of a user node to some
sensitive attribute value node. Following [14], paths in the As illustrated in Figure 5 the document is constructed
social graph are sampled by random walks. In our case, the by connecting all graphs through random jumps between
walks are executed only in the subset of selected graphs. their nodes (see also [14]). At each step the walker state
Let 𝐺1 , 𝐺2 , . . . 𝐺𝑛 be the list of selected learning graphs.Let changes and a new word is written in the text document.
𝑈 be the set of users in all graphs. We assume that the One step amounts to select a graph where the current node
friendship graph is selected and 𝐺1 = (𝑈, 𝐹 ) (otherwise occurs with non null degree, and then to select a node that
we can simply adapt the computation below). The other is connected to the current node in the selected graph. The
graphs are bipartite and we pose 𝐺𝑖 = (𝑈, 𝑉𝑖 , 𝐿𝑖 ) for 𝑖 > 1. selected node then becomes the new current node.
We introduce quotas to quantify the importance of each In this example we aim to predict liked pages of politi-
graph 𝐺𝑟 for inferring secret values of the target sensitive cians masked by user profile 𝑢3. The sensitive graph is
attribute. Each selected graph 𝐺𝑟 is assigned a 3 dimen- Graph 2 and the learning graphs are Graph 1 and Graph
sional vector 𝑉𝐺𝑟 = [𝑙𝑟(𝐺𝑟 ), 𝑐𝑟(𝐺𝑟 ), 1 − ℎ𝑟(𝐺𝑟 )]. The 3. Since the values of the sensitive attributes (the pages
quota of 𝐺𝑟 is given by its Mahalanobis distance to the of politicians) are labelled (each value belongs to a unique
null vector [0,0,0]. It is computed as follows: cluster), they are represented by the label of their clusters
√︀ in the final document. We use a greedy clustering algo-
𝑞(𝐺𝑟 ) = 𝑉𝐺𝑇𝑟 Σ−1 𝑉𝐺𝑟
rithm [3] to define size similar clusters. Pages of politicians
with Σ the 3 × 3 covariance matrix over the set of selected that share many common "likers" end up in the same clus-
graph vectors. ter. For instance the first walk depicted by Figure 5 is
To specify the random walk transitions we first define [𝑢1 , 𝑢4 , 𝑣2,3 , 𝑢4 ]. But for efficiency the walk [𝑢1 , 𝑢4 , 𝑐2,2 , 𝑢4 ]
the probability 𝑝𝑢,𝑦 that being in node 𝑢 the next node in is stored instead in the document since the value 𝑣2,3 be-
the random walk is in a selected graph 𝐺𝑦 : longs to the cluster 𝑐2,2 .
{︃ 𝑞(𝐺𝑦 )
∑︀ if 𝑑𝑒𝑔𝑦 (𝑢) > 0 Applying word2vec to compute node representations. We
𝑝𝑢,𝑦 = {𝑥|𝑑𝑒𝑔𝑥 (𝑢)>0}
𝑞(𝐺𝑥 )
(4) have performed multi-graph random walks in the social
0 otherwise
network and generated a text document. Walks in the doc-
where 𝑑𝑒𝑔𝑥 (𝑢) is the degree of user 𝑢 in graph 𝐺𝑥 . A value ument can be interpreted as sentences, where the words
node 𝑣 is followed by a user node chosen uniformly at are network nodes. Hence, inferring a link between a user
random from the ones connected to 𝑣. Assuming the node node and an attribute value node is similar to the problem
following user node 𝑢 is in 𝐺𝑦 , then it will be chosen uni- of estimating the likelihood of words co-occurrence in a
formly at random from the nodes in 𝐺𝑦 that are connected corpus. We use word2vec [9, 13] to map one-hot encoded
to 𝑢. Therefore, the transition probabilities are (𝐺1 is the vectors that represent words in a high-dimensional vocabu-
friendship graph): lary space to low-dimensional vectors (see [6]). Word2vec
4
31
is employed with the Continuous Bag of Words (CBOW) Moreover, the euclidean distance between a user that has
model for predicting a word given its context defined by many friends, for instance the user 𝑢1 in the Figure 6, and
the 𝑐 − 1 words surrounding it (𝑐 is the window size of the a popular music like “despacito”, for instance the page of
context). Inputs of the model in our case are users and pub- music 𝑣2,1 in the Figure 6, will be small. But popular users
lished attribute value. Since there is no order between the do not necessarily like popular musics.
attribute values CBOW model is more adequate than Skip-
gram. The output of the CBOW model is a vector of size 7 EXPERIMENTS
𝑣 representing a probability distribution of co-occurrence To build datasets we have crawled Facebook profiles of
between all the words of the context and each word from people that live in North-East France and in Île-de-France
the vocabulary within a window of size 𝑐. (Paris). Table 1 gives statistics about the two crawled
datasets.
Ranking sensitive attribute values. We measure semantic
similarity between two nodes by the cosine of the angle ♯ Attributes
♯ Attribute ♯ Crawled
values user profiles
formed by the vectors representing the nodes. Cosine sim- Dataset 1 (D 1)
1 929 1 022 860 15 012
ilarity is known to take greater account of context infor- North-East France
Dataset 2 (D 2)
1 296 298 617 6 550
mation. We rank all the sensitive values by their cosine Île-de-France
similarity to the target user profile. The values that have Table 1: Details about the datasets.
the lowest rank are most likely the sensitive attribute se-
cret values of the target user profile, where secret values
are actually the true values of the target user but are not
published by him on the social network. 7.1 Political orientation
The Figure 6 depicts an example of 2-dimensional vectors
that encode 8 nodes: 3 user profiles (𝑢1 , 𝑢2 and 𝑢3 ), 2 From each node 𝑛 we perform a random-walk of 800 steps.
pages of musics (𝑣3,1 and 𝑣3,2 ) and 3 clusters of pages of The dimension of the node representation is taken to be
politicians (𝑐2,1 , 𝑐2,2 and 𝑐2,3 ). The clusters of the pages of 512. The dimension is usually taken between 100 and 300
politicians are the sensitive values and their vectors are red. for natural languages. However the size of the vocabulary
The node 𝑢1 is the target user profile and its vector is blue. in social networks (equal to the number of nodes) is much
The clusters of pages of politicians will be ranked according higher than in natural languages.
to their distances to 𝑢1 . And the inference algorithm will In Dataset 1 the sensitive graph represents the links
infer as most probable pages of politicians to be liked by 𝑢1 , between 2554 user profiles and 4589 politician pages. For
the pages of politicians of the cluster that has the smallest each experiment we generate a new social graph from the
rank (the closest cluster to 𝑢1 ). dataset by selecting the user profiles that publish their
In [16] Schakel et al. show that word2vec unsupervised preferences concerning the sensitive attribute (pages of
learning algorithm encodes word semantics by affecting politicians) and at least another attribute. Then we remove
vectors in the same direction for co-occurrent words during all the links in the sensitive graph of 10% of the selected user
training. Besides, the magnitude of a vector reflects both profiles. The algorithm makes sure that all the nodes in the
the frequency of appearance of related words in the corpus resulting social graph remain connected. The experiments
and the homogeneity of contexts. Where a context is a set have consisted then in inferring the hidden links based on
of words that have high co-occurrence probability in the information from the learning graphs.
corpus. Among the 1928 learning graphs, we selected the ones
In fact, the words that appear in the same contexts have with learning rate greater than 𝜃𝑙𝑟 = 20%, confidence rate
small angular distances between them. The less overlapping greater than 𝜃𝑐𝑟 = 60% and Hamming rate lower than
the contexts are, the larger the angular distances between 𝜃ℎ𝑟 = 4%.
their different words are. However, words that appear in Table 2 details the 23 selected graphs relevance measures.
many contexts are represented by vectors that average
Number of
vectors pointing in many contexts directions. Hence, the
𝑙𝑟 𝑐𝑟 ℎ𝑟
Attribute graph
(in %) (in %) (in %) Users Values
vectors magnitude generally decreases with respect to the Users
Communities
88.37
44.97
83.98
98.47
2.08
1.75
13155
8118
13155
137338
number of contexts. Moreover, the higher the word fre- Musicians/Bands
PublicFigures
38.58
32.86
91.38
92.44
2.03
1.80
7141
6455
84762
28289
quency is, the higher the chance that it is used in different NonProfitOrganizations 31.85 86.57 1.72 6180 25847
Artists 30.65 84.22 1.92 5970 31681
contexts is. Consequently, the vector magnitude also de- Companies 30.05 85.94 1.75 5939 20750
Websites 29.57 83.94 1.78 5829 17931
creases with respect to frequency. From these remarks, we TVShows 29.48 82.41 2.31 5778 11876
conclude that the euclidean distance is not a good measure EntertainmentWebsites
Media/NewsCompanies
29.26
29.23
79.20
87.27
2.84
1.82
5669
5871
8319
14042
for our inference purpose. Actually, words that appear in Products/Services
News/MediaWebsites
27.52
27.44
80.93
83.43
1.86
1.86
5496
5550
15986
9247
many contexts have low magnitude. As a result, their eu- Organizations
Movies
26.20
26.09
80.77
75.17
1.63
2.26
5328
5171
14738
16282
clidean distances will be small and, using this criteria, they LocalBusinesses 24.91 78.58 1.69 5111 17321
Clothings 23.99 68.12 1.96 4729 16090
would be considered close even if they do not appear in Gastronomies 23.52 71.73 2.24 4763 8422
Actors/Directors 23.12 74.54 2.78 4785 10425
any common context. For instance, the euclidean distance Magazines 22.82 73.96 1.69 4733 9955
between the cluster of pages of the most popular politi- Athletes
ApplicationPages
22.68
21.68
68.79
66.36
2.35
3.04
4583
4396
14123
4244
cians will be small even if they are rivals. In the example SportsTeams 21.48 63.93 2.35 4309 10433
depicted by Figure 6 the politicians of the clusters 𝑐2,1 and Table 2: Selected learning graphs in D1 for politi-
𝑐2,2 are rivals. The angular distance between those two cians.
clusters is big. However, the euclidean distance is small.
5
32
Figure 6: Example of 2-dimensional vectors that encode 8 nodes.
We note that the communities graph has the second Selection accuracy targets deleted nodes
greatest learning rate 𝑙𝑟 = 44.97%, which means that it based on links
holds much latent information about users who hide their Relevance
0.79 252 409 558125
likes in the sensitive graph. It also has the maximal confi- (23 graphs)
dence rate 𝑐𝑟 = 98.47 and the fifth lowest Hamming rate Random 204 351 11200
0.41
(23 graphs) (average) (average) (average)
(ℎ𝑟 = 1.75%) among the 23 selected relevant graphs, which
means that its structure is very similar to the structure Table 3: Experimental results
of the politicians graph. The friendship graph (Users) has
the maximal learning rate 𝑙𝑟 = 88.37% and a high confi-
dence rate 𝑐𝑟 = 83.98% since 87.62% of users are connected
7.2 Gender and relationship status
to this graph. However, this graph has a Hamming rate
ℎ𝑟 = 2.08% greater than the average ℎ𝑟 of selected graphs To produce a text document from the social graph, we
which means that learned information from this graph is perform random-walks of 80 steps. Each random walk starts
less reliable than learned information from the communities from a different node. The dimension of the node vectorial
graph. representation is taken to be 128 since in that case the
We use the area under the ROC curve (AUC) as defined number of sensitive values is smaller. Moreover learning
in [8] to measure the accuracy of the inferred links. For the graphs are obtained by grouping several attribute values in
defined thresholds (𝜃𝑙𝑟 = 20%, 𝜃𝑐𝑟 = 60%, 𝜃ℎ𝑟 = 4%) the the merging operation (see Section 5). For the experiments,
precision is equal to 0.79. However, the inference accuracy we have generated a new social graph from the dataset
when the 23 relevant graphs are selected randomly is only by randomly selecting 10% of user profiles that publish
0.41. We conducted more tests by selecting manually 3 the value of their sensitive attribute and we have masked
graphs that are semantically close to politics as follows. it. Then SONSAI has tried to infer the masked values of
Graph 𝐺1 models the links between 1246 user profiles and the sensitive attribute for each user based on the selected
2357 political organizations, 𝐺2 models the links between attributes by the cleanser module.
1120 user profiles and 1758 political parties and 𝐺3 models Relationship status. The sensitive graph models the rela-
the links between 39 user profiles and 41 political ideologies. tionship status of user profiles. To simplify the presentation
Although the selected graphs seem promising, the inference we define two meta-relationship status as follows:
accuracy is only 0.46. This can be explained by the fact that
𝑅1= {𝑆𝑖𝑛𝑔𝑙𝑒, 𝐷𝑖𝑣𝑜𝑟𝑐𝑒𝑑, 𝑆𝑒𝑝𝑎𝑟𝑎𝑡𝑒𝑑, 𝑊 𝑖𝑑𝑜𝑤𝑒𝑑, 𝐶𝑜𝑚𝑝𝑙𝑖𝑐𝑎𝑡𝑒𝑑}
the selected graphs are very sparse and users are vigilant 𝑅2= {𝐷𝑜𝑚𝑒𝑠𝑡𝑖𝑐 𝑝𝑎𝑟𝑡𝑛𝑒𝑟𝑠ℎ𝑖𝑝, 𝑀 𝑎𝑟𝑟𝑖𝑒𝑑, 𝐸𝑛𝑔𝑎𝑔𝑒𝑑,
when publishing their preferences about those attributes. 𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝, 𝐶𝑖𝑣𝑖𝑙 𝑢𝑛𝑖𝑜𝑛, 𝑂𝑝𝑒𝑛 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛}
Consequently, the algorithm cannot learn well from them.
We note that the musicians/bands graph was automat- We aim to infer the meta-relationship status of users.
ically selected by our relevance-based selection method Table 4 gives more details about the selected attributes
confirming that music and politics are correlated as it was from dataset D2.
empirically discovered in previous studies [17], We notice that discriminant attributes toward 𝑅1 are
Table 3 summarizes the results of the conducted experi- focused around educations and leisures. On the other hand,
ments. discriminant attributes toward 𝑅2 are focused around
business. The accuracy in AUC of inferring the meta-
relationship status is higher than 0.7 in both datasets D1
6
33
Attribute Importance Discrimination
Education 2.75 88.41 % R1 inference since the vocabulary is limited to user profiles,
Community College
Consulting Agency
2.74
2.71
90.02 % R1
90.70 % R2 super-values (i.e. clusters of values) and sensitive values.
Sports & Recreation
Home & Garden Website
2.56
2.49
91.18 % R2
91.89 % R2
Ranking task has to compute cosine similarity between
Automotive, Aircraft & Boat
Locality
2.48
2.47
92.86 % R2
92.59 % R2
a few vectors that represent the sensitive values and the
Corporate Office 2.46 91.18 % R2 target user vector. Cleansing task selects most important
News & Media Website 2.42 90.32 % R2
Financial Service 2.41 90.00 % R2 attributes and reduces the vocabulary. Consequently, this
Industrial Company 2.40 89.29 % R2
Educational Consultant 2.02 75.00 % R1 process accelerates the inference tasks (relying on Ran-
Playground
Phone/Tablet
1.80
1.70
66.67 % R1
63.64 % R1 dom walk and Word2Vec). Moreover, Cleansing increases
Plastic Surgeon
Consulate & Embassy
1.60
1.60
60.00 % R1
60.00 % R2
accuracy by discarding irrelevant information.
School Sports Team 1.53 52.00 % R1
Dive Bar 1.45 54.55 % R1
Video
Playlist
1.44
1.41
51.00 % R1
53.04 % R1
7.4 Parameter sensitivity analysis
Table 4: Selected attributes in D2 for relationship Let us investigate the impact of the cleansing parameters
status. 𝑙𝑟, 𝑐𝑟 and ℎ𝑟. All experiments detailed in this section are
conducted on dataset D1 to infer users’ political orientation.
Table 7 shows that only 3 graphs among the 1928 avail-
able graphs have a learning rate 𝑙𝑟 higher than 30%. Based
and D2 as soon as the target publishes values concerning on those graphs, inference accuracy can be very low. For
at least 4 selected attributes by the cleanser. instance, inference accuracy based on gender attribute is
Gender. The sensitive graph models the gender of user only 0.36. Based only on the users (i.e. friendship) graph
profiles. We notice that discriminant attributes toward accuracy is getting better to 0.64. The communities graph
male are focused around sports, games and software. On gives high accuracy of 0.74 for inferring political views.
the other hand, discriminant attributes toward female are However, we notice that the best accuracy is obtained
focused around health, home and luxury. The accuracy in when selecting graphs with learning rate between 10% and
AUC of inferring the gender is higher than 0.83 in dataset 40%. Table 7 shows that the learning rate parameter 𝑙𝑟 is
D1 and higher than 0.67 in dataset D2 as soon as the target important to select the best graphs for inference. However,
publishes values concerning at least 2 selected attributes accuracy does not depend only on this parameter since
by the cleanser. some graphs such as gender graph that have high learning
rate may lead to very low accuracy results.
Attribute Importance Discrimination Table 8 shows that when the Hamming rate ℎ𝑟 decreases,
Sports League
Recreation & Sports Website
4.22
3.80
75.97 % Male
77.09 % Male accuracy increases. However, most graphs have a low Ham-
Video Game
Cars
3.66
3.25
80.16 % Male
73.15 % Male
ming rate because only a small part of them can be com-
Amateur Sports Team
Sport
3.03
2.80
72.86 % Male
73.07 % Male
pared to the sensitive graph, as few users publish their
Jewelry/Watches 2.72 56.26 % Female preferences in both graphs. Hence, their structure is not
Electronics 2.68 73.19 % Male
Software 2.52 77.23 % Male fairly comparable to the politicians’ graph structure. To
Outdoor & Sporting Goods 2.35 77.19 % Male
Women’s Clothing Store 2.35 77.28 % Female cope with this problem we compute a third parameter, the
Home Decor
Stadium, Arena & Sports Venue
2.29
2.28
54.60 % Female
74.45 % Male confidence rate 𝑐𝑟, that indicates how reliable the structure
Baby Goods/Kids Goods
Kitchen/Cooking
2.14
2.08
66.61 % Female
55.93 % Female
comparison is.
Bags/Luggage
Beauty, Cosmetic & Personal Care
2.04
2.03
59.16 % Female
60.59 % Female
Table 9 shows that the confidence rate, 𝑐𝑟, does not
Cosmetics Store 1.98 66.25 % Female give information about the best graph to select when it
Hair Salon 1.92 61.44 % Female
Home & Garden Website 1.72 55.18 % Female is considered isolately but it must be coupled with other
Table 5: Selected attributes in D1 for gender. parameters. For instance, if a given graph 𝑔 has a high
confidence rate but a low Hamming rate, that means that
it is a good graph for inference. However, if a given graph 𝑔
has a high confidence rate and high Hamming distance rate
7.3 Processing times that means that 𝑔 is probably a bad graph for infering the
sensitive attribute. But a given graph 𝑔 could be interesting
Table 6 displays the processing times. The processor clock
for inference if it has low 𝑐𝑟 and high ℎ𝑟.
is 2.3 GHz. Cleansing and random walk algorithms are
not parallelized. Cleansing takes more time than the other
processes in the case of gender inference since it handles 8 CONCLUSION
hundreds of thousands of nodes, compares hundreds of SONSAI application enables users to predict their sensitive
graphs to the sensitive graph and computes their impor- links in social networks from relatively small amount of data
tance. The random walk, in the case of gender inference, is and computing resources. Indeed, sensitive data inferences
performed on a small graph containing only a few dozen are fast and accurate on typical personal attributes. It
of super-values and a few thousands of user profiles. On should be noted that the friendship graph was not selected
the other hand, in the case of politicians inference, the among important ones to deduce both the gender and
task is performed on larger graphs containing dozens of the relationship status of users. This probably means that
thousands of values. The machine disposes only of 8GB alternative techniques based solely on homophily would
of RAM memory. Each chunk of 5k steps is stored sepa- be inaccurate in this context. Moreover, we have observed
rately in a text file of about 25MB. Those files are then that the privacy of users is threatened as soon as they start
parsed by Word2Vec. Word2Vec speed depends on the size publishing at least three important attributes. These ones
of the document vocabulary. It is fast in the case of gender are automatically brought to light by SONSAI, regardless
7
34
Random
Process Cleansing Word2Vec Ranking
walk
Gender D1 423 34 50 0.12
Time inference D2 243 25 30 0.12
(in seconds) Politicians D1 782 523 924 1
inference D2 574 451 733 1
Table 6: Processing times.
𝑙𝑟 Inference ♯ selected ♯ attacked ♯ masked
(in %) accuracy in AUC graphs targets links Systems Applications - 28th International Conference, DEXA
[0, 10[ 0.61 1873 252 409 2017, Lyon, France, August 28-31, 2017, Proceedings, Part I.
[10, 20[ 0.80 31 254 411
249–263. https://doi.org/10.1007/978-3-319-64468-4_19
[20, 30[ 0.86 16 254 418
[30, 40[ 0.80 5 253 410 [4] Anton Alekseev and Sergey I. Nikolenko. 2017. Word Em-
[40, 50[ 0.74 1 (Communities) 251 408 beddings for User Profiling in Online Social Networks. Com-
[70, 80[ 0.36 1 (Genders) 213 353 putación y Sistemas 21, 2 (2017). http://www.cys.cic.ipn.mx/
[80, 90] 0.64 1 (Users) 214 350
ojs/index.php/CyS/article/view/2734
Table 7: Impact of 𝑙𝑟 on inference accuracy. [5] Michael Backes, Mathias Humbert, Jun Pang, and Yang Zhang.
2017. walk2friends: Inferring Social Links from Mobility Profiles.
In Proceedings of the 2017 ACM SIGSAC Conference on
Computer and Communications Security, CCS 2017, Dallas,
ℎ𝑟 Inference ♯ selected ♯ attacked ♯ masked TX, USA, October 30 - November 03, 2017. 1943–1957. https:
(in %) accuracy in AUC graphs targets links //doi.org/10.1145/3133956.3133972
[0, 5[ 0.68 1744 253 410
[5, 10[ 0.59 87 177 304
[6] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Chris-
[10, 20[ 0.53 58 87 167 tian Janvin. 2003. A Neural Probabilistic Language Model.
[20, 30[ 0.45 11 83 129 Journal of Machine Learning Research 3 (2003), 1137–1155.
[30, 40[ 0.42 13 11 21 [7] Vladimir Estivill-Castro, Peter Hough, and Md Zahidul Islam.
[40, 50[ 0.42 5 2 3
[50, 100] 0.41 10 211 351
2014. Empowering users of social networks to assess their
privacy risks. In 2014 IEEE International Conference on Big
Table 8: Impact of ℎ𝑟 on inference accuracy. Data, Big Data 2014, Washington, DC, USA, October 27-30,
2014. 644–649. https://doi.org/10.1109/BigData.2014.7004287
[8] Fei Gao, Katarzyna Musial, Colin Cooper, and Sophia Tsoka.
2015. Link Prediction Methods and Their Accuracy for Dif-
𝑐𝑟 Inference ♯ selected ♯ attacked ♯ masked ferent Social Networks and Network Metrics. Scientific Pro-
(in %) accuracy in AUC graphs targets links
[0, 10[ 0.63 1711 245 409
gramming 2015 (2015), 172879:1–172879:13. https://doi.org/
[10, 20[ 0.43 94 246 407 10.1155/2015/172879
[20, 30[ 0.74 37 245 405 [9] Yoav Goldberg and Omer Levy. 2014. word2vec Explained:
[30, 40[ 0.54 28 248 404 deriving Mikolov et al.’s negative-sampling word-embedding
[40, 50[ 0.72 16 247 410
[50, 60[ 0.38 14 250 407
method. CoRR abs/1402.3722 (2014).
[60, 70[ 0.63 8 248 403 [10] Virgil Griffith. 2007. music that makes you dumb. (2007).
[70, 80[ 0.60 6 248 393 http://musicthatmakesyoudumb.virgil.gr/
[80, 90[ 0.65 11 255 419 [11] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Fea-
[90, 100] 0.82 3 253 410
ture Learning for Networks. In Proceedings of the 22nd ACM
Table 9: Impact of 𝑐𝑟 on inference accuracy. SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, CA, USA, August 13-17,
2016. 855–864. https://doi.org/10.1145/2939672.2939754
[12] R. Heatherly, M. Kantarcioglu, and B. Thuraisingham. 2013.
Preventing Private Information Inference Attacks on Social
their semantics and only through a structural analysis Networks. IEEE Transactions on Knowledge and Data Engi-
of the social network graph. As future work, we plan to neering 25, 8 (Aug 2013), 1849–1862. https://doi.org/10.1109/
TKDE.2012.120
incorporate countermeasures into our tool to protect users [13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and
against posts that might compromise their privacy. We Jeffrey Dean. 2013. Distributed Representations of Words
and Phrases and their Compositionality. CoRR abs/1310.4546
also plan to enhance SONSAI prediction engine with a tool (2013).
that permits one to disclose hidden friendship links using [14] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deep-
adequate combinations of queries provided by the social Walk: online learning of social representations. In The 20th
ACM SIGKDD International Conference on Knowledge Dis-
network [2]. covery and Data Mining, KDD ’14, New York, NY, USA -
Acknowledgments. This work is supported by MAIF August 24 - 27, 2014. 701–710.
Foundation5 . [15] Eunsu Ryu, Yao Rong, Jie Li, and Ashwin Machanavajjhala.
2013. curso: protect yourself from curse of attribute inference:
a social network privacy-analyzer. In Proceedings of the 3rd
REFERENCES ACM SIGMOD Workshop on Databases and Social Networks,
[1] Chaabane Abdelberi, Gergely Ács, and Mohamed Ali DBSocial 2013, New York, NY, USA, June, 23, 2013. 13–18.
Kâafar. 2012. You are what you like! Information https://doi.org/10.1145/2484702.2484706
leakage through users’ Interests. In 19th Annual Net- [16] Adriaan M. J. Schakel and Benjamin J. Wilson. 2015. Measuring
work and Distributed System Security Symposium, Word Significance using Distributed Representations of Words.
NDSS 2012, San Diego, California, USA, February CoRR abs/1508.02297 (2015).
5-8, 2012. https://www.ndss-symposium.org/ndss2012/ [17] John Street. 2012. Music & Politics. Polity Press.
you-are-what-you-information-leakage-through-users-interests
[2] Younes Abid, Abdessamad Imine, Amedeo Napoli, Chedy Raïssi,
and Michaël Rusinowitch. 2016. Online Link Disclosure Strate-
gies for Social Networks. In Risks and Security of Internet
and Systems - 11th International Conference, CRiSIS 2016,
Roscoff, France, September 5-7, 2016, Revised Selected Papers.
153–168. https://doi.org/10.1007/978-3-319-54876-0_13
[3] Younes Abid, Abdessamad Imine, Amedeo Napoli, Chedy Raïssi,
and Michaël Rusinowitch. 2017. Two-Phase Preference Disclo-
sure in Attributed Social Networks. In Database and Expert
5
www.fondation-maif.fr/
8
35