<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Aug</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sensitive attribute prediction for social networks users</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Younes Abid</string-name>
          <email>younes.abid@loria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abdessamad Imine</string-name>
          <email>abdessamad.imine@loria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michaël Rusinowitch</string-name>
          <email>michael.rusinowitch@loria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lorraine University</institution>
          ,
          <addr-line>Cnrs, Inria, Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lorraine University</institution>
          ,
          <addr-line>Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1849</volume>
      <fpage>153</fpage>
      <lpage>168</lpage>
      <abstract>
        <p>Social networks are popular means of data sharing but they are vulnerable to privacy breaches. For instance, relating users with similar profiles an entity can predict personal data with high probability. We present SONSAI a tool to help Facebook users to protect their private information from these inferences. The system samples a subnetwork centered on the user, cleanses the collected public data and predicts user sensitive attribute values by leveraging machine learning techniques. Since SONSAI displays the most relevant attributes exploited by each inference, the user can modify them to prevent undesirable inferences. The tool is designed to perform reasonably with the limited resources of a personal computer, by collecting and processing a relatively small relevant part of network data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Data published on social networks profiles can be mined
for inferring sensitive information about users. For instance
it was shown in [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ] how musical tastes allow one to
predict educational level. To increase user awareness about
these privacy threats we have designed a tool, SONSAI, for
Facebook users to audit their profiles.The system crawls
the network around the user and predicts its sensitive
attributes values using a machine learning engine. The results
provided by SONSAI, also shows the public attributes of
the user that have oriented the learning algorithm towards
a particular sensitive attribute value. The user can therefore
modify these public attributes to prevent inference.
      </p>
      <p>For the approach to be feasible several problems have
to be solved: First, data collection by crawling is limited
both by the social network and by country regulations.
Hence the crawler exploration strategy has to focus only on
meaningful representative network nodes. Since attributes
are numerous, for the learning program to scale one has
to select only the most relevant ones for inferring sensitive
attribute values. Hence the second problem is to find an
attribute relevance measure that is both accurate and easy
to compute. Note that we cannot rely on semantic proximity
since we notice that a user that hides a sensitive attribute
probably will hide semantically related ones, too. Moreover
for fully anonymised datasets the attributes semantics is
hard to recover. Therefore we follow an alternative approach
by modelling attributes as bipartite graphs and measuring
relevance of attributes by comparing their bipartite graph
structures.</p>
      <p>For specific attributes such as gender and relationship
status, the sets of values are much smaller than for other
attributes like music and movie. Consequently, the graphs
that model these attributes have higher connectivity than
the other graphs. For instance, the density of the graph
that models gender (as most users publish their gender) is
close to 0.5. In order to infer hidden links in such graphs
we need to learn from highly connected graphs. However
most of the available learning graphs are sparse. To cope
with this last problem, we derive new graphs by merging
several learning graphs.</p>
      <p>All the proposed methods have been implemented for
Facebook, however they can be applied to many other
social networks. The system has been tested by several
volunteer users for auditing their Facebook profiles. In each
case a dataset was built from real profiles collected in the
user neighborood network.</p>
      <p>
        Related works. In [15] the authors propose algorithms to
detect whether a sensitive attribute value can be infered
from the neighborhood of a target user in a social
network. Heatherly et al. [12] infer attribute values in social
network with bayesian classification techniques. For the
same purpose Estivill-Castro et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] employ decision-tree
learning algorithms. In these works learning is performed
of-line on large datasets. In order to perform attribute
inference from sparser datasets collected in short time by
our tool user in his ego-network, we rather use randow
walk-based learning. The random walk technique has been
applied to social representations in [14] and [11] where the
authors analysed friendships and used a skip-gram model.
In these works, user profiles that have similar friends will
be mapped to similar representations. This model helps
to detect communities and can predict a set of potential
friends for a given user profile. Skip-gram model has also
been applied recently to infer social relationship from
mobility data [5]. Our work uses a more adapted Continuous
Bag of Words (CBOW) model to predict the most likely
values for a user sensitive attribute. In our setting friends
are considered as an attribute among others. Moreover
we first determine automatically a relevant subset of the
social network for optimizing random walks. In [3] the
relevance of attributes is computed by Bayesian optimisation
which is much less eficient than the graph comparison
approach adopted here. In particular [3] does not succeed
in reasonable time (on standard PC) with attributes like
gender and relationship status. Let us note that (unlike
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) our approach does not need any ontology to perform
semantic correlation between attributes. Finally we notice
that the proposed system SONSAI is close under some
aspects to a recommendation system: an item suggestion
can be viewed as an attribute value prediction [4].
However unlike recommendation systems SONSAI also provides
explanations for the predicted values, namely an ordered
list of attributes that have played a significant role in the
computation. For enforcing privacy, our final goal will be
to reduce the “recommendation accuracy” by acting on
this list of attributes.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>ARCHITECTURE OF SONSAI</title>
      <p>The architecture of SONSAI is overviewed in Figure 1.
The Facebook crawler (about 5k lines of Java 8) drives a
Firefox 58.0b4 navigator through a Selenium 3.5.3 server 1.
Collected information from each profile, group and page are
stored in separated XML files. The Anonymizer, Cleanser,
Random walker and Ranker components are written in
Python 2.7 (about 2.5k lines of code). The Anonymizer
component parses all the XML files, generates anonymized
graphs and stores them as TSV files. Then, the Cleanser
selects the most relevant graphs and stores them as
adjacency lists. The Random walker component browses the
adjacency lists and stores the resulting walks in a text
document. We use the Python gensim 2 implementation of
Word2Vec to parse the text document and compute a
vectorial representation of social network nodes encountered
in the walk. Finally, the Ranker component classifies the
sensitive nodes according to their similarity to the target
user profile.
3</p>
    </sec>
    <sec id="sec-3">
      <title>SAMPLING FACEBOOK</title>
      <p>We have designed a crawler that explores the social network
around a user (to some distance given as a parameter) and
collects public information from the visited web pages (i.e.
friends, liked pages . . . ) in order to build a representative
subnetwork. We distinguish two types of Facebook nodes:
user profiles (u) and pages (p) and two types of links:
like-ships between user profiles and pages, and friendships
between user profiles. Given a node  , . denotes the
set of nodes that are linked to  . A discovered node is a
node whose URL is known by the crawler. For instance, if
the crawler retrieves a user profile and collects its public
friends list, then all the friends of that particular user
profile are discovered. Algorithm 1 crawls at most   nodes
at distance ≤  from the target node   . Each iteration of
the outer loop samples a node, crawls it and update the sets
of discovered and crawled nodes. The sampling is done by
random walks of length ≤  with a transition probability 
designed to crawl with higher priority closer nodes and to
favour neighbour nodes according to their type. Function
sinks( ) returns the set of sinks, i.e. crawled nodes such
that all discovered nodes at distance ≤  are also crawled.
Sinks are avoided by the random walks to guarantee that
the final node has not been crawled yet.
1http://www.seleniumhq.org/
2https://radimrehurek.com/gensim/index.html
1 Procedure crawl_nodes(  ,  ,   ,  )
2 crawl_node(  );
3 while |  _ | &lt;   do
4  ←   ;  ← {} ;
5 for  ← 1 to  by 1 do
6 . addAll(sinks( −  )));
7  ← random_select_in({  ∪ . } ∖  );
8 end
9 crawl_node( );
10</p>
      <p>end</p>
      <p>Algorithm 1: Crawling nodes around a target user.
4</p>
    </sec>
    <sec id="sec-4">
      <title>SOCIAL NETWORK MODEL</title>
      <p>Modelling friendship relations. Since friendship on
Facebook is symmetric, we model friendship between user
proifles by an undirected graph (,  ) where  is a set of
users’ profiles and  is a set of friendship links between
them.</p>
      <p>Modelling page like-ship. We model like-ship between
user profiles and pages by several bipartite graphs (, ,  )
where  is a set of users profiles,  is a set of pages (a
type) and  is a set of like-ship links between them. Figure
2 shows an example of page like-ship modelled by two
graphs3. Graph ( ) models liked pages of music type and
Graph ( ) models liked pages of book type. We note that
user profiles can like several pages of the same type.</p>
      <p>Anonymizing the social network graphs. For ethical and
regulation reasons Facebook identifiers are replaced by fresh
identifiers. Each node in the network is then identified
by a unique integer ID replacing its Facebook ID. The
anonymized IDs are sorted according to the node types.
Anonymized graphs are saved under the tab-separated
3Icon made by Smashicons from www.flaticon.com
value (TSV) format, one of the most general
delimiterseparated values format (DSV). TSV is widely used in
graph exchange. In contrast to the dataset released by</p>
      <p>4 where only user IDs are anonymised (but not
movies title, rating, date of rating), all attribute values are
anonymised in our datasets.</p>
      <p>In the following a sensitive graph is a graph that models
an attribute that is considered sensitive by the user (i.e.
the user does not want its value to be predictable). The
learning graphs are attribute graphs available for the
learning module in order to predict hidden links in the sensitive
other semantically-related attributes, and moreover ii) for
fully anonymised datasets the attributes semantics is hard
to recover. On the contrary we will rely on comparing
attributes graph structures.</p>
      <p>In the following we assume a fixed sensitive graph  .</p>
      <p>Step 1: Computing the learning and the confidence rates
of learning graphs. In order to compare the structure of
a given learning graph to the structure of the sensitive
graph, we first split each graph in two parts. The first part
contains user profiles that hide their links in the sensitive
graph. The ratio of user profiles that publish their links in
the first part of the learning graph represents the learning
rate  . The second part contains user profiles that publish
their links in the sensitive graph. The ratio of user profiles
that publish their links in the second part of the learning
graph represents the confidence rate  .</p>
      <p>The function connected_profiles_in() returns the set
of user profiles that publish their links in the graph given
as argument. Given  ,  respectively a learning graph and a
sensitive graph and</p>
      <p>the set of user profiles in all graphs,
we define:
 
 
= connected_profiles_in( )
= connected_profiles_in( )
 ( ) = size( 
 ( ) = size( 
∩ ( ∖   )) / size( ∖   )
∩   ) / size(  )
 1 and  2 in a given graph  is computed as follows, where
the function</p>
      <p>(  ) returns the set of nodes to which
user node   is connected in the graph  .</p>
      <p>( 1,  2) =


 ( 1) ∩ 
 ( 1) ∪ 
 ( 2)
 ( 2)
The Hamming distance 
between graphs  and  is
defined by:
 ( ) =
|   (  ,   ) −   (  ,   )|
(1)
(2)
︁∑
 ̸=
  ,  ∈  ∩ 
In order to compare learning graphs with diferent sets
of common connected profiles  
distance by the maximal Hamming distance that can be
obtained on such a set. Hence we define the Hamming rate:
∩   , we normalize this
ℎ ( ) =  ( )/ ( ) where  ( ) is
rate ℎ</p>
      <p>higher than  ℎ
the sensitive graph.</p>
    </sec>
    <sec id="sec-5">
      <title>Densifying graphs</title>
      <p>Step 3: Selecting most relevant graphs for learning
sensitive attribute values. We first discard the learning graphs
that have a learning rate 
lower than threshold  
since
they do not convey enough information. We then discard
the graphs that have a confidence rate 
lower than  
since they are considered as unreliable. Finally, from the
remaining graphs we only select graphs that have a Hamming
since they are the most similar to
For some sensitive attributes such as gender, age and
relationship status, user profiles are linked to at most one value.
Moreover the sets of values for these particular attributes
are much smaller than for other attributes. Consequently,
the graphs that model these attributes are denser than the
other ones. In this case, for improving the random-walk
based learning process (see Section 6) we need to merge
several learning graphs in order to obtain a denser one.
We explain the method with a simple example of gender
prediction: we select attribute graphs with high  , 
rates
︁∑
 ̸=
  ,  ∈  ∩ 
|
 
(  (  ,   ), 1 −   (  ,   ))|
(3)
but with also a good rate of discrimination between
genders, that is the gender of connected users in the graph is
unbalanced between male and female. For instance we can
select jewelry and fast-food graphs. We merge these graphs
by grouping all fast-foods in a unique node and similarly
for jewelries as shown in Figure 4 to obtain a new learning</p>
      <p>× 3 covariance matrix over the set of selected</p>
      <p>To specify the random walk transitions we first define
the probability  ,</p>
      <p>that being in node  the next node in
the random walk is in a selected graph   :
 ,
=
︃{
︀∑
0</p>
      <p>(  )
{  |   ( )&gt;0}  (  )
if   ( ) &gt; 0
otherwise
(4)
where</p>
      <p>( ) is the degree of user  in graph   . A value
node  is followed by a user node chosen uniformly at
random from the ones connected to  . Assuming the node
following user node  is in   , then it will be chosen
uniformly at random from the nodes in   that are connected
to  . Therefore, the transition probabilities are ( 1 is the
friendship graph):
 →
 →  :  , /
 :</p>
      <p>1/
 →  ′ :  , 1/
 ( )
 ( )
1( )
if (,  ′) ∈ 
if  ̸= 1 and (,  ) ∈  
if  ̸= 1 and (,  )
∈  
(5)</p>
      <p>As illustrated in Figure 5 the document is constructed
by connecting all graphs through random jumps between
their nodes (see also [14]). At each step the walker state
changes and a new word is written in the text document.
One step amounts to select a graph where the current node
occurs with non null degree, and then to select a node that
is connected to the current node in the selected graph. The
selected node then becomes the new current node.</p>
      <p>In this example we aim to predict liked pages of
politicians masked by user profile
 3. The sensitive graph is
Graph 2 and the learning graphs are Graph 1 and Graph
3. Since the values of the sensitive attributes (the pages
of politicians) are labelled (each value belongs to a unique
cluster), they are represented by the label of their clusters
in the final document. We use a greedy clustering
algorithm [3] to define size similar clusters. Pages of politicians
that share many common "likers" end up in the same
cluster. For instance the first walk depicted by Figure 5 is
[ 1,  4,  2,3,  4]. But for eficiency the walk
[ 1,  4,  2,2,  4]
is stored instead in the document since the value  2,3
belongs to the cluster  2,2.</p>
      <p>Applying word2vec to compute node representations. We
have performed multi-graph random walks in the social
network and generated a text document. Walks in the
document can be interpreted as sentences, where the words
are network nodes. Hence, inferring a link between a user
node and an attribute value node is similar to the problem
of estimating the likelihood of words co-occurrence in a
corpus. We use word2vec [9, 13] to map one-hot encoded
vectors that represent words in a high-dimensional
vocabulary space to low-dimensional vectors (see [6]). Word2vec
is employed with the Continuous Bag of Words (CBOW)
model for predicting a word given its context defined by
the  −</p>
      <p>1 words surrounding it ( is the window size of the
context). Inputs of the model in our case are users and
published attribute value. Since there is no order between the
attribute values CBOW model is more adequate than
Skipgram. The output of the CBOW
model is a vector of size
 representing a probability distribution of co-occurrence
between all the words of the context and each word from
the vocabulary within a window of size  .</p>
      <p>Ranking sensitive attribute values. We measure semantic
similarity between two nodes by the cosine of the angle
formed by the vectors representing the nodes. Cosine
similarity is known to take greater account of context
information. We rank all the sensitive values by their cosine
similarity to the target user profile. The values that have
the lowest rank are most likely the sensitive attribute
secret values of the target user profile, where secret values
are actually the true values of the target user but are not
published by him on the social network.</p>
      <p>The Figure 6 depicts an example of 2-dimensional vectors
that encode 8 nodes: 3 user profiles (  1,  2 and  3), 2
pages of musics ( 3,1 and  3,2) and 3 clusters of pages of
politicians ( 2,1,  2,2 and  2,3). The clusters of the pages of
politicians are the sensitive values and their vectors are red.
The node  1 is the target user profile and its vector is blue.
The clusters of pages of politicians will be ranked according
to their distances to  1. And the inference algorithm will
infer as most probable pages of politicians to be liked by  1,
the pages of politicians of the cluster that has the smallest
rank (the closest cluster to  1).</p>
      <p>In [16] Schakel et al. show that word2vec unsupervised
learning algorithm encodes word semantics by afecting
vectors in the same direction for co-occurrent words during
training. Besides, the magnitude of a vector reflects both
the frequency of appearance of related words in the corpus
and the homogeneity of contexts. Where a context is a set
of words that have high co-occurrence probability in the
corpus.</p>
      <p>In fact, the words that appear in the same contexts have
small angular distances between them. The less overlapping
the contexts are, the larger the angular distances between
their diferent words are. However, words that appear in
many contexts are represented by vectors that average
vectors pointing in many contexts directions. Hence, the
vectors magnitude generally decreases with respect to the
number of contexts. Moreover, the higher the word
frequency is, the higher the chance that it is used in diferent
contexts is. Consequently, the vector magnitude also
decreases with respect to frequency. From these remarks, we
conclude that the euclidean distance is not a good measure
for our inference purpose. Actually, words that appear in
many contexts have low magnitude. As a result, their
euclidean distances will be small and, using this criteria, they
would be considered close even if they do not appear in
any common context. For instance, the euclidean distance
between the cluster of pages of the most popular
politicians will be small even if they are rivals. In the example
depicted by Figure 6 the politicians of the clusters  2,1 and
 2,2 are rivals. The angular distance between those two
clusters is big. However, the euclidean distance is small.
Moreover, the euclidean distance between a user that has
many friends, for instance the user  1 in the Figure 6, and
a popular music like “despacito”, for instance the page of
music  2,1 in the Figure 6, will be small. But popular users
do not necessarily like popular musics.
7</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTS</title>
      <p>To build datasets we have crawled Facebook profiles of
people that live in North-East France and in Île-de-France
(Paris). Table 1 gives statistics about the two crawled
datasets.</p>
      <p>Dataset 1 (D 1)
North-East France
Dataset 2 (D 2)
Île-de-France
512. The dimension is usually taken between 100 and 300
for natural languages. However the size of the vocabulary
in social networks (equal to the number of nodes) is much
higher than in natural languages.</p>
      <p>In Dataset 1 the sensitive graph represents the links
between 2554 user profiles and 4589 politician pages. For
each experiment we generate a new social graph from the
dataset by selecting the user profiles that publish their
preferences concerning the sensitive attribute (pages of
politicians) and at least another attribute. Then we remove
all the links in the sensitive graph of 10% of the selected user
profiles. The algorithm makes sure that all the nodes in the
resulting social graph remain connected. The experiments
have consisted then in inferring the hidden links based on
information from the learning graphs.</p>
      <p>Among the 1928 learning graphs, we selected the ones
with learning rate greater than  
= 20%, confidence rate
= 60% and Hamming rate lower than
greater than  
 ℎ
cians.</p>
      <p>Attribute graph
Users
Communities
Musicians/Bands
PublicFigures
NonProfitOrganizations
Artists
Companies
Websites
TVShows
EntertainmentWebsites
Media/NewsCompanies
Products/Services
News/MediaWebsites
Organizations
Movies
LocalBusinesses
Clothings
Gastronomies
Actors/Directors
Magazines
Athletes
ApplicationPages
SportsTeams

88.37
2357 political organizations,  2 models the links between
1120 user profiles and 1758 political parties and  3 models
the links between 39 user profiles and 41 political ideologies.
Although the selected graphs seem promising, the inference
accuracy is only 0.46. This can be explained by the fact that
the selected graphs are very sparse and users are vigilant
when publishing their preferences about those attributes.
Consequently, the algorithm cannot learn well from them.</p>
      <p>We note that the musicians/bands graph was
automatically selected by our relevance-based selection method
confirming that music and politics are correlated as it was
empirically discovered in previous studies [17],</p>
      <p>Table 3 summarizes the results of the conducted
experiments.</p>
      <sec id="sec-6-1">
        <title>Selection</title>
        <p>based on</p>
      </sec>
      <sec id="sec-6-2">
        <title>Relevance</title>
        <p>(23 graphs)</p>
        <p>Random
(23 graphs)
accuracy</p>
        <p>targets
0.79
0.41
252
204
deleted
links
409
351
nodes
558125
11200
(average)</p>
        <p>(average) (average)
number of sensitive values is smaller. Moreover learning
graphs are obtained by grouping several attribute values in
the merging operation (see Section 5). For the experiments,
we have generated a new social graph from the dataset
by randomly selecting 10% of user profiles that publish
the value of their sensitive attribute and we have masked
it. Then SONSAI has tried to infer the masked values of
the sensitive attribute for each user based on the selected
attributes by the cleanser module.</p>
        <p>Relationship status. The sensitive graph models the
relationship status of user profiles. To simplify the presentation
we define two meta-relationship status as follows:
 1= { ,
 2= { 
,
,
 , 
ℎ,
,</p>
        <p>,
ℎ,  ,</p>
        <p>We aim to infer the meta-relationship status of users.
Table 4 gives more details about the selected attributes
from dataset D2.</p>
        <p>We notice that discriminant attributes toward  1 are
focused around educations and leisures. On the other hand,
discriminant attributes toward  2 are focused around
business. The accuracy in AUC of inferring the
metarelationship status is higher than 0.7 in both datasets D1
}
}
and D2 as soon as the target publishes values concerning
at least 4 selected attributes by the cleanser.</p>
        <p>Gender. The sensitive graph models the gender of user
profiles. We notice that discriminant attributes toward
male are focused around sports, games and software. On
the other hand, discriminant attributes toward female are
focused around health, home and luxury. The accuracy in
AUC of inferring the gender is higher than 0.83 in dataset
D1 and higher than 0.67 in dataset D2 as soon as the target
publishes values concerning at least 2 selected attributes
by the cleanser.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7.3 Processing times</title>
      <p>Table 6 displays the processing times. The processor clock
is 2.3 GHz. Cleansing and random walk algorithms are
not parallelized. Cleansing takes more time than the other
processes in the case of gender inference since it handles
hundreds of thousands of nodes, compares hundreds of
graphs to the sensitive graph and computes their
importance. The random walk, in the case of gender inference, is
performed on a small graph containing only a few dozen
of super-values and a few thousands of user profiles. On
the other hand, in the case of politicians inference, the
task is performed on larger graphs containing dozens of
thousands of values. The machine disposes only of 8GB
of RAM memory. Each chunk of 5k steps is stored
separately in a text file of about 25MB. Those files are then
parsed by Word2Vec. Word2Vec speed depends on the size
of the document vocabulary. It is fast in the case of gender
inference since the vocabulary is limited to user profiles,
super-values (i.e. clusters of values) and sensitive values.</p>
      <p>Ranking task has to compute cosine similarity between
a few vectors that represent the sensitive values and the
target user vector. Cleansing task selects most important
attributes and reduces the vocabulary. Consequently, this
process accelerates the inference tasks (relying on
Random walk and Word2Vec). Moreover, Cleansing increases
accuracy by discarding irrelevant information.</p>
    </sec>
    <sec id="sec-8">
      <title>7.4 Parameter sensitivity analysis</title>
      <p>Let us investigate the impact of the cleansing parameters
 ,  and ℎ . All experiments detailed in this section are
conducted on dataset D1 to infer users’ political orientation.</p>
      <p>Table 7 shows that only 3 graphs among the 1928
available graphs have a learning rate  higher than 30%. Based
on those graphs, inference accuracy can be very low. For
instance, inference accuracy based on gender attribute is
only 0.36. Based only on the users (i.e. friendship) graph
accuracy is getting better to 0.64. The communities graph
gives high accuracy of 0.74 for inferring political views.
However, we notice that the best accuracy is obtained
when selecting graphs with learning rate between 10% and
40%. Table 7 shows that the learning rate parameter  is
important to select the best graphs for inference. However,
accuracy does not depend only on this parameter since
some graphs such as gender graph that have high learning
rate may lead to very low accuracy results.</p>
      <p>Table 8 shows that when the Hamming rate ℎ decreases,
accuracy increases. However, most graphs have a low
Hamming rate because only a small part of them can be
compared to the sensitive graph, as few users publish their
preferences in both graphs. Hence, their structure is not
fairly comparable to the politicians’ graph structure. To
cope with this problem we compute a third parameter, the
confidence rate  , that indicates how reliable the structure
comparison is.</p>
      <p>Table 9 shows that the confidence rate,  , does not
give information about the best graph to select when it
is considered isolately but it must be coupled with other
parameters. For instance, if a given graph  has a high
confidence rate but a low Hamming rate, that means that
it is a good graph for inference. However, if a given graph 
has a high confidence rate and high Hamming distance rate
that means that  is probably a bad graph for infering the
sensitive attribute. But a given graph  could be interesting
for inference if it has low  and high ℎ .</p>
    </sec>
    <sec id="sec-9">
      <title>8 CONCLUSION</title>
      <p>SONSAI application enables users to predict their sensitive
links in social networks from relatively small amount of data
and computing resources. Indeed, sensitive data inferences
are fast and accurate on typical personal attributes. It
should be noted that the friendship graph was not selected
among important ones to deduce both the gender and
the relationship status of users. This probably means that
alternative techniques based solely on homophily would
be inaccurate in this context. Moreover, we have observed
that the privacy of users is threatened as soon as they start
publishing at least three important attributes. These ones
are automatically brought to light by SONSAI, regardless</p>
      <p>♯ attacked
targets
♯ masked
links</p>
      <p>Inference
accuracy in AUC
♯ selected
graphs
1744
♯ attacked
targets
♯ masked
links</p>
      <p>on inference accuracy.</p>
      <p>Inference
accuracy in AUC
♯ selected
graphs
1711
♯ attacked
targets
♯ masked
links
ℎ</p>
      <p>Foundation5.
their semantics and only through a structural analysis
of the social network graph. As future work, we plan to
incorporate countermeasures into our tool to protect users
against posts that might compromise their privacy. We
also plan to enhance SONSAI prediction engine with a tool
that permits one to disclose hidden friendship links using
adequate combinations of queries provided by the social
Acknowledgments. This work is supported by MAIF</p>
      <p>Mohamed
leakage
work
NDSS
5-8,
2012.
through</p>
      <p>like! Information
2012, San</p>
      <p>Diego,</p>
      <p>California,</p>
      <p>USA,
2012.</p>
      <sec id="sec-9-1">
        <title>Gender inference Politicians inference</title>
        <p>walk</p>
        <p>Ranking</p>
        <p>Word
Embeddings for User Profiling in Online Social Networks.
Com</p>
        <p>SIGSAC Conference on
tian Janvin. 2003.</p>
        <p>A Neural Probabilistic Language Model.
2014. 644–649. https://doi.org/10.1109/BigData.2014.7004287
2015.</p>
        <p>Link Prediction</p>
        <p>Methods and Their Accuracy for
Different Social Networks and Network Metrics. Scientific
Pro10.1155/2015/172879
[9] Yoav
deriving
2016. 855–864. https://doi.org/10.1145/2939672.2939754</p>
        <p>Representations of
(2013).
Walk: online learning of social representations. In The 20th
ACM SIGKDD International Conference on Knowledge
Discovery and Data</p>
        <p>Mining, KDD ’14, New York, NY, USA
2013. curso: protect yourself from curse of attribute inference:
ACM SIGMOD</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Chaabane</given-names>
            <surname>Abdelberi</surname>
          </string-name>
          , Gergely Ács, and Kâafar.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Systems</surname>
          </string-name>
          Applications - 28th International Conference, DEXA 2017, Lyon, France,
          <source>August 28-31</source>
          ,
          <year>2017</year>
          , Proceedings,
          <string-name>
            <surname>Part I.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          249-
          <fpage>263</fpage>
          . https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -64468-4_
          <issue>19</issue>
          [4]
          <string-name>
            <given-names>Anton</given-names>
            <surname>Alekseev</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sergey I.</given-names>
            <surname>Nikolenko</surname>
          </string-name>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>putación y Sistemas</source>
          <volume>21</volume>
          ,
          <issue>2</issue>
          (
          <year>2017</year>
          ). http://www.cys.cic.ipn.mx/ ojs/index.php/CyS/article/view/2734 [5]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Backes</surname>
          </string-name>
          , Mathias Humbert, Jun Pang, and Yang Zhang.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <year>2017</year>
          .
          <article-title>walk2friends: Inferring Social Links from Mobility Profiles</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>Proceedings of the 2017 Journal of Machine Learning Research</source>
          <volume>3</volume>
          (
          <year>2003</year>
          ),
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Estivill-Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Hough</surname>
          </string-name>
          , and Md Zahidul Islam.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>Empowering users of social networks to assess their privacy risks</article-title>
          .
          <source>In 2014 IEEE International Conference on Big Data, Big Data</source>
          <year>2014</year>
          , Washington, DC, USA, October
          <volume>27</volume>
          -30, Goldberg and
          <string-name>
            <given-names>Omer</given-names>
            <surname>Levy</surname>
          </string-name>
          .
          <year>2014</year>
          . word2vec Explained: Mikolov et al.'
          <article-title>s negative-sampling word-embedding method</article-title>
          .
          <source>CoRR abs/1402</source>
          .3722 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Virgil</given-names>
            <surname>Grifith</surname>
          </string-name>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>music that makes you dumb</article-title>
          . (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          http://musicthatmakesyoudumb.virgil.gr/ [11]
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Grover</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jure</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>node2vec: Scalable Feature Learning for Networks</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , San Francisco, CA, USA,
          <year>August</year>
          13-17, [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Heatherly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kantarcioglu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thuraisingham</surname>
          </string-name>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>