=Paper= {{Paper |id=Vol-2083/paper-05 |storemode=property |title=Sensitive Attribute Prediction for Social Networks Users |pdfUrl=https://ceur-ws.org/Vol-2083/paper-05.pdf |volume=Vol-2083 |authors=Younes Abid,Abdessamad Imine,Michaël Rusinowitch |dblpUrl=https://dblp.org/rec/conf/edbt/AbidIR18 }} ==Sensitive Attribute Prediction for Social Networks Users== https://ceur-ws.org/Vol-2083/paper-05.pdf
        Sensitive attribute prediction for social networks users
               Younes Abid                             Abdessamad Imine                           Michaël Rusinowitch
           Lorraine University                   Lorraine University, Cnrs, Inria             Lorraine University, Cnrs, Inria
             Nancy, France                               Nancy, France                                Nancy, France
           younes.abid@loria.fr                    abdessamad.imine@loria.fr                   michael.rusinowitch@loria.fr
ABSTRACT                                                                  attributes like music and movie. Consequently, the graphs
Social networks are popular means of data sharing but they                that model these attributes have higher connectivity than
are vulnerable to privacy breaches. For instance, relating                the other graphs. For instance, the density of the graph
users with similar profiles an entity can predict personal                that models gender (as most users publish their gender) is
data with high probability. We present SONSAI a tool to                   close to 0.5. In order to infer hidden links in such graphs
help Facebook users to protect their private information                  we need to learn from highly connected graphs. However
from these inferences. The system samples a subnetwork                    most of the available learning graphs are sparse. To cope
centered on the user, cleanses the collected public data                  with this last problem, we derive new graphs by merging
and predicts user sensitive attribute values by leveraging                several learning graphs.
machine learning techniques. Since SONSAI displays the                       All the proposed methods have been implemented for
most relevant attributes exploited by each inference, the                 Facebook, however they can be applied to many other
user can modify them to prevent undesirable inferences.                   social networks. The system has been tested by several
The tool is designed to perform reasonably with the lim-                  volunteer users for auditing their Facebook profiles. In each
ited resources of a personal computer, by collecting and                  case a dataset was built from real profiles collected in the
processing a relatively small relevant part of network data.              user neighborood network.

                                                                             Related works. In [15] the authors propose algorithms to
1    INTRODUCTION                                                         detect whether a sensitive attribute value can be infered
Data published on social networks profiles can be mined                   from the neighborhood of a target user in a social net-
for inferring sensitive information about users. For instance             work. Heatherly et al. [12] infer attribute values in social
it was shown in [10] how musical tastes allow one to pre-                 network with bayesian classification techniques. For the
dict educational level. To increase user awareness about                  same purpose Estivill-Castro et al. [7] employ decision-tree
these privacy threats we have designed a tool, SONSAI, for                learning algorithms. In these works learning is performed
Facebook users to audit their profiles.The system crawls                  off-line on large datasets. In order to perform attribute
the network around the user and predicts its sensitive at-                inference from sparser datasets collected in short time by
tributes values using a machine learning engine. The results              our tool user in his ego-network, we rather use randow
provided by SONSAI, also shows the public attributes of                   walk-based learning. The random walk technique has been
the user that have oriented the learning algorithm towards                applied to social representations in [14] and [11] where the
a particular sensitive attribute value. The user can therefore            authors analysed friendships and used a skip-gram model.
modify these public attributes to prevent inference.                      In these works, user profiles that have similar friends will
   For the approach to be feasible several problems have                  be mapped to similar representations. This model helps
to be solved: First, data collection by crawling is limited               to detect communities and can predict a set of potential
both by the social network and by country regulations.                    friends for a given user profile. Skip-gram model has also
Hence the crawler exploration strategy has to focus only on               been applied recently to infer social relationship from mo-
meaningful representative network nodes. Since attributes                 bility data [5]. Our work uses a more adapted Continuous
are numerous, for the learning program to scale one has                   Bag of Words (CBOW) model to predict the most likely
to select only the most relevant ones for inferring sensitive             values for a user sensitive attribute. In our setting friends
attribute values. Hence the second problem is to find an                  are considered as an attribute among others. Moreover
attribute relevance measure that is both accurate and easy                we first determine automatically a relevant subset of the
to compute. Note that we cannot rely on semantic proximity                social network for optimizing random walks. In [3] the rel-
since we notice that a user that hides a sensitive attribute              evance of attributes is computed by Bayesian optimisation
probably will hide semantically related ones, too. Moreover               which is much less efficient than the graph comparison
for fully anonymised datasets the attributes semantics is                 approach adopted here. In particular [3] does not succeed
hard to recover. Therefore we follow an alternative approach              in reasonable time (on standard PC) with attributes like
by modelling attributes as bipartite graphs and measuring                 gender and relationship status. Let us note that (unlike
relevance of attributes by comparing their bipartite graph                [1]) our approach does not need any ontology to perform
structures.                                                               semantic correlation between attributes. Finally we notice
   For specific attributes such as gender and relationship                that the proposed system SONSAI is close under some
status, the sets of values are much smaller than for other                aspects to a recommendation system: an item suggestion
                                                                          can be viewed as an attribute value prediction [4]. How-
© 2018 Copyright held by the owner/author(s). Published in the
Workshop Proceedings of the 21st International Conference on Ex-
                                                                          ever unlike recommendation systems SONSAI also provides
tending Database Technology (EDBT) (March 26-29, 2018, Vienna,            explanations for the predicted values, namely an ordered
Austria) on CEUR-WS.org (ISSN 1613-0073).Distribution of this             list of attributes that have played a significant role in the
paper is permitted under the terms of the Creative Commons license
CC-by-nc-nd 4.0.                                                          computation. For enforcing privacy, our final goal will be




                                                                     28
to reduce the “recommendation accuracy” by acting on                   1  Procedure crawl_nodes(𝑢𝑡 , 𝑑, 𝑛𝑐 , 𝜋)
this list of attributes.                                               2     crawl_node(𝑢𝑡 );
                                                                        3    while |𝑐𝑟𝑎𝑤𝑙𝑒𝑑_𝑛𝑜𝑑𝑒𝑠| < 𝑛𝑐 do
2       ARCHITECTURE OF SONSAI                                          4        𝑐 ← 𝑢𝑡 ; 𝑟 ← {};
                                                                        5        for 𝑖 ← 1 to 𝑑 by 1 do
                                                                        6           𝑟.addAll(sinks(𝑑 − 𝑖)));
                                                                        7           𝑐 ← random_select_in({𝑐 ∪ 𝑐.𝑛} ∖ 𝑟);
                                                                        8        end
                                                                        9        crawl_node(𝑐);
                                                                       10    end
                                                                          Algorithm 1: Crawling nodes around a target user.


            Figure 1: Architecture of SONSAI.                          4      SOCIAL NETWORK MODEL
                                                                           Modelling friendship relations. Since friendship on Face-
   The architecture of SONSAI is overviewed in Figure 1.               book is symmetric, we model friendship between user pro-
The Facebook crawler (about 5k lines of Java 8) drives a               files by an undirected graph (𝑈, 𝐹 ) where 𝑈 is a set of
Firefox 58.0b4 navigator through a Selenium 3.5.3 server 1 .           users’ profiles and 𝐹 is a set of friendship links between
Collected information from each profile, group and page are            them.
stored in separated XML files. The Anonymizer, Cleanser,                 Modelling page like-ship. We model like-ship between
Random walker and Ranker components are written in                     user profiles and pages by several bipartite graphs (𝑈, 𝑃, 𝐿)
Python 2.7 (about 2.5k lines of code). The Anonymizer                  where 𝑈 is a set of users profiles, 𝑃 is a set of pages (a
component parses all the XML files, generates anonymized               type) and 𝐿 is a set of like-ship links between them. Figure
graphs and stores them as TSV files. Then, the Cleanser                2 shows an example of page like-ship modelled by two
selects the most relevant graphs and stores them as adja-              graphs3 . Graph (𝑎) models liked pages of music type and
cency lists. The Random walker component browses the                   Graph (𝑏) models liked pages of book type. We note that
adjacency lists and stores the resulting walks in a text               user profiles can like several pages of the same type.
document. We use the Python gensim 2 implementation of
Word2Vec to parse the text document and compute a vec-
torial representation of social network nodes encountered
in the walk. Finally, the Ranker component classifies the
sensitive nodes according to their similarity to the target
user profile.

3       SAMPLING FACEBOOK
We have designed a crawler that explores the social network
around a user (to some distance given as a parameter) and
collects public information from the visited web pages (i.e.
friends, liked pages . . . ) in order to build a representative
subnetwork. We distinguish two types of Facebook nodes:
user profiles (u) and pages (p) and two types of links:
like-ships between user profiles and pages, and friendships
between user profiles. Given a node 𝑐, 𝑐.𝑛 denotes the
set of nodes that are linked to 𝑐. A discovered node is a
node whose URL is known by the crawler. For instance, if
the crawler retrieves a user profile and collects its public
friends list, then all the friends of that particular user
profile are discovered. Algorithm 1 crawls at most 𝑛𝑐 nodes
at distance ≤ 𝑑 from the target node 𝑢𝑡 . Each iteration of
the outer loop samples a node, crawls it and update the sets
of discovered and crawled nodes. The sampling is done by                     Figure 2: Example of pages like-ship graphs.
random walks of length ≤ 𝑑 with a transition probability 𝜋
designed to crawl with higher priority closer nodes and to
favour neighbour nodes according to their type. Function                  Anonymizing the social network graphs. For ethical and
sinks(𝑗) returns the set of sinks, i.e. crawled nodes such             regulation reasons Facebook identifiers are replaced by fresh
that all discovered nodes at distance ≤ 𝑗 are also crawled.            identifiers. Each node in the network is then identified
Sinks are avoided by the random walks to guarantee that                by a unique integer ID replacing its Facebook ID. The
the final node has not been crawled yet.                               anonymized IDs are sorted according to the node types.
                                                                       Anonymized graphs are saved under the tab-separated
1
    http://www.seleniumhq.org/
2                                                                      3
    https://radimrehurek.com/gensim/index.html                             Icon made by Smashicons from www.flaticon.com
                                                                  2




                                                                  29
value (TSV) format, one of the most general delimiter-
separated values format (DSV). TSV is widely used in
graph exchange. In contrast to the dataset released by
Netflix 4 where only user IDs are anonymised (but not
movies title, rating, date of rating), all attribute values are
anonymised in our datasets.
   In the following a sensitive graph is a graph that models
an attribute that is considered sensitive by the user (i.e.
the user does not want its value to be predictable). The
learning graphs are attribute graphs available for the learn-
ing module in order to predict hidden links in the sensitive
graph.

5       MODEL CLEANSING
The task of predicting a sensitive attribute from the other
ones is made difficult by the size of datasets. Therefore, we
suggest to apply the inference process only on a subset of
most relevant attributes for the task. Our relevance notion            Figure 3: Example of cutting graphs for structure
does not rely on semantic proximity since we noticed that              comparison.
i) a user that hides a sensitive attribute will probably hide
other semantically-related attributes, and moreover ii) for
fully anonymised datasets the attributes semantics is hard             𝑢1 and 𝑢2 in a given graph 𝐴 is computed as follows, where
to recover. On the contrary we will rely on comparing                  the function 𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢𝑗 ) returns the set of nodes to which
attributes graph structures.                                           user node 𝑢𝑗 is connected in the graph 𝐴.
In the following we assume a fixed sensitive graph 𝑠.                                     𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢1 ) ∩ 𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢2 )
                                                                                 𝐽𝐴 (𝑢1 , 𝑢2 ) =                      (1)
   Step 1: Computing the learning and the confidence rates                                𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢1 ) ∪ 𝑙𝑖𝑛𝑘𝑠𝐴 (𝑢2 )
of learning graphs. In order to compare the structure of                 The Hamming distance 𝐻 between graphs 𝑙 and 𝑠 is
a given learning graph to the structure of the sensitive               defined by:
graph, we first split each graph in two parts. The first part                                   ∑︁
                                                                               𝐻(𝑙) =                     |𝐽𝑙 (𝑢𝑘 , 𝑢𝑗 ) − 𝐽𝑠 (𝑢𝑘 , 𝑢𝑗 )|   (2)
contains user profiles that hide their links in the sensitive
graph. The ratio of user profiles that publish their links in
                                                                                        𝑢𝑘 ,𝑢𝑗 ∈𝑈 𝑙∩𝑈 𝑠
                                                                                             𝑘̸=𝑗
the first part of the learning graph represents the learning
                                                                       In order to compare learning graphs with different sets
rate 𝑙𝑟. The second part contains user profiles that publish
                                                                       of common connected profiles 𝑈 𝑙 ∩ 𝑈 𝑠, we normalize this
their links in the sensitive graph. The ratio of user profiles
                                                                       distance by the maximal Hamming distance that can be
that publish their links in the second part of the learning
                                                                       obtained on such a set. Hence we define the Hamming rate:
graph represents the confidence rate 𝑐𝑟.
                                                                       ℎ𝑟(𝑙) = 𝐻(𝑙)/𝑀 (𝑙) where 𝑀 (𝑙) is
   The function connected_profiles_in() returns the set                            ∑︁
of user profiles that publish their links in the graph given                                    |𝑀 𝑎𝑥(𝐽𝑠 (𝑢𝑘 , 𝑢𝑗 ), 1 − 𝐽𝑠 (𝑢𝑘 , 𝑢𝑗 ))|    (3)
as argument. Given 𝑙, 𝑠 respectively a learning graph and a                   𝑢𝑘 ,𝑢𝑗 ∈𝑈 𝑙∩𝑈 𝑠
sensitive graph and 𝑈 the set of user profiles in all graphs,                      𝑘̸=𝑗

we define:                                                                Step 3: Selecting most relevant graphs for learning sen-
                                                                       sitive attribute values. We first discard the learning graphs
            𝑈 𝑙 = connected_profiles_in(𝑙)
                                                                       that have a learning rate 𝑙𝑟 lower than threshold 𝜃𝑙𝑟 since
           𝑈 𝑠 = connected_profiles_in(𝑠)                              they do not convey enough information. We then discard
          𝑙𝑟(𝑙) = size(𝑈 𝑙 ∩ (𝑈 ∖ 𝑈 𝑠)) / size(𝑈 ∖ 𝑈 𝑠)                the graphs that have a confidence rate 𝑐𝑟 lower than 𝜃𝑐𝑟
         𝑐𝑟(𝑙) = size(𝑈 𝑙 ∩ 𝑈 𝑠) / size(𝑈 𝑠)                           since they are considered as unreliable. Finally, from the re-
                                                                       maining graphs we only select graphs that have a Hamming
                                                                       rate ℎ𝑟 higher than 𝜃ℎ𝑟 since they are the most similar to
   Figure 3 depicts an example of splitting two graphs for             the sensitive graph.
comparison. The graph that models the link-ship between
user profiles and pages of politicians is the sensitive graph.         Densifying graphs
And the graph that models the link-ship between user                   For some sensitive attributes such as gender, age and rela-
profiles and pages of musics is the learning graph. The                tionship status, user profiles are linked to at most one value.
learning rate 𝑙𝑟 for this example is equal to 50%. And the             Moreover the sets of values for these particular attributes
confidence rate 𝑐𝑟 is equal to 75%.                                    are much smaller than for other attributes. Consequently,
  Step 2: Computing the distance between a learning graph              the graphs that model these attributes are denser than the
and the sensitive graph. In this step, we discard user profiles        other ones. In this case, for improving the random-walk
that have a null degree in the learning graph or in the                based learning process (see Section 6) we need to merge
sensitive graph. The Jaccard index between two user nodes              several learning graphs in order to obtain a denser one.
                                                                       We explain the method with a simple example of gender
4
    https://www.kaggle.com/netflix-inc/netflix-prize-data              prediction: we select attribute graphs with high 𝑙𝑟, 𝑐𝑟 rates
                                                                  3




                                                                  30
but with also a good rate of discrimination between gen-
ders, that is the gender of connected users in the graph is                     𝑢 → 𝑣 : 𝑝𝑢,𝑦 /𝑑𝑒𝑔𝑦 (𝑢)           if 𝑦 ̸= 1 and (𝑢, 𝑣) ∈ 𝐿𝑦
unbalanced between male and female. For instance we can                         𝑣 → 𝑢 : 1/𝑑𝑒𝑔𝑦 (𝑣)               if 𝑦 ̸= 1 and (𝑢, 𝑣) ∈ 𝐿𝑦        (5)
select jewelry and fast-food graphs. We merge these graphs                      𝑢 → 𝑢′ : 𝑝𝑢,1 /𝑑𝑒𝑔1 (𝑢)          if (𝑢, 𝑢′ ) ∈ 𝐹
by grouping all fast-foods in a unique node and similarly
for jewelries as shown in Figure 4 to obtain a new learning
graph.




                  Figure 4: Merging graphs.



6    RANDOM WALK-BASED
     ATTRIBUTE LEARNING
   Representation generation. We plan to translate latent
information from the selected graphs (in previous section)
to a document. The resulting document holds information
about paths in the graphs and their frequencies that can                      Figure 5: Example of multi graph random walk.
be exploited for infering proximity of a user node to some
sensitive attribute value node. Following [14], paths in the                    As illustrated in Figure 5 the document is constructed
social graph are sampled by random walks. In our case, the                   by connecting all graphs through random jumps between
walks are executed only in the subset of selected graphs.                    their nodes (see also [14]). At each step the walker state
   Let 𝐺1 , 𝐺2 , . . . 𝐺𝑛 be the list of selected learning graphs.Let        changes and a new word is written in the text document.
𝑈 be the set of users in all graphs. We assume that the                      One step amounts to select a graph where the current node
friendship graph is selected and 𝐺1 = (𝑈, 𝐹 ) (otherwise                     occurs with non null degree, and then to select a node that
we can simply adapt the computation below). The other                        is connected to the current node in the selected graph. The
graphs are bipartite and we pose 𝐺𝑖 = (𝑈, 𝑉𝑖 , 𝐿𝑖 ) for 𝑖 > 1.               selected node then becomes the new current node.
We introduce quotas to quantify the importance of each                          In this example we aim to predict liked pages of politi-
graph 𝐺𝑟 for inferring secret values of the target sensitive                 cians masked by user profile 𝑢3. The sensitive graph is
attribute. Each selected graph 𝐺𝑟 is assigned a 3 dimen-                     Graph 2 and the learning graphs are Graph 1 and Graph
sional vector 𝑉𝐺𝑟 = [𝑙𝑟(𝐺𝑟 ), 𝑐𝑟(𝐺𝑟 ), 1 − ℎ𝑟(𝐺𝑟 )]. The                     3. Since the values of the sensitive attributes (the pages
quota of 𝐺𝑟 is given by its Mahalanobis distance to the                      of politicians) are labelled (each value belongs to a unique
null vector [0,0,0]. It is computed as follows:                              cluster), they are represented by the label of their clusters
                                    √︀                                       in the final document. We use a greedy clustering algo-
                       𝑞(𝐺𝑟 )   =        𝑉𝐺𝑇𝑟 Σ−1 𝑉𝐺𝑟
                                                                             rithm [3] to define size similar clusters. Pages of politicians
with Σ the 3 × 3 covariance matrix over the set of selected                  that share many common "likers" end up in the same clus-
graph vectors.                                                               ter. For instance the first walk depicted by Figure 5 is
  To specify the random walk transitions we first define                     [𝑢1 , 𝑢4 , 𝑣2,3 , 𝑢4 ]. But for efficiency the walk [𝑢1 , 𝑢4 , 𝑐2,2 , 𝑢4 ]
the probability 𝑝𝑢,𝑦 that being in node 𝑢 the next node in                   is stored instead in the document since the value 𝑣2,3 be-
the random walk is in a selected graph 𝐺𝑦 :                                  longs to the cluster 𝑐2,2 .
             {︃            𝑞(𝐺𝑦 )
                  ∑︀                            if 𝑑𝑒𝑔𝑦 (𝑢) > 0                 Applying word2vec to compute node representations. We
    𝑝𝑢,𝑦 =            {𝑥|𝑑𝑒𝑔𝑥 (𝑢)>0}
                                       𝑞(𝐺𝑥 )
                                                                  (4)        have performed multi-graph random walks in the social
                  0                             otherwise
                                                                             network and generated a text document. Walks in the doc-
where 𝑑𝑒𝑔𝑥 (𝑢) is the degree of user 𝑢 in graph 𝐺𝑥 . A value                 ument can be interpreted as sentences, where the words
node 𝑣 is followed by a user node chosen uniformly at                        are network nodes. Hence, inferring a link between a user
random from the ones connected to 𝑣. Assuming the node                       node and an attribute value node is similar to the problem
following user node 𝑢 is in 𝐺𝑦 , then it will be chosen uni-                 of estimating the likelihood of words co-occurrence in a
formly at random from the nodes in 𝐺𝑦 that are connected                     corpus. We use word2vec [9, 13] to map one-hot encoded
to 𝑢. Therefore, the transition probabilities are (𝐺1 is the                 vectors that represent words in a high-dimensional vocabu-
friendship graph):                                                           lary space to low-dimensional vectors (see [6]). Word2vec
                                                                        4




                                                                        31
is employed with the Continuous Bag of Words (CBOW)                      Moreover, the euclidean distance between a user that has
model for predicting a word given its context defined by                 many friends, for instance the user 𝑢1 in the Figure 6, and
the 𝑐 − 1 words surrounding it (𝑐 is the window size of the              a popular music like “despacito”, for instance the page of
context). Inputs of the model in our case are users and pub-             music 𝑣2,1 in the Figure 6, will be small. But popular users
lished attribute value. Since there is no order between the              do not necessarily like popular musics.
attribute values CBOW model is more adequate than Skip-
gram. The output of the CBOW model is a vector of size                   7      EXPERIMENTS
𝑣 representing a probability distribution of co-occurrence               To build datasets we have crawled Facebook profiles of
between all the words of the context and each word from                  people that live in North-East France and in Île-de-France
the vocabulary within a window of size 𝑐.                                (Paris). Table 1 gives statistics about the two crawled
                                                                         datasets.
   Ranking sensitive attribute values. We measure semantic
similarity between two nodes by the cosine of the angle                                               ♯ Attributes
                                                                                                                     ♯ Attribute      ♯ Crawled
                                                                                                                       values        user profiles
formed by the vectors representing the nodes. Cosine sim-                        Dataset 1 (D 1)
                                                                                                         1 929        1 022 860         15 012
ilarity is known to take greater account of context infor-                      North-East France
                                                                                 Dataset 2 (D 2)
                                                                                                         1 296           298 617        6 550
mation. We rank all the sensitive values by their cosine                          Île-de-France

similarity to the target user profile. The values that have                         Table 1: Details about the datasets.
the lowest rank are most likely the sensitive attribute se-
cret values of the target user profile, where secret values
are actually the true values of the target user but are not
published by him on the social network.                                  7.1       Political orientation
   The Figure 6 depicts an example of 2-dimensional vectors
that encode 8 nodes: 3 user profiles (𝑢1 , 𝑢2 and 𝑢3 ), 2                From each node 𝑛 we perform a random-walk of 800 steps.
pages of musics (𝑣3,1 and 𝑣3,2 ) and 3 clusters of pages of              The dimension of the node representation is taken to be
politicians (𝑐2,1 , 𝑐2,2 and 𝑐2,3 ). The clusters of the pages of        512. The dimension is usually taken between 100 and 300
politicians are the sensitive values and their vectors are red.          for natural languages. However the size of the vocabulary
The node 𝑢1 is the target user profile and its vector is blue.           in social networks (equal to the number of nodes) is much
The clusters of pages of politicians will be ranked according            higher than in natural languages.
to their distances to 𝑢1 . And the inference algorithm will                 In Dataset 1 the sensitive graph represents the links
infer as most probable pages of politicians to be liked by 𝑢1 ,          between 2554 user profiles and 4589 politician pages. For
the pages of politicians of the cluster that has the smallest            each experiment we generate a new social graph from the
rank (the closest cluster to 𝑢1 ).                                       dataset by selecting the user profiles that publish their
   In [16] Schakel et al. show that word2vec unsupervised                preferences concerning the sensitive attribute (pages of
learning algorithm encodes word semantics by affecting                   politicians) and at least another attribute. Then we remove
vectors in the same direction for co-occurrent words during              all the links in the sensitive graph of 10% of the selected user
training. Besides, the magnitude of a vector reflects both               profiles. The algorithm makes sure that all the nodes in the
the frequency of appearance of related words in the corpus               resulting social graph remain connected. The experiments
and the homogeneity of contexts. Where a context is a set                have consisted then in inferring the hidden links based on
of words that have high co-occurrence probability in the                 information from the learning graphs.
corpus.                                                                     Among the 1928 learning graphs, we selected the ones
   In fact, the words that appear in the same contexts have              with learning rate greater than 𝜃𝑙𝑟 = 20%, confidence rate
small angular distances between them. The less overlapping               greater than 𝜃𝑐𝑟 = 60% and Hamming rate lower than
the contexts are, the larger the angular distances between               𝜃ℎ𝑟 = 4%.
their different words are. However, words that appear in                    Table 2 details the 23 selected graphs relevance measures.
many contexts are represented by vectors that average
                                                                                                                                       Number of
vectors pointing in many contexts directions. Hence, the
                                                                                                         𝑙𝑟        𝑐𝑟         ℎ𝑟
                                                                             Attribute graph
                                                                                                       (in %)    (in %)     (in %)   Users Values

vectors magnitude generally decreases with respect to the                    Users
                                                                             Communities
                                                                                                       88.37
                                                                                                       44.97
                                                                                                                 83.98
                                                                                                                 98.47
                                                                                                                             2.08
                                                                                                                             1.75
                                                                                                                                     13155
                                                                                                                                     8118
                                                                                                                                                 13155
                                                                                                                                                 137338
number of contexts. Moreover, the higher the word fre-                       Musicians/Bands
                                                                             PublicFigures
                                                                                                       38.58
                                                                                                       32.86
                                                                                                                 91.38
                                                                                                                 92.44
                                                                                                                             2.03
                                                                                                                             1.80
                                                                                                                                     7141
                                                                                                                                     6455
                                                                                                                                                 84762
                                                                                                                                                 28289
quency is, the higher the chance that it is used in different                NonProfitOrganizations    31.85     86.57       1.72    6180        25847
                                                                             Artists                   30.65     84.22       1.92    5970        31681
contexts is. Consequently, the vector magnitude also de-                     Companies                 30.05     85.94       1.75    5939        20750
                                                                             Websites                  29.57     83.94       1.78    5829        17931
creases with respect to frequency. From these remarks, we                    TVShows                   29.48     82.41       2.31    5778        11876
conclude that the euclidean distance is not a good measure                   EntertainmentWebsites
                                                                             Media/NewsCompanies
                                                                                                       29.26
                                                                                                       29.23
                                                                                                                 79.20
                                                                                                                 87.27
                                                                                                                             2.84
                                                                                                                             1.82
                                                                                                                                     5669
                                                                                                                                     5871
                                                                                                                                                 8319
                                                                                                                                                 14042
for our inference purpose. Actually, words that appear in                    Products/Services
                                                                             News/MediaWebsites
                                                                                                       27.52
                                                                                                       27.44
                                                                                                                 80.93
                                                                                                                 83.43
                                                                                                                             1.86
                                                                                                                             1.86
                                                                                                                                     5496
                                                                                                                                     5550
                                                                                                                                                 15986
                                                                                                                                                 9247
many contexts have low magnitude. As a result, their eu-                     Organizations
                                                                             Movies
                                                                                                       26.20
                                                                                                       26.09
                                                                                                                 80.77
                                                                                                                 75.17
                                                                                                                             1.63
                                                                                                                             2.26
                                                                                                                                     5328
                                                                                                                                     5171
                                                                                                                                                 14738
                                                                                                                                                 16282
clidean distances will be small and, using this criteria, they               LocalBusinesses           24.91     78.58       1.69    5111        17321
                                                                             Clothings                 23.99     68.12       1.96    4729        16090
would be considered close even if they do not appear in                      Gastronomies              23.52     71.73       2.24    4763        8422
                                                                             Actors/Directors          23.12     74.54       2.78    4785        10425
any common context. For instance, the euclidean distance                     Magazines                 22.82     73.96       1.69    4733        9955
between the cluster of pages of the most popular politi-                     Athletes
                                                                             ApplicationPages
                                                                                                       22.68
                                                                                                       21.68
                                                                                                                 68.79
                                                                                                                 66.36
                                                                                                                             2.35
                                                                                                                             3.04
                                                                                                                                     4583
                                                                                                                                     4396
                                                                                                                                                 14123
                                                                                                                                                 4244
cians will be small even if they are rivals. In the example                  SportsTeams               21.48     63.93       2.35    4309        10433

depicted by Figure 6 the politicians of the clusters 𝑐2,1 and            Table 2: Selected learning graphs in D1 for politi-
𝑐2,2 are rivals. The angular distance between those two                  cians.
clusters is big. However, the euclidean distance is small.
                                                                    5




                                                                    32
                         Figure 6: Example of 2-dimensional vectors that encode 8 nodes.


   We note that the communities graph has the second                      Selection     accuracy    targets     deleted     nodes
greatest learning rate 𝑙𝑟 = 44.97%, which means that it                   based on                               links
holds much latent information about users who hide their                  Relevance
                                                                                          0.79       252         409        558125
likes in the sensitive graph. It also has the maximal confi-             (23 graphs)
dence rate 𝑐𝑟 = 98.47 and the fifth lowest Hamming rate                    Random                     204         351        11200
                                                                                          0.41
                                                                         (23 graphs)               (average)   (average)   (average)
(ℎ𝑟 = 1.75%) among the 23 selected relevant graphs, which
means that its structure is very similar to the structure                             Table 3: Experimental results
of the politicians graph. The friendship graph (Users) has
the maximal learning rate 𝑙𝑟 = 88.37% and a high confi-
dence rate 𝑐𝑟 = 83.98% since 87.62% of users are connected
                                                                       7.2    Gender and relationship status
to this graph. However, this graph has a Hamming rate
ℎ𝑟 = 2.08% greater than the average ℎ𝑟 of selected graphs              To produce a text document from the social graph, we
which means that learned information from this graph is                perform random-walks of 80 steps. Each random walk starts
less reliable than learned information from the communities            from a different node. The dimension of the node vectorial
graph.                                                                 representation is taken to be 128 since in that case the
   We use the area under the ROC curve (AUC) as defined                number of sensitive values is smaller. Moreover learning
in [8] to measure the accuracy of the inferred links. For the          graphs are obtained by grouping several attribute values in
defined thresholds (𝜃𝑙𝑟 = 20%, 𝜃𝑐𝑟 = 60%, 𝜃ℎ𝑟 = 4%) the                the merging operation (see Section 5). For the experiments,
precision is equal to 0.79. However, the inference accuracy            we have generated a new social graph from the dataset
when the 23 relevant graphs are selected randomly is only              by randomly selecting 10% of user profiles that publish
0.41. We conducted more tests by selecting manually 3                  the value of their sensitive attribute and we have masked
graphs that are semantically close to politics as follows.             it. Then SONSAI has tried to infer the masked values of
Graph 𝐺1 models the links between 1246 user profiles and               the sensitive attribute for each user based on the selected
2357 political organizations, 𝐺2 models the links between              attributes by the cleanser module.
1120 user profiles and 1758 political parties and 𝐺3 models               Relationship status. The sensitive graph models the rela-
the links between 39 user profiles and 41 political ideologies.        tionship status of user profiles. To simplify the presentation
Although the selected graphs seem promising, the inference             we define two meta-relationship status as follows:
accuracy is only 0.46. This can be explained by the fact that
                                                                        𝑅1=    {𝑆𝑖𝑛𝑔𝑙𝑒, 𝐷𝑖𝑣𝑜𝑟𝑐𝑒𝑑, 𝑆𝑒𝑝𝑎𝑟𝑎𝑡𝑒𝑑, 𝑊 𝑖𝑑𝑜𝑤𝑒𝑑, 𝐶𝑜𝑚𝑝𝑙𝑖𝑐𝑎𝑡𝑒𝑑}
the selected graphs are very sparse and users are vigilant              𝑅2=    {𝐷𝑜𝑚𝑒𝑠𝑡𝑖𝑐 𝑝𝑎𝑟𝑡𝑛𝑒𝑟𝑠ℎ𝑖𝑝, 𝑀 𝑎𝑟𝑟𝑖𝑒𝑑, 𝐸𝑛𝑔𝑎𝑔𝑒𝑑,
when publishing their preferences about those attributes.                      𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝, 𝐶𝑖𝑣𝑖𝑙 𝑢𝑛𝑖𝑜𝑛, 𝑂𝑝𝑒𝑛 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛}
Consequently, the algorithm cannot learn well from them.
   We note that the musicians/bands graph was automat-                    We aim to infer the meta-relationship status of users.
ically selected by our relevance-based selection method                Table 4 gives more details about the selected attributes
confirming that music and politics are correlated as it was            from dataset D2.
empirically discovered in previous studies [17],                          We notice that discriminant attributes toward 𝑅1 are
   Table 3 summarizes the results of the conducted experi-             focused around educations and leisures. On the other hand,
ments.                                                                 discriminant attributes toward 𝑅2 are focused around
                                                                       business. The accuracy in AUC of inferring the meta-
                                                                       relationship status is higher than 0.7 in both datasets D1
                                                                  6




                                                                  33
      Attribute                       Importance     Discrimination
      Education                          2.75         88.41 % R1             inference since the vocabulary is limited to user profiles,
      Community College
      Consulting Agency
                                         2.74
                                         2.71
                                                      90.02 % R1
                                                      90.70 % R2             super-values (i.e. clusters of values) and sensitive values.
      Sports & Recreation
      Home & Garden Website
                                         2.56
                                         2.49
                                                      91.18 % R2
                                                      91.89 % R2
                                                                                Ranking task has to compute cosine similarity between
      Automotive, Aircraft & Boat
      Locality
                                         2.48
                                         2.47
                                                      92.86 % R2
                                                      92.59 % R2
                                                                             a few vectors that represent the sensitive values and the
      Corporate Office                   2.46         91.18 % R2             target user vector. Cleansing task selects most important
      News & Media Website               2.42         90.32 % R2
      Financial Service                  2.41         90.00 % R2             attributes and reduces the vocabulary. Consequently, this
      Industrial Company                 2.40         89.29 % R2
      Educational Consultant             2.02         75.00 % R1             process accelerates the inference tasks (relying on Ran-
      Playground
      Phone/Tablet
                                         1.80
                                         1.70
                                                      66.67 % R1
                                                      63.64 % R1             dom walk and Word2Vec). Moreover, Cleansing increases
      Plastic Surgeon
      Consulate & Embassy
                                         1.60
                                         1.60
                                                      60.00 % R1
                                                      60.00 % R2
                                                                             accuracy by discarding irrelevant information.
      School Sports Team                 1.53         52.00 % R1
      Dive Bar                           1.45         54.55 % R1
      Video
      Playlist
                                         1.44
                                         1.41
                                                      51.00 % R1
                                                      53.04 % R1
                                                                             7.4    Parameter sensitivity analysis
Table 4: Selected attributes in D2 for relationship                          Let us investigate the impact of the cleansing parameters
status.                                                                      𝑙𝑟, 𝑐𝑟 and ℎ𝑟. All experiments detailed in this section are
                                                                             conducted on dataset D1 to infer users’ political orientation.
                                                                                 Table 7 shows that only 3 graphs among the 1928 avail-
                                                                             able graphs have a learning rate 𝑙𝑟 higher than 30%. Based
and D2 as soon as the target publishes values concerning                     on those graphs, inference accuracy can be very low. For
at least 4 selected attributes by the cleanser.                              instance, inference accuracy based on gender attribute is
   Gender. The sensitive graph models the gender of user                     only 0.36. Based only on the users (i.e. friendship) graph
profiles. We notice that discriminant attributes toward                      accuracy is getting better to 0.64. The communities graph
male are focused around sports, games and software. On                       gives high accuracy of 0.74 for inferring political views.
the other hand, discriminant attributes toward female are                    However, we notice that the best accuracy is obtained
focused around health, home and luxury. The accuracy in                      when selecting graphs with learning rate between 10% and
AUC of inferring the gender is higher than 0.83 in dataset                   40%. Table 7 shows that the learning rate parameter 𝑙𝑟 is
D1 and higher than 0.67 in dataset D2 as soon as the target                  important to select the best graphs for inference. However,
publishes values concerning at least 2 selected attributes                   accuracy does not depend only on this parameter since
by the cleanser.                                                             some graphs such as gender graph that have high learning
                                                                             rate may lead to very low accuracy results.
   Attribute                            Importance     Discrimination            Table 8 shows that when the Hamming rate ℎ𝑟 decreases,
   Sports League
   Recreation & Sports Website
                                           4.22
                                           3.80
                                                       75.97 % Male
                                                       77.09 % Male          accuracy increases. However, most graphs have a low Ham-
   Video Game
   Cars
                                           3.66
                                           3.25
                                                       80.16 % Male
                                                       73.15 % Male
                                                                             ming rate because only a small part of them can be com-
   Amateur Sports Team
   Sport
                                           3.03
                                           2.80
                                                       72.86 % Male
                                                       73.07 % Male
                                                                             pared to the sensitive graph, as few users publish their
   Jewelry/Watches                         2.72        56.26 % Female        preferences in both graphs. Hence, their structure is not
   Electronics                             2.68        73.19 % Male
   Software                                2.52        77.23 % Male          fairly comparable to the politicians’ graph structure. To
   Outdoor & Sporting Goods                2.35        77.19 % Male
   Women’s Clothing Store                  2.35        77.28 % Female        cope with this problem we compute a third parameter, the
   Home Decor
   Stadium, Arena & Sports Venue
                                           2.29
                                           2.28
                                                       54.60 % Female
                                                       74.45 % Male          confidence rate 𝑐𝑟, that indicates how reliable the structure
   Baby Goods/Kids Goods
   Kitchen/Cooking
                                           2.14
                                           2.08
                                                       66.61 % Female
                                                       55.93 % Female
                                                                             comparison is.
   Bags/Luggage
   Beauty, Cosmetic & Personal Care
                                           2.04
                                           2.03
                                                       59.16 % Female
                                                       60.59 % Female
                                                                                 Table 9 shows that the confidence rate, 𝑐𝑟, does not
   Cosmetics Store                         1.98        66.25 % Female        give information about the best graph to select when it
   Hair Salon                              1.92        61.44 % Female
   Home & Garden Website                   1.72        55.18 % Female        is considered isolately but it must be coupled with other
  Table 5: Selected attributes in D1 for gender.                             parameters. For instance, if a given graph 𝑔 has a high
                                                                             confidence rate but a low Hamming rate, that means that
                                                                             it is a good graph for inference. However, if a given graph 𝑔
                                                                             has a high confidence rate and high Hamming distance rate
7.3    Processing times                                                      that means that 𝑔 is probably a bad graph for infering the
                                                                             sensitive attribute. But a given graph 𝑔 could be interesting
Table 6 displays the processing times. The processor clock
                                                                             for inference if it has low 𝑐𝑟 and high ℎ𝑟.
is 2.3 GHz. Cleansing and random walk algorithms are
not parallelized. Cleansing takes more time than the other
processes in the case of gender inference since it handles                   8     CONCLUSION
hundreds of thousands of nodes, compares hundreds of                         SONSAI application enables users to predict their sensitive
graphs to the sensitive graph and computes their impor-                      links in social networks from relatively small amount of data
tance. The random walk, in the case of gender inference, is                  and computing resources. Indeed, sensitive data inferences
performed on a small graph containing only a few dozen                       are fast and accurate on typical personal attributes. It
of super-values and a few thousands of user profiles. On                     should be noted that the friendship graph was not selected
the other hand, in the case of politicians inference, the                    among important ones to deduce both the gender and
task is performed on larger graphs containing dozens of                      the relationship status of users. This probably means that
thousands of values. The machine disposes only of 8GB                        alternative techniques based solely on homophily would
of RAM memory. Each chunk of 5k steps is stored sepa-                        be inaccurate in this context. Moreover, we have observed
rately in a text file of about 25MB. Those files are then                    that the privacy of users is threatened as soon as they start
parsed by Word2Vec. Word2Vec speed depends on the size                       publishing at least three important attributes. These ones
of the document vocabulary. It is fast in the case of gender                 are automatically brought to light by SONSAI, regardless
                                                                        7




                                                                        34
                                                                                          Random
                                                  Process                     Cleansing               Word2Vec      Ranking
                                                                                           walk
                                                      Gender          D1        423         34            50           0.12
                                     Time            inference        D2        243         25            30           0.12
                                 (in seconds)       Politicians       D1        782         523           924           1
                                                     inference        D2        574         451           733           1
                                                                Table 6: Processing times.


       𝑙𝑟          Inference         ♯ selected       ♯ attacked     ♯ masked
     (in %)    accuracy in AUC         graphs           targets        links               Systems Applications - 28th International Conference, DEXA
     [0, 10[          0.61              1873              252           409                2017, Lyon, France, August 28-31, 2017, Proceedings, Part I.
    [10, 20[          0.80               31               254           411
                                                                                           249–263. https://doi.org/10.1007/978-3-319-64468-4_19
    [20, 30[          0.86               16               254           418
    [30, 40[          0.80                5               253           410            [4] Anton Alekseev and Sergey I. Nikolenko. 2017. Word Em-
    [40, 50[          0.74       1 (Communities)          251           408                beddings for User Profiling in Online Social Networks. Com-
    [70, 80[          0.36          1 (Genders)           213           353                putación y Sistemas 21, 2 (2017). http://www.cys.cic.ipn.mx/
    [80, 90]          0.64           1 (Users)            214           350
                                                                                           ojs/index.php/CyS/article/view/2734
       Table 7: Impact of 𝑙𝑟 on inference accuracy.                                    [5] Michael Backes, Mathias Humbert, Jun Pang, and Yang Zhang.
                                                                                           2017. walk2friends: Inferring Social Links from Mobility Profiles.
                                                                                           In Proceedings of the 2017 ACM SIGSAC Conference on
                                                                                           Computer and Communications Security, CCS 2017, Dallas,
            ℎ𝑟         Inference     ♯ selected    ♯ attacked      ♯ masked                TX, USA, October 30 - November 03, 2017. 1943–1957. https:
         (in %)    accuracy in AUC     graphs        targets         links                 //doi.org/10.1145/3133956.3133972
          [0, 5[          0.68          1744           253            410
         [5, 10[          0.59           87            177            304
                                                                                       [6] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Chris-
        [10, 20[          0.53           58            87             167                  tian Janvin. 2003. A Neural Probabilistic Language Model.
        [20, 30[          0.45           11            83             129                  Journal of Machine Learning Research 3 (2003), 1137–1155.
        [30, 40[          0.42           13            11              21              [7] Vladimir Estivill-Castro, Peter Hough, and Md Zahidul Islam.
        [40, 50[          0.42            5             2              3
       [50, 100]          0.41           10            211            351
                                                                                           2014. Empowering users of social networks to assess their
                                                                                           privacy risks. In 2014 IEEE International Conference on Big
      Table 8: Impact of ℎ𝑟 on inference accuracy.                                         Data, Big Data 2014, Washington, DC, USA, October 27-30,
                                                                                           2014. 644–649. https://doi.org/10.1109/BigData.2014.7004287
                                                                                       [8] Fei Gao, Katarzyna Musial, Colin Cooper, and Sophia Tsoka.
                                                                                           2015. Link Prediction Methods and Their Accuracy for Dif-
           𝑐𝑟          Inference     ♯ selected    ♯ attacked      ♯ masked                ferent Social Networks and Network Metrics. Scientific Pro-
         (in %)    accuracy in AUC     graphs        targets         links
         [0, 10[          0.63          1711           245            409
                                                                                           gramming 2015 (2015), 172879:1–172879:13. https://doi.org/
        [10, 20[          0.43           94            246            407                  10.1155/2015/172879
        [20, 30[          0.74           37            245            405              [9] Yoav Goldberg and Omer Levy. 2014. word2vec Explained:
        [30, 40[          0.54           28            248            404                  deriving Mikolov et al.’s negative-sampling word-embedding
        [40, 50[          0.72           16            247            410
        [50, 60[          0.38           14            250            407
                                                                                           method. CoRR abs/1402.3722 (2014).
        [60, 70[          0.63            8            248            403             [10] Virgil Griffith. 2007. music that makes you dumb. (2007).
        [70, 80[          0.60            6            248            393                  http://musicthatmakesyoudumb.virgil.gr/
        [80, 90[          0.65           11            255            419             [11] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Fea-
       [90, 100]          0.82            3            253            410
                                                                                           ture Learning for Networks. In Proceedings of the 22nd ACM
       Table 9: Impact of 𝑐𝑟 on inference accuracy.                                        SIGKDD International Conference on Knowledge Discovery
                                                                                           and Data Mining, San Francisco, CA, USA, August 13-17,
                                                                                           2016. 855–864. https://doi.org/10.1145/2939672.2939754
                                                                                      [12] R. Heatherly, M. Kantarcioglu, and B. Thuraisingham. 2013.
                                                                                           Preventing Private Information Inference Attacks on Social
their semantics and only through a structural analysis                                     Networks. IEEE Transactions on Knowledge and Data Engi-
of the social network graph. As future work, we plan to                                    neering 25, 8 (Aug 2013), 1849–1862. https://doi.org/10.1109/
                                                                                           TKDE.2012.120
incorporate countermeasures into our tool to protect users                            [13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and
against posts that might compromise their privacy. We                                      Jeffrey Dean. 2013. Distributed Representations of Words
                                                                                           and Phrases and their Compositionality. CoRR abs/1310.4546
also plan to enhance SONSAI prediction engine with a tool                                  (2013).
that permits one to disclose hidden friendship links using                            [14] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deep-
adequate combinations of queries provided by the social                                    Walk: online learning of social representations. In The 20th
                                                                                           ACM SIGKDD International Conference on Knowledge Dis-
network [2].                                                                               covery and Data Mining, KDD ’14, New York, NY, USA -
Acknowledgments. This work is supported by MAIF                                            August 24 - 27, 2014. 701–710.
Foundation5 .                                                                         [15] Eunsu Ryu, Yao Rong, Jie Li, and Ashwin Machanavajjhala.
                                                                                           2013. curso: protect yourself from curse of attribute inference:
                                                                                           a social network privacy-analyzer. In Proceedings of the 3rd
REFERENCES                                                                                 ACM SIGMOD Workshop on Databases and Social Networks,
    [1] Chaabane Abdelberi, Gergely Ács, and Mohamed Ali                                   DBSocial 2013, New York, NY, USA, June, 23, 2013. 13–18.
        Kâafar. 2012.       You are what you like! Information                             https://doi.org/10.1145/2484702.2484706
        leakage through users’ Interests. In 19th Annual Net-                         [16] Adriaan M. J. Schakel and Benjamin J. Wilson. 2015. Measuring
        work and Distributed System Security Symposium,                                    Word Significance using Distributed Representations of Words.
        NDSS 2012, San Diego, California, USA, February                                    CoRR abs/1508.02297 (2015).
        5-8, 2012.       https://www.ndss-symposium.org/ndss2012/                     [17] John Street. 2012. Music & Politics. Polity Press.
        you-are-what-you-information-leakage-through-users-interests
    [2] Younes Abid, Abdessamad Imine, Amedeo Napoli, Chedy Raïssi,
        and Michaël Rusinowitch. 2016. Online Link Disclosure Strate-
        gies for Social Networks. In Risks and Security of Internet
        and Systems - 11th International Conference, CRiSIS 2016,
        Roscoff, France, September 5-7, 2016, Revised Selected Papers.
        153–168. https://doi.org/10.1007/978-3-319-54876-0_13
    [3] Younes Abid, Abdessamad Imine, Amedeo Napoli, Chedy Raïssi,
        and Michaël Rusinowitch. 2017. Two-Phase Preference Disclo-
        sure in Attributed Social Networks. In Database and Expert
5
    www.fondation-maif.fr/
                                                                                  8




                                                                                 35