Sensitive attribute prediction for social networks users

Sensitive attribute prediction for social networks users YounesAbid younes.abid@loria.fr Lorraine University Nancy

France

AbdessamadImine abdessamad.imine@loria.fr Lorraine University

Cnrs, Inria Nancy France

MichaëlRusinowitch michael.rusinowitch@loria.fr Lorraine University

Cnrs, Inria Nancy France

Sensitive attribute prediction for social networks users 1613-0073) 70F27609892C3C4AF152E50DA26923BC GROBID - A machine learning software for extracting information from scholarly documents

Social networks are popular means of data sharing but they are vulnerable to privacy breaches. For instance, relating users with similar profiles an entity can predict personal data with high probability. We present SONSAI a tool to help Facebook users to protect their private information from these inferences. The system samples a subnetwork centered on the user, cleanses the collected public data and predicts user sensitive attribute values by leveraging machine learning techniques. Since SONSAI displays the most relevant attributes exploited by each inference, the user can modify them to prevent undesirable inferences. The tool is designed to perform reasonably with the limited resources of a personal computer, by collecting and processing a relatively small relevant part of network data.

INTRODUCTION

Data published on social networks profiles can be mined for inferring sensitive information about users. For instance it was shown in [10] how musical tastes allow one to predict educational level. To increase user awareness about these privacy threats we have designed a tool, SONSAI, for Facebook users to audit their profiles.The system crawls the network around the user and predicts its sensitive attributes values using a machine learning engine. The results provided by SONSAI, also shows the public attributes of the user that have oriented the learning algorithm towards a particular sensitive attribute value. The user can therefore modify these public attributes to prevent inference.

For the approach to be feasible several problems have to be solved: First, data collection by crawling is limited both by the social network and by country regulations. Hence the crawler exploration strategy has to focus only on meaningful representative network nodes. Since attributes are numerous, for the learning program to scale one has to select only the most relevant ones for inferring sensitive attribute values. Hence the second problem is to find an attribute relevance measure that is both accurate and easy to compute. Note that we cannot rely on semantic proximity since we notice that a user that hides a sensitive attribute probably will hide semantically related ones, too. Moreover for fully anonymised datasets the attributes semantics is hard to recover. Therefore we follow an alternative approach by modelling attributes as bipartite graphs and measuring relevance of attributes by comparing their bipartite graph structures.

For specific attributes such as gender and relationship status, the sets of values are much smaller than for other attributes like music and movie. Consequently, the graphs that model these attributes have higher connectivity than the other graphs. For instance, the density of the graph that models gender (as most users publish their gender) is close to 0.5. In order to infer hidden links in such graphs we need to learn from highly connected graphs. However most of the available learning graphs are sparse. To cope with this last problem, we derive new graphs by merging several learning graphs.

All the proposed methods have been implemented for Facebook, however they can be applied to many other social networks. The system has been tested by several volunteer users for auditing their Facebook profiles. In each case a dataset was built from real profiles collected in the user neighborood network.

Related works. In [15] the authors propose algorithms to detect whether a sensitive attribute value can be infered from the neighborhood of a target user in a social network. Heatherly et al. [12] infer attribute values in social network with bayesian classification techniques. For the same purpose Estivill-Castro et al. [7] employ decision-tree learning algorithms. In these works learning is performed off-line on large datasets. In order to perform attribute inference from sparser datasets collected in short time by our tool user in his ego-network, we rather use randow walk-based learning. The random walk technique has been applied to social representations in [14] and [11] where the authors analysed friendships and used a skip-gram model. In these works, user profiles that have similar friends will be mapped to similar representations. This model helps to detect communities and can predict a set of potential friends for a given user profile. Skip-gram model has also been applied recently to infer social relationship from mobility data [5]. Our work uses a more adapted Continuous Bag of Words (CBOW) model to predict the most likely values for a user sensitive attribute. In our setting friends are considered as an attribute among others. Moreover we first determine automatically a relevant subset of the social network for optimizing random walks. In [3] the relevance of attributes is computed by Bayesian optimisation which is much less efficient than the graph comparison approach adopted here. In particular [3] does not succeed in reasonable time (on standard PC) with attributes like gender and relationship status. Let us note that (unlike [1]) our approach does not need any ontology to perform semantic correlation between attributes. Finally we notice that the proposed system SONSAI is close under some aspects to a recommendation system: an item suggestion can be viewed as an attribute value prediction [4]. However unlike recommendation systems SONSAI also provides explanations for the predicted values, namely an ordered list of attributes that have played a significant role in the computation. For enforcing privacy, our final goal will be to reduce the "recommendation accuracy" by acting on this list of attributes. The architecture of SONSAI is overviewed in Figure 1. The Facebook crawler (about 5k lines of Java 8) drives a Firefox 58.0b4 navigator through a Selenium 3.5.3 server 1 . Collected information from each profile, group and page are stored in separated XML files. The Anonymizer, Cleanser, Random walker and Ranker components are written in Python 2.7 (about 2.5k lines of code). The Anonymizer component parses all the XML files, generates anonymized graphs and stores them as TSV files. Then, the Cleanser selects the most relevant graphs and stores them as adjacency lists. The Random walker component browses the adjacency lists and stores the resulting walks in a text document. We use the Python gensim 2 implementation of Word2Vec to parse the text document and compute a vectorial representation of social network nodes encountered in the walk. Finally, the Ranker component classifies the sensitive nodes according to their similarity to the target user profile.

ARCHITECTURE OF SONSAI

SAMPLING FACEBOOK

We have designed a crawler that explores the social network around a user (to some distance given as a parameter) and collects public information from the visited web pages (i.e. friends, liked pages . . . ) in order to build a representative subnetwork. We distinguish two types of Facebook nodes: user profiles (u) and pages (p) and two types of links: like-ships between user profiles and pages, and friendships between user profiles. Given a node 𝑐, 𝑐.𝑛 denotes the set of nodes that are linked to 𝑐. A discovered node is a node whose URL is known by the crawler. For instance, if the crawler retrieves a user profile and collects its public friends list, then all the friends of that particular user profile are discovered. Algorithm 1 crawls at most 𝑛𝑐 nodes at distance ≤ 𝑑 from the target node 𝑢𝑡. Each iteration of the outer loop samples a node, crawls it and update the sets of discovered and crawled nodes. The sampling is done by random walks of length ≤ 𝑑 with a transition probability 𝜋 designed to crawl with higher priority closer nodes and to favour neighbour nodes according to their type. Function sinks(𝑗) returns the set of sinks, i.e. crawled nodes such that all discovered nodes at distance ≤ 𝑗 are also crawled. Sinks are avoided by the random walks to guarantee that the final node has not been crawled yet.

SOCIAL NETWORK MODEL

Modelling friendship relations. Since friendship on Facebook is symmetric, we model friendship between user profiles by an undirected graph (𝑈, 𝐹 ) where 𝑈 is a set of users' profiles and 𝐹 is a set of friendship links between them.

Modelling page like-ship. We model like-ship between user profiles and pages by several bipartite graphs (𝑈, 𝑃, 𝐿) where 𝑈 is a set of users profiles, 𝑃 is a set of pages (a type) and 𝐿 is a set of like-ship links between them. Figure 2 shows an example of page like-ship modelled by two graphs 3 . Graph (𝑎) models liked pages of music type and Graph (𝑏) models liked pages of book type. We note that user profiles can like several pages of the same type. Anonymizing the social network graphs. For ethical and regulation reasons Facebook identifiers are replaced by fresh identifiers. Each node in the network is then identified by a unique integer ID replacing its Facebook ID. The anonymized IDs are sorted according to the node types. Anonymized graphs are saved under the tab-separated value (TSV) format, one of the most general delimiterseparated values format (DSV). TSV is widely used in graph exchange. In contrast to the dataset released by Netflix 4 where only user IDs are anonymised (but not movies title, rating, date of rating), all attribute values are anonymised in our datasets.

In the following a sensitive graph is a graph that models an attribute that is considered sensitive by the user (i.e. the user does not want its value to be predictable). The learning graphs are attribute graphs available for the learning module in order to predict hidden links in the sensitive graph.

MODEL CLEANSING

The task of predicting a sensitive attribute from the other ones is made difficult by the size of datasets. Therefore, we suggest to apply the inference process only on a subset of most relevant attributes for the task. Our relevance notion does not rely on semantic proximity since we noticed that i) a user that hides a sensitive attribute will probably hide other semantically-related attributes, and moreover ii) for fully anonymised datasets the attributes semantics is hard to recover. On the contrary we will rely on comparing attributes graph structures. In the following we assume a fixed sensitive graph 𝑠.

Step 1: Computing the learning and the confidence rates of learning graphs. In order to compare the structure of a given learning graph to the structure of the sensitive graph, we first split each graph in two parts. The first part contains user profiles that hide their links in the sensitive graph. The ratio of user profiles that publish their links in the first part of the learning graph represents the learning rate 𝑙𝑟. The second part contains user profiles that publish their links in the sensitive graph. The ratio of user profiles that publish their links in the second part of the learning graph represents the confidence rate 𝑐𝑟.

The function connected_profiles_in() returns the set of user profiles that publish their links in the graph given as argument. Given 𝑙, 𝑠 respectively a learning graph and a sensitive graph and 𝑈 the set of user profiles in all graphs, we define:

𝑈 𝑙 = connected_profiles_in(𝑙) 𝑈 𝑠 = connected_profiles_in(𝑠) 𝑙𝑟(𝑙) = size(𝑈 𝑙 ∩ (𝑈 ∖ 𝑈 𝑠)) / size(𝑈 ∖ 𝑈 𝑠) 𝑐𝑟(𝑙) = size(𝑈 𝑙 ∩ 𝑈 𝑠) / size(𝑈 𝑠) Figure 3

depicts an example of splitting two graphs for comparison. The graph that models the link-ship between user profiles and pages of politicians is the sensitive graph. And the graph that models the link-ship between user profiles and pages of musics is the learning graph. The learning rate 𝑙𝑟 for this example is equal to 50%. And the confidence rate 𝑐𝑟 is equal to 75%.

Step 2: Computing the distance between a learning graph and the sensitive graph. In this step, we discard user profiles that have a null degree in the learning graph or in the sensitive graph. The Jaccard index between two user nodes 4 https://www.kaggle.com/netflix-inc/netflix-prize-data 𝑢1 and 𝑢2 in a given graph 𝐴 is computed as follows, where the function 𝑙𝑖𝑛𝑘𝑠𝐴(𝑢𝑗) returns the set of nodes to which user node 𝑢𝑗 is connected in the graph 𝐴.

𝐽𝐴(𝑢1, 𝑢2) = 𝑙𝑖𝑛𝑘𝑠𝐴(𝑢1) ∩ 𝑙𝑖𝑛𝑘𝑠𝐴(𝑢2) 𝑙𝑖𝑛𝑘𝑠𝐴(𝑢1) ∪ 𝑙𝑖𝑛𝑘𝑠𝐴(𝑢2)(1)

The Hamming distance 𝐻 between graphs 𝑙 and 𝑠 is defined by:

𝐻(𝑙) = ∑︁ 𝑢 𝑘 ,𝑢 𝑗 ∈𝑈 𝑙∩𝑈 𝑠 𝑘̸ =𝑗 |𝐽 𝑙 (𝑢 𝑘 , 𝑢𝑗) − 𝐽𝑠(𝑢 𝑘 , 𝑢𝑗)|(2)

In order to compare learning graphs with different sets of common connected profiles 𝑈 𝑙 ∩ 𝑈 𝑠, we normalize this distance by the maximal Hamming distance that can be obtained on such a set. Hence we define the Hamming rate:

ℎ𝑟(𝑙) = 𝐻(𝑙)/𝑀 (𝑙) where 𝑀 (𝑙) is ∑︁ 𝑢 𝑘 ,𝑢 𝑗 ∈𝑈 𝑙∩𝑈 𝑠 𝑘̸ =𝑗 |𝑀 𝑎𝑥(𝐽𝑠(𝑢 𝑘 , 𝑢𝑗), 1 − 𝐽𝑠(𝑢 𝑘 , 𝑢𝑗))| (3)

Step 3: Selecting most relevant graphs for learning sensitive attribute values. We first discard the learning graphs that have a learning rate 𝑙𝑟 lower than threshold 𝜃 𝑙𝑟 since they do not convey enough information. We then discard the graphs that have a confidence rate 𝑐𝑟 lower than 𝜃𝑐𝑟 since they are considered as unreliable. Finally, from the remaining graphs we only select graphs that have a Hamming rate ℎ𝑟 higher than 𝜃 ℎ𝑟 since they are the most similar to the sensitive graph.

Densifying graphs

For some sensitive attributes such as gender, age and relationship status, user profiles are linked to at most one value. Moreover the sets of values for these particular attributes are much smaller than for other attributes. Consequently, the graphs that model these attributes are denser than the other ones. In this case, for improving the random-walk based learning process (see Section 6) we need to merge several learning graphs in order to obtain a denser one. We explain the method with a simple example of gender prediction: we select attribute graphs with high 𝑙𝑟, 𝑐𝑟 rates but with also a good rate of discrimination between genders, that is the gender of connected users in the graph is unbalanced between male and female. For instance we can select jewelry and fast-food graphs. We merge these graphs by grouping all fast-foods in a unique node and similarly for jewelries as shown in Figure 4 to obtain a new learning graph.

RANDOM WALK-BASED ATTRIBUTE LEARNING

Representation generation. We plan to translate latent information from the selected graphs (in previous section) to a document. The resulting document holds information about paths in the graphs and their frequencies that can be exploited for infering proximity of a user node to some sensitive attribute value node. Following [14], paths in the social graph are sampled by random walks. In our case, the walks are executed only in the subset of selected graphs.

Let 𝐺1, 𝐺2, . . . 𝐺𝑛 be the list of selected learning graphs.Let 𝑈 be the set of users in all graphs. We assume that the friendship graph is selected and 𝐺1 = (𝑈, 𝐹 ) (otherwise we can simply adapt the computation below). The other graphs are bipartite and we pose 𝐺𝑖 = (𝑈, 𝑉𝑖, 𝐿𝑖) for 𝑖 > 1.

We introduce quotas to quantify the importance of each graph 𝐺𝑟 for inferring secret values of the target sensitive attribute. Each selected graph 𝐺𝑟 is assigned a 3 dimensional vector

𝑉𝐺 𝑟 = [𝑙𝑟(𝐺𝑟), 𝑐𝑟(𝐺𝑟), 1 − ℎ𝑟(𝐺𝑟)].

The quota of 𝐺𝑟 is given by its Mahalanobis distance to the null vector [0,0,0]. It is computed as follows:

𝑞(𝐺𝑟) = √︀ 𝑉 𝑇 𝐺𝑟 Σ −1

𝑉𝐺 𝑟 with Σ the 3 × 3 covariance matrix over the set of selected graph vectors.

To specify the random walk transitions we first define the probability 𝑝𝑢,𝑦 that being in node 𝑢 the next node in the random walk is in a selected graph 𝐺𝑦:

𝑝𝑢,𝑦 = {︃ 𝑞(𝐺𝑦 ) ∑︀ {𝑥|𝑑𝑒𝑔𝑥 (𝑢)>0} 𝑞(𝐺𝑥) if 𝑑𝑒𝑔𝑦(𝑢) > 0 0 otherwise(4)

where 𝑑𝑒𝑔𝑥(𝑢) is the degree of user 𝑢 in graph 𝐺𝑥. A value node 𝑣 is followed by a user node chosen uniformly at random from the ones connected to 𝑣. Assuming the node following user node 𝑢 is in 𝐺𝑦, then it will be chosen uniformly at random from the nodes in 𝐺𝑦 that are connected to 𝑢. Therefore, the transition probabilities are (𝐺1 is the friendship graph): As illustrated in Figure 5 the document is constructed by connecting all graphs through random jumps between their nodes (see also [14]). At each step the walker state changes and a new word is written in the text document. One step amounts to select a graph where the current node occurs with non null degree, and then to select a node that is connected to the current node in the selected graph. The selected node then becomes the new current node.

In this example we aim to predict liked pages of politicians masked by user profile 𝑢3. The sensitive graph is Graph 2 and the learning graphs are Graph 1 and Graph 3. Since the values of the sensitive attributes (the pages of politicians) are labelled (each value belongs to a unique cluster), they are represented by the label of their clusters in the final document. We use a greedy clustering algorithm [3] to define size similar clusters. Pages of politicians that share many common "likers" end up in the same cluster. For instance the first walk depicted by Figure 5 is [𝑢1, 𝑢4, 𝑣2,3, 𝑢4]. But for efficiency the walk [𝑢1, 𝑢4, 𝑐2,2, 𝑢4] is stored instead in the document since the value 𝑣2,3 belongs to the cluster 𝑐2,2.

Applying word2vec to compute node representations. We have performed multi-graph random walks in the social network and generated a text document. Walks in the document can be interpreted as sentences, where the words are network nodes. Hence, inferring a link between a user node and an attribute value node is similar to the problem of estimating the likelihood of words co-occurrence in a corpus. We use word2vec [9,13] to map one-hot encoded vectors that represent words in a high-dimensional vocabulary space to low-dimensional vectors (see [6]). Word2vec is employed with the Continuous Bag of Words (CBOW) model for predicting a word given its context defined by the 𝑐 − 1 words surrounding it (𝑐 is the window size of the context). Inputs of the model in our case are users and published attribute value. Since there is no order between the attribute values CBOW model is more adequate than Skipgram. The output of the CBOW model is a vector of size 𝑣 representing a probability distribution of co-occurrence between all the words of the context and each word from the vocabulary within a window of size 𝑐.

Ranking sensitive attribute values. We measure semantic similarity between two nodes by the cosine of the angle formed by the vectors representing the nodes. Cosine similarity is known to take greater account of context information. We rank all the sensitive values by their cosine similarity to the target user profile. The values that have the lowest rank are most likely the sensitive attribute secret values of the target user profile, where secret values are actually the true values of the target user but are not published by him on the social network.

The Figure 6 depicts an example of 2-dimensional vectors that encode 8 nodes: 3 user profiles (𝑢1, 𝑢2 and 𝑢3), 2 pages of musics (𝑣3,1 and 𝑣3,2) and 3 clusters of pages of politicians (𝑐2,1, 𝑐2,2 and 𝑐2,3). The clusters of the pages of politicians are the sensitive values and their vectors are red. The node 𝑢1 is the target user profile and its vector is blue. The clusters of pages of politicians will be ranked according to their distances to 𝑢1. And the inference algorithm will infer as most probable pages of politicians to be liked by 𝑢1, the pages of politicians of the cluster that has the smallest rank (the closest cluster to 𝑢1).

In [16] Schakel et al. show that word2vec unsupervised learning algorithm encodes word semantics by affecting vectors in the same direction for co-occurrent words during training. Besides, the magnitude of a vector reflects both the frequency of appearance of related words in the corpus and the homogeneity of contexts. Where a context is a set of words that have high co-occurrence probability in the corpus.

In fact, the words that appear in the same contexts have small angular distances between them. The less overlapping the contexts are, the larger the angular distances between their different words are. However, words that appear in many contexts are represented by vectors that average vectors pointing in many contexts directions. Hence, the vectors magnitude generally decreases with respect to the number of contexts. Moreover, the higher the word frequency is, the higher the chance that it is used in different contexts is. Consequently, the vector magnitude also decreases with respect to frequency. From these remarks, we conclude that the euclidean distance is not a good measure for our inference purpose. Actually, words that appear in many contexts have low magnitude. As a result, their euclidean distances will be small and, using this criteria, they would be considered close even if they do not appear in any common context. For instance, the euclidean distance between the cluster of pages of the most popular politicians will be small even if they are rivals. In the example depicted by Figure 6 the politicians of the clusters 𝑐2,1 and 𝑐2,2 are rivals. The angular distance between those two clusters is big. However, the euclidean distance is small. Moreover, the euclidean distance between a user that has many friends, for instance the user 𝑢1 in the Figure 6, and a popular music like "despacito", for instance the page of music 𝑣2,1 in the Figure 6, will be small. But popular users do not necessarily like popular musics.

EXPERIMENTS

To build datasets we have crawled Facebook profiles of people that live in North-East France and in Île-de-France (Paris). Table 1

Political orientation

From each node 𝑛 we perform a random-walk of 800 steps. The dimension of the node representation is taken to be 512. The dimension is usually taken between 100 and 300 for natural languages. However the size of the vocabulary in social networks (equal to the number of nodes) is much higher than in natural languages.

In Dataset 1 the sensitive graph represents the links between 2554 user profiles and 4589 politician pages. For each experiment we generate a new social graph from the dataset by selecting the user profiles that publish their preferences concerning the sensitive attribute (pages of politicians) and at least another attribute. Then we remove all the links in the sensitive graph of 10% of the selected user profiles. The algorithm makes sure that all the nodes in the resulting social graph remain connected. The experiments have consisted then in inferring the hidden links based on information from the learning graphs.

Among the 1928 learning graphs, we selected the ones with learning rate greater than 𝜃 𝑙𝑟 = 20%, confidence rate greater than 𝜃𝑐𝑟 = 60% and Hamming rate lower than

𝜃 ℎ𝑟 = 4%.

Table 2 We note that the communities graph has the second greatest learning rate 𝑙𝑟 = 44.97%, which means that it holds much latent information about users who hide their likes in the sensitive graph. It also has the maximal confidence rate 𝑐𝑟 = 98.47 and the fifth lowest Hamming rate (ℎ𝑟 = 1.75%) among the 23 selected relevant graphs, which means that its structure is very similar to the structure of the politicians graph. The friendship graph (Users) has the maximal learning rate 𝑙𝑟 = 88.37% and a high confidence rate 𝑐𝑟 = 83.98% since 87.62% of users are connected to this graph. However, this graph has a Hamming rate ℎ𝑟 = 2.08% greater than the average ℎ𝑟 of selected graphs which means that learned information from this graph is less reliable than learned information from the communities graph.

We use the area under the ROC curve (AUC) as defined in [8] to measure the accuracy of the inferred links. For the defined thresholds (𝜃 𝑙𝑟 = 20%, 𝜃𝑐𝑟 = 60%, 𝜃 ℎ𝑟 = 4%) the precision is equal to 0.79. However, the inference accuracy when the 23 relevant graphs are selected randomly is only 0.41. We conducted more tests by selecting manually 3 graphs that are semantically close to politics as follows. Graph 𝐺1 models the links between 1246 user profiles and 2357 political organizations, 𝐺2 models the links between 1120 user profiles and 1758 political parties and 𝐺3 models the links between 39 user profiles and 41 political ideologies. Although the selected graphs seem promising, the inference accuracy is only 0.46. This can be explained by the fact that the selected graphs are very sparse and users are vigilant when publishing their preferences about those attributes. Consequently, the algorithm cannot learn well from them.

We note that the musicians/bands graph was automatically selected by our relevance-based selection method confirming that music and politics are correlated as it was empirically discovered in previous studies [17],

Table 3 summarizes the results of the conducted experiments.

Gender and relationship status

To produce a text document from the social graph, we perform random-walks of 80 steps. Each random walk starts from a different node. The dimension of the node vectorial representation is taken to be 128 since in that case the number of sensitive values is smaller. Moreover learning graphs are obtained by grouping several attribute values in the merging operation (see Section 5). For the experiments, we have generated a new social graph from the dataset by randomly selecting 10% of user profiles that publish the value of their sensitive attribute and we have masked it. Then SONSAI has tried to infer the masked values of the sensitive attribute for each user based on the selected attributes by the cleanser module.

Relationship status. The sensitive graph models the relationship status of user profiles. To simplify the presentation we define two meta-relationship status as follows: 𝑅1= {𝑆𝑖𝑛𝑔𝑙𝑒, 𝐷𝑖𝑣𝑜𝑟𝑐𝑒𝑑, 𝑆𝑒𝑝𝑎𝑟𝑎𝑡𝑒𝑑, 𝑊 𝑖𝑑𝑜𝑤𝑒𝑑, 𝐶𝑜𝑚𝑝𝑙𝑖𝑐𝑎𝑡𝑒𝑑} 𝑅2= {𝐷𝑜𝑚𝑒𝑠𝑡𝑖𝑐 𝑝𝑎𝑟𝑡𝑛𝑒𝑟𝑠ℎ𝑖𝑝, 𝑀 𝑎𝑟𝑟𝑖𝑒𝑑, 𝐸𝑛𝑔𝑎𝑔𝑒𝑑, 𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝, 𝐶𝑖𝑣𝑖𝑙 𝑢𝑛𝑖𝑜𝑛, 𝑂𝑝𝑒𝑛 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛}

We aim to infer the meta-relationship status of users. Table 4 gives more details about the selected attributes from dataset D2.

We notice that discriminant attributes toward 𝑅1 are focused around educations and leisures. On the other hand, discriminant attributes toward 𝑅2 are focused around business. The accuracy in AUC of inferring the metarelationship status is higher than 0.7 in both datasets D1 and D2 as soon as the target publishes values concerning at least 4 selected attributes by the cleanser.

Gender. The sensitive graph models the gender of user profiles. We notice that discriminant attributes toward male are focused around sports, games and software. On the other hand, discriminant attributes toward female are focused around health, home and luxury. The accuracy in AUC of inferring the gender is higher than 0.83 in dataset D1 and higher than 0.67 in dataset D2 as soon as the target publishes values concerning at least 2 selected attributes by the cleanser. Ranking task has to compute cosine similarity between a few vectors that represent the sensitive values and the target user vector. Cleansing task selects most important attributes and reduces the vocabulary. Consequently, this process accelerates the inference tasks (relying on Random walk and Word2Vec). Moreover, Cleansing increases accuracy by discarding irrelevant information.

Processing times

Parameter sensitivity analysis

Let us investigate the impact of the cleansing parameters 𝑙𝑟, 𝑐𝑟 and ℎ𝑟. All experiments detailed in this section are conducted on dataset D1 to infer users' political orientation.

Table 7 shows that only 3 graphs among the 1928 available graphs have a learning rate 𝑙𝑟 higher than 30%. Based on those graphs, inference accuracy can be very low. For instance, inference accuracy based on gender attribute is only 0.36. Based only on the users (i.e. friendship) graph accuracy is getting better to 0.64. The communities graph gives high accuracy of 0.74 for inferring political views. However, we notice that the best accuracy is obtained when selecting graphs with learning rate between 10% and 40%. Table 7 shows that the learning rate parameter 𝑙𝑟 is important to select the best graphs for inference. However, accuracy does not depend only on this parameter since some graphs such as gender graph that have high learning rate may lead to very low accuracy results.

Table 8 shows that when the Hamming rate ℎ𝑟 decreases, accuracy increases. However, most graphs have a low Hamming rate because only a small part of them can be compared to the sensitive graph, as few users publish their preferences in both graphs. Hence, their structure is not fairly comparable to the politicians' graph structure. To cope with this problem we compute a third parameter, the confidence rate 𝑐𝑟, that indicates how reliable the structure comparison is.

Table 9 shows that the confidence rate, 𝑐𝑟, does not give information about the best graph to select when it is considered isolately but it must be coupled with other parameters. For instance, if a given graph 𝑔 has a high confidence rate but a low Hamming rate, that means that it is a good graph for inference. However, if a given graph 𝑔 has a high confidence rate and high Hamming distance rate that means that 𝑔 is probably a bad graph for infering the sensitive attribute. But a given graph 𝑔 could be interesting for inference if it has low 𝑐𝑟 and high ℎ𝑟.

CONCLUSION

SONSAI application enables users to predict their sensitive links in social networks from relatively small amount of data and computing resources. Indeed, sensitive data inferences are fast and accurate on typical personal attributes. It should be noted that the friendship graph was not selected among important ones to deduce both the gender and the relationship status of users. This probably means that alternative techniques based solely on homophily would be inaccurate in this context. Moreover, we have observed that the privacy of users is threatened as soon as they start publishing at least three important attributes. These ones are automatically brought to light by SONSAI, regardless their semantics and only through a structural analysis of the social network graph. As future work, we plan to incorporate countermeasures into our tool to protect users against posts that might compromise their privacy. We also plan to enhance SONSAI prediction engine with a tool that permits one to disclose hidden friendship links using adequate combinations of queries provided by the social network [2].

Figure 1 :1Figure 1: Architecture of SONSAI.

5 for 7 𝑐 1 :571𝑖 ← 1 to 𝑑 by 1 do 6 𝑟.addAll(sinks(𝑑 − 𝑖))); ← random_select_in({𝑐 ∪ 𝑐.𝑛} ∖ 𝑟); Crawling nodes around a target user.

Figure 2 :2Figure 2: Example of pages like-ship graphs.

Figure 3 :3Figure 3: Example of cutting graphs for structure comparison.

Figure 4 :4Figure 4: Merging graphs.

5 )Figure 5 :55Figure 5: Example of multi graph random walk.

Figure 6 :6Figure 6: Example of 2-dimensional vectors that encode 8 nodes.

1 Procedure crawl_nodes(𝑢𝑡, 𝑑, 𝑛𝑐, 𝜋) 2 crawl_node(𝑢𝑡); 3 while |𝑐𝑟𝑎𝑤𝑙𝑒𝑑_𝑛𝑜𝑑𝑒𝑠| < 𝑛𝑐 do 4 𝑐 ← 𝑢𝑡; 𝑟 ← {};1 http://www.seleniumhq.org/ 2 https://radimrehurek.com/gensim/index.html

Table 1 : Details about the datasets.1gives statistics about the two crawled datasets.Dataset 1 (D 1) North-East France Dataset 2 (D 2) Île-de-France♯ Attributes 1 929 1 296♯ Attribute values 1 022 860 298 617♯ Crawled user profiles 15 012 6 550

Table 2 : Selected learning graphs in D1 for politi- cians.2details the 23 selected graphs relevance measures.Attribute graph𝑙𝑟 (in %)𝑐𝑟 (in %)ℎ𝑟 (in %)Number of Users ValuesUsers Communities Musicians/Bands PublicFigures NonProfitOrganizations Artists Companies Websites TVShows EntertainmentWebsites Media/NewsCompanies Products/Services News/MediaWebsites Organizations Movies LocalBusinesses Clothings Gastronomies Actors/Directors Magazines Athletes ApplicationPages SportsTeams88.37 44.97 38.58 32.86 31.85 30.65 30.05 29.57 29.48 29.26 29.23 27.52 27.44 26.20 26.09 24.91 23.99 23.52 23.12 22.82 22.68 21.68 21.4883.98 98.47 91.38 92.44 86.57 84.22 85.94 83.94 82.41 79.20 87.27 80.93 83.43 80.77 75.17 78.58 68.12 71.73 74.54 73.96 68.79 66.36 63.932.08 1.75 2.03 1.80 1.72 1.92 1.75 1.78 2.31 2.84 1.82 1.86 1.86 1.63 2.26 1.69 1.96 2.24 2.78 1.69 2.35 3.04 2.3513155 8118 7141 6455 6180 5970 5939 5829 5778 5669 5871 5496 5550 5328 5171 5111 4729 4763 4785 4733 4583 4396 430913155 137338 84762 28289 25847 31681 20750 17931 11876 8319 14042 15986 9247 14738 16282 17321 16090 8422 10425 9955 14123 4244 10433

Table 3 : Experimental results3

Table 4 : Selected attributes in D2 for relationship status.4

Table 5 : Selected attributes in D1 for gender.5

Table 66displays the processing times. The processor clock is 2.3 GHz. Cleansing and random walk algorithms are not parallelized. Cleansing takes more time than the other processes in the case of gender inference since it handles hundreds of thousands of nodes, compares hundreds of graphs to the sensitive graph and computes their importance. The random walk, in the case of gender inference, is performed on a small graph containing only a few dozen of super-values and a few thousands of user profiles. On the other hand, in the case of politicians inference, the task is performed on larger graphs containing dozens of thousands of values. The machine disposes only of 8GB of RAM memory. Each chunk of 5k steps is stored separately in a text file of about 25MB. Those files are then parsed by Word2Vec. Word2Vec speed depends on the size of the document vocabulary. It is fast in the case of gender inference since the vocabulary is limited to user profiles, super-values (i.e. clusters of values) and sensitive values.

Table 6 :6Processing times.

Process Gender inference (in seconds) Politicians D1 D1 Time D2 inference D2Cleansing 423 243 782 574Random Word2Vec Ranking walk 34 50 0.12 25 30 0.12 523 924 1 451 733 1𝑙𝑟 (in %) [0, 10[ [10, 20[ [20, 30[ [30, 40[ [40, 50[ [70, 80[ [80, 90]Inference accuracy in AUC 0.61 0.80 0.86 0.80 0.74 0.36 0.64♯ selected graphs 1873 31 16 5 1 (Communities) 1 (Genders) 1 (Users)♯ attacked targets 252 254 254 253 251 213 214♯ masked links 409 411 418 410 408 353 350

Table 7 :7Impact of 𝑙𝑟 on inference accuracy.ℎ𝑟 (in %) [0, 5[ [5, 10[ [10, 20[ [20, 30[ [30, 40[ [40, 50[ [50, 100]Inference accuracy in AUC 0.68 0.59 0.53 0.45 0.42 0.42 0.41♯ selected graphs 1744 87 58 11 13 5 10♯ attacked targets 253 177 87 83 11 2 211♯ masked links 410 304 167 129 21 3 351

Table 8 :8Impact of ℎ𝑟 on inference accuracy.𝑐𝑟 (in %) [0, 10[ [10, 20[ [20, 30[ [30, 40[ [40, 50[ [50, 60[ [60, 70[ [70, 80[ [80, 90[ [90, 100]Inference accuracy in AUC 0.63 0.43 0.74 0.54 0.72 0.38 0.63 0.60 0.65 0.82♯ selected graphs 1711 94 37 28 16 14 8 6 11 3♯ attacked targets 245 246 245 248 247 250 248 248 255 253♯ masked links 409 407 405 404 410 407 403 393 419 410

Table 9 : Impact of 𝑐𝑟 on inference accuracy.9Icon made by Smashicons from www.flaticon.com 2

Acknowledgments. This work is supported by MAIF Foundation 5 .

You are what you like! Information leakage through users' Interests ChaabaneAbdelberi GergelyÁcs MohamedAliKâafar 19th Annual Network and Distributed System Security Symposium, NDSS 2012

San Diego, California, USA

2012. February 5-8, 2012 Online Link Disclosure Strategies for Social Networks AbdessamadYounes Abid AmedeoImine ChedyNapoli MichaëlRaïssi Rusinowitch 10.1007/978-3-319-54876-0_13 Risks and Security of Internet and Systems -11th International Conference, CRiSIS 2016

Roscoff, France

2016. September 5-7, 2016 Revised Selected Papers Two-Phase Preference Disclosure in Attributed Social Networks AbdessamadYounes Abid AmedeoImine ChedyNapoli MichaëlRaïssi Rusinowitch 10.1007/978-3-319-64468-4_19 Database and Expert 5 www.fondation-maif.fr/ Systems Applications -28th International Conference, DEXA 2017

Lyon, France

2017. August 28-31, 2017 Proceedings, Part I Word Embeddings for User Profiling in Online Social Networks AntonAlekseev SergeyINikolenko Computación y Sistemas 21 2 2017. 2017 walk2friends: Inferring Social Links from Mobility Profiles MichaelBackes MathiasHumbert JunPang YangZhang 10.1145/3133956.3133972 Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017 the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017

Dallas, TX, USA

2017. October 30 -November 03, 2017 A Neural Probabilistic Language Model YoshuaBengio RéjeanDucharme PascalVincent ChristianJanvin Journal of Machine Learning Research 3 2003. 2003 Empowering users of social networks to assess their privacy risks VladimirEstivill-Castro PeterHough Md ZahidulIslam 10.1109/BigData.2014.7004287 IEEE International Conference on Big Data, Big Data 2014

Washington, DC, USA

2014. 2014. October 27-30, 2014 Link Prediction Methods and Their Accuracy for Different Social Networks and Network Metrics FeiGao KatarzynaMusial ColinCooper SophiaTsoka 10.1155/2015/172879 Scientific Programming 2015 13 2015. 2015 word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method YoavGoldberg OmerLevy CoRR abs/1402.3722 2014. 2014 VirgilGriffith music that makes you dumb 2007. 2007 node2vec: Scalable Feature Learning for Networks AdityaGrover JureLeskovec 10.1145/2939672.2939754 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

San Francisco, CA, USA

2016. August 13-17, 2016 Preventing Private Information Inference Attacks on Social Networks RHeatherly MKantarcioglu BThuraisingham 10.1109/TKDE.2012.120 IEEE Transactions on Knowledge and Data Engineering 25 8 2013. Aug 2013 Distributed Representations of Words and Phrases and their Compositionality TomasMikolov IlyaSutskever KaiChen GregCorrado JeffreyDean CoRR abs/1310.4546 2013. 2013 Deep-Walk: online learning of social representations BryanPerozzi RamiAl-Rfou StevenSkiena The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14

New York, NY, USA

2014. August 24 -27, 2014 curso: protect yourself from curse of attribute inference: a social network privacy-analyzer EunsuRyu YaoRong JieLi AshwinMachanavajjhala 10.1145/2484702.2484706 Proceedings of the 3rd ACM SIGMOD Workshop on Databases and Social Networks the 3rd ACM SIGMOD Workshop on Databases and Social Networks

DBSocial; New York, NY, USA

2013. 2013. June, 23, 2013 Measuring Word Significance using Distributed Representations of Words MJAdriaan BenjaminJSchakel Wilson CoRR abs/1508.02297 2015. 2015 JohnStreet Music & Politics Polity Press 2012