Deanonymizing Users in Social Networking Services: an Ego-Network Analysis Approach Vladislav Chesnokov Peter Klyucharev Information Security Department Information Security Department Bauman Moscow State Technical University Bauman Moscow State Technical University Moscow, Russia Moscow, Russia v.o.chesnokov@yandex.ru pk.iu8@yandex.ru Abstract — One of the main problems in social networking reposting publications they interested in and performing other services monitoring systems is the incompleteness of analyzed actions revealing their identity. data. Anonymous users may participate in information warfare tampering public opinion. A new method for user profiling in Depending on quality and amount of data SNS users social networking services is proposed. It is based on analysis of deanonimization methods can be divided into four categories: user ego-network communities. The core idea of the method is  ones that use only publicly available data: user profiles, that each user has a unique set of attributes and user attributes are strongly related to his ego-network communities. User profile photos, publications, etc; attributes can be obtained as the union of attributes relevant to  technical methods (e.g. browser fingerprints); each community. The method was compared with majority voting and two community detection based approaches.  cross-SNS analysis; Experiments on four datasets from Facebook, Twitter, LiveJournal and VKontakte social networking services showed  social engineering. that the proposed method outperforms others and some user Of course, these methods can be combined. Methods from attributes can be determined with high precision and recall. The the first category are based on user attributes inferring and method is tolerant to node attributes partial absence. User their further analysis. It is quite obvious that a set of attributes attributes determined by the proposed method combined with like age, sex, location, education, workplace, etc is unique for additional sources of information may lead to user each person, so having them one can obtain user identity. deanonimization. Using additional data sources like graduates lists or other user Keywords — deanonimization; user profiling; ego-network; databases one can even get their full name. social graph; user profile attributes; community detection; information warfare; social networking service II. PROBLEM DEFINITION Consider unweighted undirected graph G’(V’, E’) with I. INTRODUCTION D(G’) = 2 which has such that Social networking service (SNS) monitoring is one of the key instruments for public opinion analysis, which is used in , (1) marketing, politics, science, etc. A good report must contain statistics about its target audience attributes. However, some where D(G’) is the diameter of graph G’. Then it is called an users in SNS may be anonymous, and some people may create ego-network for node u, while u is an ego or center. Denote profiles just to tamper the public opinion regarding some topic. In recent years there were many examples how SNS can , be used to manipulate its users and influence “the real world” (2) [1, 2, 3]. Revealing and deanonimization such profiles are the key problems of information warfare. . (3) SNS allow users to create profiles filled with personal Each vertex from V’ has a set of attributes (sometimes data. Most users publish a huge amount of personal attributes called features) from set F. I.e. there exists a map f: V → 2F. such as age, location, education, favorites, photos, etc. In practice, only a part of vertex attribute set is observable, so However, not everyone does that. There are several reasons to denote it by f’ such that have empty or scarcely filled profile: privacy issue, laziness, paranoia and others. Some users may think that fake names and no profile photo can assure their anonymity. Meanwhile . they can still use SNS to its full extent, communicating with (4) their friends and family, listing social links in their profile, This work was supported by RFBR (grant № 16-29-09517). 40 The task of profile inferring is to obtain such user profile of redundant and may lead to low results. More efficient methods vertex u that are based on ego-network community detection. For example, in [19] a generative model was used to obtain user circles (5) (local communities). In [18] some attributes of user profile were determined by ego-network communities with 70% where p* = f(u) and δ – some similarity measure of two sets, accuracy on LinkedIn dataset. One of the main ideas of the e.g. F-measure. method is to determine a type of links between ego-user and others with respect to vertex attributes. It was shown that this III. RELATED WORK method outperforms methods from [17] and [19] by accuracy There are quite lot methods which can be used to infer user of user profiling and community detection. However, this profile attributes. Most of them are based on machine learning. method requires knowledge about attributes nature and Others can use attribute inferring, majority voting, user supposes that communities may not intersect. preferences or community detection. The most basic method is the majority voting: the more IV. THE PROPOSED METHOD neighbors (or “friends”) of vertex has some attribute, the It is well known that social ties of a person higher the probability that this vertex has this attribute. This are not random. Users connected to him can be separated into method was used to determine the geolocation of Twitter users several groups formed by some attribute: classmates, in [4]. It has quite high precision (86%) but low recall (20%), colleagues, relatives, close friends, etc. Consider ego-network because only small fraction of users provides their location. communities. The result of the community detection algorithm More complex methods include attribute transfer. For proposed in [20] is a cover with corresponding label sets example, in [5] user geolocation was determined by median describing them. Suppose that the central vertex value of its neighbors location. The error of the method was belongs to all its ego-network communities. So, most likely, it less than 10 kilometers for half of the users from Twitter would have all attributes of them. The proposed method is of dataset. In [6] user preferences were transferred through their determining user profile is simply a union of attributes sets social ties and in [7] Facebook user geolocation was corresponding to these communities: determined in the same way. (6) Methods based on user preferences use more complex The main problem is to obtain this correspondence. The information about a user, which is harder to obtain. In [8] algorithm from [20, 21] can be simplified as we don’t need the profile attributes were determined by music preferences, and community membership, just labels. First, associate with each in [9] – by user reactions (“likes”). PGPI method proposed vertex v an empty set of key attributes Kv. Next, start an in [10] allows determining some attributes of user profile with iterative process of updating sets of key attributes. At each accuracy higher than 90%. More than that, it needs only iteration all nodes are visited. For each vertex v with neighbors limited amount of information (number of “facts”) and uses set and each attribute a sets users attributes, group membership, publications view count and likes count. (7) The quality of user profiles got by machine learning based and methods highly depends on quality of training data. In [11] a comparative analysis of several models was conducted. The task was to determine age and sex of 7 million telecom (8) provider clients by calls and SMS data. The best model are computed. If the sum is greater than the showed 0.85 for sex and 0.72 for age values of F-measure threshold , where αa is a fraction defining with 90% data as training. One of the common approaches is qualified majority for attribute a, then attribute a is added to to use user attributes with features extracted from their the set of key attributes Kv. The process stops when there were publications [12-15]. Accuracy higher than 89% for age and no changes at the last iteration. The original community sex detection can be reached with this method [12]. Machine detection algorithm has four more steps to determine learning can be used to combine several approaches. For community membership from key attributes sets and filter example, in [16] features like distance between user favorite some attributes. However without great loss of accuracy only publications and profiles were used to determine his described steps are required in the simplified algorithm. So, geolocation. Of course one can build an ensemble of the profile can be determined as classifiers to determine user attributes. (9) The proposed method is based on community detection approach. In [17] a greedy community detection algorithm Of course, not all attributes can be determined reliably regarding each attribute based on conductivity maximization is using this method, e.g. sex or last name. So, the attributes proposed. Given that value of some attribute is known for at should be filtered before applying the method. Some attributes least 20% of users, the accuracy for this attribute for other can be grouped if their nature is known – e.g. education. users is 80% on Facebook dataset for students and professors Unfortunately, attribute filtering is quite hard task nowadays, of two universities [17]. In [18] it was concluded that applying and may depend on SNS type. Some solutions for that may community detection algorithm to full social graph is include statistical analysis of obtained profiles or machine 41 learning. Most frequently, a data scientist may filter them by TABLE III. METHOD EVALUATION RESULTS (WITH ATTRIBUTE FILTERING). AVERAGE VALUES OF F-MEASURE hand on in semi-automatic way. Facebook Twitter VKontakte LiveJournal V. METHOD EVALUATION simple 0.769 0.642 0.424 0.678 majority The method was evaluated on four ego-networks datasets ground-truth 0.759 0.756 n/a n/a from Facebook1, Twitter2, VKontakte3 and LiveJournal4. The communities first two datasets are from SNAP dataset [22]. Two other Infomap 0.726 0.641 0.557 0.504 datasets were crawler by author. All datasets characteristics communities modularity are shown in Table 1. Each graph in all datasets has a profile maximization 0.793 0.675 0.578 0.625 (a set of binary attributes) for each vertex, including ego. communities AGM-fit Precision, recall and F-measure were used to measure the communities 0.780 0.669 0.624 0.668 quality of obtained profiles. The proposed method was also BigCLAM 0.860 0.720 0.728 0.750 compared with some others: simple majority; majority by communities communities (obtained by one of the five community CESNA 0.842 0.719 0.683 0.727 detection algorithms: modularity maximization [23], communities CESNA Infomap [24], AGM-fit [25], BigCLAM [26], CESNA [27] or attribute 0.466 0.752 0.767 0.874 ground-truth communities); method based on attribute weights weights in communities obtained by CESNA. For each method several proposed 0.841 0.866 0.860 0.908 threshold values were used, and for each graph threshold with method the best value of F-measure was used. For each SNS and profile detection method quality measures were averaged. The results for all methods without attributes filtering are shown in Table I. The values for all methods are quite low. So, TABLE I. DATASETS for each datasets attributes were filtered. They were grouped avg. avg. avg. by similarity and description. For Facebook dataset attributes number corresponding to education and hometown were kept (612 SNS number number number of of graphs of vertices of edges attributes total). In Twitter dataset attributes are hashtags and replies, so Facebook 10 417 17 017 228 only most popular hashtags were kept (86 total). For VKontakte dataset, education and workplace attributes were Twitter 973 138 2 350 665 used (61 361 total). For LiveJournal dataset attributes VKontakte 2 000 164 1 561 1 665 corresponding to education and hometown were kept (47 106 total). LiveJournal 700 149 1 335 2 368 The results for datasets with filtered attributes are shown in TABLE II. METHOD EVALUATION RESULTS (NO ATTRIBUTE FILTERING). Table III. Average values for F-measure are significantly AVERAGE VALUES OF F-MEASURE higher for all methods. Perhaps they can be improved even further with another attribute filtering. First of all, the result Facebook Twitter VKontakte LiveJournal for Twitter, VKontakte and LiveJournal datasets are similar simple majority 0.462 0.155 0.058 0.210 for all methods, but different for Facebook datasets. It can be ground-truth connected with small size of Facebook dataset or its nature (it 0.515 0.190 n/a n/a was created by volunteers [22], while others were crawled communities Infomap 0.491 0.183 0.170 0.099 from random profiles). communities modularity Simple majority method showed quite low results for all maximization 0.532 0.190 0.154 0.152 datasets except Facebook. Precision and recall tests communities demonstrated that this method lacks precision. One of the best AGM-fit results was given by ground-truth communities. Of course, 0.516 0.189 0.187 0.209 communities BigCLAM this method cannot be applied in practice, but high results for 0.570 0.234 0.260 0.264 it confirm hypothesis that community structure is strongly communities CESNA connected with profiles attributes, and improvement of 0.535 0.222 0.218 0.236 communities community detection algorithms may lead to increase of CESNA precision and recall of profile attributes detection methods. attribute 0.312 0.187 0.225 0.300 weights Methods based on communities obtained by graph proposed clustering algorithms (modularity maximization and Infomap) 0.600 0.274 0.323 0.369 method are not suitable for attributes detection. One of the probable reasons for that is that communities do not intersect and their 1 https://facebook.com union must contain all graph vertices. Similar results for 2 https://twitter.com community detection evaluation were obtained in [20], so the 3 https://vk.com better community detection, the more accurate user profiling. 4 Community detection algorithms based on affiliation model https://livejournal.com 42 outperform graph clustering algorithms, and the similar results VII. CONCLUSION can be declared for user profiling based on communities A user profile detection method was proposed. It is based obtained by these algorithms. on community detection algorithm. The core idea of the Second best method is user profiling by attributes weights method is that community structure of user ego-network is in communities given by CESNA. However it showed the strongly connected with attributes of his profile. worst result on Facebook dataset. The experiments showed that the proposed method The proposed method ties for the best result with CESNA outperforms other attribute transfer based methods by F- communities method for Facebook dataset, and the top results measure, precision and recall. Some attributes can be for Twitter, VKontakte and LiveJournal datasets determined with precision and recall close to one. The method outperforming closest competitors by 14.5%, 12% and 4% preserves its performance if node attributes are partially respectively. absent. Considering some chosen groups of attributes (see table This method can be used to determine user identity and IV) the results given by the proposed method by F-measure, deanonymize users in social networking services. Method precision and recall are close to the maximum, 1. application could improve public opinion analysis and help to detect and counteract users who tries to tamper it. TABLE IV. F-MEASURE, PRECISION AND RECALL OF THE PROPOSED METHOD FOR CHOSEN ATTRIBUTE GROUPS References Dataset Attribute group F-measure Precision Recall [1] J. Ratkiewicz et al.Detecting and Tracking Political Abuse in Social Media. Proc. 5th International AAAI Conference on Weblogs and Social Facebook education 0.902 0.967 0.875 Media (ICWSM). — North America, 2011. DOI: 10.1.1.646.5073 Facebook hometown 1.000 1.000 1.000 [2] Metaxas P. T., Mustafaraj E. Social Media and the Elections. Science. — 2012. — Vol. 338, No. 6106. — P. 472–473. DOI: VKontakte middle school 0.919 0.910 0.973 10.1126/science.1230456. VKontakte faculty 0.994 0.995 0.996 [3] Khondker H. H. Role of the New Media in the Arab Spring. Globalizations. — 2011. — Vol. 8, no. 5. — P. 675–679. DOI: VKontakte work place 0.994 0.997 0.994 10.1080/14747731.2011.621287. [4] Inferring the Location of Twitter Messages Based on User LiveJournal education 0.975 0.991 0.970 Relationships. / C. A. J. Davis [et al.]. T. GIS. — 2011. — Vol. 15, no. 6. — P. 735–751. DOI: 10.1111/j.1467-9671.2011.01297.x . LiveJournal hometown 0.984 0.977 1.000 [5] Jurgens D. That’s What Friends Are For: Inferring Location in Online Social Media Platforms Based on Social Relationships.. Proceedings of VI. TOLERANCE TO ATTRIBUTE ABSENCE the AAAI International Conference on Weblogs and Social Media. — 2013. Real datasets often lack some information about nodes attributes. So, the methods’ tolerance to node attributes partial [6] Pennacchiotti M., Popescu A.M. Democrats, Republicans and Starbucks Afficionados: User Classification in Twitter. Proceedings of the 17th absence was tested. Note that the experiments were performed ACM SIGKDD International Conference on Knowledge Discovery and on crawled datasets, so they have already have been imperfect. Data Mining. — San Diego, California, USA: ACM, 2011. — P. 430– 438. DOI:10.1145/2020408.2020477. For each graph in each dataset 10, 20 … 90% random [7] Backstrom L., Sun E., Marlow C. Find Me if You Can: Improving attributes values were removed. In other words, each dataset Geographical Prediction with Social and Spatial Proximity. Proceedings had nine copies of it with less and less data. of the 19th International Conference on World Wide Web. — Raleigh, North Carolina, USA: ACM, 2010. — P. 61–70.DOI: The precision of most methods with exception of the 10.1145/1772690.1772698. proposed method and CESNA weights method sufficiently [8] Chaabane A., Acs G., Kaafar M. You Are What You Like! Information drops on Facebook dataset after deleting 60% of node Leakage Through Users’ Interests. Proc. Annual Network and attributes, but recall grows. On Twitter dataset precision of all Distributed System Security Symposium. — San Diego, United States, methods grows but recall drops because the obtained profiles 2012. have less “extra” attributes. After removing up to 80% of [9] Kosinski M., Stillwell D., Graepel T. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the attribute values the proposed method and CESNA weights National Academy of Sciences. — 2013. — Vol. 110, no. 15. — P. method significantly outperform others. For VKontakte and 5802–5805.DOI: 10.1073/pnas.1218772110. LiveJournal datasets results are similar to Twitter dataset. [10] Dougnon R. Y., Fournier-Viger P., Nkambou R. Inferring User Profiles in Online Social Networks Using a Partial Social Graph. Advances in Summing up, the experiments showed two obvious leaders Artificial Intelligence: 28th Canadian Conference on Artificial which preserve high values of precision and recall Intelligence, Canadian AI 2015, Halifax, Nova Scotia, Canada, June 25, outperforming other methods: the proposed method and 2015, Proceedings. — Cham: Springer International Publishing, 2015. CESNA weights method. Both of them are highly tolerant to — P. 84–99.DOI: 10.1007/978-3-319-18356-5_8. attribute partial absence. So, methods based on joint analysis [11] Inferring User Demographics and Social Strategies in Mobile Social of ego-network structure and nodes attributes are better suited Networks / Y. Dong [et al.]. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. — for user profiling than others. New York, New York, USA: ACM, 2014. — P. 15–24.DOI: 10.1145/2623330.2623703. 43 [12] Analiz socialnikh setei: metody i prilozaniya / A.V. Korshunov et al.. [20] Chesnokov V. Overlapping Community Detection in Social Networks Trudi Instatuta sistemnogo programmirovaniya RAN. -2014. – v.26. – with Node Attributes by Neighborhood Influence. Models, Algorithms, issue 1. – P. 357-394. and Technologies for Network Analysis: NET 2016, Nizhny Novgorod, [13] Cheng Z., Caverlee J., Lee K. You Are Where You Tweet: A Russia, May 2016 / ed. by V. A. Kalyagin [et al.]. — Cham: Springer Contentbased Approach to Geolocating Twitter Users. Proceedings of International Publishing, 2017. — P. 187–203. DOI: 10.1007/978-3- the 19th ACM International Conference on Information and Knowledge 319-56829-4_14. Management. — Toronto, ON, Canada: ACM, 2010. — P. 759–768. [21] Chesnokov V. Application of the community allocation algorithm in the DOI: 10.1145/1871437.1871535. information confrontation in the social networks. Voprosy [14] Discriminating gender on Twitter / J. D. Burger [et al.]. Proceedings of kiberbezopasnosti [Cybersecurity issues], 2017, N 1(19), pp. 37-44. the Conference on Empirical Methods in Natural Language Processing. DOI: 10.21681/2311-3456-2017-1-37-44. — Association for Computational Linguistics. 2011. — P. 1301–1309. [22] Leskovec J., Krevl A. SNAP Datasets: Stanford Large Network Dataset [15] Ikawa Y., Enoki M., Tatsubori M. Location inference using microblog Collection. — June 2014. — http://snap.stanford.edu/data. messages.. WWW (Companion Volume). — ACM. - 2012. — P. 687– [23] Clauset A., Newman M. E. J., Moore C. Finding community structure in 690. DOI: 10.1145/2187980.2188181. very large networks. Phys. Rev. E. — 2004. — Vol. 70, issue 6. — P. 1– [16] Towards Social User Profiling: Unified and Discriminative Influence 6. DOI: 10.1103/PhysRevE.70.066111 Model for Inferring Home Locations / R. Li [et al.]. Proceedings of the [24] Rosvall M., Bergstrom C. T. Maps of random walks on complex 18th ACM SIGKDD International Conference on Knowledge Discovery networks reveal community structure. Proceedings of the National and Data Mining. — Beijing, China: ACM, 2012. — P. 1023–1031. Academy of Sciences. — 2008. — Vol. 105, no. 4. — P. 1118–1123. DOI: 10.1145/2339530.2339692. DOI: 10.1073/pnas.0706851105 [17] You Are Who You Know: Inferring User Profiles in Online Social [25] Yang J., Leskovec J. Community Affiliation Graph Model for Networks / A. Mislove [et al.]. Proceedings of the Third ACM Overlapping Network Community Detection. 12th IEEE International International Conference on Web Search and Data Mining. — New Conference on Data Mining, ICDM 2012, Brussels, Belgium, December York, New York, USA: ACM, 2010. — P. 251–260. DOI: 1013, 2012. — 2012. — P. 1170–1175. DOI: 10.1109/ICDM.2012.139 10.1145/1718487.1718519. [26] Yang J., Leskovec J. Overlapping Community Detection at Scale: A [18] Li R., Wang C., Chang K. C.C. User Profiling in an Ego Network: Nonnegative Matrix Factorization Approach. Proceedings of the Sixth Coprofiling Attributes and Relationships. Proceedings of the 23rd ACM International Conference on Web Search and Data Mining. — International Conference on World Wide Web. — Seoul, Korea: ACM, Rome, Italy: ACM, 2013. — P. 587–596. DOI: 2014. — P. 819–830.DOI: 10.1145/2566486.2568045. 10.1145/2433396.2433471 [19] McAuley J., Leskovec J. Discovering Social Circles in Ego Networks. [27] Yang J., McAuley J. J., Leskovec J. Community Detection in Networks ACM Trans. Knowl. Discov. Data. — 2014. — Vol. 8, no. 1. — P. 1– with Node Attributes. 2013 IEEE 13th International Conference on Data 28.DOI: 10.1145/2556612. Mining. – 2013. – P. 1151-1156. DOI: 10.1109/ICDM.2013.167 44