Deanonymizing Users in Social Networking Services:
       an Ego-Network Analysis Approach

                   Vladislav Chesnokov                                                               Peter Klyucharev
            Information Security Department                                                Information Security Department
        Bauman Moscow State Technical University                                       Bauman Moscow State Technical University
                    Moscow, Russia                                                                 Moscow, Russia
               v.o.chesnokov@yandex.ru                                                           pk.iu8@yandex.ru


    Abstract — One of the main problems in social networking              reposting publications they interested in and performing other
services monitoring systems is the incompleteness of analyzed             actions revealing their identity.
data. Anonymous users may participate in information warfare
tampering public opinion. A new method for user profiling in                  Depending on quality and amount of data SNS users
social networking services is proposed. It is based on analysis of        deanonimization methods can be divided into four categories:
user ego-network communities. The core idea of the method is
                                                                              ones that use only publicly available data: user profiles,
that each user has a unique set of attributes and user attributes
are strongly related to his ego-network communities. User profile
                                                                               photos, publications, etc;
attributes can be obtained as the union of attributes relevant to             technical methods (e.g. browser fingerprints);
each community. The method was compared with majority
voting and two community detection based approaches.                          cross-SNS analysis;
Experiments on four datasets from Facebook, Twitter,
LiveJournal and VKontakte social networking services showed                   social engineering.
that the proposed method outperforms others and some user                     Of course, these methods can be combined. Methods from
attributes can be determined with high precision and recall. The          the first category are based on user attributes inferring and
method is tolerant to node attributes partial absence. User               their further analysis. It is quite obvious that a set of attributes
attributes determined by the proposed method combined with
                                                                          like age, sex, location, education, workplace, etc is unique for
additional sources of information may lead to user
                                                                          each person, so having them one can obtain user identity.
deanonimization.
                                                                          Using additional data sources like graduates lists or other user
    Keywords — deanonimization; user profiling; ego-network;              databases one can even get their full name.
social graph; user profile attributes; community detection;
information warfare; social networking service                                                II.     PROBLEM DEFINITION
                                                                             Consider unweighted undirected graph G’(V’, E’) with
                       I.    INTRODUCTION
                                                                          D(G’) = 2 which has       such that
    Social networking service (SNS) monitoring is one of the
key instruments for public opinion analysis, which is used in                                                                     ,       (1)
marketing, politics, science, etc. A good report must contain
statistics about its target audience attributes. However, some            where D(G’) is the diameter of graph G’. Then it is called an
users in SNS may be anonymous, and some people may create                 ego-network for node u, while u is an ego or center. Denote
profiles just to tamper the public opinion regarding some
topic. In recent years there were many examples how SNS can
                                                                                                                        ,
be used to manipulate its users and influence “the real world”
                                                                                                                                (2)
[1, 2, 3]. Revealing and deanonimization such profiles are the
key problems of information warfare.                                         .                 (3)
    SNS allow users to create profiles filled with personal                   Each vertex from V’ has a set of attributes (sometimes
data. Most users publish a huge amount of personal attributes             called features) from set F. I.e. there exists a map f: V → 2F.
such as age, location, education, favorites, photos, etc.                 In practice, only a part of vertex attribute set is observable, so
However, not everyone does that. There are several reasons to             denote it by f’ such that
have empty or scarcely filled profile: privacy issue, laziness,
paranoia and others. Some users may think that fake names
and no profile photo can assure their anonymity. Meanwhile                                                                  .
they can still use SNS to its full extent, communicating with                    (4)
their friends and family, listing social links in their profile,

   This work was supported by RFBR (grant № 16-29-09517).


                                                                     40
The task of profile inferring is to obtain such user profile of         redundant and may lead to low results. More efficient methods
vertex u that                                                           are based on ego-network community detection. For example,
                                                                        in [19] a generative model was used to obtain user circles
                                                            (5)         (local communities). In [18] some attributes of user profile
                                                                        were determined by ego-network communities with 70%
where p* = f(u) and δ – some similarity measure of two sets,
                                                                        accuracy on LinkedIn dataset. One of the main ideas of the
e.g. F-measure.
                                                                        method is to determine a type of links between ego-user and
                                                                        others with respect to vertex attributes. It was shown that this
                     III.   RELATED WORK                                method outperforms methods from [17] and [19] by accuracy
   There are quite lot methods which can be used to infer user          of user profiling and community detection. However, this
profile attributes. Most of them are based on machine learning.         method requires knowledge about attributes nature and
Others can use attribute inferring, majority voting, user               supposes that communities may not intersect.
preferences or community detection.
    The most basic method is the majority voting: the more                               IV.   THE PROPOSED METHOD
neighbors (or “friends”) of vertex has some attribute, the                  It     is          well known that social ties of a person
higher the probability that this vertex has this attribute. This        are not random. Users connected to him can be separated into
method was used to determine the geolocation of Twitter users           several groups formed by some attribute: classmates,
in [4]. It has quite high precision (86%) but low recall (20%),         colleagues, relatives, close friends, etc. Consider ego-network
because only small fraction of users provides their location.           communities. The result of the community detection algorithm
More complex methods include attribute transfer. For                    proposed in [20] is a cover with corresponding label sets
example, in [5] user geolocation was determined by median                         describing them. Suppose that the central vertex
value of its neighbors location. The error of the method was            belongs to all its ego-network communities. So, most likely, it
less than 10 kilometers for half of the users from Twitter              would have all attributes of them. The proposed method is of
dataset. In [6] user preferences were transferred through their         determining user profile is simply a union of attributes sets
social ties and in [7] Facebook user geolocation was                    corresponding to these communities:
determined in the same way.
                                                                                                                                     (6)
    Methods based on user preferences use more complex
                                                                            The main problem is to obtain this correspondence. The
information about a user, which is harder to obtain. In [8]
                                                                        algorithm from [20, 21] can be simplified as we don’t need the
profile attributes were determined by music preferences, and
                                                                        community membership, just labels. First, associate with each
in [9] – by user reactions (“likes”). PGPI method proposed
                                                                        vertex v an empty set of key attributes Kv. Next, start an
in [10] allows determining some attributes of user profile with
                                                                        iterative process of updating sets of key attributes. At each
accuracy higher than 90%. More than that, it needs only
                                                                        iteration all nodes are visited. For each vertex v with neighbors
limited amount of information (number of “facts”) and uses
                                                                        set and each attribute a sets
users attributes, group membership, publications view count
and likes count.
                                                                                                                                     (7)
    The quality of user profiles got by machine learning based
                                                                        and
methods highly depends on quality of training data. In [11] a
comparative analysis of several models was conducted. The
task was to determine age and sex of 7 million telecom                                                                               (8)
provider clients by calls and SMS data. The best model                  are computed. If the sum is                 greater than the
showed 0.85 for sex and 0.72 for age values of F-measure                threshold                , where αa is a fraction defining
with 90% data as training. One of the common approaches is              qualified majority for attribute a, then attribute a is added to
to use user attributes with features extracted from their               the set of key attributes Kv. The process stops when there were
publications [12-15]. Accuracy higher than 89% for age and              no changes at the last iteration. The original community
sex detection can be reached with this method [12]. Machine             detection algorithm has four more steps to determine
learning can be used to combine several approaches. For                 community membership from key attributes sets and filter
example, in [16] features like distance between user favorite           some attributes. However without great loss of accuracy only
publications and profiles were used to determine his                    described steps are required in the simplified algorithm. So,
geolocation. Of course one can build an ensemble of                     the profile can be determined as
classifiers to determine user attributes.
                                                                                                                                    (9)
    The proposed method is based on community detection
approach. In [17] a greedy community detection algorithm                    Of course, not all attributes can be determined reliably
regarding each attribute based on conductivity maximization is          using this method, e.g. sex or last name. So, the attributes
proposed. Given that value of some attribute is known for at            should be filtered before applying the method. Some attributes
least 20% of users, the accuracy for this attribute for other           can be grouped if their nature is known – e.g. education.
users is 80% on Facebook dataset for students and professors            Unfortunately, attribute filtering is quite hard task nowadays,
of two universities [17]. In [18] it was concluded that applying        and may depend on SNS type. Some solutions for that may
community detection algorithm to full social graph is                   include statistical analysis of obtained profiles or machine


                                                                   41
learning. Most frequently, a data scientist may filter them by                         TABLE III.        METHOD EVALUATION RESULTS (WITH ATTRIBUTE
                                                                                                   FILTERING). AVERAGE VALUES OF F-MEASURE
hand on in semi-automatic way.
                                                                                                      Facebook     Twitter    VKontakte   LiveJournal
                       V.      METHOD EVALUATION                                    simple
                                                                                                           0.769      0.642       0.424         0.678
                                                                                    majority
    The method was evaluated on four ego-networks datasets                          ground-truth
                                                                                                           0.759      0.756         n/a              n/a
from Facebook1, Twitter2, VKontakte3 and LiveJournal4. The                          communities
first two datasets are from SNAP dataset [22]. Two other                            Infomap
                                                                                                           0.726      0.641       0.557         0.504
datasets were crawler by author. All datasets characteristics                       communities
                                                                                    modularity
are shown in Table 1. Each graph in all datasets has a profile                      maximization           0.793      0.675       0.578         0.625
(a set of binary attributes) for each vertex, including ego.                        communities
                                                                                    AGM-fit
    Precision, recall and F-measure were used to measure the                        communities
                                                                                                           0.780      0.669       0.624         0.668
quality of obtained profiles. The proposed method was also                          BigCLAM
                                                                                                           0.860      0.720       0.728         0.750
compared with some others: simple majority; majority by                             communities
communities (obtained by one of the five community                                  CESNA
                                                                                                           0.842      0.719       0.683         0.727
detection     algorithms:    modularity    maximization [23],                       communities
                                                                                    CESNA
Infomap [24], AGM-fit [25], BigCLAM [26], CESNA [27] or
                                                                                    attribute              0.466      0.752       0.767         0.874
ground-truth communities); method based on attribute weights                        weights
in communities obtained by CESNA. For each method several                           proposed
                                                                                                           0.841      0.866       0.860         0.908
threshold values were used, and for each graph threshold with                       method
the best value of F-measure was used. For each SNS and
profile detection method quality measures were averaged.
                                                                                       The results for all methods without attributes filtering are
                                                                                   shown in Table I. The values for all methods are quite low. So,
                         TABLE I.          DATASETS                                for each datasets attributes were filtered. They were grouped
                                       avg.          avg.         avg.             by similarity and description. For Facebook dataset attributes
                      number                                                       corresponding to education and hometown were kept (612
         SNS                         number        number      number of
                     of graphs
                                    of vertices    of edges    attributes          total). In Twitter dataset attributes are hashtags and replies, so
     Facebook                 10           417       17 017           228          only most popular hashtags were kept (86 total). For
                                                                                   VKontakte dataset, education and workplace attributes were
     Twitter                 973           138        2 350           665          used (61 361 total). For LiveJournal dataset attributes
     VKontakte              2 000          164        1 561         1 665          corresponding to education and hometown were kept (47 106
                                                                                   total).
     LiveJournal             700           149        1 335         2 368
                                                                                       The results for datasets with filtered attributes are shown in
TABLE II.          METHOD EVALUATION RESULTS (NO ATTRIBUTE FILTERING).             Table III. Average values for F-measure are significantly
                      AVERAGE VALUES OF F-MEASURE                                  higher for all methods. Perhaps they can be improved even
                                                                                   further with another attribute filtering. First of all, the result
                     Facebook       Twitter       VKontakte    LiveJournal
                                                                                   for Twitter, VKontakte and LiveJournal datasets are similar
    simple
    majority
                            0.462       0.155          0.058          0.210        for all methods, but different for Facebook datasets. It can be
    ground-truth                                                                   connected with small size of Facebook dataset or its nature (it
                            0.515       0.190            n/a            n/a        was created by volunteers [22], while others were crawled
    communities
    Infomap
                            0.491       0.183          0.170          0.099
                                                                                   from random profiles).
    communities
    modularity                                                                         Simple majority method showed quite low results for all
    maximization            0.532       0.190          0.154          0.152        datasets except Facebook. Precision and recall tests
    communities                                                                    demonstrated that this method lacks precision. One of the best
    AGM-fit                                                                        results was given by ground-truth communities. Of course,
                            0.516       0.189          0.187          0.209
    communities
    BigCLAM
                                                                                   this method cannot be applied in practice, but high results for
                            0.570       0.234          0.260          0.264        it confirm hypothesis that community structure is strongly
    communities
    CESNA                                                                          connected with profiles attributes, and improvement of
                            0.535       0.222          0.218          0.236
    communities                                                                    community detection algorithms may lead to increase of
    CESNA                                                                          precision and recall of profile attributes detection methods.
    attribute               0.312       0.187          0.225          0.300
    weights                                                                            Methods based on communities obtained by graph
    proposed                                                                       clustering algorithms (modularity maximization and Infomap)
                            0.600       0.274          0.323          0.369
    method                                                                         are not suitable for attributes detection. One of the probable
                                                                                   reasons for that is that communities do not intersect and their
1
  https://facebook.com                                                             union must contain all graph vertices. Similar results for
2
  https://twitter.com                                                              community detection evaluation were obtained in [20], so the
3
  https://vk.com                                                                   better community detection, the more accurate user profiling.
4                                                                                  Community detection algorithms based on affiliation model
  https://livejournal.com


                                                                              42
outperform graph clustering algorithms, and the similar results                                       VII. CONCLUSION
can be declared for user profiling based on communities                         A user profile detection method was proposed. It is based
obtained by these algorithms.                                               on community detection algorithm. The core idea of the
   Second best method is user profiling by attributes weights               method is that community structure of user ego-network is
in communities given by CESNA. However it showed the                        strongly connected with attributes of his profile.
worst result on Facebook dataset.                                              The experiments showed that the proposed method
    The proposed method ties for the best result with CESNA                 outperforms other attribute transfer based methods by F-
communities method for Facebook dataset, and the top results                measure, precision and recall. Some attributes can be
for Twitter, VKontakte and LiveJournal datasets                             determined with precision and recall close to one. The method
outperforming closest competitors by 14.5%, 12% and 4%                      preserves its performance if node attributes are partially
respectively.                                                               absent.
   Considering some chosen groups of attributes (see table                     This method can be used to determine user identity and
IV) the results given by the proposed method by F-measure,                  deanonymize users in social networking services. Method
precision and recall are close to the maximum, 1.                           application could improve public opinion analysis and help to
                                                                            detect and counteract users who tries to tamper it.
  TABLE IV.         F-MEASURE, PRECISION AND RECALL OF THE PROPOSED
                   METHOD FOR CHOSEN ATTRIBUTE GROUPS                                                    References
   Dataset         Attribute group    F-measure   Precision   Recall        [1]  J. Ratkiewicz et al.Detecting and Tracking Political Abuse in Social
                                                                                 Media. Proc. 5th International AAAI Conference on Weblogs and Social
 Facebook                 education       0.902      0.967     0.875             Media (ICWSM). — North America, 2011. DOI: 10.1.1.646.5073
 Facebook                hometown         1.000      1.000     1.000        [2] Metaxas P. T., Mustafaraj E. Social Media and the Elections. Science.
                                                                                 — 2012. — Vol. 338, No. 6106. — P. 472–473. DOI:
 VKontakte            middle school       0.919      0.910     0.973             10.1126/science.1230456.
 VKontakte                  faculty       0.994      0.995     0.996        [3] Khondker H. H. Role of the New Media in the Arab Spring.
                                                                                 Globalizations. — 2011. — Vol. 8, no. 5. — P. 675–679. DOI:
 VKontakte               work place       0.994      0.997     0.994             10.1080/14747731.2011.621287.
                                                                            [4] Inferring the Location of Twitter Messages Based on User
 LiveJournal              education       0.975      0.991     0.970             Relationships. / C. A. J. Davis [et al.]. T. GIS. — 2011. — Vol. 15, no.
                                                                                 6. — P. 735–751. DOI: 10.1111/j.1467-9671.2011.01297.x .
 LiveJournal             hometown         0.984      0.977     1.000
                                                                            [5] Jurgens D. That’s What Friends Are For: Inferring Location in Online
                                                                                 Social Media Platforms Based on Social Relationships.. Proceedings of
             VI.     TOLERANCE TO ATTRIBUTE ABSENCE                              the AAAI International Conference on Weblogs and Social Media. —
                                                                                 2013.
    Real datasets often lack some information about nodes
attributes. So, the methods’ tolerance to node attributes partial           [6] Pennacchiotti M., Popescu A.M. Democrats, Republicans and Starbucks
                                                                                 Afficionados: User Classification in Twitter. Proceedings of the 17th
absence was tested. Note that the experiments were performed                     ACM SIGKDD International Conference on Knowledge Discovery and
on crawled datasets, so they have already have been imperfect.                   Data Mining. — San Diego, California, USA: ACM, 2011. — P. 430–
                                                                                 438. DOI:10.1145/2020408.2020477.
    For each graph in each dataset 10, 20 … 90% random                      [7] Backstrom L., Sun E., Marlow C. Find Me if You Can: Improving
attributes values were removed. In other words, each dataset                     Geographical Prediction with Social and Spatial Proximity. Proceedings
had nine copies of it with less and less data.                                   of the 19th International Conference on World Wide Web. — Raleigh,
                                                                                 North Carolina, USA: ACM, 2010. — P. 61–70.DOI:
    The precision of most methods with exception of the                          10.1145/1772690.1772698.
proposed method and CESNA weights method sufficiently                       [8] Chaabane A., Acs G., Kaafar M. You Are What You Like! Information
drops on Facebook dataset after deleting 60% of node                             Leakage Through Users’ Interests. Proc. Annual Network and
attributes, but recall grows. On Twitter dataset precision of all                Distributed System Security Symposium. — San Diego, United States,
methods grows but recall drops because the obtained profiles                     2012.
have less “extra” attributes. After removing up to 80% of                   [9] Kosinski M., Stillwell D., Graepel T. Private traits and attributes are
                                                                                 predictable from digital records of human behavior. Proceedings of the
attribute values the proposed method and CESNA weights                           National Academy of Sciences. — 2013. — Vol. 110, no. 15. — P.
method significantly outperform others. For VKontakte and                        5802–5805.DOI: 10.1073/pnas.1218772110.
LiveJournal datasets results are similar to Twitter dataset.                [10] Dougnon R. Y., Fournier-Viger P., Nkambou R. Inferring User Profiles
                                                                                 in Online Social Networks Using a Partial Social Graph. Advances in
    Summing up, the experiments showed two obvious leaders                       Artificial Intelligence: 28th Canadian Conference on Artificial
which preserve high values of precision and recall                               Intelligence, Canadian AI 2015, Halifax, Nova Scotia, Canada, June 25,
outperforming other methods: the proposed method and                             2015, Proceedings. — Cham: Springer International Publishing, 2015.
CESNA weights method. Both of them are highly tolerant to                        — P. 84–99.DOI: 10.1007/978-3-319-18356-5_8.
attribute partial absence. So, methods based on joint analysis              [11] Inferring User Demographics and Social Strategies in Mobile Social
of ego-network structure and nodes attributes are better suited                  Networks / Y. Dong [et al.]. Proceedings of the 20th ACM SIGKDD
                                                                                 International Conference on Knowledge Discovery and Data Mining. —
for user profiling than others.                                                  New York, New York, USA: ACM, 2014. — P. 15–24.DOI:
                                                                                 10.1145/2623330.2623703.


                                                                       43
[12] Analiz socialnikh setei: metody i prilozaniya / A.V. Korshunov et al..          [20] Chesnokov V. Overlapping Community Detection in Social Networks
     Trudi Instatuta sistemnogo programmirovaniya RAN. -2014. – v.26. –                   with Node Attributes by Neighborhood Influence. Models, Algorithms,
     issue 1. – P. 357-394.                                                               and Technologies for Network Analysis: NET 2016, Nizhny Novgorod,
[13] Cheng Z., Caverlee J., Lee K. You Are Where You Tweet: A                             Russia, May 2016 / ed. by V. A. Kalyagin [et al.]. — Cham: Springer
     Contentbased Approach to Geolocating Twitter Users. Proceedings of                   International Publishing, 2017. — P. 187–203. DOI: 10.1007/978-3-
     the 19th ACM International Conference on Information and Knowledge                   319-56829-4_14.
     Management. — Toronto, ON, Canada: ACM, 2010. — P. 759–768.                     [21] Chesnokov V. Application of the community allocation algorithm in the
     DOI: 10.1145/1871437.1871535.                                                        information confrontation in the social networks. Voprosy
[14] Discriminating gender on Twitter / J. D. Burger [et al.]. Proceedings of             kiberbezopasnosti [Cybersecurity issues], 2017, N 1(19), pp. 37-44.
     the Conference on Empirical Methods in Natural Language Processing.                  DOI: 10.21681/2311-3456-2017-1-37-44.
     — Association for Computational Linguistics. 2011. — P. 1301–1309.              [22] Leskovec J., Krevl A. SNAP Datasets: Stanford Large Network Dataset
[15] Ikawa Y., Enoki M., Tatsubori M. Location inference using microblog                  Collection. — June 2014. — http://snap.stanford.edu/data.
     messages.. WWW (Companion Volume). — ACM. - 2012. — P. 687–                     [23] Clauset A., Newman M. E. J., Moore C. Finding community structure in
     690. DOI: 10.1145/2187980.2188181.                                                   very large networks. Phys. Rev. E. — 2004. — Vol. 70, issue 6. — P. 1–
[16] Towards Social User Profiling: Unified and Discriminative Influence                  6. DOI: 10.1103/PhysRevE.70.066111
     Model for Inferring Home Locations / R. Li [et al.]. Proceedings of the         [24] Rosvall M., Bergstrom C. T. Maps of random walks on complex
     18th ACM SIGKDD International Conference on Knowledge Discovery                      networks reveal community structure. Proceedings of the National
     and Data Mining. — Beijing, China: ACM, 2012. — P. 1023–1031.                        Academy of Sciences. — 2008. — Vol. 105, no. 4. — P. 1118–1123.
     DOI: 10.1145/2339530.2339692.                                                        DOI: 10.1073/pnas.0706851105
[17] You Are Who You Know: Inferring User Profiles in Online Social                  [25] Yang J., Leskovec J. Community Affiliation Graph Model for
     Networks / A. Mislove [et al.]. Proceedings of the Third ACM                         Overlapping Network Community Detection. 12th IEEE International
     International Conference on Web Search and Data Mining. — New                        Conference on Data Mining, ICDM 2012, Brussels, Belgium, December
     York, New York, USA: ACM, 2010. — P. 251–260. DOI:                                   1013, 2012. — 2012. — P. 1170–1175. DOI: 10.1109/ICDM.2012.139
     10.1145/1718487.1718519.                                                        [26] Yang J., Leskovec J. Overlapping Community Detection at Scale: A
[18] Li R., Wang C., Chang K. C.C. User Profiling in an Ego Network:                      Nonnegative Matrix Factorization Approach. Proceedings of the Sixth
     Coprofiling Attributes and Relationships. Proceedings of the 23rd                    ACM International Conference on Web Search and Data Mining. —
     International Conference on World Wide Web. — Seoul, Korea: ACM,                     Rome,      Italy:   ACM,      2013.     —     P.     587–596.    DOI:
     2014. — P. 819–830.DOI: 10.1145/2566486.2568045.                                     10.1145/2433396.2433471
[19] McAuley J., Leskovec J. Discovering Social Circles in Ego Networks.             [27] Yang J., McAuley J. J., Leskovec J. Community Detection in Networks
     ACM Trans. Knowl. Discov. Data. — 2014. — Vol. 8, no. 1. — P. 1–                     with Node Attributes. 2013 IEEE 13th International Conference on Data
     28.DOI: 10.1145/2556612.                                                             Mining. – 2013. – P. 1151-1156. DOI: 10.1109/ICDM.2013.167


                                                                                44