=Paper= {{Paper |id=Vol-2667/paper22 |storemode=property |title=Analysis of open data of a social network in order to identify deviant communities |pdfUrl=https://ceur-ws.org/Vol-2667/paper22.pdf |volume=Vol-2667 |authors=Rostislav Mikherskii,Dmitry Kuznetsov }} ==Analysis of open data of a social network in order to identify deviant communities == https://ceur-ws.org/Vol-2667/paper22.pdf
Analysis of open data of a social network in order to
           identify deviant communities
                      Rostislav Mikherskii                                                           Dmitry Kuznetsov
                    Physico-technical Institute                                                   Physico-technical Institute
           V.I. Vernadsky Crimean Federal University                                     V.I. Vernadsky Crimean Federal University
                       Simferopol, Russia                                                            Simferopol, Russia
                        mrm03@mail.ru                                                          dimabrayankuznetsov@mail.ru

    Abstract—The system of analysis of open data of the social               emergence of group norms, and the use of linguistic markers.
network Vkontakte is developed and programmatically                          Similar studies have been conducted for groups promoting
implemented. Two ways of identification of deviant                           suicidal behavior [15].
communities are proposed. The first way is by the number of                      In [16-20], the categorization of pornographic content
community subscribers blocked by the social network for
violating the rules. The second way, by the presence of
                                                                             and the frequency of its use were studied.
common subscribers between the studied community, and the                        In [21], the authors also focused on consumption
community about which it is precisely known that it is deviant.              networks for adult content, which is present in many online
It is experimentally established that the second method of                   social networks and on the Internet as a whole. The authors
identification of deviant communities gives the best result.                 of this work investigated how such communities interact
                                                                             with the entire social network. They found that few small and
    Keywords—big data, open data, social network                             closely related communities are responsible for much of the
                        I. INTRODUCTION                                      production of content. Produced content is distributed
                                                                             through the rest of the network mainly directly or through
    Analysis of open data from social networks is a                          bridge communities, reaching at least 450 times more users.
significant area in the field of big data processing. In                     In this work, a demographic analysis of the networks of
particular, an important task for both law enforcement                       producers and consumers of adult content was also carried
agencies and social network administrators is to identify                    out. It has been shown that it is possible to easily identify
communities of these networks that disseminate socially                      several key users in order to radically eradicate the process
dangerous content.Many works that were written recently                      of distribution of pornographic content.
have been devoted to discussion of this problem.The work                         The issue of community polarization in social networks
[1] is devoted to the development of a method for assessing                  was studied in detail in [22]. It proposed a new polarization
the degree of connectedness of user profiles of social                       metric based on the analysis of the boundary of a pair of
networks based on open data. The degree of connectedness                     (potentially polarized) communities, which better reflects the
of user profiles is understood as the probability of meeting                 concepts of antagonism and polarization.
profile owners in real life.In [2], a review of methods that                     Cyber aggression, as a form of deviant behavior in the
detect the demographic attributes of a user from their profile               Internet environment, was studied in detail in [23-28]. This
and messages is made. In [3,4], forms of deviant behavior of                 socio-psychological phenomenon has many forms, the main
users of the Russian-language segment of the Internet are                    of which are trolling, cybermobbing and astroturfing.
examined in detail.In particular, in [4] it was shown that the                   As can be seen from the above review of published
main reason for deviant behavior in social networks is                       scientific papers, the search for deviant communities is an
virtuality and anonymity. In [5], according to foreign                       important task both for scientists involved in researching
sources, a review of the main methods of analysis of social                  such communities and for law enforcement agencies.
networks in relation to the task of identifying suspicious and               Unfortunately, most often, the identification of deviant
criminal communities is carried out.                                         communities is carried out manually, often only by user
    To study social networks in terms of social relationships,               complaints.
the Social Network Analysis (SNA) method is often used.                          The aim of this work was to develop a methodology for
The SNA method is described in detail in [6–8]. In this                      identifying deviant communities in the social network
method, the objects of research are the nodes, and the                       Vkontakte in automatic mode. To achieve this goal, two
relationships characterizing the relationship between them.                  options have been proposed to search for such communities.
Nodes can be communities, users of social networks, etc.
The connections between these nodes can be money                                                     II. RESULTS
transfers, communication, friendship, etc. This method has                       In the first version, the following algorithm for searching
been successfully used to study the organization structure of                for such communities is proposed and programmatically
the Al-Qaida terrorist network [9], to study the network of                  implemented. For the studied community, the number l of
terrorist organizations operating in India [10], to analyze the              subscribers blocked by the social network for breaking the
topological structure of criminal networks, in particular the                rules, as well as the total number L of subscribers of this
network of methamphetamine traffic [11]. These research
studies are mainly motivated by the need to find effective                   community, is determined. The coefficient k  l is found. It
                                                                                                                              L
methods to undermine criminal or terrorist organizations.
    Anorexia-oriented online communities have been studied                   is assumed that if the coefficient k is greater than some
in [12–14]. A wide range of issues was studied in these                      critical value kd, then the community under study is deviant.
works, including the construction and management of                              The software implementation of the above algorithm was
member identities, the processes of social recognition, the                  implemented in the Python programming language. During


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

the implementation of this program, 50,704 communities of                 deviant community is determined. It is assumed that a
the Vkontakte social network were randomly selected. In                   sufficiently large number of communities from this list will
order to shorten an influence of statistical error, only                  also be deviant. This algorithm was programmatically
communities with 100 or more total subscribers were                       implemented using the Python programming language.
selected from the general list. Due to system and API                         To test the performance of this program, the deviant
limitation, communities with a few members were                           community “Mom Anarchy” was chosen with an
considered. A coefficient was calculated for each of these                identification number of 177615404. This community is
communities. Further, all communities were sorted in                      engaged in popularizing the ideas of anarchism and has
descending order of magnitude of this coefficient. Table 1                32097 subscribers. The data processing time was 18 hours.
presents the first 20 communities from the list.                          Followers of this community are also subscribed to 940512
                                                                          other communities. All of them were sorted in descending
  TABLE I.        COMMUNITIES WITH A HIGH PERCENTAGE OF BLOCKED           order by the number of users who are also subscribed to the
                             SUBSCRIBERS
                                                                          Mama Anarchy community. Table 2 presents the first 20
 №    Community        Number of       Number of     Percentage of        communities from this list.
      identification   subscribers     blocked       blocked
      number           in        the   subscribers   subscribers
                                                                                 TABLE II.        COMMUNITIES WHOSE SUBSCRIBERS ARE ALSO
                       community                     from the total                      SUBSCRIBERS OF THE MOM ANARCHY COMMUNITY
                                                     number       of
                                                     community              №     Community                Number          of   The number
                                                     subscribers, k•              identification number    subscribers in the   of
                                                     100%                                                  community            subscribers
  1    172017411            104            101           97.1154                                                                who are also
  2    171896750            122            114           93.4426                                                                subscribers
  3    41398959             107            98            91.5888                                                                of the Mama
  4    125043269           1017            904           88.8889                                                                Anarchy
  5    19613748             960            852            88.75                                                                 community
  6    176328754            226            193           85.3982             1               ***                 5539982            15035
  7    148023353            495            419           84.6465             2            91050183               9356399            12924
  8    188941498            530            438           82.6415             3               ***                 707327             12712
  9    23811356            1116            921           82.5269             4            159146575              1162785            12521
 10    164252296            152            123           80.9211             5               ***                 563784             11987
 11    150230769            198            157           79.2929             6               ***                 4403183            11644
 12    130381011            200            157            78.5               7               ***                 2768306            11317
 13    154988787            410            317           77.3171             8               ***                 2508543            11246
 14    155397881            847            654           77.2137             9            57846937              11275065           11224*
 15    149830913            107            81            75.7009            10               ***                 2684988            11154
 16    170030633            577            428           74.1768            11               ***                 2586853            11145
 17       ***               174            129           74.1379            12            150550417              937052             10916
 18    164288533            153            113           73.8562            13            149094324              2076903           10832*
 19    143657800            424            312           73.5849            14            30316056               1809325            10451
 20    157513161            420            309           73.5714            15            66678575               4976245            10299
                                                                            16            12353330               3555825            10167
                                                                            17            154168174              1264550            10145
    In order to prevent propaganda of deviant communities,
                                                                            18            173556111              641480             10005
in this table and further in table 2, the identification number             19               ***                 3802683             9576
of all such communities is replaced by the symbols “***”.                   20            133180305              3116645             9540
    As can be seen from this list, there is only one deviant
community in it (community under No. 17). This community                      As can be seen from this table, out of 20 communities of
was classified as deviant due to the presence of pornographic             the presented list, 9 are deviant. The main reasons that these
material in it.                                                           communities are attributed to deviants are: propaganda of
    Thus, the hypothesis that the percentage of blocked users             violence, criticism of the existing constitutional system, and
in deviant communities is greater than in non-deviant                     the use of profanity. This results show us that algorithm
communities has not been experimentally confirmed.                        results should not be considered as final predictions but as an
Furthermore, it’s clear that some communities are abandoned               assumption. Still results must be managed by special person
and they can contain a lot of banned users because of lack of             to make a conclusion about community content. The main
moderation and new subscribers. Other communities can be                  aim of this algorithm for now is to narrow the search for
related to advertising or temporary events. But they are still            deviant communities.
not deviant despite the fact that social network Vkontakte has                Furthermore, this algorithm allows to consider
special rules that restrict the creation of such communities as           communities with bigger amount of subscribers. However, it
communities with an inappropriate content.                                should be mentioned that API has a strong impact on
    The second option for searching for deviant communities               algorithms productivity. Therefore such systems have a
is based on the following algorithm: One community is                     portability limitations. Nevertheless, the core idea of this
found for which it is known for certain that it is deviant. For           system is to show dependencies between blocked users
this community, a list of subscribers is defined. Each of these           amount and community content.
subscribers defines the communities to which it is                            The scientific novelty of this work lies in the proposed
subscribed. For each of the communities in this list, the                 algorithm, which helps to identify deviant communities.
number of subscribers who are also subscribers of the studied             Despite the fact that current algorithm can only help us to



VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                     99
Data Science

make suggestion about community content, there could be                      [5]  M. Basarab, I. Ivanov, A. Kolesnikov and V. Matveev, “Detection of
                                                                                  illegal activities in cyberspace based on the analysis of social
ways to improve it by using extra algorithms and tools, such                      networks: algorithms, methods and tools (review),” Cybersecurity
as image recognition tools and text analyzer. Therefore,                          issues, vol. 4, no. 17, pp. 11-19, 2016.
holistic recognition system could be developed to make more                  [6] L.C. Freeman, “The development of social network analysis: A study
accurate predictions about deviant communities in social                          in the sociology of science,” Social Networks, vol. 27, no. 4, pp. 377-
networks with open API.                                                           384 , 2005.
                                                                             [7] A. Hopkins, “Graph theory, social networks and counter terrorism,”
                         III. CONCLUSION                                          Univ. of Massachusetts Dartmouth, pp. 22, 2010.
                                                                             [8] L.C. Freeman, “Centrality in social networks conceptual
    Thus, the second way of identifying deviant communities                       clarification,” Social networks, vol. 1, no. 3, pp. 215-239, 1979.
is much more effective than the first. This technique for                    [9] V. Krebs, “Mapping networks of terrorist cells,” Connections, vol. 24,
identifying deviant communities in automatic mode can be                          no. 3, pp. 43-52, 2002.
applied not only on the social network Vkontakte but also in                 [10] P. Choudhary and U. Singh, “A survey on social network analysis for
other social networks. We also note that the second method                        counter-terrorism,” Int. Journal of Computer Applications, vol. 112,
                                                                                  no. 9, pp. 24-29, 2015.
can be applied not only to search for deviant communities,
                                                                             [11] J. Xu, H. Chen, “The topology of dark networks,” Communications of
but also when searching for communities related to the                            the ACM, vol. 51, no. 10, pp. 58-65, 2008.
studied community, for example, in marketing research. In                    [12] J. Gavin, K. Rodham and H. Poyer, “The presentation of “pro-
the case of such studies, it is possible to determine the                         anorexia” in online group interactions,” Qualitative Health Research,
interests of community users and, accordingly, build a policy                     vol. 18, no. 3, pp. 325-333, 2008.
to attract new users to this community.                                      [13] J.D.S. Ramos, A.D.F PereiraNeto and M. Bagrichevsky, “Pro-
    Another possible use of this method is to conduct an                          anorexia cultural identity: characteristics of a lifestyle in a virtual
                                                                                  community,” Interface (Botucatu), vol. 15, no. 37, pp. 447-460, 2011.
advertising campaign of a certain community. In this case, as
                                                                             [14] N. Boero and C.J. Pascoe, “Pro-anorexia communities and online
the studied community, you can choose the community                               interaction: Bringing the pro-ana body online,” Body & Society, vol.
whose advertising you want to conduct. Define a list of                           18, no. 2, pp. 27-57, 2012.
communities associated with this community and place                         [15] S.M. Haas, M.E. Irr, N.A.              Jennings and L.M. Wagner,
advertising messages in these communities.                                        “Communicating thin: A grounded model of Online Negative
                                                                                  Enabling Support Groups in the pro-anorexia movement,” New Media
    It should also be noted that to search for deviant                            & Society, vol. 13, no. 1, pp. 40-57, 2010.
communities, it may be useful to use machine learning                        [16] M. Schuhmacher, C. Zirn and J. Volker, “Exploring youporn
methods, such as, for example, artificial immune systems                          categories, tags, and nicknames for pleasant recommendations,”
[29-31] or convolutional neural networks [32-41]. However,                        Workshop on Search and Exploration of X-Rated Information. ACM,
even when using machine learning, the method of identifying                       pp. 27-28, 2013.
deviant communities by the presence of common subscribers                    [17] G. Tyson, Y. Elkhatib, N. Sastry and S. Uhlig, “Are People Really
                                                                                  Social in Porn 2.0?” Proceedings of the 9th International AAAI
between the studied community, and the community about                            Conference on Web and Social Media (ICWSM), pp. 436-444, 2015.
which it is known for certain that it is deviant will not lose its           [18] G.M. Hald and A. Stulhofer, “What types of pornography do people
relevance. This is primarily due to the fact that this method                     use and do they cluster? Assessing types and categories of
has a high degree of transparency in interpreting the results                     pornography consumption in a large-scale online sample,” Journal of
obtained, in contrast to machine learning methods, which are                      Sex Research, pp. 1-11, 2015.
often a black box, the results of which are often                            [19] G.M. Hald, N.N. Malamuth, T. Lange, “Pornography and sexist
                                                                                  attitudes among heterosexuals,” Journal of Communication, vol. 63,
incomprehensible.                                                                 no. 4, pp. 638-660, 2013.
    Thus, in this study, a new method is proposed that allows                [20] G.M. Hald, “Gender differences in pornography consumption among
you to quickly, cheaply and efficiently search for deviant                        young heterosexual danish adults,” Archives of sexual behavior, vol.
communities.                                                                      35, no. 5, pp. 577-585, 2006.
                                                                             [21] M. Coletto, L.M. Aiello, C. Lucchese and F. Silvestri, “On the
                         ACKNOWLEDGMENT                                           Behaviour of Deviant Communities in Online Social Networks,”
                                                                                  Proceedings of the 10th International AAAI Conference on Web and
    In conclusion, we would like to thank Marina                                  Social Media (ICWSM), pp. 72-81, 2016.
Vsevolodovna Glumova, Director of the Physico-technical                      [22] P.H.C. Guerra, Jr.W. Meira, C. Cardie, R. Kleinberg, “A measure of
Institute of the V. I. Vernadsky Crimean Federal University,                      polarization on social media networks based on community
and Victor Vasilyevich Milyukov, head of the Department of                        boundaries,” Proceedings of the 7th International AAAI Conference
                                                                                  on Web and Social Media (ICWSM), pp. 215-224, 2013.
computer engineering and modeling of the Physico-technical
Institute of the V. I. Vernadsky Crimean Federal University,                 [23] J.S. Chibbaro, “School counselors and the cyberbully: interventions
                                                                                  and implications,” Journal of Professional School Counseling, vol.
for their assistance in organizing research.                                      11, no. 1, pp. 65-68, 2007.
                                                                             [24] R. Gable, J. Snakenborg and R. Van Acker, “Cyberbullying:
                             REFERENCES                                           Prevention and Intervention to Protect Our Children and Youth,”
[1]   V. Kataeva, I. Pantyukhin and I. Yurin, “Methods for assessing the          Preventing School Failure, vol. 55, no. 2, pp. 88-95, 2011.
      degree of connectivity of social network user profiles based on open   [25] W. Heirman and M. Walrave, “Cyberbullying: Predicting
      data,” Open education, vol. 21, no. 6, pp. 14-22, 2017.                     Victimisation and Perpetration,” Children & Society, vol. 25, pp. 59-
[2]   A. Gomzin and S. Kuznetsov, “Methods for constructing socio-                72, 2011.
      demographic profiles of Internet users,” Proceedings of the ISP RAS,   [26] J.S. Donath, “Identity and Deception in the Virtual Community,”
      vol. 27, no. 4, pp. 129-142, 2015.                                          Communities in Cyberspace,” London: Routledge, pp. 26, 1999.
[3]   A. Baklantseva, “Transformation of social norms and deviations in      [27] N.E. Willard, “From Cyberbullying and Cyberthreats: Responding to
      the Russian-language Internet,” News of universities in the North           the Challenge of Online Social Aggression, Threats, and Distress,”
      Caucasus region. Social sciences, vol. 3, pp. 21-25, 2014.                  Champaign, IL: Research Press, pp. 303, 2007.
[4]   D. Cherenkov, “Deviant behavior in social networks: causes, forms,     [28] R.A. Vnebrachnykh, “Trolling as a form of social aggression in
      consequence,” Nauka-Rastudent.Ru, vol. 7, no. 19, pp. 29, 2015.             virtual communities,” Bulletin of the Udmurt University. Philosophy.
                                                                                  Sociology. Psychology. Pedagogy, vol. 1, pp. 48-51, 2012.



VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                              100
Data Science

[29] R. Mikherskii, “Application of an artificial immune system for visual   [36] C. Lian, M. Liu, J. Zhang and D. Shen, "Hierarchical Fully
     pattern recognition,” Computer Optics, vol. 42, no. 1, pp. 113-117,          Convolutional Network for Joint Atrophy Localization and
     2018. DOI: 10.18287/2412-6179-2018-42-1-113-117.                             Alzheimer's Disease Diagnosis Using Structural MRI," IEEE
[30] G. Luh, “Face recognition based on artificial immune networks and            Transactions on Pattern Analysis & Machine Intelligence, vol. 42, no.
     principal component analysis with single training image per person,”         04, pp. 880-893, 2020. DOI: 10.1109/TPAMI.2019.2895781.
     Immune Computation, vol. 2, no. 1, pp. 21-34, 2014.                     [37] A. Bulat and G. Tzimiropoulos, "Hierarchical Binary CNNs for
[31] D. Dasgupta, S. Yu and F. Nino, “Recent advances in artificial               Landmark Localization with Limited Resources," IEEE Transactions
     immune systems: Models and applications,” Applied Soft Computing,            on Pattern Analysis & Machine Intelligence, vol. 42, no. 02, pp. 343-
     vol. 11, no. 2,           pp. 1574-1587, 2011. DOI: 10.1016/                 356, 2020. DOI: 10.1109/TPAMI.2018.2866051.
     j.asoc.2010.08.024.                                                     [38] V.A. Sindagi and V.M. Patel, “A survey of recent advances in CNN-
[32] Y. Li, X. Zhang and D. Chen, “CSRNet: Dilated convolutional neural           based single image crowd counting and density estimation,” Pattern
     networks for understanding the highly congested scenes,” Proc. IEEE          Recognit. Lett., vol. 107, pp. 3-16, 2017.
     Conf. Comput. Vis. Pattern Recognit, pp. 1091-1100, 2018.               [39] K. He, G. Gkioxari, P. Dollar and R. Girshick, "Mask R-CNN," IEEE
[33] M. Kalayeh and M. Shah, "Training Faster by Separating Modes of              Transactions on Pattern Analysis & Machine Intelligence, vol. 42, no.
     Variation in Batch-Normalized Models," IEEE Transactions on                  02, pp. 386-397, 2020. DOI: 10.1109/TPAMI.2018.2844175.
     Pattern Analysis & Machine Intelligence, vol. 42, no. 6, pp. 1483-      [40] S. Lin, R. Ji, C. Chen, D. Tao and J. Luo, "Holistic CNN
     1500, 2020. DOI: 10.1109/TPAMI.2019.2895781R.                                Compression via Low-Rank Decomposition with Knowledge
[34] A. Farrugia and C. Guillemot, "Light Field Super-Resolution Using a          Transfer," IEEE Transactions on Pattern Analysis & Machine
     Low-Rank Prior and Deep Convolutional Neural Networks," IEEE                 Intelligence, vol. 41, no. 12, pp. 2889-2905, 2019. DOI:
     Transactions on Pattern Analysis & Machine Intelligence, vol. 42, no.        10.1109/TPAMI.2018.2873305.
     05, pp.1162-1175, 2020. DOI: 10.1109/TPAMI.2019.2893666.                [41] I. Rocco, R. Arandjelovic and J. Sivic, "Convolutional Neural
[35] Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway and J. Liang,               Network Architecture for Geometric Matching," IEEE Transactions
     “Fine-tuning convolutional neural networks for biomedical image              on Pattern Analysis & Machine Intelligence, vol. 41, no. 11, pp.
     analysis: Actively and incrementally,” Proc. IEEE Conf. Comput.              2553-2567, 2019. DOI: 10.1109/TPAMI.2018.2865351.
     Vis. Pattern Recognit., pp. 7340-7349, 2017.




VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                            101