Hidden Communities in the Russian Social Network Corpus: a Comparative Study of Detection Methods Ivan Mamaev1[0000-0003-3362-9131] and Olga Mitrofanova1[0000-0002-3008-5514] 1 Saint Petersburg State University, Universitetskaya emb. 11, Saint Petersburg, Russia mamaev_96@mail.ru, o.mitrofanova@spbu.ru Abstract. The paper presents a comparative study of different methods that help to detect hidden communities within social networks. The tested approach- es were divided into three main groups: a graph-based method, a clustering method, and a hybrid method. The experiments were conducted on the Russian corpus of posts from VKontakte social network. We discuss advantages and disadvantages of all the methods, and predict the ways of their improving. Keywords: Social Networks, Corpus Linguistics, Jaccard Index, Cluster analy- sis, Topic Modeling, Automatic Topic Labeling 1 Introduction In recent decades, social networks have been actively developing through the perspec- tive of network analysis: sociologists study relationships between users, linguists study texts on the Internet, etc. Despite significant progress in investigating social networks, one of the existing gaps is the detection of hidden communities. A hidden community can be defined as a set of users that have a lot of implicit connections. Unlike “friends” on social networks or members of a real community, users of hidden communities may not know each other, they may live in different regions of the same country, but their interests coincide. These interests may be reflected in users’ posts. Using various similarity metrics, one can unite posts in a common semantic group, they forming a semantic network, which conventionally looks as follows (Figure 1). Fig. 1. An example of a semantic network There are a great number of algorithms for analyzing the structure of social network communities. However, one of the problems is the impossibility of determining the Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 effectiveness, reproducibility, flexibility, and universality of a method because schol- ars often carry out experiments involving a single dataset and a single method. In our opinion, the most productive way is to compare several methods using one dataset. A qualitative and quantitative comparison of methods and their results can reveal the best approach that is applicable to the posts of social networks. In our study, we will analyze several algorithms, identify the structural properties of the obtained commu- nities, and also focus on their advantages and disadvantages. 2 Related works Contemporary algorithms for detecting hidden communities in social networks can be divided into three main groups:  methods based on graphs;  methods based on clustering;  hybrid methods. The first papers dedicated to detecting hidden communities on the Internet appeared in the early 2000s. The papers [2, 8] describe the use of the hidden Markov model and a random graph to detect communities, the accuracy of the results reached 90%. It should be noted that the approach did not include semantic information about target users which could have improved the accuracy of the method. In [6], a two-stage algorithm HICODE (Hidden Community Detection) was devel- oped. The first stage consists in finding the number of community layers. The first layer is a group with the strongest links, each subsequent layer has a lower degree of connections. The second stage is called the refinement stage. Its idea is in improving the quality of the layers. In this case one can obtain complete data because a stronger community structure can distort information about the structure of weaker ones. The HICODE algorithm has also been applied to hidden communities on Reddit [13]. The authors of the paper concluded that the boundaries of communities are implicit for several reasons: users of forums communicate a lot with each other, the structure of the site itself is not regulated, the constant expansion of topics on the forums leads to adjustments in the results. With the growth of machine learning algorithms, social media researchers have be- gun to use cluster analysis more often. In [3, 7, 10, 12], clustering results are dis- cussed as regards both English and Russian data. In particular, [7] provides an analy- sis of the relations of Kinopoisk users with topic communities in VKontakte social network. The authors used the silhouette metric to evaluate clustering performance, results showed that k-means was the most efficient algorithm. The next step in the development of methods for detecting hidden communities was a combination of various approaches. [1] describes a hybrid method: the edges of the graph are compared based on the Jaccard index; they are united into hidden com- munity clusters. In [14], a clustering experiment based on the semantic similarity of English Twitter posts was conducted. To identify the degree of text similarity, Word- Net was used. Computational linguistics methods were also applied in [9]: researchers 3 described the process of detecting communities in experiments with topic modeling and automatic topic labeling. 3 Experiments 3.1 Corpus collection All of the above methods have been tested on different datasets. In this study, to com- pare existing algorithms for detecting hidden communities in social networks, we used the corpus of Russian posts (8 679 402 tokens) on the VKontakte social network proposed in [9]. The choice of this corpus is connected with the fact that algorithms for detecting hidden communities based on Russian data are not fully described. The following algorithms were selected for comparison: The Jaccard index (graph method), clustering using doc2vec (clustering method), topic modeling and automatic topic labeling (hybrid method). There is a brief description of procedures for creating models of hidden communities. 3.2 Graph-based method The corpus of posts consists of more than 25 000 users’ texts. For the convenience of analysis, we presented them in the form VK_USER_X_POST_Y, where X is a user id, Y is an ordinal number of the post. Using Python 3.7, a script was created to calcu- late the values of the Jaccard index. We considered it necessary to set the minimum similarity threshold to 0.75, i.e. ¾ of the content of the posts should match. Pairs of users that met these requirements were written to a .csv file. Using the Gephi applica- tion, we built a resultant graph that shows the state of hidden communities. 3.3 Clustering analysis There are a lot of cluster analysis methods: k-means, DBSCAN, hierarchical method, etc. In our research, to deal with a large amount of data, we will take k-means from the scikit-learn library1 due to a rather high degree of work and visualization of the implementation. To improve the quality of the obtained models, we will use a pre- trained doc2vec model of the corpus with the following parameters: the size of the context window – 5, the dimension of the vector – 100. 3.4 Hybrid method We took the method described in [9]. It is necessary to search for the optimal number of user topics, build topic models, and generate topic labels. In contrast to the previ- ous algorithms, in topic modeling it is necessary to implement the basic NLP proce- dures: tokenization, lemmatization, processing with the help of a stop-list, adding 1 https://scikit-learn.org/stable/ 4 bigrams and trigrams. Topic labeling is used to improve the interpretability of topics, the procedures are carried with the help of word2vec, RuWordNet2, the Russian Na- tional Corpus3 and a Frequency Dictionary of Contemporary Russian 4 by O.N. Lyash- evskaya and S.A. Sharov. The Gephi application was also used to visualize the graph. 4 Results and Evaluation 4.1 Models of hidden communities There are three models of hidden communities created with the help of different algo- rithms (Figures 2–4). Fig. 2. Hidden communities created with the help of the graph method Fig. 3. Hidden communities created with the help of the cluster analysis 2 https://ruwordnet.ru/ru 3 https://ruscorpora.ru/new/index.html 4 http://dict.ruslang.ru/freq.php 5 Fig. 4. Hidden communities created with the help of the hybrid method 4.2 Graph method Using the Jaccard index, 344 pairs of semantically close posts were detected, they being grouped into 34 communities. Figure 2 shows a fragmentary structure of the relationships that exist in the corpus. The first obvious result of graphical analysis is the ability to select several subsets: a giant component (a great number of central nodes of the graph that are interconnected) and stand-alone nodes. The presence of a large number of intersections indicates close relationships within a given semantic field. It should be noted that the names of topical hidden communities are not displayed when they are formed, as a result, further analysis is carried out manually. For in- stance, users 696481, 39371, and 2500528 are members of the central hidden com- munity. After analyzing the posts, it was found that, despite the difference in topics, they all have one common topic – online conferences. When carrying out further analysis, we also identified that other users of this community also had information about online meetings. As for fragmented communities, they have more stable social topics. Users 981916, 118634, 72972 and 1417120 are united by a concert topic. The situation is similar for users 62729 and 4583: they are united by the topic of health. At the same time, users 1361373 and 892245 are also interested in healthcare, but they were not in the health community. The above factors may indicate the following peculiarities when working with graph methods. 6 1. Graph methods partially allow you to track trends in society. In particular, it ap- plies to online meetings caused by the coronavirus pandemic: user 696481 held remote meetings dedicated to religious topics, and user 2500528 commented on distance education. 2. If the life in the society had been calmer, the final representation could have been more fragmented, users of the big community having no connections. This indi- cates the impossibility of taking into account semantic variations of posts. 4.3 Cluster analysis The k-means method analyzes the sets of given objects and creates k optimal groups, but the nature of these groups is not known in advance. As a result, the clustering process must be repeated a number of times with different parameters in order to find the most stable variant of hidden communities. 7 groups of hidden communities were detected. For a more detailed analysis, the graphic data were labeled corresponding to user ids (Figure 5). Fig. 5. Example of data labels As we have already noted, the k-means method does not require much time; it is suit- able for preliminary procedures, after which more powerful algorithms are required. At the same time, we can note some shortcomings of this algorithm when analyzing hidden communities. First, the algorithm is outlier sensitive: i.e. theoretically, those users whose topics do not coincide with others, will be assigned to any class. User 1415502 is keen on subcultures, but other users from the corpus do not write posts on the same topic. At the same time, using the k-means method, it was observed that this user and user 167175 are in the topical community dedicated to games. It should also be noted that the algorithm cannot cope with the task in which ob- jects can belong to different topical groups. Unlike the graph method, all users can belong to the only hidden community. So, for instance, the posts of user 1955327 are dedicated to books and health, although in Figure 5 the text of the user were assigned to the only topical community. A possible solution to the problem is to combine dif- ferent clustering methods. In particular, the DBSCAN algorithm will take into ac- count the “noise” when constructing models, while the c-means algorithm – a fuzzy clustering method, which is the improvement of the k-means method – assigns texts to different clusters with a certain probability. It is an equivalent to some elements of the model based on graphs when one node can be included in several communities. 7 4.4 Hybrid method When working with the hybrid approach, there are also advantages and disadvantages. A lot of algorithms for topic modeling and their variations have already been success- fully adapted to Russian corpora of social networks [4, 5, 11]. It should also be noted that, while improving the interpretation of models, we used a double ranking algorithm (Google PageRank and ipm in a Frequency Dictionary of Contemporary Russian by O.N. Lyashevskaya and S.A. Sharov) to avoid low- frequency topic labels. In the final model of communities, users can belong to differ- ent topic communities, or have unique interests that do not intersect with the interests of other people. The approach allows taking into account the various interests of a user, so some people are members of several hidden communities. For instance, 820468, 14846, and 243903 are interested in linguistics, while 62729, 4583, 1361373, and 892245 pay attention to health. Mind that the graph-based method also united 62729 and 4583 in the same community, but 1361373 and 892245 were not in it. At the same time, we faced certain difficulties. For instance, texts on social net- works have some peculiarities of spelling of words or their graphical representation, they making us improve algorithms of tokenization and normalization or edit texts manually (remove diacritical symbols, correct misspellings, etc.). Moreover, approaches for automatic topic labeling in Russian and English do not have a gold standard that allows assessing the quality of the developed methods. We used Google Forms and 10 independent experts to assess the candidates: 0 is for irrel- evant labels, and 1 is for relevant ones (Figure 6, Table 1). Fig. 6. Comparative diagram of candidates for user 3854113 8 Table 1. Results of an expert assessment of candidates for topic labels of user 3854113 Topic of user 3854113: год, диктант, тотальный, язык, акция, два, вид, новость, человек, жизнь (year, dictation, total, language, action, two, kind, news, person, life). Bi- and trigrams РИА новости тотальный дик- число пятёрок (RIA news) тант (number of A’s) (total dictation) Assessing results 3 10 1 Unigrams текст диктант новости (text) (dictation) (news) Assessing results 4 5 4 In the example above, assessors are sure that bigram total dictation is the best one, while number of A’s is the worst one. There was also a need to unify different labels that are in the same semantic field: for instance, when we found adjective and noun modifiers in collocations, we chose the first one as a dominant collocation: student organizations, organizations of stu- dents – student organizations. It is also a time-consuming process. Further studies can involve the development of the gold standard, which will sig- nificantly save time for building models of hidden communities, and a semantic ana- lyzer that will unify different lexical-semantic variants of labels. Below we present a summary of the methods based on some parameters (Table 2). Table 2. Comparison of methods Parameters Graphs Clustering Hybrid methods Is the method time- Yes Yes No saving? Can users belong to Depending on the Yes Yes several communities? algorithm Are lexical and seman- tic features taken in No No Yes account? Is it possible to find out the name of hidden No No Yes communities? 5 Summary With the development of social networks, the focus and scope of users’ interaction are getting expanded. The analysis of the corpus of posts provides an insight into the structure of hidden communities existing in any online platform. The discovery of hidden communities is used in biology, sociology, as well as computational linguis- tics. In this paper, we have used a corpus of VKontakte posts and made a detailed re- view of some approaches to community detection in social networks. The obtained results of comparison show that, unfortunately, the state-of-the-art approaches are far from perfect: some have difficulties with analyzing linguistic properties of texts, and 9 others are rather time-consuming. We may expect that the manual creation of hidden community models of the corpus will have certain differences compared to our re- sults. Nonetheless, the combination of the methods, described in previous sections, can still be used for preliminary experiments. We hope to develop the research in the following ways:  improving the doc2vec model of the corpus for obtaining more precise results on clustering;  combining various clustering methods for choosing the most optimal one;  expanding the size of the already existed corpus including posts of other Russian social networks in order to detect latent links between a great number of users;  improving basic NLP procedures to simplify training topic models;  developing and assessing a gold standard for topic labeling procedures in the Rus- sian segment of topic modeling. References 1. Ahn, Y., Bagrow, J., Lehmann, S.: Link communities reveal multiscale complexity in net- works. Nature, 466, 761–764 (2010). 2. Baumes, J., Goldberg, M., Magdon-Ismail, M., Wallace, W.: Discovering Hidden Groups in Communication Networks. In: Chen, H., Moore, R., Zeng, D.D., Leavitt, J. (eds.). Intel- ligence and Security Informatics. ISI 2004, Lecture Notes in Computer Science, 3073, 378–389. Springer, Berlin, Heidelberg (2004). 3. Bedi, P., Sharma, C.: Community detection in social networks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(3), 1–22 (2016). 4. Bodrunova, S., Blekanov, I., Kukarkin, M.: Topic modeling for Twitter discussions: Model selection and quality assessment. In: Proceedings of the 6th SGEM International Multidis- ciplinary Scientific Conferences on SOCIAL SCIENCES and ARTS SGEM2018, Science and Humanities, 207–214. STEF92 Technology Ltd., Sofia, Bulgaria (2019). 5. Bodrunova, S., Blekanov, I., Kukarkin, M.: Topics in the Russian Twitter and relations be- tween their interpretability and sentiment. In: Sixth International Conference on Social Networks Analysis, Management and Security, 549–554 (2019). 6. He, K., Li, Y., Soundarajan, S., Hopcroft, J.: Hidden community detection in social net- works. Inf. Sci., 425, 92–106 (2018). 7. Khlopotov, M., Startseva, N., Makarenko, A.: Analysis of movie lovers’ preferences and their thematic communities in social networks. The Eurasian Scientific Journal, 11(2), 1– 11 (2019). 8. Magdon-Ismail, M., Goldberg, M., Wallace, W., Siebecker D.: Locating Hidden Groups in Communication Networks Using Hidden Markov Models. In: Chen, H., Miranda, R., Zeng, D.D., Demchak, C., Schroeder, J., Madhusudan, T. (eds.) Intelligence and Security Informatics. ISI 2003, Lecture Notes in Computer Science, 2665, 126–137. Springer, Ber- lin, Heidelberg (2003). 9. Mamaev, I., Mitrofanova, O.: Automatic Detection of Hidden Communities in the Texts of Russian Social Network Corpus. In: Filchenkov, A., Kauttonen, J., Pivovarova, L. (eds.) Artificial Intelligence and Natural Language. AINL 2020, Communications in Computer and Information Science, 1292, 17–33. Springer, Cham (2020). 10 10. Mishra, N., Schreiber, R., Stanton, I., Tarjan, R.: Clustering Social Networks. In: Bonato, A., Chung, F.R.K. (eds.) Algorithms and Models for the Web-Graph. WAW 2007, Lecture Notes in Computer Science, 4863, 56–67. Springer, Berlin, Heidelberg (2007). 11. Nagornyy, O., Koltsova, O.: Redefining media agendas: topic problematization in online reader comments. Media and Communication, 7(3), 145–156 (2019). 12. Rytsarev, I., Kupriyanov, A., Kirsh, D., Liseckiy, K.: Clustering of social media content with the use of BigData technology. Journal of Physics Conference Series, 921–927 (2018). 13. Salz, D., Benavides, N., Li, J.: Hidden Community Detection in Online Forums. CS224W: Machine Learning with Graphs, 1–10 (2019). 14. Singh, K., Shakya, H., Biswas, B.: Clustering of People in Social Networks based on Tex- tual Similarity. In: Perspectives in Science, 570–573 (2016).