=Paper=
{{Paper
|id=Vol-2482/paper39
|storemode=property
|title=Make Social Networks Clean Again: Graph Embedding and Stacking Classifiers for Bot Detection
|pdfUrl=https://ceur-ws.org/Vol-2482/paper39.pdf
|volume=Vol-2482
|authors=Kirill Skorniakov,Denis Turdakov,Andrey Zhabotinsky
|dblpUrl=https://dblp.org/rec/conf/cikm/SkorniakovTZ18
}}
==Make Social Networks Clean Again: Graph Embedding and Stacking Classifiers for Bot Detection==
Make Social Networks Clean Again: Graph Embedding and Stacking Classifiers for Bot Detection Kirill Skorniakov Denis Turdakov Andrey Zhabotinsky Ivannikov Institute for System Ivannikov Institute for System Ivannikov Institute for System Programming of the Russian Programming of the Russian Programming of the Russian Academy of Sciences Academy of Sciences Academy of Sciences Moscow Institute of Physics and National Research University Lomonosov Technology (State University) Higher School of Economics Moscow State University Moscow, Russia Moscow, Russia Moscow, Russia kirill.skorniakov@ispras.ru turdakov@ispras.ru zhabotinsky@ispras.ru Effective automatic methods are required to search for bots on the scale of the entire social network. Abstract Many papers addressed this problem in recent years [10, 13, 18, 1, 8]. Most of them are based on su- The paper introduces a novel approach to the pervised machine learning and produce valuable re- detection of social bots using ensembling of sults. Authors use features extracted from informa- classifiers. We also studied the impact of tion about particular profile available in most OSNs different feature sets and demonstrated the such as the publication of posts and comments, the power of graph embedding which is underused formation of non-directional (friendship) and directed by the existing methods. The main contri- (subscription) links. As far as we know, there are no bution of this work is a creating of a stack- works that use the global network structure for this ing based ensemble, which effectively exploits purpose. In this paper, we correct this omission with text and graph features. Empirical evalua- the aid of graph embedding. tion proved the effectiveness of the proposed Any development of a supervised classifier for bot method for bots detection and showed im- detection meets several difficulties. First of all, it’s provement in comparison to existing solutions hard to determine the definition of a “bot” and receive by 4-9 points of AUC. a labeled dataset. Another problem is the creation of a qualitative classifier that can effectively use and combine text, graph and other types of data. In this 1 Introduction work, we explore and utilize the existing solutions for Online social networks (OSNs) are an important part the first problem and focus on improving solutions for of life for many people. More than 2 billion people the second problem. use social network services. Unfortunately, online so- Many good algorithms that use different feature ex- cial networks like many other technologies provide op- traction methods were developed recently. It seems portunities for illegal and undesirable activities. Thus reasonable to combine them in order to achieve state- OSNs can be used to spread spam, phishing links, fake of-the-art result and stacking of algorithms can help news. In addition, there are many malefactors and achieve a quality gain in this task. fraudsters who extort money from users or engage in To summarize, we make the following contributions: illegal advertising. They disturb and defraud upstand- ing users. That is why the administration of social • We applied graph embedding techniques to ex- networks tries to find and block their profiles. tract additional features from the entire network; Copyright © CIKM 2018 for the individual papers by the papers' • We also created an efficient stacking based bot authors. Copyright © CIKM 2018 for the volume as a collection classifier, which combines graph and text infor- by its editors. This volume and its papers are published under mation. the Creative Commons License Attribution 4.0 International (CC BY 4.0). The rest of the paper is organized as follows. In statistics of a user’s followers and tweets. Onur Varol Section 2 we describe existing solutions. Section 3 et.al. [15] proposed framework with many different fea- is devoted to our stacking-based classifier. Section 4 tures. They are statistics of emotions, pos tags; post- presents our experiments. At the Section 5 we sum- ing time features, simple statistics of retweets, men- marize and conclude. tions, and hashtag co-occurrence networks. We will use works [8, 10] as baselines. Unfortunately, we can’t 2 Related Work directly compare with the [15] because they used many tools and features specific in English, but in our sam- 2.1 Bot Detection Labels ples, we have many different languages. An important issue is to collect a labeled dataset to de- To the best of our knowledge, the researchers did tect bots using machine learning algorithms The main not apply graph embedding techniques to the entire difficulties arise from the definition of the concept of a social graph to improve their solution to bots detec- bot. However, despite the ambiguity of the definition, tion. there are ways to obtain a rather precise labeling of bot dataset. 2.3 Graph Embedding According to [10] all labeling techniques can be cat- In the classical approach to building a feature vector egorized into three groups. They are: for objects, features are usually invented by experts • Manual annotation; in the field. This approach has several limitations. First of all, for each new area, an involvement of ex- • Lists of suspended users – users blocked by the perts leads to additional expenses for development. In social network administration; addition, features created by them could be compu- tationally complex. Therefore, methods of learning • Honeypots [8] – bots created by researches to lure representations, which automatically obtain qualita- other bots. tive representations of small dimensions became pop- ular in recent years. Such representations are usually We use the second approach because it requires the called “embedding” (word embedding, graph embed- least human resources for the markup and utilizes an ding, etc.). There are a large number of such meth- underlying information about the rules used by social ods for graphs. DeepWalk [12], Node2vec [4], and network administration which is unobtainable to ex- LINE [14] are the most popular of them due to the ternal researchers (for example, the number of com- linear complexity of the number of edges or vertices. plaints about the user). Since in our experience the quality of the above al- gorithms does not differ much for unweighted large 2.2 Bot Detection Features graphs, we used in our research LINE because of its A wide range of various features can be extracted from good computational speed. a social network and used for bot detection. Based In the LINE two probability models of the appear- on [10, 13, 18, 1] we can group these features in the ance of edges in the graph were proposed. The first following way: model, which preserve first-order proximity maximizes the joint probability of the observed edges {(vi , vj )}i,j • Text features which include the number of hash- in accordance with the following model: tags, links, geo-tags, words from spam list, statis- tics of posts sentiment and topics; 1 p1 (vi , vj ) = , 1 + exp(−~uTi ~uj ) • User-profile features usually imply the username, the number of friends, subscribes, photos, audios, where ~ui – vector representations of nodes. retweets; In the second model, vertices have two different rep- resentations (as in word2vec [9] model): ~ui when ver- • Time features include statistics of user online tex treated as vertex itself and ~u0i – when vertex treated time, user publications time; as the context of other vertices. Then joint link prob- ability would be p(vi , vj ) = p(vi )p2 (vj |vi ), where • Graph features involve information extracted from friendship and subscribes graphs such as T exp(u~0j ~ui ) PageRank and centrality. p2 (vj |vi ) = P . |V | ~0 T ui ) k=1 exp(uk ~ Fred Morstatter et. al. [10] used topics distribution, obtained by the Latent Dirichlet Allocation (LDA) al- The hidden parameters of these models are vector gorithm on tweets. Kyumin Lee et. al. [8] utilized representations of the vertices. These parameters are Bag of LogReg Subscribes Table 1: Bot Datasets Social Network Active Banned word2vec GRU VKontakte 158806 21263 Twitter 69602 7886 TF-IDF LogReg GB word dictionary, |D3char |, |D4char | — sizes of 3- and 4- LogReg grams dictionaries. Then TF-IDF transformation was KNN applied to this feature vector and logistic regression LINE was used for classification. GB Another classifier was GRU recurrent neural net- work [2] that receives word embeddings created by MLP well-known word2vec model [9] as input. We used word vectors with 300 dimensions preliminary trained on the our large corpus (3.3GB) of informal text col- Figure 1: Proposed scheme of stacking lected from online social networks. obtained by minimizing the Kullback-Leibler diver- 3.2 Subscribes classifier gence between the model and observed distributions (which is equivalent to maximizing the likelihood). To Information about subscribes we handle the following accelerate calculations, the method of negative sam- way: pling [9] is used. 1. Select largest N groups by number of subscribers; 3 Stacking for bot detection 2. Each user is represented as “bag of subscribes” – by the selected at previous step groups. In this section, we turn to the construction of a clas- sifier for prediction of bots. We use three types of “Bag of subscribes” is a binary vector with information attributes for each user: friendship graph, subscrip- about subscribes. The i-th element of this vector is tion information, and user’s texts. We produce various 1 if the user is subscribed to the i-th group and 0 transformations of these attributes to obtain a great otherwise. Length of final vector is N . set of vector representations of a user. Then we train logistic regression on this feature vec- Then there are two main options for exploring dif- tor with N = 10000. ferent feature spaces: combine these views with one large classifier or train a separate classifier on each 3.3 Graph classifiers of them and combine these classifiers with ensemble techniques. The first option considers relations be- For friendship analysis we use graph embedding fea- tween features in various spaces which could improve tures with size d = 100 obtained from LINE as input the predictive power of a model. But at the same time, to different classifiers: such classifiers have less flexibility and a tendency to overfitting. By flexibility, we mean the ability to use • Multilayer perceptron (MLP) the classifier or its parts in the case when for some features don’t exist for certain users or reusability of • K nearest neighbours with cosine distance (KNN) parts of the classifiers in the new domain. In addition, • Gradient Boositng Classifier (GB) the effectiveness of the method of ensemble algorithms was repeatedly proved in the data analysis competition • Logistic Regression (LogReg) and papers. Therefore, we chose the second option. For each of the three types of attributes, we build Machine learning algorithms were implemented their transformations into vector representations and with scikit-learn [11], lightGBM [6] and keras [3]. train on them different classifiers. 3.4 Stacking 3.1 Text classifiers There are several common ways to combine classifiers As one text representation we use the concatena- predictions. The most popular are the weighted aver- tion of bags of words and char n-grams representa- age and stacking. In case of weighted average result- tions, where n = (3, 4). Resulting feature vector has ing prediction is a convex combination of all classifiers |D| + |D3char | + |D4char | dimensions, where |D| — size scores. The main idea of stacking [16] is to use predic- For obtaining labels for VKontakte dataset we use tions of existing classifiers (which are called “first- the following process: layer” classifiers) as new features. Then a new clas- sifier, called a meta-classifier, is trained on them. 1. All users profiles from the social network are col- We append vector representations of the vertices of lected; the friendship graph as the most “powerful” features to 2. Profiles with the “deleted” and “banned” statuses the predictions of classifiers of the first level. Then we are excluded from the sample; use gradient boosting as a meta-classifier. The scheme of the proposed method is shown in Figure 1. 3. Statuses of all remaining accounts are recollected; 4. Users with the “banned” status after recollection 4 Evaluation receive a label “1”; In this section, we empirically evaluate the proposed 5. Users with the “active” status receive a label “0”. method on two real-life datasets and compare our re- sults with existing solutions. We use data collected from 2 until 16 November 2017 for our experiments 5 . 4.1 Twitter Dataset Of all users, we select users with more than 10 friends and texts in the open groups for the last month. There are several papers, which share their Twitter- The restriction on the number of friends is introduced based datasets ([15, 8]) 1 2 . Unfortunately, these to accelerate the computation of vector representa- datasets don’t contain followers graphs, because this tions. The second restriction is necessary in order data can’t be easily collected due to Twitter’s API to use text attributes in the final algorithm. Note limit. Also, we can’t obtain any information from sus- that both these restrictions only affect the speed of pended accounts. That’s why we construct dataset in the experiment, because if one of the attributes (text the following way: we use tweets from [17] 3 , graph or graph of friendship) is absent, the user can be classi- from [7] 4 and relations between sceen name and fied by another available attribute using only one clas- user id (for suspended users) from [5]. The result- sifier from the ensemble. The resulting graph contains ing dataset consists of 77488 users that are present in 110 million of vertices and 7 billion of edges. all these three datasets. To reduce the required computational resources for We used Twitter API to determine the suspended machine learning algorithms we select a small sample user’s. Users who have been suspending received a from the original dataset. We take all banned user label of 1, the rest received a label of 0. among the selected above as positive labels and 5% of active users as negative labels. The parameters of the 4.2 VKontakte Dataset final dataset are shown in the Table 1. Note that the In order to show that our method is applicable on algorithms of the graph embedding algorithms were different social networks, we also collect data from still trained on the whole friendship graph for obtain- VKontakte (Russian social network similar to Face- ing good node representations. book) that had friendly open API with fewer restric- 4.3 Metric and Baselines tions than in most OSNs. We use only public infor- mation, such as user’s posts and comments in open As a measure of quality, we use the area under ROC- groups, their friendship information, group subscribes. curve (AUC), that is a common metric for binary clas- Data is downloaded using VKontakte API. We sifications tasks. use information about the status of users as labels. As a first baseline algorithm for comparison, we Users can have one of the following statuses: “active”, took the results of the detection of bots from the pa- “deleted”, ”banned”. Users with the “deleted” status per by Zegzhda et.al. [18]. The choice of the paper have deleted their accounts themselves. Users with is justified by using the same task, quality measures, the “banned” status were blocked by VKontakte ad- and social network. We also used Twitter bot detec- ministration. We can’t obtain any information about tion algorithms BoostOR [10] and Boosting [8]. Our non-active profiles due to restrictions of VKontakte. implementation of the Boosting algorithm didn’t use one feature — dynamic of followers number, because 1 https://botometer.iuni.iu.edu/bot- we don’t have this information in our datasets. For repository/datasets.html 2 http://bit.ly/asonam-bot-data 5 VKontakte dataset is available at 3 http://snap.stanford.edu/data/bigdata/twitter7/tweets2009- http://talisman.ispras.ru/datasets/ 09.txt.gz 6 We did not re-implement the method and only give the re- 4 https://snap.stanford.edu/data/twitter-2010.html sult published by the author [18] Table 2: Bot detection results achieves the best score on the tests. Experimental re- Transformation Classifier VK Twitter sults on real-life datasets show the effectiveness of the tf-idf LogReg 0.714 0.632 proposed method and its applicability to the analysis Word2vec GRU 0.687 0.592 of the different social networks. Our method improved LINE GB 0.779 0.639 existing solutions by 4-9% of AUC score. LogReg 0.760 0.629 KNN 0.756 0.623 5.1 Acknowledgements MLP 0.769 0.638 This work is funded by the Minobrnauki Russia (grant “Bag of number: 14.604.21.0199 (id:RFMEFI60417X0199)). subscribes” LogReg 0.692 - first-layer Weighted classifiers Average 0.802 0.652 References first-layer classifiers + LINE GB 0.820 0.652 [1] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting spammers on twitter. Zegzhda et al. [18] MLP 0.73 6 - BoostOR In Collaboration, electronic messaging, anti-abuse LDA [10] 0.659 0.613 and spam conference (CEAS), volume 6, page 12, Boosting 2010. Lee et al. [8] [8] 0.637 0.605 [2] K. Cho, B. Van Merriënboer, D. Bahdanau, and performance estimation, we do standard 5-fold cross- Y. Bengio. On the properties of neural machine validation. translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014. 4.4 Results and Discussions [3] F. Chollet et al. Keras, 2015. The results of the experiments are shown in Table 2. It’s clearly seen that proposed method based on stack- [4] A. Grover and J. Leskovec. node2vec: Scalable ing different classifiers outperform existing approaches feature learning for networks. In Proceedings of by 4-9 points of AUC. It is worth paying attention the 22nd ACM SIGKDD international conference to the features obtained from graph embedding tech- on Knowledge discovery and data mining, pages niques. LINE provides powerful user representations. 855–864. ACM, 2016. It even allows achieving good quality with single clas- sifiers. We believe that the main reason for this is [5] N. O. Hodas and K. Lerman. The simple rules of that the graph structure is the most complex charac- social contagion. Scientific reports, 4:4343, 2014. teristic of a user account. Thus, the creation of a bot with a friendship graph similar to a normal user is a [6] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, complex and time-consuming task and the most bot W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly makers don’t do this. efficient gradient boosting decision tree. In Ad- Results also show that stacking of first layer clas- vances in Neural Information Processing Systems, sifiers with graph embedding features allows boosting pages 3146–3154, 2017. the best single classifier scores by 1-4% of AUC. We consider that the relatively poor results of all algo- [7] H. Kwak, C. Lee, H. Park, and S. Moon. What rithms on the Twitter dataset are caused by peculiar- is twitter, a social network or a news media? In ities of its construction. There was more than 8 years Proceedings of the 19th international conference between the collection of user’s texts and graphs and on World wide web, pages 591–600. AcM, 2010. the moment of the collection of labels. Also, to the best of our knowledge, Twitter does not let you know [8] K. Lee, B. D. Eoff, and J. Caverlee. Seven months if the user deleted his profile or it was blocked. with the devils: A long-term study of content polluters on twitter. In ICWSM, pages 185–192, 5 Conclusion 2011. This paper presents an ensemble approach to bots de- [9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, tection problem. We showed that graph embedding and J. Dean. Distributed representations of words techniques can be used to obtain powerful features to and phrases and their compositionality. In Ad- this task. We proposed stacking algorithm which effec- vances in neural information processing systems, tively combines text- and graph-based classifiers and pages 3111–3119, 2013. [10] F. Morstatter, L. Wu, T. H. Nazer, K. M. Carley, and H. Liu. A new approach to bot detection: striking the balance between precision and recall. In Proceedings of the 2016 IEEE/ACM Interna- tional Conference on Advances in Social Networks Analysis and Mining, pages 533–540. IEEE Press, 2016. [11] F. Pedregosa et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011. [12] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Pro- ceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data min- ing, pages 701–710. ACM, 2014. [13] V. Subrahmanian, A. Azaria, S. Durst, V. Ka- gan, A. Galstyan, K. Lerman, L. Zhu, E. Ferrara, A. Flammini, and F. Menczer. The darpa twitter bot challenge. Computer, 49(6):38–46, 2016. [14] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information net- work embedding. In Proceedings of the 24th Inter- national Conference on World Wide Web, pages 1067–1077. International World Wide Web Con- ferences Steering Committee, 2015. [15] O. Varol, E. Ferrara, C. A. Davis, F. Menczer, A. Flammini, V. Subrahmanian, A. Azaria, S. Durst, V. Kagan, A. Galstyan, et al. Online human-bot interactions: Detection, estimation, and characterization. Comm. ACM, 59:7, 2016. [16] D. H. Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992. [17] J. Yang and J. Leskovec. Patterns of tempo- ral variation in online media. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 177–186. ACM, 2011. [18] P. Zegzhda, E. Malyshev, and E. Y. Pavlenko. The use of an artificial neural network to detect automatically managed accounts in social net- works, 2017.