Research Application of the Spam Filtering and Spammer Detection Algorithms on Social Media Nataliia Liubchenko1, Andrii Podorozhniak1 and Vasyl Oliinyk1 1 National Technical University “Kharkiv Polytechnic Institute”, Kyrpychova str. 2, Kharkiv, 61002, Ukraine Abstract There are a bunch of different social networks and messengers today, which in times of pandemic corona-virus and Russian war in Ukraine have take a really big part of our entire live, especially in our work activities. Besides that, the problem with the spam and spammers is the most relevant than ever, the count of spam in the work text stream is continuously increased. Under spam we understand the text content that is not necessary in the particular text stream, in case of spammer it is meant the person that is sending the spam messages in his or her own purposes. The project was design to solve the scientific and applied problem of detecting spammers and identifying spam messages in the text context of any social network or messenger using various spam detection algorithms and spammer detection approaches. We have implemented four algorithms for spam recognition and the complex majority algorithm for spam recognition and spammer detection: an algorithm using naive Bayesian classifier, Support-vector machine, multilayer perceptron neural network and convolution neural network. The developed approach using a complex majority algorithm can be used not only to remove spam and spammer detecting, but also, for example, to antispam bot messages monitoring for chats that are important for a particular user. Keywords 1 Spam, Spammer Detection, Social Network, Antispam Bot, Complex Majority Algorithm 1. Introduction Thanks to various anti-spam and spammer algorithms, the share of spam in global email traffic in 2021 was down by 4.81 p.p. when compared to the previous reporting period, averaging 45.56% [1] (Figure 1). Figure 1: Percentage of spam in email traffic in 2021 COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems, May 12–13, 2022, Gliwice, Poland EMAIL: nlubchenko63@gmail.com (N. Liubchenko); andriipodorozhniak@gmail.com (A. Podorozhniak); oleynikwasya@gmail.com (V. Oliinyk) ORCID: 0000-0002-4575-4741 (N. Liubchenko); 0000-0002-6688-8407 (A. Podorozhniak); 0000-0002-7582-3568 (V. Oliinyk) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Most probably only inboxes have built-in the anti-spam algorithms, the others chat rooms do not have such functionality. It can be the reason why the spam percentage in the mail-boxes and others message is mostly the same. For instance, the malicious link injected to the message and sent to the company employ can be a big danger for the whole company. Therefore, our today’s world has an issue of monitoring the incoming text stream in social networks and messengers. Is also necessary to identify and ban spammers [2, 3], this facilitates the work of algorithms and complicates the life of spammers, and the most important is that it reduce the share of the spam as we see from Fig. 1. The ability to filter spam messages, identify and ban spammers in messengers and social networks can save a bunch of humanity time and prevent loss of information and money. To solve the problem we used algorithms using a naive Bayesian classifier, support vector method, multilayer perceptron neural network, convolution neural network and complex majority algorithm [4]. We also developed a simple algorithm that identifies and blocks the user that was recognized as a spammer. An approach with integrated application of the investigated algorithms can begin to solve the problem of spam in social networks and messengers. 2. Characterization of spam and spammer and how to deal with it Let’s start and firstly discuss what is the spam actually. Spam is a mass mailing of correspondence of an advertisement to people who have not expressed a desire to receive it [5, 6]. Here is the different types of spam: advertisements; phishing; Nigerian emails; mass mailings of letters with religious content; mass mailings to put the mail system out of service (causing the system crush); mass mailings of letters containing computer viruses (for their initial spread); mass mailings on behalf of another person in order to cause a negative attitude towards that person; The most popular spam spreading methods are the following [5, 7]: e-mail; usernet; messengers; substitution of Internet traffic; SMS messages; phone calls, etc. The receiver of the spam usually has to pay the Internet provider for the time used to receive the spam, in the same time for sender of the spam messages it costs almost nothing. The load traffic is also messed up because of the mass spread of spam, it also complicates the operation of information systems and resources. Due to mass mailings the user has to spend unnecessary time filtering the messages. To avoid this, we use anti-spam filters to save our time. But spam filters can also accidentally erase an important message by recognizing it as spam. The surest way to deal with spam is to prevent spammers from getting your email address. Auto-Spam Detection Software is called Anti-Spam Filters [8]. They can be applied by end-users or on servers. This software has two main approaches [9, 10]. 1. The content of the message is analyzed, based on that it is concluded whether it is spam or not. If a message is classified as spam, it can be flagged, moved to another folder or even deleted. Such software can run both on the server and on the client computer. With this approach you don't see the spam filtered, but you continue to pay the full cost for receiving it, because the anti-spam software receives each spam message anyway (wasting your money) and only then decides whether to show it or not. 2. It classifies the sender as a spammer without looking at the text of the message. This software can only work on the server which directly receives the messages. With this approach it's possible to reduce the cost - money is only spent on communicating with spam mailing programs (i.e. refusing to accept the messages) and on contacting other servers (if any) for verification. The gain, however, is not as great as you might expect. If the recipient refuses to accept the message, the spammer program tries to bypass the protection and send it another way. Each such attempt has to be repelled separately, which adds to the load on the server. Let's also take a look at a few basic spammer detection methods [11, 12]. Usually the existing spam detection options are categorized into two groups, i.e., linguistic-based, behavior-based. Linguistic-based Spam Detection. Linguistic-based methods aim at extracting the discriminative linguistic features to differentiate the fake users from normal ones. For example, these methods identify review spams according to linguistic clue, writing-style feature, syntactic pattern, LDA-based topic model, Bayesian generative model, positive-unlabeled learning, frame-based model, and document-level features. Behavior-based Spam Detection. Behavior-based spam detection aims at detecting a set of collective malicious manipulation of online reviews according to behavior-based features [13]. Our chosen algorithm is related to the Linguistic-based Spam Detection, since we define if the user is spammer or not based on his messages. This project discusses a statistical Bayesian spam filtering method using a support vector method, multilayer perceptron neural network, convolution neural network and complex majority algorithm for the spam filtering and spammer detection in social media. 3. Results As a training datasets were chosen the dataset of spam messages from the Kaggle SMS Spam Collection Dataset [14] and Spam Mails Dataset [15], but the dataset of messages from a particular company can also be used to train the algorithm. To implement the spam filtering algorithms, we used the Python 3.6 programming language, the PyCharm. programming environment and the Keras, NumPy, Sklearn and Pandas libraries [16, 17], MySQL DB for storing spammers and all users of the text stream. The simulation was performed on a LifeBook E744 notebook with 8Gb RAM, an Intel Core i7 CPU (up to 3.2 GHz) and an Intel HD Graphics 4600 video processor. The spam message analyzing process is shown in Figure 2. Figure 2: The spam message analyzing process We used four most popular spam recognition algorithms: Naïve Baes Classifier [18, 19], Multilayer Perceptron Neural Network [20, 21], Convolution Neural Network [22, 23] and Support Vector Machine [24, 25]. We get the message from the user (in our case, form Telegram user) then if the user is unknown in our system, we add him to our database (DB) with all of the users of the application, after that we analyze the message using all of the existing algorithms, passing the results from all algorithms to the Majority algorithm we calculate the spam percentage of the message [4]. Then the result of the Majority Algorithm is passed to the Spam Analyzer, which decides if the user that sent the message is spammer or not based on the provided spam percentage of the message and two last predictions. So to identify the user as a spammer we analyze his 3 last messages and if the average spam percentage is bigger than specified edge, we recognize the user as a spammer and put his id to the DB with spammers. The proposed complex majority algorithm shown in Figure 3 uses as inputs for the majority scheme the solutions of the Bayesian spam filtering method, Multilayer perceptron neural network, Support vector method and Convolutional neural network algorithms. To match the outputs of the algorithmic blocks (0… 1) with the inputs of the majority scheme (0, 1), their binarization with a threshold of 0.95 is performed. Figure 3: The majority algorithm process The results of the complex algorithm of antispam bot in the form of an estimate of the probabilistic of correct spam recognition for the test samples are shown in Figure 4. Figure 4: The results of recognition of the complex algorithm of antispam bot The implementation of the spam analyzing and spammer analyzing are shown in Figure 5. If a user is in the spammers DB his messages are being deleted without even analyzing them. The user receives the message that he was blocked. Only the manager of the application is able to remove users from the spammers. The process of putting spammers to the DB and the communication of the spam analyzer with the DB are shown in Figure 6. Figure 5: The implementation of the spam analyzing and spammer analyzing Figure 6: The process of putting spammers to the DB. The communication of the spam analyzer and the DB The general scheme of execution of the developed software application is given in Figure 7. Figure 7: The general scheme of execution of the developed software application Algorithm of analyzing spam messages and identifying a spammer contains the following steps. 3. The user enters into the software application the initial text that should be analyzed. 4. Software application parses the initial text into array of words, then each word is converted to the infinitive, then the resulting set of words is vectorized and transmitted to the input to the all of the used algorithms. 5. The algorithms analyze the received data and returns the result as the probability of belonging the received data to the class (each algorithm has two classes: spam and non-spam). 6. The received data passed through the Majority Algorithm to calculate the spam percentage. 7. The app decides if the user should be marked as spammer based on the last 3 spam prediction of his messages. 8. If the user was identified as a spammer he is blocked. The algorithm recognizes the user as a spammer only if the average value of the predictions of the last 3 messages sent exceeds the threshold value. So the actual amount of time that the algorithm requires to determinate the if the user is spammer cannot be calculated. We can only talk about the situation when the user sends another spam message which will be the last one before the user is recognized as a spammer and blocked. In this case the reacation time of the algorithm will be within 1 second. We are also independent of the database search time since we use the buffer to store the predictions of the last 3 messages of every user, so we do not go to the database every time user sends a message. 4. Testing And Comparison Also, in addition to the usual accuracy metric for evaluating selected algorithms, we used F1 score. Accuracy is a ratio between the correctly classified samples to the total number of samples. Nowadays it is the most used metric of classification performance. TP  TN (1) Accuracy  TP  FN  TN  FP where TP – (True Positive) correctly classified positive sample; FN – (False Negative) the sample is positive but it is classified as negative; TN – (True Negative) the sample is negative and it is classified as negative; FP – (False Positive) the sample is negative but it is classified as positive. The explanation of accuracy evaluation are shown in Figure 8. Figure 8: The explanation of accuracy evaluation The data set sample that we used for our project are shown in Figure 9. . Figure 9: SMS Spam Collection Dataset The results of the tests using accuracy metric are shown at Table 1. Table 1 The results of the testing algorithms on training and test samples Algorithm Training sample Test sample Bayes 0.988 0.982 SVM 0.998 0.989 CNN 0.990 0.985 Majority 1.000 0.999 In comparison purposes we also test all of the stuff using another dataset, called “Spam Mails Dataset”, so we can know how good our algorithms are when analyzing mail traffic (Figure 10). Figure 10: Spam Mails Dataset The result of the testing on Spam Mails Dataset can be seen on Table 2. Table 2 The results of the testing algorithms on training and test samples on Spam Mails Dataset Algorithm Training sample Test sample Bayes 0.922 0.898 SVM 0.983 0.956 CNN 0.954 0.949 Majority 0.965 0.959 5. Conclusions As part of this research, the scientific and applied problem of determining spam in the textual context of social networking messengers was solved by the example of Kaggle SMS Spam Collection Dataset and Spam Mails Dataset using chatbots in the popular messenger Telegram. Besides that, the basic spam detection algorithms were analyzed and the one was implemented in the application. 1. Considered the relevance of spam detection and possible problems due to spam intervention. 2. Consider the basic methods of spam recognition, namely naive Bayesian classifier, the method of support vectors, multilayer perceptron neural network and convolution neural network. 3. Consider the basic methods of spammer detection. 4. It was developed a program to filter spam and spammers detection in the messenger Telegram, that uses 4 implemented algorithms for spam recognition and proposed complex majority algorithm. All of the text traffic is also checked for the spam and spammers detection. 6. References [1] T. Kulikova, T. Shcherbakova, Spam and phishing in 2021, 2022. URL: https://securelist.ru/spam-and-phishing-in-2021/104407/. [2] S. R. Sahoo, B. B. Gupta, D. Peraković, F. J. G. Peñalvo, I. Cvitić, Spammer Detection Approaches in Online Social Network (OSNs): A Survey, in: L. Knapcikova, D. Peraković, M. Perisa, M. Balog (Eds.), Sustainable Management of Manufacturing Systems in Industry 4.0, EAI/Springer Innovations in Communication and Computing. Springer, Cham, 2022, pp. 159– 180. doi:10.1007/978-3-030-90462-3_11. [3] T. Sudalaimuthu, C. Dheeraj Kumar Reddy, B. Sairam Reddy, M. Lakshmi Sahithya, S. Visalaxi, Detecting spammer and fake user on social networks using machine learning approach, in: AIP Conference Proceedings, volume 2385, 050010, 2022. doi:10.1063/5.0071071. [4] N. Liubchenko, A. Podorozhniak, V. Oliinyk, Research of antispam bot algorithms for social networks, in: CEUR Workshop Proceedings, volume 2870, 2021, pp. 822–831. URL: http://ceur- ws.org/Vol-2870/paper61.pdf. [5] B. Liu, E. Blasch, Y. Chen, D. Shen, G. Chen, Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier, in: Proceedings of the IEEE International Conference on Big Data, USA, 2013, pp. 99–104. doi:10.1109/BigData.2013.6691740. [6] S. Kaddoura, G. Chandrasekaran, D.A. Popescu, J.H. Duraisamy, A systematic literature review on spam content detection and classification, PeerJ Computer Science, 8, e830, 2022. doi:10.7717/PEERJ-CS.830. [7] S. Chaudhry, S. Dhawan, R. Tanwar, Spam Detection in Social Network Using Machine Learning Approach, in: U. Batra, N. Roy, B. Panda (Eds.), Data Science and Analytics. REDSET 2019, Communications in Computer and Information Science, 2020, pp. 236–245. doi:10.1007/978-981-15-5830-6_20. [8] A. Mykytiuk, V. Vysotska, S. Albota, Spam Filtration System with the Use of Machine Learning Technology, in: Proceedings of the International Scientific and Technical Conference on Computer Sciences and Information Technologies, Lviv, 2021, pp. 124–130. doi:10.1109/CSIT52700.2021.9648757. [9] C. Zhao, Y. Xin, X. Li, Y. Yang, Y. Chen, A Heterogeneous Ensemble Learning Framework for Spam Detection in Social Networks with Imbalanced Data, Applied Sciences, 10, 936, 2020. doi:10.3390/app10030936. [10] What is spam and how to fight it, 2019. URL: https://www.ukraine.com.ua/uk/blog/marketing/chto-takoe-spam-i-kak-s-nim-borotsya.html. [11] D. Koggalahewa, Y. Xu, E. Foo, An unsupervised method for social network spammer detection based on user information interests, Journal of Big Data, 9, 7, 2022. doi:10.1186/s40537-021- 00552-5. [12] F. Masood, G. Ammad, A. Almogren, A. Abbas, M. Zuair, Spammer Detection and Fake User Identification on Social Networks, IEEE Access, volume 7, 2019, pp. 68140–68152. doi:10.1109/ACCESS.2019.2918196. [13] A. Peleshchyshyn, O. Markovets, V. Volodymyr, S. Albota, Identifying specific roles of users of social networks and their influence methods, in: Proceedings of the International Scientific and Technical Conference on Computer Sciences and Information Technologies, Lviv, 2018, pp. 39- 42. doi:10.1109/STC-CSIT.2018.8526635. [14] SMS Spam Collection Dataset [Data set]. URL: https://www.kaggle.com/uciml/sms-spam- collection-dataset. [15] Spam Mails Dataset [Data set]. URL: https://www.kaggle.com/datasets/venky73/spam-mails- dataset. [16] Python For Beginners, Python Software Foundation. URL: https://www.python.org/about/gettingstarted/. [17] Applications for Python, Python Software Foundation. URL: https://www.python.org/about/apps/. [18] S. Sugahara, M. Ueno, Exact Learning Augmented Naïve Bayes Classifier, Entropy, 2021, 23, 1703. doi:doi.org/10.3390/e23121703. [19] T. Wei, Understanding of the naive Bayes classifier in spam filtering, in: AIP Conference Proceedings, volume 1967, 020007, 2018. doi:10.1063/1.5038979. [20] R. Jehad, S. A. Yousif, Classification of fake news using multi-layer perceptron, in: AIP Conference Proceedings, volume 2334, 1, 070004, 2021. doi:10.1063/5.0042264. [21] N. Liubchenko, A. Podorozhniak, V. Bondarchuk, Neural network method of intellectual processing of multispectral images, Advanced Information Systems, volume 1, 2, 2017, pp. 39– 44. doi:10.20998/2522-9052.2017.2.07. [22] C. I. Ejiofor, L. C. Ochei, Application of Convolutional Neural Network (CNN) for the Prediction of Spam Mail, Journal of Computer Science and Its Application, volume 28, 1, 2021. doi:10.4314/jcsia.v28i1.12. [23] V. Yaloveha, D. Hlavcheva, A. Podorozhniak, H. Kuchuk, Spectral Indexes Evaluation for Satellite Images Classification using CNN, Journal of Information and Organizational Sciences, volume 46, 2, 2021, pp. 95-113. doi:10.31341/jios.45.2.5. [24] L. Nguyen, Tutorial on Support Vector Machine, Applied and Computational Mathematics, volume 6, 4, 2017, pp. 1–15. doi:10.11648/j.acm.s.2017060401.11. [25] Z. S. Torabi, M. H. Nadimi-Shahraki, A. Nabiollahi, Efficient Support Vector Machines for Spam Detection: A Survey, International Journal of Computer Science and Information Security, volume 13, 1, 2015, pp. 11–28. URL: https://ia600301.us.archive.org/24/items/JournalOfComputerScienceIJCSISVol.13No.1January2 015/Journal%20of%20Computer%20Science%20IJCSIS%20%20Vol.%2013%20No.%201%20J anuary%202015.pdf.