=Paper=
{{Paper
|id=Vol-3611/paper5
|storemode=property
|title=Application-based spam detection with machine learning algorithms
|pdfUrl=https://ceur-ws.org/Vol-3611/paper5.pdf
|volume=Vol-3611
|authors=Ali Erbey,Necaattin Barişçi
|dblpUrl=https://dblp.org/rec/conf/ivus/ErbeyB22
}}
==Application-based spam detection with machine learning algorithms==
A p pli c ati o n -b as e d s p a m d et e cti o n wit h m a c hi n e l e ar ni n g al g orit h ms A li Er b e y 1 , N e c a a tti n B arış çı 2 1 Uş a k U ni versit y, Dist a nce E d uc ati o n V oc ati o n al Sc h o ol , De p art me nt of C o m p uter Pr o gr a m mi n g , Uş a k, T ur k e y 2 G a zi U ni versit y , F ac ult y of Tec h n ol o g y , De p art me nt of C o m p uter E n gi neeri n g , A n k ar a, T ur k e y A bstr a ct T o d a y, t h e us e of s o ci al m e di a sit es s u c h as F a c e b o o k, I nst a gr a m, T witt er is i n cr e asi n g d a y b y d a y. S h ar es o n s o ci al m e di a c a n e v e n c h a n g e t h e a g e n d a b y r e a c hi n g h u g e m ass es. O n T witt er, a s o ci al m e di a sit e, t h e a g e n d a t o pi cs c a n b e f oll o w e d t hr o u g h t h e s e cti o n c all e d Tr e n d -T o pi c. T his Tr e n d - T o pi c s e cti o n m a y b e m a ni p ul at e d b y s p a m m ers fr o m ti m e t o ti m e. I n or d er t o a v oi d s u c h u n w a nt e d sit u ati o ns, it is n e c ess ar y t o d et er mi n e w h et h er t h e us er is s p a m or n ot. M a c hi n e l e ar ni n g al g orit h ms c a n cl assif y w h et h er a us er is s p a m or n ot. Wit h m a c hi n e l e ar ni n g al g orit h ms, s u c c essf ul r es ults ar e als o o bt ai n e d i n sit u ati o ns s u c h as i m a g e pr o c essi n g, s p e e c h, v oi c e r e c o g niti o n a n d m al w ar e d et e ct i o n. I n t his st u d y, m a c hi n e l e ar ni n g al g orit h ms N aiv e B a y es, K N e ar est N ei g h b or s, R a n d o m F or est, j 4 8, M ultil a y er P er c e ptr o n w er e us e d t o cl assif y us ers. As a r es ult of t h e e v al u ati o ns, R a n d o m F or est al g orit h m, o n e of t h e m a c hi n e l e ar ni n g al g orit h ms us e d, m a d e t h e m ost s u c c essf ul cl assifi c ati o n wit h a n a c c ur a c y r at e of 8 8 % . K e y w or ds 1 T witt er, s p a m d et e cti o n, m a c hi n e l e ar ni n g 1. I ntr o d u cti o n of s o ci al m e di a h as f a cilit at e d t h e n e gl e ct of t h es e e n vir o n m e nts b y m ali ci o us p e o pl e [ 4]. W a nti n g t o c h a n g e t h e a g e n d a is als o a m et h o d t h at c a n b e T o d a y, wit h t h e wi d es pr e a d us e of t h e i nt er n et n e gl e ct e d b y m ali ci o us p e o pl e. I n or d er t o pr e v e nt a n d t h e i n cr e as e i n t h e us e of m o bil e d e vi c es, s u c h o missi o ns, m a n y st u di es h a v e b e e n c arri e d o nli n e s o ci al n et w or ki n g sit es, s o ci al n et w or ks o ut i n ar e as s u c h as n at ur al l a n g u a g e pr o c essi n g s u c h as F a c e b o o k, T witt er a n d Li n k e dI n ar e a n d d at a mi ni n g wit h t h e d at a c oll e ct e d fr o m b e c o mi n g m or e a n d m or e p o p ul ar [ 1]. T h es e sit es T witt er [ 5]. ar e f oll o w e d b y milli o ns of p e o pl e; I n a d diti o n t o W h e n w e l o o k at t h e e xisti n g st u di es i n b ei n g sit es w h er e fri e n ds, f a mil y or a c q u ai nt a n c es d et e cti n g s p a m wit h T witt er p osts, w e s e e t h at c a n b e c o nt a ct e d, t h e y ar e als o us e d as t h er e is a l ot of w or k. T h es e st u di es ar e cl ust er e d mi cr o bl o g gi n g s er vi c es, r e c o m m e n d ati o n i n c ert ai n ar e as. T witt er s p a m d et e cti o n st u di es ar e s er vi c es, r e al-ti m e n e ws s o ur c es a n d c o nt e nt m ai nl y h a n dl e d i n t hr e e gr o u ps. T h es e ar e: a) s h ari n g pl a c es [ 2]. Us ers c a n s h ar e b y cr e ati n g t h os e w h o o nl y e x a mi n e t h e t w e ets b y t e xt st at us m ess a g es o n T witt er, o n e of t h es e sit es. I n mi ni n g, b) t h os e w h o a n al y z e t h e t w e et t e xt b y T witt er, w hi c h is a p o p ul ar mi cr o bl o g gi n g sit e i n ass o ci ati n g wit h t h e us er w h o s e nt t h e t w e et, c) t er ms of s h ari n g, t h es e st at us m ess a g es cr e at e d ar e t h os e w h o e x a mi n e t h e r el ati o ns of us ers wit h c all e d t w e ets [ 3]. Wit h t h es e t w e ets s e nt b y t h e s p a m. us ers, t h e Tr e n d T o pi c s e cti o n, w hi c h c o nstit ut es T e xt mi ni n g -b as e d r es e ar c h m ostl y f o c us es o n t h e e xisti n g a g e n d a t o pi cs, is f or m e d. t w e et t e xt. I n t h es e st u di es, r es e ar c h ers first Tr e n d T o pi c s e cti o n c a n b e dir e ct e d t o t h e e xtr a ct f e at ur es a n d t h e n cl assif y t h e m wit h a g e n d a i n a n u n d esir a bl e w a y wit h m ess a g es s e nt al g orit h ms s u c h as N ai v e B a y es a n d j 4 8. F e at ur e fr o m ti m e t o ti m e, o ut of p ur p os e. T h e h e a v y us e I V U S 2 0 2 2: 2 7t h I nt er n ati o n al C o nf er e n c e o n I nf or m ati o n T e c h n ol o g y E M AI L: ali er b e y @ g m ail . c o m ( A. Er b e y); n b aris ci @ g a zi . ed u.tr ( N. B arış çı) O R CI D: 0 0 0 0 - 00 0 2 - 09 3 0 - 40 8 1 ( A. Er b e y); 0 0 0 0 - 00 0 2 - 87 6 2 - 50 9 1 ( N. B arış çı) © 2 0 2 2 C o p yri g ht f or t his p a p e r b y its a ut h ors. Us e p er mitt e d u n d er Cr e ati v e C o m m o ns Li c e ns e Attri b uti o n 4. 0 I nt er n ati o n al (C C B Y 4. 0). C E U R W or ks h o p Pr o c e e di n gs ( C E U R-W S. or g) C EU R h ttp ://c e u r-w s .o rg W o rk s h o p IS S N 1 6 1 3 -0 0 7 3 P ro c e e d in g s CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings extraction sometimes follows feature selection to 100 tweets of each user were collected. Since improve classification accuracy and reduce some users did not have 100 tweets, a total of training time. Gupta and Kumar [6] used multiple 305.604 tweets were reached. Then, these tweets linear regression to select important features. were processed for each user, and some Other features such as the tweet's character count, unnecessary data such as smileys were removed word count, like or retweet count are used in most from the text of the tweet. Because tweets are research. unofficial texts, some autocorrect libraries were Some research takes user characteristics into used and typos were corrected. Then, as a more account when deciding whether a user is spam. complex process, some new features are obtained These features can be account age, number of from this tweet data, such as how often the user followers/followers, follower/followers’ rate, tweets, the average number of characters of the format of the profile page. However, since these user's tweets, or the unique words tweeted in those features can be easily changed by the user, they 100 tweets, represented as columns in the final are considered to be less reliable. dataset has been done. Because user characteristics can change easily, As a result of the data collection and some researchers have studied the relationships preprocessing stage, a dataset consisting of 3798 between spammers and real users. By examining rows representing each user and 22 columns in their following / following relationships, they total, including features such as the age of the created a network for each user. Setting up these user, whether he is a verified account, whether he networks can be costly in terms of computation entered a URL on the profile page, was obtained. time, power and data collection time. In the following sections of this study, 2.2. Classification obtaining the spammy dataset, classification, detection of spammy users and feature selection processes are carried out. In the last section, the The next part after data collection is to obtained results are evaluated. determine whether the user is a spam user with different machine learning algorithms. Weka software was used to classify the 2. Material and method obtained data as spam or not spam. Weka is a program developed for machine learning and text We collect the dataset before making the mining intended to assist in the application of classification. The data collection process has an machine learning techniques [7]. important role in the classification process. In this study, Naive Bayes algorithm, k Nearest Neighbor algorithm, Random Forest algorithm, 2.1. Spamming Twitter dataset j48 algorithm and Multilayer Perceptron classification algorithms were used in Weka software. In this study, user characteristics and the Considering the studies in the literature, the method of evaluating users' tweet attributes were algorithms used in other studies are shown in chosen in order to classify spam. The reason for Table 1. choosing this method is that it is less costly in terms of data collection and it is seen to give better Table 1 results regardless of the tweet content, as it Algorithms used in other studies depends on user characteristics. For training, a topic was selected from the Authors Algorithms Turkey Trend topic list, since a data set with spam Diale et al.[8] SVM, RF, c4.5 users should be obtained. 15000 tweets were Wang[1] DT, NN, SVM, NB collected from this trending topic with the Twitter Aydın et al.[4] DT, LR, SVM public API. Then, repetitive data, news content, McCord et al.[9] RF, SVM, NB, KNN tweets containing URL only were removed from this dataset and the remaining 3798 tweets were The reason for choosing the NB, KNN, RF , classified as spam and not spam. As a result, 3798 j48 and MLP algorithms used in this study is to tweets were classified as 1666 spam and 2132 try to create a combination of algorithms that are non-spam users. widely used in the literature and in addition to In order to evaluate which users can send spam them, less used algorithms. The reason for tweets after classification, the last maximum of choosing the most used algorithms is to make general, back propagation algorithm learning comparisons with previous studies. The reason for technique based on slope drop method is used in choosing the less used algorithms is to create an MLP. With this technique, the error between the alternative to the frequently used algorithms. desired output and the produced output is minimized [18]. 2.2.1. Naive Bayes 2.3. Determination of spam users The Naive Bayes (NB) algorithm is a simple probabilistic classifier that calculates a probability The data obtained during the classification of set by counting the frequency and combinations the data was divided into 80% training data and of values in a given data set. It is a classification 20% test data and evaluated in NB, KNN, RB, j48 algorithm that classifies data by calculating it with and MLP algorithms. It is known that 1666 of probability principles. Naive Bayes is a popular 3798 users are spam users and 2132 of them are algorithm used commercially or open source for non-spam users in the dataset. The number of email spam filtering [11]. users to test is 760 people, which is 20% of the data. Looking at whether users are spam with the 2.2.2. k Nearest Neighbors Naive Bayes algorithm, the Naive Bayes algorithm classified users with an accuracy rate of 76%. In 1968, Cover and Hart proposed the k When the complexity matrix of the NB Nearest Neighbor (KNN) algorithm, which they algorithm is examined, the data are shown in have been working on for a long time [12]. The Table 2. intuition underlying the K Nearest Neighbor Classification is quite simple, samples are Table 2 classified according to the class of their nearest NB Complexity Matrix neighbors [13]. Having an efficient algorithm for performing nearest neighbor operations on large Predicted Class datasets can provide rapid improvements for Class Positive Negative True many applications [14]. KNN is one of the useful Positive 382 60 algorithms in terms of speed. Class Negative 122 196 2.2.3. Random Forest According to the complexity matrix of the algorithm in Table 2; Of the 760 people in the Breiman [15] developed the Random Forest 20% test data, 382 people who were spam were (RF) method as an extension of classification classified as spam, and 60 people who were spam trees. In the RF algorithm, each node has a were classified as non-spam. 122 non-spam were random feature selection [16]. It is an algorithm classified as spam, while 196 non-spam were that aims to increase the classification value by classified as non-spam. When we look at the KNN using more than one decision tree. algorithm, one of the machine learning algorithms, it is seen that it classifies users with an accuracy rate of 74%. 2.2.4. j48 The complexity matrix of the KNN algorithm is as shown in Table 3. The purpose of the Decision Tree Algorithm is to determine how the feature vector behaves for a Table 3 few samples [17]. In the WEKA data mining tool, KNN Complexity Matrix J48 is an open-source Java implementation of the C4.5 algorithm [17]. Predicted Class Class Positive Negative True 2.2.5. Multilayer Perceptron Positive 357 85 Class Negative 107 211 Multilayer perceptron (MLP) is an algorithm that can be effectively used for classification According to the complexity matrix of the purposes and has been used a lot recently. In KNN algorithm in Table 3; Of the 760 people in the 20% test data, 357 people who were spam According to the complexity matrix of the were classified as spam, and 85 people who were MLP algorithm in Table 6; Of the 760 people in spam were classified as non-spam. 107 non-spam the 20% test data, 390 people who were spam were classified as spam, while 211 non-spam were were classified as spam, and 52 people who were classified as non-spam. spam were classified as non-spam. 82 non-spam When the Random Forest algorithm was used people were classified as spam, while 236 non- in the study, an accuracy rate of 88% was spam were classified as not spam. achieved. 2.4. Feature selection Table 4 RF Complexity Matrix Feature selection is one of the important steps Predicted Class of pattern recognition, machine learning and data Class Positive Negative mining. Its purpose is to eliminate irrelevant and True Positive 406 36 redundant variables in order to understand the Class data, reduce the computational requirement, Negative 53 265 reduce the dimensionality effect, and improve the performance of the predictor [19]. The sections According to the RF algorithm complexity matrix in Table 4; Of the 760 people in the 20% selected by the Weka software after feature selection are shown in Table 7. test data, 406 people who were spam were classified as spam, and 36 people who were spam were classified as non-spam. 53 non-spam Table 7 classified as spam, 265 non-spam classified as Remaining sections after feature selection non-spam. Column Name 85% accuracy rate was observed with the J48 Friend_count Favourite_freq algorithm. The complexity matrix of the J48 Account_age Reply_freq algorithm is as shown in Table 5. Tweet_freq Unique_freq Hashtag_freg Spam Table 5 J48 Complexity Matrix When the data is re-evaluated after the feature Predicted Class selection, the performances of the algorithms are Class Positive Negative seen in Figure 1. True Positive 382 60 Class Negative 122 196 According to the complexity matrix of the j48 algorithm in Table 5; Of the 760 people in the 20% test data, 382 people who were spam were classified as spam, and 60 people who were spam were classified as non-spam. 50 non-spam classified as spam, 268 non-spam classified as non-spam. Figure 1: Performance of algorithm The MLP algorithm found an accuracy rate of 82%. The complexity matrix of the MLP As shown in Figure 1, it has obtained similar algorithm is as shown in Table 6. results with feature selection and without feature selection. Table 6 MLP Complexity Matrix 3. Conclusion and discussion Predicted Class Class Positive Negative In this study, SPAM users on Twitter were True tried to be detected. In this study, the data set Positive 390 52 Class consists of 3798 user information and different Negative 82 236 machine learning algorithms are used to classify users. The Random Forest algorithm achieved the Computer Vision and the Internet. 2016. highest accuracy rate of 88%. The NB algorithm ACM. achieved 76%, KNN 74%, j48 83% and MLP 80% [7] G. Holmes, A. Donkin and I.H. Witten, correct classification rates. When the algorithms Weka: A machine learning workbench. 1994. were applied again after the feature selection was [8] M. Diale, T. Celik, and C. Van Der Walt, made, it was observed that the accuracy rate Unsupervised feature learning for spam decreased in other algorithms except the NB email filtering. Computers & Electrical algorithm. Engineering, 2019. 74: pp. 89-104. In addition to the accuracy rates in the [9] M. Mccord and M. Chuah. Spam detection algorithms, the number of users whose real class on twitter using traditional classifiers. in is negative but classified as positive in the international conference on Autonomic and complexity matrix is also important. Since these trusted computing. 2011. Springer. users are classified as spam even though they are [10] T.R. Patil, and S. Sherekar, Performance not spam, they will suffer if the algorithm is analysis of Naive Bayes and J48 trusted. This will create an undesirable situation. classification algorithm for data When we look at the results, it is seen that the j48 classification. International journal of algorithm gives the lowest rate with 50 users. computer science and applications, 2013. As a suggestion for future research, different 6(2): pp. 256-261. optimizations of feature extraction and different [11] V. Metsis, I. Androutsopoulos, and G. machine learning algorithms methods can be tried Paliouras. Spam filtering with naive bayes- on the collected data and more successful which naive bayes? in CEAS. 2006. classification results can be achieved. Mountain View, CA. [12] A. Kataria, and M. Singh, A review of data 4. References classification using k- nearest neighbour algorithm. International Journal of Emerging [1] A. H. Wang, "Detecting spam bots in online Technology and Advanced Engineering, social networking sites: a machine learning 2013. 3(6): pp. 354-360. approach," in IFIP Annual Conference on [13] P. Cunningham, and S.J. Delany, k-Nearest Data and Applications Security and Privacy. neighbour classifiers. Multiple Classifier Springer. Berlin, Heidelberg, 2010. pp. 335- Systems, 2007. 34(8): pp. 1-17. 342 doi:10.1007/978-3-642-13739-6_25 [14] M.Mujaand D.G.Lowe, Scalable nearest [2] M. Pennacchiotti and A.M. Popescu, A neighbor algorithms for high dimensional machine learning approach to twitter user data. IEEE transactions on pattern analysis classification. in Fifth International AAAI and machine intelligence, 2014. 36(11): pp. Conference on Weblogs and Social Media. 2227-2240. 2011. [15] L.Breiman, Random forests. Machine [3] A. Go, R. Bhayani and L. Huang, Twitter learning, 2001. 45(1): pp. 5-32. sentiment classification using distant [16] K.J. Archer, and R.V. Kimes, Empirical supervision. CS224N Project Report, characterization of random forest variable Stanford, 2009. 1(12): pp. 2009. importance measures. Computational [4] İ. Aydın, M. Sevi and M.U. Salur, "Detection Statistics & Data Analysis, 2008. 52(4): pp. of Fake Twitter Accounts with Machine 2249-2260. Learning Algorithms," 2018 International [17] G. Kaur and A. Chhabra, Improved J48 Conference on Artificial Intelligence and classification algorithm for the prediction of Data Processing (IDAP), 2018, pp. 1-4, doi: diabetes. International Journal of Computer 10.1109/IDAP.2018.8620830. Applications, 2014. 98(22). [5] E.S Akgül, C. Ertano and B. Diri, Twitter [18] A.R. Yılmaz, O. Yavuz, and B. Erkmen. verileri ile duygu analizi. Pamukkale Training multilayer perceptron using University Journal of Engineering Sciences. differential evolution algorithm for signature 2016, Vol. 22 Issue 2, p106-110. 5p. recognition application. in 2013 21st Signal [6] D.K Gupta and A. Kumar. Spam And Processing and Communications Sentiment Analysis Model For Twitter Data Applications Conference (SIU). 2013. IEEE. Using Statistical Learning. in Proceedings of [19] F.Asdaghi and A.Soleimani, An effective the Third International Symposium on feature selection method for web spam detection. Knowledge-Based Systems, 2019. 166: pp. 198-206.